Hi, I have several applications and I would like t...
# general
m
Hi, I have several applications and I would like to watch for a metric they expose and send over Kakfa as a message. When I ingest the Kafka messages into Pinot, is there a way to aggregate them so that only the latest messages sent by each application are kept? If not, which is to say we have to keep all the messages, is there a way to query Pinot to show only the latest message for each application?
k
You can use upsert feature
m
Thanks @User! I read from https://docs.pinot.apache.org/basics/data-import/upsert that
An important requirement for the Pinot upsert table is to partition the input stream by the primary key. For Kafka messages, this means the producer shall set the key in the
send
API. If the original stream is not partitioned, then a streaming processing job (e.g. Flink) is needed to shuffle and repartition the input stream into a partitioned one for Pinot's ingestion.
Is it to say that if the primary key column has thousands or even millions of distinct values, I need to create these many partitions in my kafka topic?
k
Partition to Key mapping is one to many.. i.e one partition can have many unique keys as long as they all hash to the same partitioning after applying a partitioning function
something like hash(key) % num_partitions
m
ah thanks. I misread it as a requirement of 1:1 mapping
k
@User we should probably mention this in the docs
m
Also, just wanted to point out that the fields like
primaryKeyColumns
and
segmentAssignmentStrategy
and the value
strictReplicaGroup
for
instanceSelectorType
are not documented in https://docs.pinot.apache.org/configuration-reference/table or https://docs.pinot.apache.org/configuration-reference/schema
k
@User ^^
j
Thanks for pointing it out. We'll add them into the documentation
👍 1