Hi, is there possibility Pinot will create 2 rows ...
# troubleshooting
a
Hi, is there possibility Pinot will create 2 rows for one Kafka message?
m
What’s the use case?
a
Kafka topic data is ingested to Pinot table and there’re duplicated rows in this table. My first guess is there’re duplicated messages in Kafka topic, but just curious if there’s any possibility Pinot will create duplicated rows from one message.
😅
m
Oh I see, I thought you were asking for a feature. No, it is not possible for Pinot to create multiple rows from one row as a bug today
😂 1
x
Well if you want that feature, you can check out unnest complex type (like array field) during ingestion: https://docs.pinot.apache.org/basics/data-import/complex-type#ingestion-configurations
a
@Mayank Glad to hear it.
@Xiaobing Not for this feature, but still thanks.
l
Duplicate data being pushed to Kafka is likely the culprit, and also a very likely scenario. It can be very difficult to enforce exactly-once event processing.
1
You could have a read about potential deduplication patterns here: https://medium.com/lydtech-consulting/kafka-deduplication-patterns-1-of-2-ef0371a3331b
Alternatively you can do this using batch ingestion. There's a bit of overhead getting batch ingestion setup, unless you opt for the csv ingestion approach which is supported through spark or hadoop.
m
Side note, there’s also deduplication in Pinot available in master now: https://github.com/apache/pinot/pull/8708
But I know in @Alice’s case the upstream cannot be custom partitioned.
a
@Mayank about deduplication feature, any doc I can refer to to configure my table?
m
@saurabh dubey
s
@Alice you can refer to https://docs.pinot.apache.org/basics/data-import/dedup This feature is available v 0.11 onwards.
Although partitioning the stream by the desired primary key is still needed for dedup to work.
a
@saurabh dubey Got it. Thanks.
s
@saurabh dubey if we enable upsert on a table with primary key. does it also fix the duplication issue happening coz of kafka.
s
If it's full upsert, then yes, only one row for a given PK will be present in the table (assuming querying without the skipUpsert flag). But if it's partial upsert, then it won't (depending on what the merge strategy for the different columns is)
s
for partial upsert also - if primary_key and event_time are same for 2 records. cant it just drop one of it. since both are identical.
and can we have de-duplication and upsert on the same table. with different primary key.
s
If the partial upsert strategy for each of the cols is OVERWRITE, UNION, IGNORE, MAX, MIN it should be able to dedup records. If it's INCREMENT, APPEND, it'll not be able to dedup correctly. Dedup and upsert can't be enabled simultaneously.
s
Thanks. that clears a lot of confusion between upsert and de-dup