Does Pinot's batch insert have any way to avoid in...
# general
a
Does Pinot's batch insert have any way to avoid inserting duplicate data? Say that ever day I want to batch-insert the previous day of data, and I have multiple batches of data per day (say each batch of data corresponds to data from a different ice cream flavor). If I'm generating + batch inserting yesterday's data for each ice cream flavor in parallel, and the "strawberry" job fails, so I rerun it, how do I make sure I'm not batch-inserting "strawberry" data that was already inserting?
k
Segment name is unique across the table. As long as you maintain idempotent across multiple runs. It will be fine
So in your case, make sure you encode value of the flavor in segment name
So even if you push the same data again, it will be overridden
a
Ok, super cool. So I just need to make sure I set the segment name correctly -- like in
segmentNameGeneratorSpec
?
k
Right
We typically use date and partition Some kind of partition id as the convention
a
Awesome, thank you
Also just for my understanding -- is there any point in time where partially complete segments or partially overwritten segments are visible to consumers?
k
Is this hybrid or batch only table
a
I'm curious about the answer for both!
k
with batch only, its visible as soon as a segment is pushed
in hybrid, its only visible after a time boundary moves from one day to another.