Does Pinot s batch insert have any way to avoid inserting du Apache Pinot #general

Does Pinot's batch insert have any way to avoid in...

Aaron Wishnick

03/19/2021, 6:33 PM

Does Pinot's batch insert have any way to avoid inserting duplicate data? Say that ever day I want to batch-insert the previous day of data, and I have multiple batches of data per day (say each batch of data corresponds to data from a different ice cream flavor). If I'm generating + batch inserting yesterday's data for each ice cream flavor in parallel, and the "strawberry" job fails, so I rerun it, how do I make sure I'm not batch-inserting "strawberry" data that was already inserting?

Kishore G

03/19/2021, 6:44 PM

Segment name is unique across the table. As long as you maintain idempotent across multiple runs. It will be fine

Kishore G

03/19/2021, 6:45 PM

So in your case, make sure you encode value of the flavor in segment name

Kishore G

03/19/2021, 6:45 PM

So even if you push the same data again, it will be overridden

Aaron Wishnick

03/19/2021, 6:47 PM

Ok, super cool. So I just need to make sure I set the segment name correctly -- like in

segmentNameGeneratorSpec

Kishore G

03/19/2021, 6:49 PM

Right

Kishore G

03/19/2021, 6:50 PM

We typically use date and partition Some kind of partition id as the convention

Aaron Wishnick

03/19/2021, 6:52 PM

Awesome, thank you

Aaron Wishnick

03/19/2021, 7:06 PM

Also just for my understanding -- is there any point in time where partially complete segments or partially overwritten segments are visible to consumers?

Kishore G

03/19/2021, 7:07 PM

Is this hybrid or batch only table

Aaron Wishnick

03/19/2021, 7:40 PM

I'm curious about the answer for both!

Kishore G

03/19/2021, 7:40 PM

with batch only, its visible as soon as a segment is pushed

Kishore G

03/19/2021, 7:41 PM

in hybrid, its only visible after a time boundary moves from one day to another.

Open in Slack

Previous Next