https://pinot.apache.org/ logo
j

Jonathan Meyer

05/14/2021, 10:47 AM
Hello ! 👋 I've got the following scenario : • Data is integrated in multiple batches per day (in an OFFLINE table) ◦ Batch 1: _01/01/2021 (data date) - DATA 1, DATA 3, DATA 6 -> `Segment_1(date=01/01/2021, data=[DATA 1, DATA 3, DATA 6])`_ ◦ Batch 2: _01/01/2021 (data date) - DATA 2, DATA 4, DATA 5 -> `Segment_2(date=01/01/2021, data=[DATA 2, DATA 4, DATA 5])`_ • Data must be available asap, so 2 separate segments are generated & ingested into Pinot • Some data needs to be corrected after the initial data ingestion, say DATA 1 & DATA 2 I know it is possible to replace segments but, how can we handle replacing data across multiple segments ? Can we generate a new segment with only the modified data and ignore old data in previous segments ? (
Segment_1
&
Segment_2
) ->
Segment_3(date=01/01/2021, data=[DATA 1, DATA 2])
Or do we have to regenerate the 2 segments entirely ? (if so, we need to identify what they contain) - Possibly after merging them ?
m

Mayank

05/14/2021, 11:51 AM
You can regenerate the two segments (using same name as existing segments) and push them to Pinot. Currently this is not an atomic transaction so there may be some small time period when one segment is old and another is new. This is being worked on to fix. @Seunghyun
1
j

Jonathan Meyer

05/14/2021, 11:53 AM
Thanks @Mayank So it is necessary to know the contents of the 2 segments and regenerate them with the same data as before (+ updates) ? Sounds like this could be non trivial in some cases
What is Pinot behavior when duplicated data exists ? A 3rd segment with some data already present in the first 2 The notion of "duplicated" implies we have a primary key, which is not the case on OFFLINE table iirc So I guess we would simply have "duplicated" lines
m

Mayank

05/14/2021, 3:14 PM
Pinot won’t know that it is duplicate data, and will be included in query processing
If you are generating daily segments then replacing one days segments should be straight forward
j

Jonathan Meyer

05/14/2021, 4:11 PM
If you are generating daily segments then replacing one days segments should be straight forward
The difficulty is that not all day's data may / will arrive at the same time but ingestion Hence • Batch 1:   _01/01/2021 (data date) - DATA 1, DATA 3, DATA 6   -> `Segment_1(date=01/01/2021, data=[DATA 1, DATA 3, DATA 6])`_ • Batch 2:   _01/01/2021 (data date) - DATA 2, DATA 4, DATA 5  -> `Segment_2(date=01/01/2021, data=[DATA 2, DATA 4, DATA 5])`_ In the end, I feel like my question is "how can we update part of a segment" ? I feel like it's not possible then It looks like there's only 2 ways to reach my goal then : 1. Only have a single segment per day at a time so a. Drop day's segment b. Regenerate segment with updated data [99% of the data may not have changed, so pretty inefficient] 2. Identify impacted segments & regenerate impacted one only (in their entirety) What do you think ? 🙂
m

Mayank

05/14/2021, 4:17 PM
Is your offline pipeline not generating daily partitions? Typically offline pipelines would created time partitioned folders, and generating segment from one folder will guarantee to not overlap with other days
j

Jonathan Meyer

05/14/2021, 4:24 PM
It is but we have 3 additional constraints • Data for a given day can arrive in multiple parts (for the same day) [imagine the case with N timezones} • Partial data need to be available asap (can't wait for other parts) • Need to be able to update some data later on (doesn't need to be perfectly efficient, as it's clearly not an ideal case for OLAP)
m

Mayank

05/14/2021, 4:43 PM
Do you not have realtime component? If you do then you can serve data from realtime while your offline settles?
j

Jonathan Meyer

05/14/2021, 4:44 PM
I feel like this would help, but no, data comes in batches from external sources.. I'll keep that in mind still
m

Mayank

05/14/2021, 4:52 PM
What is the max delay for data to arrive? Does one day's worth of data settle in a day or so? Or it can take several days / weeks?
Also, even if your incoming data is not partitioned, you can always generate segments to guarantee the data belongs to one day (eg pick several folders to scan and select data only for single day to generate input for pinot segment)
j

Jonathan Meyer

05/14/2021, 4:55 PM
What is the max delay for data to arrive? Does one day's worth of data settle in a day or so? Or it can take several days / weeks?
Typically much less than a month but it is technically unbounded - customer data could theoretically be corrected months after first ingestion
m

Mayank

05/14/2021, 4:57 PM
When it arrives after a month, which folder does it land in? Is it in the correct date folder? Also, how do you know which older folders got changed?
Throwing an idea, if you can find the delta between what was pushed to Pinot as part of daily push, and the delta so far across all days, you can have a set of segments for daily, and another set (perhaps very small say 1 or 2 segments) that represent delta across all days, and keep refreshing that delta segments
It works if your delta is tiny, but may not scale if delta is huge
j

Jonathan Meyer

05/14/2021, 4:59 PM
Also, even if your incoming data is not partitioned, you can always generate segments to guarantee the data belongs to one day (eg pick several folders to scan and select data only for single day to generate input for pinot segment)
If I understand correctly, you're saying that after every batch, we generate the whole pinot segment ? For example, we've got a single file per batch and after every batch, we could regenerate a single pinot segment from every one of these files Meaning we always keep a single Pinot segment (per day) at a time, and replacing it is straightforward
m

Mayank

05/14/2021, 5:25 PM
Discussed offline, @Jonathan Meyer to summarize.
j

Jonathan Meyer

05/14/2021, 5:26 PM
Yes 🙂
Summary : Context : • Data comes in batches every day (for the same day) • Each batch generates a new file • Data must be available asap (i.e. can't wait having all the data before generating a segment) • Data correction can come in later (weeks) Solution discussed : • While, every batch generates a new separate file, the goal is to keep having a single Pinot segment per day at a time • To do so, after every next batch, ◦ Merge every file for the day before calling CreateSegment to generate a new segment containing all (existing) data for the day ▪︎ Later, a new feature will allow generating a single Pinot segment out of multiple input files, dropping the need for file concatenation ◦ This new segment will replace the existing one (for the day)
This solution means that we only need to regenerate a single segment per day impacted with data correction However, if data correction happens along another dimension than time, say if we have (date, entity, value) - correcting all values for a given entity will result in the regeneration of all segments
@Mayank Summary sounds ok ?
m

Mayank

05/14/2021, 5:41 PM
Yes, thanks
🙂 1
r

reallyonthemove tous

10/13/2022, 12:32 AM
> a new feature will allow generating a single Pinot segment out of multiple input files, dropping the need for file concatenation
Is there any updates to this? am looking for something similar