Hi folks here are some questions about ingesting data from A Apache Pinot #general

Hi folks, here are some questions about ingesting ...

Eric Song

10/20/2022, 7:43 AM

Hi folks, here are some questions about ingesting data from Apache Iceberg to Pinot, and I want to hear your advices. What I want to do is ingesting data from Apache Iceberg elegantly, like using SQL to select some data in Iceberg and using already exist plugin to generate and push segement.

Eric Song

10/20/2022, 7:44 AM

As I understand, there are 3 steps to generate a segment in batch ingestion:

1. Get file list from inputDirUrl

2. Using pattern to include/exclude some files

3. Using Record Reader in pinot-input-format to read these files and generate segment

But as Iceberg has snapshot and versions, I can not just configure a inputDirURL, casue that may include some out-date data, and will cause chaos in Pinot table.

Eric Song

10/20/2022, 7:44 AM

I got some solutions, but I don't know which may has hidden troubles.

First is using Iceberg Java API to get the newest version and snapshot, then get a list of files which I need. This solution don't need to change so much code, just add some functions in SegmentGenerationJobRunner which can replace the file list. But this solution is a all or nothing solution which means I can only ingest all Iceberg Table, can not do some select using conditions.

Second is an optimization of First, as I can do something in the Record Reader, which will using some condition to filter Generics, and then generate segment and push it.

Third is using Iceberg Java API and Spark to read the table in row level and generate segment. I don't think this solution is a good solution, cause it break the whole batch ingestion design which means I need to do a lot to finish it, and it also lack of compatibility with other format. So I put it at last.

Eric Song

10/20/2022, 7:44 AM

I thought the second is better, but I am not sure if this solution has some hidden troubles which I don't know and there may already have some better solutions existed. Appreciate for any advices.

Xiang Fu

10/20/2022, 8:51 AM

@Saurabh Lambe may have some idea

👍 1

Saurabh Lambe

10/20/2022, 8:53 AM

Hi @Xiang Fu, I may take a look at this, but did you mean the other Saurabh @saurabh dubey?

Xiang Fu

10/20/2022, 8:57 AM

ah, my bad 😛

saurabh dubey

10/20/2022, 9:21 AM

@Ehsan Irshad We had a similar problem statement but for another popular datalake storage format "delta". We went with more or less the first approach, wherein, on each minion task run, we get the changelog (list of added / remove files) using delta's java library, create segments for these parquet files, and use the segment replace protocol in pinot to atomically add all the new segments, and remove the segments corresponding to removed files. Like you mentioned, this approach is good for cases where we wish to keep the pinot table state always in sync with the datalake's latest version. Which was exactly the problem statement we had. Similar approach should work for other copy on write storage formats (I know iceberg supports for COW and MOR (merge on write), so might have to be careful about MOR transaction logs)

Mayank

10/20/2022, 1:36 PM

@Eric Song ^^

😀 2

Ehsan Irshad

10/21/2022, 3:07 AM

interesting 😄

Open in Slack

Previous Next