Hi folks, here are some questions about ingesting ...
# general
e
Hi folks, here are some questions about ingesting data from Apache Iceberg to Pinot, and I want to hear your advices. What I want to do is ingesting data from Apache Iceberg elegantly, like using SQL to select some data in Iceberg and using already exist plugin to generate and push segement.
As I understand, there are 3 steps to generate a segment in batch ingestion:
1. Get file list from inputDirUrl
2. Using pattern to include/exclude some files
3. Using Record Reader in pinot-input-format to read these files and generate segment
But as Iceberg has snapshot and versions, I can not just configure a inputDirURL, casue that may include some out-date data, and will cause chaos in Pinot table.
I got some solutions, but I don't know which may has hidden troubles.
First is using Iceberg Java API to get the newest version and snapshot, then get a list of files which I need. This solution don't need to change so much code, just add some functions in SegmentGenerationJobRunner which can replace the file list. But this solution is a all or nothing solution which means I can only ingest all Iceberg Table, can not do some select using conditions.
Second is an optimization of First, as I can do something in the Record Reader, which will using some condition to filter Generics, and then generate segment and push it.
Third is using Iceberg Java API and Spark to read the table in row level and generate segment. I don't think this solution is a good solution, cause it break the whole batch ingestion design which means I need to do a lot to finish it, and it also lack of compatibility with other format. So I put it at last.
I thought the second is better, but I am not sure if this solution has some hidden troubles which I don't know and there may already have some better solutions existed. Appreciate for any advices.
x
@Saurabh Lambe may have some idea
👍 1
s
Hi @Xiang Fu, I may take a look at this, but did you mean the other Saurabh @saurabh dubey?
x
ah, my bad 😛
s
@Ehsan Irshad We had a similar problem statement but for another popular datalake storage format "delta". We went with more or less the first approach, wherein, on each minion task run, we get the changelog (list of added / remove files) using delta's java library, create segments for these parquet files, and use the segment replace protocol in pinot to atomically add all the new segments, and remove the segments corresponding to removed files. Like you mentioned, this approach is good for cases where we wish to keep the pinot table state always in sync with the datalake's latest version. Which was exactly the problem statement we had. Similar approach should work for other copy on write storage formats (I know iceberg supports for COW and MOR (merge on write), so might have to be careful about MOR transaction logs)
m
@Eric Song ^^
😀 2
e
interesting 😄