Eric Song
10/20/2022, 7:43 AMEric Song
10/20/2022, 7:44 AM1. Get file list from inputDirUrl
2. Using pattern to include/exclude some files
3. Using Record Reader in pinot-input-format to read these files and generate segmentBut as Iceberg has snapshot and versions, I can not just configure a inputDirURL, casue that may include some out-date data, and will cause chaos in Pinot table.
Eric Song
10/20/2022, 7:44 AMFirst is using Iceberg Java API to get the newest version and snapshot, then get a list of files which I need. This solution don't need to change so much code, just add some functions in SegmentGenerationJobRunner which can replace the file list. But this solution is a all or nothing solution which means I can only ingest all Iceberg Table, can not do some select using conditions.
Second is an optimization of First, as I can do something in the Record Reader, which will using some condition to filter Generics, and then generate segment and push it.
Third is using Iceberg Java API and Spark to read the table in row level and generate segment. I don't think this solution is a good solution, cause it break the whole batch ingestion design which means I need to do a lot to finish it, and it also lack of compatibility with other format. So I put it at last.
Eric Song
10/20/2022, 7:44 AMXiang Fu
Saurabh Lambe
10/20/2022, 8:53 AMXiang Fu
saurabh dubey
10/20/2022, 9:21 AMMayank
Ehsan Irshad
10/21/2022, 3:07 AM