Hey everyone we are using the iceberg sink to write data to Apache Flink #random

Hey everyone, we are using the iceberg sink to wri...

Bhupendra Yadav

12/01/2023, 11:29 AM

Hey everyone, we are using the iceberg sink to write data to an iceberg table(using hive catalog, storage S3) in table-format v2, upsert enabled. We have written ~25 million records on a table (daily partitioned) and the Table sort order is on 3 fields(all string, having max 32 char length). We are in the POC phase only, and the data is written successfully(only 25m records in 1 day partition, no other data). We have checkpoint of 60s, and around ~350 files are written in the S3. Each file size range from ~10kb to ~3mb. Issue: When we are trying to read data from this table using trino, our trino workers occupy whole memory(4 workers, each 20gb) and Major GC trigger frequently and eventually they die. Earlier, I thought we might have a problem with trino infra. So I wrote a flink job to read this data simple query like (select a, b, c, count(1) from table group by a,b,c; This query will have max 5-6 rows). And surprisingly, we face the same issue of heap OOM in TM. Then I came across this doc, https://iceberg.apache.org/docs/latest/flink-actions/ to rewrite small files into larger one's. So I wrote another flink job with 4 parallelism, TM (memory 16 GB, 8 core CPU) to perform this re-write action. This job runs for some time and eventually, the TM dies. I can see TM CPU going 100%, heap also ~95%. I went through a bunch of github issues like https://github.com/apache/iceberg/issues/6104 but not really able to figure out what is causing the issue. Flink Version: 1.16.0 Iceberg version: 1.4.1

Bhupendra Yadav

12/01/2023, 11:30 AM

Rewrite actions job screenshot: Caused by: java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id redacted-taskmanager-1-4 timed out.

Giannis Polyzos

12/01/2023, 3:19 PM

my initial thought would be that this is due to the fact the the framework tries to read way to many files. iceberg doesnt have automatic compaction for example.. if there are way too many files (both data files and metadata files this is quite problematic).. at the same time u need to make MoR/CoW trafe-offs and overal the performance is poor. https://paimon.apache.org/ maybe its worth giving also apache paimon a try as its the best connector to flink and it also provides automatic compaction out of the box (just a though)

Giannis Polyzos

12/01/2023, 3:19 PM

the fie metadata layout is the same as iceberg, but its native to flink and uses LSM for better performance

Bhupendra Yadav

12/01/2023, 5:07 PM

Thank you, i will check it out. But I was wondering 🤔 if 350 files are too much to process. We are maybe missing something very obvious 😕

Ramesh Motaparthy

02/20/2024, 3:00 PM

@Bhupendra Yadav any updates on the above? Interested to know what route you went and if you solved the above issue

Bhupendra Yadav

02/23/2024, 7:40 AM

Hey, I created this GitHub issue https://github.com/apache/iceberg/issues/9193#issuecomment-1842146242 Where they correctly pointed out it's due to a lot of deletes in metadata file. Even though our upsert key was such that there will not be a lot of upserts but still in metadata file we were seeing lot of upserts. We weren't able to figure it out and due to time constraints of task, we didn't go ahead with iceberg

🙏 1

Open in Slack

Previous Next