Distributed Data Community #apache-iceberg

jay

05/20/2024, 7:49 PM

Hey folks! Has anyone tried using the `daft.read_iceberg`/`daft.write_iceberg` APIs? Curious to hear if folks have run into any problems or have feature requests here

jay

05/30/2024, 5:31 PM

Talk recording from Iceberg Summit 2024:

https://youtu.be/W-JxwRadGiY?si=5pu4h2H8tPPgSX5h▾

🙌 2

Geronimo Gil

07/25/2024, 11:36 AM

hi all! With pyiceberg 0.7 around the corner, I was wondering if daft latest version would be compatible with pyiceberg 0.7 Thanks in advance!

ChanChan Mao

09/24/2024, 4:00 PM

<!channel> Hey folks! We’re partnering with Apache Iceberg to co-host the next Bay Area Apache Iceberg Community Meetup on Monday, November 4 in San Francisco! If you're interested in giving a talk and sharing your Apache Iceberg best practices and expertise with the community, submit your talk by Wed, Oct 9: https://forms.gle/ibj6Yhfv2WDhqAXC6 More event details coming soon…

🙌 7

ChanChan Mao

10/18/2024, 4:00 PM

<!channel> As promised, more event details about the next Apache Iceberg Community Meetup! We have a very exciting lineup featuring speakers from Daft, AWS, RisingWave, Netflix, and HANSETAG sharing their experiences and developments with Apache Iceberg -- such as implementing Iceberg in a distributed fashion, improvements in Iceberg FileIO, building an Iceberg connector in Rust, Netflix’s journey from Hive to Iceberg, and building a Rust-native modular Iceberg Rest Catalog. Head over to our Luma page for more information about all the talks! https://lu.ma/fholq6oz

ChanChan Mao

11/14/2024, 12:50 AM

<!channel> @Kevin Wang’s talk from the recent Apache Iceberg Community Meetup was posted!

https://youtu.be/-2Vd02A_Jy4▾

He walked through how we adapted PyIceberg for distributed workloads and introduced workarounds we implemented for challenges with using existing Python/Rust Iceberg tooling. Learn about what it means for an Iceberg library to provide useful abstractions while giving the query engine proper control over execution.

🧊 2

🙌 3

ChanChan Mao

11/14/2024, 12:50 AM

You can find recordings of the other presentations on Apache Iceberg Meetup youtube: https://www.youtube.com/@IcebergMeetup

Geronimo Gil

02/13/2025, 1:06 PM

hi all! is there a way to estimate how much parquet disk space would be used if we write a daft dataframe? If so, it would be really nice to have a

df.into_partitions()

inside

df.write_iceberg()

to try to get parquet files of size close to the iceberg table property

write.target-file-size-bytes

(default is 512 MiB)

Sandeep Devarapalli

04/25/2025, 1:50 PM

And this is why OLake (Open Source) is fast! Here's something for your weekend read: Exploring OLake's Architecture. If you're diving into real-time data replication or building modern data lakehouse architectures with Apache Iceberg, we've just shared an in-depth look at how OLake actually works behind the scenes. Whether your stack includes MongoDB, PostgreSQL, or MySQL, and you're targeting formats like Apache Iceberg or Parquet, this article has practical insights on designing scalable, efficient data pipelines. OLake is an open-source tool specifically built for high-speed data ingestion. Key Highlights: ⚡ Speed: Load data 4x to 10x faster compared to traditional ETL tools. 🕒 Real-Time CDC: Minimal-lag Change Data Capture from MongoDB, PostgreSQL, and MySQL. 🧩 Plug-and-Play Architecture: Cleanly separated core, drivers, and writers make extending OLake straightforward. 📊 Schema Flexibility: Seamlessly handles schema evolution and type changes compatible with Apache Iceberg. 🔄 Reliable Syncs: Built-in state management means your sync operations can resume effortlessly if interrupted. https://olake.io/blog/olake-architecture-deep-dive

Sandeep Devarapalli

05/08/2025, 10:01 AM

🚨 Benchmark Alert — OLake is rewriting the rules. 🚨 We ran head-to-head sync benchmarks and here’s what shook out: ✅ 100× faster than Airbyte ✅ 99× cheaper than Fivetran ✅ 3× faster than Debezium ✅ 11× faster and far cheaper than Estuary OLake synced 4 billion rows for only $75. Competitors? Either took hours… or cost thousands. 😳 You seriously need to take a closer look at OLake. Happy to share details or set up a deeper dive — just ping me. More details here: https://olake.io/docs/connectors/postgres/benchmarks

Giridhar Pathak

05/25/2025, 10:20 PM

Trouble querying iceberg table using daft.