Distributed Data Community #announcements

Zapier

03/25/2025, 8:32 PM

We're hosting a meetup with Ray this Thursday 3/27! Scaling Data Processing and ML Training with Daft + Ray We are excited to invite you to our upcoming event, co-hosted with Ray, where we delve into the intricacies of scaling data processing and machine learning training using Daft + Ray. This is a fantastic opportunity to gain insights from engineers from Daft and Ray and to learn more about how to develop AI applications from local to production! Agenda: • 5pm: Doors open and networking • 6pm: Talk by Daft software engineer Desmond Cheong • 630pm Talk by Anyscale product manager Ricardo Decal • 7pm: Networking & Pizza Enjoy pizza as you engage with the local AI community. Don’t miss the chance to connect and learn from each other! https://lu.ma/u5p41kte

ChanChan Mao

03/26/2025, 6:36 PM

The Daft team is going to Iceberg Summit 2025 🧊! Explore the complexities of S3 Tables and their advantages, such as native support for Iceberg and Parquet. While S3 Tables have their perks, are they easy to use beyond AWS services? Discover how Daft's Catalog API enables seamless interaction with S3 Tables, allowing you to read them as Daft DataFrames and extend our existing Iceberg read and write capabilities. Register now if you haven't already and come check out our talk on April 8 at 10:30am ⤵️ https://www.icebergsummit2025.com/

ChanChan Mao

03/26/2025, 6:42 PM

We're also co-hosting a happy hour with AWS and UST to continue the conversations after Iceberg Summit Day 1, walking distance from the conference venue. This is an exclusive invite-only happy hour and we'd love folks from our Daft community to attend! https://lu.ma/ti0nkvp6

ChanChan Mao

03/28/2025, 8:27 PM

📢 Upcoming Webinar: Beyond JVMs – Reinventing Catalogs with Daft & Delta Lake Bogged down by JVM dependencies? Python users, this one’s for you! With tools like PyIceberg, you can now interact with Iceberg and Delta Lake without Java dependencies—but what if you could take it even further? 🤔 Daft is redefining how users interact with tables and catalogs natively in Python. Join us on Monday, April 7 at 10:00 AM PT to explore: ✅ How Daft enables fast, parallel reading from Delta Lake with data skipping optimizations ✅ How its full-featured DataFrame API makes ML/AI data transformation seamless ✅ How Daft unifies modern data and ML stacks, simplifying the path from raw data to model ingestion 🔗 Register now → https://lu.ma/BeyondJVMs

Zapier

04/07/2025, 3:00 PM

New video out! Scaling Data Processing and ML Training with Daft + Ray Co-hosted with Ray, delve into the intricacies of scaling data processing and machine learning training using Daft + Ray. Hear from Desmond Cheong https://www.linkedin.com/in/desmondcheongzx/, Software Engineer at Daft, and Ricardo Decal https://www.linkedin.com/in/richarddecal/, Product Manager at Anyscale, to learn more about how to develop AI applications from local to production! Scaling Data Processing and ML Training with Daft +, Ray hosted by Eventual Computing (the team building Daft). 💜 Get to know Daft ‣ Learn more about Daft: https://www.getdaft.io/ ‣ Join our Distributed Data Slack Community: https://join.slack.com/t/dist-data/shared_invite/zt-1t44ss4za-1rtsJNIsQOnjlf8BlG05yw ‣ Star Daft Github: https://github.com/Eventual-Inc/Daft ‣ Subscribe to Daft Engineering Blog: https://blog.getdaft.io/ #daft #distributed #dataframe #data #dataengineering 0:00 Daft Session by Desmond Cheong 27:30 Ray Session by Ricardo Decal

https://www.youtube.com/watch?v=3JWrg1DitaA▾

daft bro 3

ChanChan Mao

04/22/2025, 3:22 PM

Want to find the best developers on Github? Try out sashimi4talent.com 🐠 (For some reason, all of our projects end up fish related..) As a fun internal hackathon project (motivated by an omakase dinner as the prize), in 2 days Colin and Sammy created Sashimi 4 Talent by cloning 15K+ repositories, analyzing 33M+ commits, and creating a dataset of 250K+ contributors -- all using Daft! Read our technical blog about how this complex project, combining data engineering, batch inference, and analytics, was made very easy with Daft: https://blog.getdaft.io/p/we-cloned-over-15000-repos-to-find

ChanChan Mao

04/24/2025, 3:44 PM

We're LIVE on Unity Catalog webinar! 🙂 LinkedIn

YouTube▾

❤️ 4

ChanChan Mao

04/29/2025, 6:27 PM

🚗 Solving Critical Data Challenges in Autonomous Vehicle Development Mobileye’s autonomous driving technology processes massive amounts of camera data where critical information like lane markings occupy only a fraction of each frame. This sparse data structure created a significant challenge - while their storage formats compressed data on disk, files would expand up to 500x when loaded into memory, causing crashes and performance bottlenecks. After exploring various solutions, Mobileye turned to Daft and made several contributions to develop a custom approach called "Sparse Tensor Delta Encoding." By storing only the differences between consecutive indices of non-zero values, they achieved exceptional compression while maintaining processing speed. The results have been transformative: enhanced parsing speed, storage requirements reduced from 60MB to just 117KB per data sample (a 500x improvement), and significantly improved memory efficiency during global operations. This has allowed Mobileye to process more autonomous driving data faster than ever before without the memory crashes that previously hampered their pipeline. Curious about the technical details of how Mobileye implemented this solution? Check out their full engineering blog post where they break down the implementation and share their benchmarks: https://medium.com/@sageahrac/cracking-the-code-a-smarter-way-to-store-sparse-data-23d28363829b Thank you @Sagi for authoring this blog piece!

❤️ 3

ChanChan Mao

05/01/2025, 4:00 PM

Daft v0.4.12 release 🚀 Here are the highlights: • Advanced Window Functions: Comprehensive window function support including dynamic/sliding window aggregations, SQL syntax, and functions like LAG/LEAD, RANK/DENSE_RANK with PARTITION BY and ORDER BY capabilities (https://www.getdaft.io/projects/docs/en/stable/core_concepts/#window-functions) • Improved AWS Integration: Enhanced GlueCatalog with botocore session support and fixed Lance file read/write on S3 in Ray mode (https://www.getdaft.io/projects/docs/en/stable/integrations/glue/) • Performance Boost: Flight client implementation in Rust and optimized Arrow operations will significantly improve data processing speed • Better Catalog Management: Support for arbitrary properties in CREATE TABLE statements for more flexible metadata management • Enhanced Documentation: New comprehensive window function documentation and usage examples for Session and Catalog classes (https://www.getdaft.io/projects/docs/en/stable/api/sessions/#daft.session.Session, https://www.getdaft.io/projects/docs/en/stable/api/catalogs_tables/#daft.catalog.Catalog) See full release notes here: https://github.com/Eventual-Inc/Daft/releases/tag/v0.4.12

🙌 6

🙌🏽 1

ChanChan Mao

05/05/2025, 9:21 PM

Daft v0.4.13 release 🚀 This is a minor release with some nice productivity upgrades: • DataFrame pipe method to apply a sequence of UDFs https://www.getdaft.io/projects/docs/en/stable/api/dataframe/#daft.DataFrame.pipe • Several internal improvements for better compatibility (glibc 2.24 support) and fixing nightly tests Here are the full release notes: https://github.com/Eventual-Inc/Daft/releases/tag/v0.4.13

💯 4

ChanChan Mao

05/06/2025, 4:00 PM

If you’re new to Daft and/or new to S3Tables, watch Conner & ChanChan’s session from Iceberg Summit 2025!

https://www.youtube.com/watch?v=4IPMD6z2FvY▾

The first 8 minutes introduces Daft and covers Daft DataFrames, Python & SQL, and local and distributed execution at a high level. The remainder of the talk dives into the background of what S3 Tables are, why they’re useful and the problems they solve, how S3 Tables relates to Iceberg tables, and how you can easy use Daft to read and write via S3 Iceberg REST Endpoint 🙌 Here’s the slide deck if you’re interested: https://docs.google.com/presentation/d/1WzlHkfrbkpv2zW5_g412Nn2GC9h8tWt-/edit?slide=id.g339a2d034dd_1_0#slide=id.g339a2d034dd_1_0

daft party 3

👍 3

ChanChan Mao

05/07/2025, 4:00 PM

Daft was one of the first to natively integrate with Unity Catalog! In the latest Unity Catalog Zero to Hero webinar, Kevin & Sammy dive into how Daft’s catalog API enables you to query Unity tables using both SQL and Python DataFrame. In the first 10 minutes, Sammy gives an introduction to Daft (DataFrames, Python, SQL, local & distributed execution) and how Daft differs from Spark — a common question we get asked a lot! Following that is a live coding session where Kevin walked Victoria through how to set up Daft, connect it to a Unity catalog, and work with multimodal data in a Unity table. They iterated and ran their code locally and then scaled their completed code up and deployed it to a remote cluster. If you’re curious to see how Daft interacts with Unity Catalog for multimodal workloads via local and distributed workloads, this one’s for you: https://www.youtube.com/live/bSQ-Bs1UxPY

daft party 3

ChanChan Mao

05/07/2025, 9:52 PM

Daft v0.4.14 release 🚀 Here are the key highlights: 🕰️ Comprehensive Temporal Functions: New date/time capabilities including quarter(), unix_date(), unix_micros(), unix_millis(), unix_seconds(), day_of_month(), and week_of_year() for more powerful time series analysis 🪟 Window Function Improvements: Dynamic window frame incremental updates for more efficient streaming analytics ⏱️ Interval Arithmetic Enhancement: New multiplication operations with Interval data types for more flexible time calculations 🔄 CI/Build Improvements: Updated PyArrow support (19.0.1) and enhanced build processes Special shoutout to our external contributors! 🙌 @petern48 contributed all of the temporal functions listed above that will make time-based aggregations and filtering much simpler 🙌 @flaneur2020 enabled multiplication arithmetics with Interval data types, a great enhancement for time-based calculations Find the full release notes here: https://github.com/Eventual-Inc/Daft/releases/tag/v0.4.14

daft party 4

🚀 3

ChanChan Mao

05/08/2025, 8:00 PM

How do you use Daft to combine traditional data processing, multimodal data, and AI workloads at scale? At the recent Data for AI Meetup in SF, @Colin Ho explored practical examples such as how to use Daft’s UDFs to run arbitrary Python code, run concurrent API calls for batch inference, together with traditional processing methods like aggregations and joins. All easily running on distributed clusters with minimal changes in code. Here’s the recording:

https://youtu.be/Qnw6059ddgE▾

daft bro 6

ChanChan Mao

05/09/2025, 4:00 PM

Want to help build THE open source data engine for modern data and AI workloads at any modality and any scale that can outperform Spark? Do you use Daft or similar DataFrame tools in your daily work? We’d love for you to contribute to Daft and we have some great first issues waiting for you! daft party Why contribute? ✅ Join us on our mission to defeat Spark 😎 ✅ Work with Rust and Python in a high-performance system 🚀 ✅ Opportunity to write blog posts about your contributions 📝 ✅ Be apart of our growing Daft community 💜 Check out our beginner-friendly issues labeled “good first issue” on Github: https://github.com/Eventual-Inc/Daft/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22

ChanChan Mao

05/13/2025, 8:53 PM

🎉 Thanks to @Péter Ferenc Gyarmati, Chapter 1 of Daft has been launched on Marimo! https://marimo-team.github.io/learn/daft/01_what_makes_daft_special.html This is a great starting resource if you're new to Daft! TLDR; What Makes Daft Special? 🤔 🦀 Built with Rust: Performance and Simplicity Daft’s core engine is built with Rust and leverages Apache Arrow for in-memory data format, providing better performance through memory efficiency and native Python bindings, while its lazy execution allows it to process massive datasets by building logical plans that only execute when explicitly requested (.show, . collect). 🌐 Scale Your Work: From Laptop to Cluster Daft’s “write once, scale anywhere” approach allows developers to use the same Python API and move from local development (multiple cores on single machine) to large-scale distributed execution (via Ray) without significantly refactoring your code. 🖼️ Handling More Than Just Tables: Multimodal Data Support Daft natively supports multimodal data types, enabling direct processing of complex data like images, audio, URLs, and tensors within the same framework, particularly valuable for ML/AI pipelines and advanced analytics involving diverse sources. 🧑‍💻 Designed for Developers: Python and SQL Interfaces Daft enhances developer experience through a flexible dual-interface approach, offering both a familiar Pythonic DataFrame API and a robust SQL interface, allowing users to choose their preferred method or use both interchangeably. Check out the Marimo notebook for a full introduction to Daft and stay tuned for future chapters that dive deep into Daft’s core concepts, working with data catalogs, multimodal data, and more! https://marimo-team.github.io/learn/daft/01_what_makes_daft_special.html Follow this issue for a list of planned courses: https://github.com/marimo-team/learn/issues/43

🙌 3

🎉 3

ChanChan Mao

05/15/2025, 4:51 PM

Just shipped Daft v0.4.15 release 🚀 Here are the key highlights: 📊 Performance Enhancements: new TopN operator & optimization for faster queries on large datasets and local distinct optimizations for count_distinct aggregations (https://github.com/Eventual-Inc/Daft/pull/4325) 🔄 Window Function Improvements: order by-only ranking, range between for partition-by windows (https://www.getdaft.io/projects/docs/en/stable/api/window/#daft.window.Window.range_between), and added support for several TPC-DS queries (https://github.com/Eventual-Inc/Daft/pull/4283) 📈 Data Flow Capabilities: generic interface for custom data sinks, user-defined DataFrame source APIs, async file writers for improved I/O performance 🔍 Observability: added support for OpenTelemetry metrics and tracing (https://github.com/Eventual-Inc/Daft/pull/4322) 📚 Developer Experience: improved documentation layout and search with Algolia search integration (https://github.com/Eventual-Inc/Daft/pull/4330) Find the full release notes here: https://github.com/Eventual-Inc/Daft/releases/tag/v0.4.15

❤️ 4

🔥 5

ChanChan Mao

05/19/2025, 5:41 PM

Window Functions: Session Ranking — This or that? How do you perform a session ranking WITHOUT window functions? 1. A self-join to compare each session with all other sessions 2. Filtering to count sessions with more chocolates 3. Grouping and aggregating to calculate the rank 4. Joining back to the original data 5. Handling nulls for top-ranked sessions ❌ Complex, error-prone, and hard to understand at a glance. How about WITH window functions? 1. Define a window partitioned by contestant and ordered by chocolates (descending) 2. Apply the

dense_rank()

function over this window (https://www.getdaft.io/projects/docs/en/stable/api/window/#daft.functions.dense_rank) ✅ All the comparison logic is handled internally, code is more concise and more efficient. Check out our documentation and try out Daft's window functions yourself! https://www.getdaft.io/projects/docs/en/stable/core_concepts/#window-functions

ChanChan Mao

05/20/2025, 5:39 PM

📺 Daft is purpose-built for processing data of any modality and at any scale. Use Daft to query data of all different shapes and sizes, from tabular (Parquet, CSV) to semi-structured (JSON) to unstructured (text, images, audio). Dive into the technical details with Kevin and I from MDS Fest that allow Daft to handle multimodal data while maximizing I/O throughput, including distributed reads of large files, memory stability via morsel-based execution, and I/O-aware query optimizations. https://www.youtube.com/live/2GTUGv_S5Gc

🙌 2

ChanChan Mao

05/22/2025, 8:10 PM

Daft v0.4.16 just dropped with a PACKED release 💪 ✨ Native Local Parquet Writer - Efficient local parquet processing without external dependencies (https://github.com/Eventual-Inc/Daft/pull/4260) 🛣️ Daft Roadmap - See what features the Daft team is prioritizing in the coming year (https://www.getdaft.io/projects/docs/en/stable/roadmap/) 🔧 Python Partitioning Classes - Better data organization and query optimization capabilities (https://github.com/Eventual-Inc/Daft/pull/4366) 🐠 Flotilla Scheduler & Planner - Enhanced distributed computing with smarter resource management ☁️ S3 Multipart Uploads - Native support for efficient large file uploads to S3-compatible storage 🐍 Optional Spark Dependencies - Cleaner installs with PySpark connector now optional (

pip install -U "daft[spark]"

) 🪟 Window Functions Tutorial - Transform complex analytical challenges into elegant solutions with Daft’s window functions (https://colab.research.google.com/github/Eventual-Inc/Daft/blob/main/tutorials/window_functions/window_functions.ipynb) 📊 Performance Wins: • Smart projection splitting for granular batching • Optimized morsel sizing for project/filter ops • Better batch sizing with URL download connections • CSV predicate pushdown improvements The team is cooking 🔥 Find the full release notes here: https://github.com/Eventual-Inc/Daft/releases/tag/v0.4.16

🚀 3

🔥 6

ChanChan Mao

05/28/2025, 4:00 PM

Daft v0.4.17 is out 🎉 This release focuses on stability and developer experience: 📅 Duration Expressions - Built-in time duration operations for better temporal data processing 🔧 Enhanced Spark Connect - Fixed column renaming operations that preserve non-renamed columns properly ⚡ Rust-Python Catalog Integration - Native Rust support for Python catalogs and tables 🛠️ Developer Experience Wins: - Progress bar stability improvements (no more panics!) - Better error handling with null arguments in string operations - S3n protocol support for legacy systems - Stricter MyPy enforcement for better code quality Find the full release notes here: https://github.com/Eventual-Inc/Daft/releases/tag/v0.4.17

🎉 4

Kevin Wang

05/31/2025, 1:19 AM

I'm excited to announce the release of Daft v0.5.0! 🎉 For those migrating from v0.4, please take a look at the migration guide located in the release notes. Thank you!

daft party 6

ChanChan Mao

06/02/2025, 6:02 PM

<!channel> KICKING OFF LAUNCH WEEK 🔥 Hear from Sammy, Co-Founder & CEO at Eventual, on what's coming in hot this week. ✅ New brand ✅ New website: www.getdaft.io ✅ New docs: docs.getdaft.io/ 🎯 Same mission: simple and reliable multimodal data processing for modern AI workloads. 💥 BUCKLE UP — we’re dropping something new EVERY DAY this week. This is just Day 1. Try out Daft today:

pip install daft

sammy launch.mov

daft party 12

🙌 12

ChanChan Mao

06/03/2025, 5:11 PM

<!channel> DAY 2 OF DAFT LAUNCH WEEK: Multimodal. Data. Dynamic. Execution. 🚨 Images, audio, text, video, documents – you name it. Multimodal data is fundamentally different. Daft’s dynamic execution tunes batch sizes adaptively based on data type, operation, and downstream pressure. 🔄 Streaming downloads? Daft auto-batches to saturate the network without blowing up memory. 📦 Writing Parquet? Daft is optimized for columnar writes. 🧠 All handled in a single pipeline. Zero config. Maximum performance. See it in action with an image preprocessing pipeline featuring @Colin Ho, Software Engineer at Eventual. ✨ Ready to ditch fixed batches and unlock adaptive data flows?

pip install daft

Check out our docs on dynamic execution: https://docs.getdaft.io/en/stable/core_concepts/#dynamic-execution-for-multimodal-workloads

🙌 6

daft party 4

clapclap e 4

ChanChan Mao

06/04/2025, 5:12 PM

<!channel> ✅ No JVMs ✅ No JARs ✅ Local or Distributed ✅O ne engine for all Daft Launch Week Day 3 🚀 Spark Connect for Daft Tired of juggling JVM, JARs, and legacy pipelines just to run Spark? 😭 What if your existing PySpark code “just worked”... but without Spark…? With Spark Connect for Daft, you can now: ⚡ Run the SAME Spark queries using Daft’s blazing-fast engine ⚙️ Switch from PySpark to Daft with just 2 lines of code: daft.pyspark and .local() or .remote() 🌍 Scale from local to distributed with Ray—no JVM required ❌ No need to maintain separate stacks for both legacy and multimodal workloads Same Spark code, radically better experience. Multimodal + traditional data under one unified engine 💜 Follow along as @Kevin Wang, Software Engineer at Eventual, shows you just how easy it is. It’s fast. It’s simple. Try it out yourself:

pip install "daft[spark]"

Learn more about Spark Connect in our docs: https://docs.getdaft.io/en/stable/spark_connect/

🔥 11

🎉 3

❤️ 12

ChanChan Mao

06/05/2025, 10:20 PM

<!channel> Introducing User-Defined Data Sources and Sinks. Now you can plug in any data source and sink into Daft in just ~100 lines of code. Happy Day 4 of Daft #LaunchWeek Want to write to a vector database like Chroma? We did just that – LIVE with distributed workers – follow along with @Desmond Cheong, Software Engineer at Eventual. 👉Get started now and write your own custom integration:

pip install daft

(Ping us once your PR is ready for review 😉) Check out our docs on user defined sources & sinks to learn more: https://docs.getdaft.io/en/stable/io/#user-defined

🐐 6

daft bro 4

clapclap e 10

ChanChan Mao

06/06/2025, 5:05 PM

<!channel> It’s been a thrilling launch week – from dynamic execution to Spark Connect to user-defined I/O. To wrap things up, hear from @Sammy Sidhu, CEO & Co-Founder of Eventual, about what’s next for Daft and what we’re excited about! Let’s build together:

pip install daft

Check out the Daft roadmap: https://docs.getdaft.io/en/stable/roadmap/ And join the team! https://jobs.ashbyhq.com/eventualcomputing

sammy roadmap.mov

👀 5

daft party 7

ChanChan Mao

06/09/2025, 4:00 PM

In the midst of last week’s exciting launch week, we didn’t mention that we released Daft v0.5 with major improvements! We’re focused on making data processing more intuitive and performant. Here’s what’s new: → Native runner is now the default: Better performance out of the box, deprecated PyRunner (use daft.context.set_runner_native()) → Simplified SQL interface: daft.sql(”SELECT * FROM df”, df=df), no more catalog boilerplate → Rust-powered in-memory catalogs: Faster operations, cleaner architecture → Cleaner Catalog API: ie. daft.catalog.read_table → daft.read_table → New deserialize function with support for JSON We’ve kept breaking changes minimal while making substantial improvements to the core engine. If you’re working with multimodal data at scale, the combination of Python ergonomics with Rust performance is worth checking out. → Get started with Daft today:

pip install daft

Here are the full release notes: https://github.com/Eventual-Inc/Daft/releases/tag/v0.5.0

daft party 5

ChanChan Mao

06/26/2025, 4:41 PM

Another huge milestone achieved this week - Daft has surpassed 3000 stars! Thank you to our growing community for continuing to support us and sharing the love of Daft. And thank you to everyone who is using Daft and believing in the future that we're building. https://github.com/Eventual-Inc/Daft

❤️ 6

🙌 5

ChanChan Mao

07/09/2025, 7:35 PM

Hey folks in Seattle! Tune into @Robert Howell's session - “Working with Iceberg Predicates: Representation, Translation, and Interop” - next Tuesday July 15 at Seattle Apache Iceberg™ Community Meetup! Join us as we break down how Daft’s rust-based optimizer pushes expressions directly into pyiceberg for optimizing reads. The current state of expression representation across table formats is fragment and inefficient. We’ll show you: • How Daft translates expressions for PyIceberg optimization • Why direct Iceberg-Rust integration changes everything • The path to unified expression representation across Iceberg, Delta Lake, DataFusion, and Lance. When & Where: Tuesday, July 15 | 5-8:30pm | Bellevue, WA Register here: https://lu.ma/q9w0j1ky

daft party 4