can cai
08/26/2025, 10:10 AMcan cai
08/26/2025, 11:41 AMcan cai
08/26/2025, 11:42 AMColin Ho
08/26/2025, 7:03 PMMalcolm Greaves
08/26/2025, 7:06 PM.collect() or a .show() or a .write_XYZ), we generate a logical plan describing what the query does, we optimize it, then we execute it.
Once we execute it, however, this turns into something that's more like a hybrid between an offline batch-oriented data processing system and a streaming engine. The actual internal execution in Daft is much more like a streaming engine. We have back-pressure and we stream data in and results out. Daft is batch-oriented and it needs to know the work to do ahead-of-time. That ahead-of-time work is the plan. And Daft's runtime is guided by the plan.
But it starts up an runs very quickly. So Daft is a great fit for small data jobs & general exploratory data analysis too. Compared to something like Spark (or even Hadoop) there's a big difference in UX when using Daft vs. a more offline based system.Colin Ho
08/26/2025, 7:07 PMcan cai
08/27/2025, 2:41 AMKesav Kolla
08/27/2025, 11:24 AMcan cai
08/27/2025, 12:05 PMColin Ho
08/27/2025, 1:15 PMMalcolm Greaves
08/27/2025, 9:15 PMI'm looking for a compelling justification to advocate for adopting Daft over Ray Data within our company. Could you offer any suggestions?Yes! 🙂 For choosing Daft over Ray Data, I'd summarize the top points as: • Performance, both memory and throughput. ◦ Daft is significantly faster than Ray data when using native operations such as URL downloads. For executing Python code (aka UDFs) it's a bit faster than Ray Data (not blow-it-out-of-the-water faster, like our built-in expressions, but you'll still get your results faster). ◦ Daft is a lot better at managing memory than Ray data. You will essentially never have an OOM when using a built-in expression. Even when running your own UDFs, you'll still have a hard time pushing Daft to the point where it runs out of memory and crashes. Why? Because we all used to use Spark and hated getting OOMs in the middle of our pipelines, so we decided to make Daft have functionality like backpressure built in on day 1 so that at runtime we can understand if some operator is eating through memory and reduce throughput + batch sizes to bring that memory use down. • Better dev UX: there's no need for a cluster when a single node will do! And swordfish, the single node runner, is insanely fast from a laptop to a big EC2 instance you rent. • Better shuffles: Daft is really good at joins and aggregations. When you're taking lots of different data sources and combining them, you can run into some bad runtime behaviors like scanning unnecessary data (e.g. because it gets filtered out later) or having exploding memory use (e.g. because something that could have been streamed as batches is instead loaded into memory all at once). Daft implemented more like a database when it comes to things like joins, so it has all sorts of optimizations like predicate pushdown that make joins much faster.
can cai
08/28/2025, 8:46 AMcan cai
08/28/2025, 8:51 AMHey @can cai , stay tuned for benchmarks, we’re working on them!Hi @Colin Ho I would like to do the benchmark work of ray data vs daft. May I ask if there are any suggestions for carrying out this work? For instance, how to test, how to construct the data, which scenarios to test, and so on
Malcolm Greaves
08/28/2025, 5:50 PMdata skew occurs, does daft have the ability to handle it automatically?As in, uneven partition sizes? Daft has a repartition expression, so you can rebalance your partition sizes. There's equivalent functionality in Ray Data.
can cai
08/29/2025, 1:51 AMMalcolm Greaves
08/29/2025, 5:26 PM.into_batches(batch_size: int) on your dataframe. This will ensure that reads are done in chunks of this batch_size. And Daft will parallelize across these batches. So it means it doesn’t matter if your partitions are not even at this point: into_batches ensures you’re operating with evenly sized chunks.