Hi, may I ask if there is a benchmark report of da...
# general
c
Hi, may I ask if there is a benchmark report of daft vs ray data? https://docs.daft.ai/en/stable/benchmarks/
The official documentation of ray states that in terms of offline distributed inference, ray data is more suitable for GPU workloads in machine learning compared to daft. Is this really the case? https://docs.ray.io/en/latest/data/comparisons.html
gently ping @Colin Ho
c
Hey @can cai , stay tuned for benchmarks, we’re working on them!
🙏 1
m
Hi @can cai! I'm not quite sure what Ray really means by "a streaming paradigm" in this sentence where they're contrasting against Daft. 🤔 Ray data isn't designed for any sort of realtime streaming. And if you're doing offline work, you really care about throughput. Latency is more important if you're doing realtime things (e.g. you care about time to first token for LLMs for the P99 for complete model response). (And if you're doing something like realtime model serving, you want a model server! So I'd look to services like Modal or the Nvidia Triton inference server or rolling your own). For GPU throughput, you want something that can saturate the memory transfer between main mem and the GPU. And in an offline setting, when you know how much work is to be done, you have the best chance of doing this! (If you've used PyTorch, this is why the Dataset + DataLoader + collate_fn abstraction works really well for training!) Daft is made for batch oriented workflows, but that's all in the logical plan stage. As in, when you materialize your Daft query (via a
.collect()
or a
.show()
or a
.write_XYZ
), we generate a logical plan describing what the query does, we optimize it, then we execute it. Once we execute it, however, this turns into something that's more like a hybrid between an offline batch-oriented data processing system and a streaming engine. The actual internal execution in Daft is much more like a streaming engine. We have back-pressure and we stream data in and results out. Daft is batch-oriented and it needs to know the work to do ahead-of-time. That ahead-of-time work is the plan. And Daft's runtime is guided by the plan. But it starts up an runs very quickly. So Daft is a great fit for small data jobs & general exploratory data analysis too. Compared to something like Spark (or even Hadoop) there's a big difference in UX when using Daft vs. a more offline based system.
🙏 1
c
Daft does GPU inference pretty well too! Check out this blog for an example https://www.daft.ai/blog/embedding-millions-of-text-documents-with-qwen3 Ray data has tighter integrations with the rest of the ray ecosystem, like train or serve, but if you don’t need that, or prefer working with dataframes / sql, daft will work for you.
🙏 1
c
Thanks for your reply. @Colin Ho @Malcolm Greaves In fact, our company has already adopted Ray internally, including Ray Core and Ray Data. We are now evaluating the integration of Daft. Currently, we are analyzing the differences between Daft and Ray Data to define their strategic positioning and clarify the future roadmaps for both. And when not considered the DataFrame or SQL APIs, the core distinctions between Ray Data and Daft appear minimal. Both frameworks are capable of performing large-scale ETL and executing offline distributed inference. I'm looking for a compelling justification to advocate for adopting Daft over Ray Data within our company. Could you offer any suggestions? Thank you very much.
k
How does daft handles distributed workloads and parallel workloads? Currently I use spark with 8 nodes in cluster to achieve parallel workload execution. Without Ray can daft leverage multiple nodes for handling large volumes of data and parallel workloads?
c
As far as I know, the distributed execution of daft depends on whether the runner it uses supports distribution, such as ray runner and spark runner. However, Its native runner does not support distributed execution.
c
@can cai is correct. Currently daft has 2 runners, native and ray. We recommend using the native runner on a single machine, and using Ray for distributed. As of now, we don't have concrete plans for providing distributed execution without ray, but if we do, we will update here: https://docs.daft.ai/en/stable/roadmap/
👍 1
m
Hi @can cai! I'd like to chime in here on:
I'm looking for a compelling justification to advocate for adopting Daft over Ray Data within our company. Could you offer any suggestions?
Yes! 🙂 For choosing Daft over Ray Data, I'd summarize the top points as: • Performance, both memory and throughput. ◦ Daft is significantly faster than Ray data when using native operations such as URL downloads. For executing Python code (aka UDFs) it's a bit faster than Ray Data (not blow-it-out-of-the-water faster, like our built-in expressions, but you'll still get your results faster). ◦ Daft is a lot better at managing memory than Ray data. You will essentially never have an OOM when using a built-in expression. Even when running your own UDFs, you'll still have a hard time pushing Daft to the point where it runs out of memory and crashes. Why? Because we all used to use Spark and hated getting OOMs in the middle of our pipelines, so we decided to make Daft have functionality like backpressure built in on day 1 so that at runtime we can understand if some operator is eating through memory and reduce throughput + batch sizes to bring that memory use down. • Better dev UX: there's no need for a cluster when a single node will do! And swordfish, the single node runner, is insanely fast from a laptop to a big EC2 instance you rent. • Better shuffles: Daft is really good at joins and aggregations. When you're taking lots of different data sources and combining them, you can run into some bad runtime behaviors like scanning unnecessary data (e.g. because it gets filtered out later) or having exploding memory use (e.g. because something that could have been streamed as batches is instead loaded into memory all at once). Daft implemented more like a database when it comes to things like joins, so it has all sorts of optimizations like predicate pushdown that make joins much faster.
👍 1
c
Thank you @Malcolm Greaves btw, when performing a join based on daft, if data skew occurs, does daft have the ability to handle it automatically? If not, will users directly use the ray core api to solve the data skew problem?
Hey @can cai , stay tuned for benchmarks, we’re working on them!
Hi @Colin Ho I would like to do the benchmark work of ray data vs daft. May I ask if there are any suggestions for carrying out this work? For instance, how to test, how to construct the data, which scenarios to test, and so on
m
data skew occurs, does daft have the ability to handle it automatically?
As in, uneven partition sizes? Daft has a repartition expression, so you can rebalance your partition sizes. There's equivalent functionality in Ray Data.
c
> As in, uneven partition sizes? @Malcolm Greaves Yes. However what if I do shuffle by key? Data with the same key must be placed in the same partition and cannot be randomly repartitioned
m
If you don’t want to repartition after your shuffle, you can call
.into_batches(batch_size: int)
on your dataframe. This will ensure that reads are done in chunks of this
batch_size
. And Daft will parallelize across these batches. So it means it doesn’t matter if your partitions are not even at this point: into_batches ensures you’re operating with evenly sized chunks.
🙏 1