Hi group is there any benchmark comparing ray data amp daft Distributed Data Community #general

Join Slack

Hi group, is there any benchmark comparing ray dat...

# general

VOID 001

09/05/2025, 5:56 AM

Hi group, is there any benchmark comparing ray data & daft?

💯 1

Navneeth Krishnan

09/05/2025, 6:05 AM

I had a question along the same lines. Not with respect to performance but rather functional aspects. What more do I get out of DAFT that I cannot achieve using Ray Data!? Because from what I see, APIs are the somewhat similar. Keen to know more about this.

VOID 001

09/05/2025, 6:16 AM

IMO daft provides easy way to operate on data, also it provides a duck-db SQL like interface to interact with. It has an optimizer and planner that can optimize the logical plan before execution, which I expect to have a better performance (but haven't finished benchmarking yet)

Kevin Wang

09/05/2025, 7:37 AM

@Colin Ho are there any benchmarks we can share versus Ray Data?

jay

09/05/2025, 10:24 PM

Because of Ray Data's architecture, Daft blows it out of the water in terms of performance and development UX

Colin Ho

09/06/2025, 1:09 AM

I am preparing some benchmarks, you can expect them in the next few weeks!

👀 1

🎉 1

VOID 001

09/12/2025, 10:12 AM

@jay Could you provide some detail of why the performance of Daft is way better than Ray Data? on the technical side, thanks!

jay

09/12/2025, 5:48 PM

Streaming execution: Ray Data just does Pandas/pyarrow/Python under the hood. If you have to do URL downloads, image decoding etc it all runs in inefficient non-parallelized Python code You'll see that Daft performs way better in real-world situations such as when you have URLs in a parquet table etc.

👍 1

Navneeth Krishnan

09/12/2025, 6:03 PM

I am guessing DAFT expressions is the real game changer.

Colin Ho

09/12/2025, 6:36 PM

We will likely be posting the benchmarks soon, together with some detailed comparisons between daft vs ray data vs spark. But in the meantime i can give a little sneak peek. Here's an large scale image classification job where Daft took 2m 46, ray data took 28m 5s, and spark took 2h 38m. https://gist.github.com/colin-ho/bb0e31159d3c361a20bee85911057236 First of all, i found spark to be terrible when it comes to anything with multimodal data, models, gpus, python dependencies, udfs, etc. It was built for the java world and analytics. It is hard to tell spark that only this udf needs gpus / this udf needs cpus / this udf does I/O, this udf needs some batch size, please schedule accordingly. It's just impossible. Ok, what about ray data and daft then? Both natively support and embrace python. You can write python udfs, give the engines hints about cpu usage, batch size, concurrency, etc. But some of the things where Daft shines is 1. native multimodal expressions. image decoding / encoding / cropping / resizing. text and image embedding / classification APIs, llm APIs. text tokenization, cosine similarity, url downloads / uploads. reading video -> image frames. These native multimodal expressions are highly optimized in daft. in ray data you have to write your own python udfs that use external dependencies, like pillow, numpy, spacy, huggingface, etc. this comes at the cost of extra data movement because these libraries each have their own data format, plus just overhead for the user. 2. native I/O. We wrote our own readers and writers for parquet, csv, json, lance, iceberg, delta, WARC, you name it. This means I/O is tightly integrated into the engine's streaming execution model, and fully paralllelized. ray data simply uses pyarrow for reading / writing, which is less performant, and they have less control over it. 3. Query optimizer. Once your query starts involving more and more operations, it gets very complicated for you, the user, to understand how to optimize. Things like projection pushdowns, filter pushdowns, join reordering, you get for free in Daft. in ray data, you have to manually decide which columns you need, where to put the filter, etc. Imagine how hard this can be when you have >100 columns.

❤️ 2

VOID 001

09/13/2025, 1:52 AM

I think the memory copy / data transfer overhead might also be a game changer in the future (with Flotilla runner)

Akshay Malik

09/23/2025, 5:28 PM

Hi Colin, super interesting! I'm trying to access the dataset in s3://daft-public-datasets/imagenet/sample_100k but currently getting a Access Denied error -- could you open up the permissions?

Colin Ho

09/23/2025, 5:34 PM

the

daft-public-datasets

bucket is a requester pays bucket, you will need to add credentials

Copy code

s3_config = S3Config(
    requester_pays=True,
    key_id=os.environ["AWS_ACCESS_KEY_ID"],
    access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
    anonymous=False,
)

IO_CONFIG = IOConfig(s3=s3_config)
daft.set_planning_config(default_io_config=IO_CONFIG)

🙌 1

Ricky Samson

09/23/2025, 8:00 PM

Hey @Colin Ho! I tried accessing

<s3://daft-public-datasets/imagenet/benchmark>

with

requester_pays=True

, but I got an error. Could you help me out?

Copy code

Traceback (most recent call last):
  File "/home/ray/default/image_classification/daft_main.py", line 64, in <module>
    df = daft.read_parquet(INPUT_PATH)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/daft/io/_parquet.py", line 84, in read_parquet
    builder = get_tabular_files_scan(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/daft/io/common.py", line 39, in get_tabular_files_scan
    scan_op = ScanOperatorHandle.glob_scan(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
daft.exceptions.DaftCoreException: DaftError::External Unhandled Error for path: <s3://daft-public-datasets/imagenet/benchmark>
Details:
unhandled error: Error { s3_extended_request_id: "cGWuXOF0y3KYHoUYKQ7M4VGR7LqGRrj26sPLMVQNZi0ELQoP4JB/jpH7PmCQyoE4D0SuanIOvig=", aws_request_id: "DHHPYAJ65GSQVSSZ" } (Unhandled(Unhandled { source: ErrorMetadata { code: None, message: None, extras: Some({"s3_extended_request_id": "cGWuXOF0y3KYHoUYKQ7M4VGR7LqGRrj26sPLMVQNZi0ELQoP4JB/jpH7PmCQyoE4D0SuanIOvig=", "aws_request_id": "DHHPYAJ65GSQVSSZ"}) }, meta: ErrorMetadata { code: None, message: None, extras: None } }))

Colin Ho

09/23/2025, 10:16 PM

how are you setting the s3 config?

Ricky Samson

09/23/2025, 10:26 PM

So, AWS credentials are already configured on my nodes. I've tried this:

Copy code

s3_config = S3Config(
    requester_pays=True,
    anonymous=False,
)

IO_CONFIG = IOConfig(s3=s3_config)
daft.set_planning_config(default_io_config=IO_CONFIG)

And also manually exporting my credentials and doing this:

Copy code

s3_config = S3Config(
    requester_pays=True,
    key_id=os.environ["AWS_ACCESS_KEY_ID"],
    access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
    anonymous=False,
)

IO_CONFIG = IOConfig(s3=s3_config)
daft.set_planning_config(default_io_config=IO_CONFIG)

Colin Ho

09/23/2025, 10:26 PM

can you try setting region to us west 2?

Ricky Samson

09/23/2025, 10:34 PM

Hmm...seems like that didn't work.

Copy code

import daft
from <http://daft.io|daft.io> import S3Config, IOConfig

INPUT_PATH = "<s3://daft-public-datasets/imagenet/benchmark>"

AWS_ACCESS_KEY_ID = ...
AWS_SECRET_ACCESS_KEY = ...
AWS_SESSION_TOKEN = ...


s3_config = S3Config(
    requester_pays=True,
    anonymous=False,
    region_name="us-west-2",
    session_token=AWS_SESSION_TOKEN,
    access_key=AWS_ACCESS_KEY_ID
)

df = daft.read_parquet(INPUT_PATH)

Error:

Copy code

daft.exceptions.DaftCoreException: DaftError::External Unhandled Error for path: <s3://daft-public-datasets/imagenet/benchmark>
Details:
unhandled error: Error { aws_request_id: "8AW1EBDTYWY531X9", s3_extended_request_id: "w9v2jFJlXt/6VgCYKFUkU66eESOxdi6z6mOI7i/h2gG1BBDp611r+OQLXL5ylG/hE4sLdu58dJI=" } (Unhandled(Unhandled { source: ErrorMetadata { code: None, message: None, extras: Some({"aws_request_id": "8AW1EBDTYWY531X9", "s3_extended_request_id": "w9v2jFJlXt/6VgCYKFUkU66eESOxdi6z6mOI7i/h2gG1BBDp611r+OQLXL5ylG/hE4sLdu58dJI="}) }, meta: ErrorMetadata { code: None, message: None, extras: None } }))

Colin Ho

09/23/2025, 10:36 PM

I think it should be

Copy code

s3_config = S3Config(
    requester_pays=True,
    anonymous=False,
    region_name="us-west-2",
    session_token=AWS_SESSION_TOKEN,
    key_id=AWS_ACCESS_KEY_ID,
    access_key=AWS_SECRET_ACCESS_KEY
)

Ricky Samson

09/23/2025, 10:41 PM

Tried that and it didn't work either 😕

Copy code

s3_config = S3Config(
    requester_pays=True,
    anonymous=False,
    region_name="us-west-2",
    session_token=AWS_SESSION_TOKEN,
    key_id=AWS_ACCESS_KEY_ID,
    access_key=AWS_SECRET_ACCESS_KEY
)
IO_CONFIG = IOConfig(s3=s3_config)
daft.set_planning_config(default_io_config=IO_CONFIG)
df = daft.read_parquet(INPUT_PATH)

Is there any way I can get a more descriptive error message than

ErrorMetadata { code: None, message: None, extras: None }

Colin Ho

09/24/2025, 6:02 PM

ok, it might be an issue with this bucket key, does accessing s3://daft-public-datasets/imagenet/sample_100k work?

2 Views

Open in Slack

Previous Next