Hi group, is there any benchmark comparing ray dat...
# general
v
Hi group, is there any benchmark comparing ray data & daft?
💯 1
n
I had a question along the same lines. Not with respect to performance but rather functional aspects. What more do I get out of DAFT that I cannot achieve using Ray Data!? Because from what I see, APIs are the somewhat similar. Keen to know more about this.
v
IMO daft provides easy way to operate on data, also it provides a duck-db SQL like interface to interact with. It has an optimizer and planner that can optimize the logical plan before execution, which I expect to have a better performance (but haven't finished benchmarking yet)
k
@Colin Ho are there any benchmarks we can share versus Ray Data?
j
Because of Ray Data's architecture, Daft blows it out of the water in terms of performance and development UX
c
I am preparing some benchmarks, you can expect them in the next few weeks!
👀 1
🎉 1
v
@jay Could you provide some detail of why the performance of Daft is way better than Ray Data? on the technical side, thanks!
j
Streaming execution: Ray Data just does Pandas/pyarrow/Python under the hood. If you have to do URL downloads, image decoding etc it all runs in inefficient non-parallelized Python code You'll see that Daft performs way better in real-world situations such as when you have URLs in a parquet table etc.
👍 1
n
I am guessing DAFT expressions is the real game changer.
c
We will likely be posting the benchmarks soon, together with some detailed comparisons between daft vs ray data vs spark. But in the meantime i can give a little sneak peek. Here's an large scale image classification job where Daft took 2m 46, ray data took 28m 5s, and spark took 2h 38m. https://gist.github.com/colin-ho/bb0e31159d3c361a20bee85911057236 First of all, i found spark to be terrible when it comes to anything with multimodal data, models, gpus, python dependencies, udfs, etc. It was built for the java world and analytics. It is hard to tell spark that only this udf needs gpus / this udf needs cpus / this udf does I/O, this udf needs some batch size, please schedule accordingly. It's just impossible. Ok, what about ray data and daft then? Both natively support and embrace python. You can write python udfs, give the engines hints about cpu usage, batch size, concurrency, etc. But some of the things where Daft shines is 1. native multimodal expressions. image decoding / encoding / cropping / resizing. text and image embedding / classification APIs, llm APIs. text tokenization, cosine similarity, url downloads / uploads. reading video -> image frames. These native multimodal expressions are highly optimized in daft. in ray data you have to write your own python udfs that use external dependencies, like pillow, numpy, spacy, huggingface, etc. this comes at the cost of extra data movement because these libraries each have their own data format, plus just overhead for the user. 2. native I/O. We wrote our own readers and writers for parquet, csv, json, lance, iceberg, delta, WARC, you name it. This means I/O is tightly integrated into the engine's streaming execution model, and fully paralllelized. ray data simply uses pyarrow for reading / writing, which is less performant, and they have less control over it. 3. Query optimizer. Once your query starts involving more and more operations, it gets very complicated for you, the user, to understand how to optimize. Things like projection pushdowns, filter pushdowns, join reordering, you get for free in Daft. in ray data, you have to manually decide which columns you need, where to put the filter, etc. Imagine how hard this can be when you have >100 columns.
❤️ 2
v
I think the memory copy / data transfer overhead might also be a game changer in the future (with Flotilla runner)
a
Hi Colin, super interesting! I'm trying to access the dataset in s3://daft-public-datasets/imagenet/sample_100k but currently getting a Access Denied error -- could you open up the permissions?
c
the
daft-public-datasets
bucket is a requester pays bucket, you will need to add credentials
Copy code
s3_config = S3Config(
    requester_pays=True,
    key_id=os.environ["AWS_ACCESS_KEY_ID"],
    access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
    anonymous=False,
)

IO_CONFIG = IOConfig(s3=s3_config)
daft.set_planning_config(default_io_config=IO_CONFIG)
🙌 1
r
Hey @Colin Ho! I tried accessing
<s3://daft-public-datasets/imagenet/benchmark>
with
requester_pays=True
, but I got an error. Could you help me out?
Copy code
Traceback (most recent call last):
  File "/home/ray/default/image_classification/daft_main.py", line 64, in <module>
    df = daft.read_parquet(INPUT_PATH)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/daft/io/_parquet.py", line 84, in read_parquet
    builder = get_tabular_files_scan(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/daft/io/common.py", line 39, in get_tabular_files_scan
    scan_op = ScanOperatorHandle.glob_scan(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
daft.exceptions.DaftCoreException: DaftError::External Unhandled Error for path: <s3://daft-public-datasets/imagenet/benchmark>
Details:
unhandled error: Error { s3_extended_request_id: "cGWuXOF0y3KYHoUYKQ7M4VGR7LqGRrj26sPLMVQNZi0ELQoP4JB/jpH7PmCQyoE4D0SuanIOvig=", aws_request_id: "DHHPYAJ65GSQVSSZ" } (Unhandled(Unhandled { source: ErrorMetadata { code: None, message: None, extras: Some({"s3_extended_request_id": "cGWuXOF0y3KYHoUYKQ7M4VGR7LqGRrj26sPLMVQNZi0ELQoP4JB/jpH7PmCQyoE4D0SuanIOvig=", "aws_request_id": "DHHPYAJ65GSQVSSZ"}) }, meta: ErrorMetadata { code: None, message: None, extras: None } }))
c
how are you setting the s3 config?
r
So, AWS credentials are already configured on my nodes. I've tried this:
Copy code
s3_config = S3Config(
    requester_pays=True,
    anonymous=False,
)

IO_CONFIG = IOConfig(s3=s3_config)
daft.set_planning_config(default_io_config=IO_CONFIG)
And also manually exporting my credentials and doing this:
Copy code
s3_config = S3Config(
    requester_pays=True,
    key_id=os.environ["AWS_ACCESS_KEY_ID"],
    access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
    anonymous=False,
)

IO_CONFIG = IOConfig(s3=s3_config)
daft.set_planning_config(default_io_config=IO_CONFIG)
c
can you try setting region to us west 2?
r
Hmm...seems like that didn't work.
Copy code
import daft
from <http://daft.io|daft.io> import S3Config, IOConfig

INPUT_PATH = "<s3://daft-public-datasets/imagenet/benchmark>"

AWS_ACCESS_KEY_ID = ...
AWS_SECRET_ACCESS_KEY = ...
AWS_SESSION_TOKEN = ...


s3_config = S3Config(
    requester_pays=True,
    anonymous=False,
    region_name="us-west-2",
    session_token=AWS_SESSION_TOKEN,
    access_key=AWS_ACCESS_KEY_ID
)

df = daft.read_parquet(INPUT_PATH)
Error:
Copy code
daft.exceptions.DaftCoreException: DaftError::External Unhandled Error for path: <s3://daft-public-datasets/imagenet/benchmark>
Details:
unhandled error: Error { aws_request_id: "8AW1EBDTYWY531X9", s3_extended_request_id: "w9v2jFJlXt/6VgCYKFUkU66eESOxdi6z6mOI7i/h2gG1BBDp611r+OQLXL5ylG/hE4sLdu58dJI=" } (Unhandled(Unhandled { source: ErrorMetadata { code: None, message: None, extras: Some({"aws_request_id": "8AW1EBDTYWY531X9", "s3_extended_request_id": "w9v2jFJlXt/6VgCYKFUkU66eESOxdi6z6mOI7i/h2gG1BBDp611r+OQLXL5ylG/hE4sLdu58dJI="}) }, meta: ErrorMetadata { code: None, message: None, extras: None } }))
c
I think it should be
Copy code
s3_config = S3Config(
    requester_pays=True,
    anonymous=False,
    region_name="us-west-2",
    session_token=AWS_SESSION_TOKEN,
    key_id=AWS_ACCESS_KEY_ID,
    access_key=AWS_SECRET_ACCESS_KEY
)
r
Tried that and it didn't work either 😕
Copy code
s3_config = S3Config(
    requester_pays=True,
    anonymous=False,
    region_name="us-west-2",
    session_token=AWS_SESSION_TOKEN,
    key_id=AWS_ACCESS_KEY_ID,
    access_key=AWS_SECRET_ACCESS_KEY
)
IO_CONFIG = IOConfig(s3=s3_config)
daft.set_planning_config(default_io_config=IO_CONFIG)
df = daft.read_parquet(INPUT_PATH)
Is there any way I can get a more descriptive error message than
ErrorMetadata { code: None, message: None, extras: None }
?
c
ok, it might be an issue with this bucket key, does accessing s3://daft-public-datasets/imagenet/sample_100k work?