Distributed Data Community #general

Garrett Weaver

07/25/2025, 4:44 AM

👋 added an issue around column order when writing out to parquet. In latest version on native runner, if I do a final

select

prior to writing to parquet, it is not necessarily respected such that the order when reading back is different, is this expected?

👀 1

Garrett Weaver

07/28/2025, 5:33 PM

Hi, do window functions and joins work in a single query with the new distributed engine on Ray, with the join pieces falling back to old Ray distributed engine? I know if I toggle the new distribution engine off, I get a not implemented error

daft.exceptions.DaftCoreException: Not Yet Implemented: Window functions are currently only supported on the native runner.

A small test with new engine on seems to work, but want to make sure there are not any caveats.

Everett Kleven

07/28/2025, 9:41 PM

Hey Daft Team, how expensive is the average concat operation? is it more recommended to append rows with pyarrow recordbatches?

Yufan

07/29/2025, 7:30 AM

Hey folks, does anyone know if Daft has anything similar to ray data's

AggregateFnV2

interface to define an efficient aggregation UDF

Amir Shukayev

07/31/2025, 5:10 PM

hey maybe a stupid question, but what is the best way do pandarallel-type parallelism with nativerunner on a single machine?

Piqi Chen

07/31/2025, 11:59 PM

Hi, @Kevin Wang, starting daft 0.5.5, our pipeline begins to fail with auth to Azure blob storage. it was working with 0.5.4 and we made no changes. We highly suspect it is due to the change here: https://github.com/Eventual-Inc/Daft/pull/4508/files#diff-710768e09629e57839fbb2446756db7c7349c641346bd17cfe7ee9bf4d0a4f8f

Garrett Weaver

08/01/2025, 4:53 PM

qq, on native executor (not distributed), would a cross join + project udf + filter + write parquet be memory stable (tbl 1 rows = ~700, tbl 2 rows ~2m). note I generally would not cross join, just a special case I am looking at. currently I am seeing memory explode and the job dies, but works on ray runner with decent partitioning

Giridhar Pathak

08/06/2025, 9:43 PM

hi folks! we been exploring databricks and there are usecases involving delta sharing.. im curious if anyone has tried setting up daft with delta sharing to be able to access data from a unity catalog share

Sky Yin

08/09/2025, 3:54 PM

Hi, new here. I'm curious what scheduler folks often use with Daft? Airflow, Dagster, or something else?

Kesav Kolla

08/14/2025, 5:10 AM

Hi, how to use external ecosystem of libraries with daft? I have a requirement to convert tiff images into PDF. I'm using Java library to do the conversion. How easy is it to integrate Java ecosystem into daft?

Michele Tasca

08/24/2025, 4:24 PM

Hi guys 👋🏼 👋🏼 Daft looks super cool. I was wondering, does it support

“first”

and

“last”

aggregation strategies for window functions? Are there plans to support them? I commented in this git issue, but also asking here in case i missed something (Btw.. I’m evaluating different framewroks for a new project of ours, and it’s amazing how many things “just work” in daft. Too bad no first or last is a deal breaker for us)

👀 1

can cai

08/26/2025, 10:10 AM

Hi, may I ask if there is a benchmark report of daft vs ray data? https://docs.daft.ai/en/stable/benchmarks/

Garrett Weaver

08/27/2025, 5:54 AM

👋 I have a class UDF that depends on an environment variable being available. I know that the environment variable is set, but it is complaining that it doesn't exist. context running native executor in Argo workflows step.

Kesav Kolla

08/27/2025, 11:26 AM

Is there any benefit of writing rust functions instead of Python UDF? Wondering what's the performance penalty of Python UDFs? I have billions of rows in my dataframe and need to operate row wise transformations.

Garrett Weaver

08/27/2025, 6:18 PM

Is there general guidance on using

daft.func

daft.udf

? I would guess that if the the underlying python code is not taking advantage of any vectorization but maybe just a list comprehension

[my_func(x) for x in some_series],

then just use

daft.func

Garrett Weaver

08/28/2025, 4:21 PM

sqlmesh python models + daft would be 🔥 https://sqlmesh.readthedocs.io/en/latest/concepts/models/python_models/#pyspark

VOID 001

08/29/2025, 3:55 AM

Hi, does daft json unnset support in SQL queries? Is there any grammar like DuckDB struct explode? Something similar to the following SQL would be nice

Copy code

df = daft.from_pydict({
    "json": [
        '{"a": 1, "b": 2}',
        '{"a": 3, "b": 4}',
    ],
})
df = daft.sql("SELECT json.* FROM df")
df.collect()

Amir Shukayev

08/29/2025, 4:01 AM

concat

lazy? Like

Copy code

df = reduce(
    lambda df1, df2: df1.concat(df2),
    [
        df_provider[i].get_daft_df()
        for i in range(num_dfs)
    ],
)

Is there any way to lazily combine a set of dfs? in any order

Sky Yin

08/29/2025, 10:31 PM

When looking at the document, I don't see any data connector for GCP. How does Daft query data in Google cloud storage?

Garrett Weaver

09/04/2025, 8:41 PM

Hi, with the new flotilla runner, should I expect OOM on the head node? I see

get_next_partition

is running there.

Desmond Cheong

09/04/2025, 11:58 PM

Apparently we're trending on github (in rust) now! Thank you for all your support and love, and thank you to everyone who's been using Daft and building Daft alongside us :') https://www.reddit.com/r/rust/comments/1n8o8ud/daft_is_trending_on_github_in_rust/

🔥 6

❤️ 8

VOID 001

09/05/2025, 5:56 AM

Hi group, is there any benchmark comparing ray data & daft?

💯 1

Peer Schendel

09/07/2025, 9:10 AM

Hi guys, I am a data engineer in an ai team. I stumbled over daft and see alot of benefits using it 🙂 I saw the llm.generate() function. I was wondering is also working with llm-proxy providers like liteLLM? I also heard in the video about it, that it is running in batches like batch inference. But I was wondering, if there might be a nice implementation to run the real batch_api from AzureOpenai, Openai or other providers: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/batch?tabs=global-bat[…]ndard-input%2Cpython-key&pivots=programming-language-python

Copy code

import os
from openai import AzureOpenAI
    
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2025-03-01-preview",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )

# Upload a file with a purpose of "batch"
file = client.files.create(
  file=open("test.jsonl", "rb"), 
  purpose="batch",
  extra_body={"expires_after":{"seconds": 1209600, "anchor": "created_at"}} # Optional you can set to a number between 1209600-2592000. This is equivalent to 14-30 days
)


print(file.model_dump_json(indent=2))

print(f"File expiration: {datetime.fromtimestamp(file.expires_at) if file.expires_at is not None else 'Not set'}")

file_id = file.id

Edmondo Porcu

09/07/2025, 4:17 PM

Hello world, minor member of the DataFusion community here 🙂 I actually think I met one of the founders of Daft at a startup event

❤️ 5

ChanChan Mao

09/08/2025, 5:29 PM

The growth we saw last week was absolutely incredible 🔥 In the past 10 days, we've gained 700+ stars 🤯 Thank you to all for your support and for believing in Daft 🫶

🎉 6

🚀 8

daft party 9

ChanChan Mao

09/09/2025, 6:23 PM

aaaaand we're live on Hugging Face documentation! Thank you to Quentin Lhoest, Daniel van Strien, and the Hugging Face team for all their help pushing this through, and excited for our continued collaboration! https://huggingface.co/docs/hub/datasets-daft

🤗 7

🙌 6

Kyle

09/11/2025, 5:04 AM

For llm_generate is it possible to run a local huggingface model? Perhaps by directly putting the local model repo path in the params instead of an open huggingface repo name?

Edmondo Porcu

09/12/2025, 6:36 PM

Quick question about Daft: how re-usable is its integration with Ray? The reason I asked is that datafusion-ray was an interesting project, tried to do some work, had no time, someone else picked it up, they had no time... Daft seems to be using Ray for distributing the workload

Rakesh Jain

09/12/2025, 10:15 PM

Hello Daft team, for the Lakevision project, which is for visualizing Iceberg based Lakehouses, we use daft for SQL and Sample Data, and we are very happy with it. Thanks for the great work!

❤️ 4

Kyle

09/15/2025, 6:22 AM

Are there any plans to have a function to generate the perplexity of a model on a given text? E.g. perplexity of qwen1.5b on a particular string?