https://www.getdaft.io logo
Join Slack
Powered by
# general
  • g

    Garrett Weaver

    07/25/2025, 4:44 AM
    👋 added an issue around column order when writing out to parquet. In latest version on native runner, if I do a final
    select
    prior to writing to parquet, it is not necessarily respected such that the order when reading back is different, is this expected?
    👀 1
    c
    • 2
    • 1
  • g

    Garrett Weaver

    07/28/2025, 5:33 PM
    Hi, do window functions and joins work in a single query with the new distributed engine on Ray, with the join pieces falling back to old Ray distributed engine? I know if I toggle the new distribution engine off, I get a not implemented error
    daft.exceptions.DaftCoreException: Not Yet Implemented: Window functions are currently only supported on the native runner.
    A small test with new engine on seems to work, but want to make sure there are not any caveats.
    c
    • 2
    • 6
  • e

    Everett Kleven

    07/28/2025, 9:41 PM
    Hey Daft Team, how expensive is the average concat operation? is it more recommended to append rows with pyarrow recordbatches?
    c
    • 2
    • 2
  • y

    Yufan

    07/29/2025, 7:30 AM
    Hey folks, does anyone know if Daft has anything similar to ray data's
    AggregateFnV2
    interface to define an efficient aggregation UDF
    k
    r
    • 3
    • 18
  • a

    Amir Shukayev

    07/31/2025, 5:10 PM
    hey maybe a stupid question, but what is the best way do pandarallel-type parallelism with nativerunner on a single machine?
    c
    • 2
    • 8
  • p

    Piqi Chen

    07/31/2025, 11:59 PM
    Hi, @Kevin Wang, starting daft 0.5.5, our pipeline begins to fail with auth to Azure blob storage. it was working with 0.5.4 and we made no changes. We highly suspect it is due to the change here: https://github.com/Eventual-Inc/Daft/pull/4508/files#diff-710768e09629e57839fbb2446756db7c7349c641346bd17cfe7ee9bf4d0a4f8f
    k
    • 2
    • 3
  • g

    Garrett Weaver

    08/01/2025, 4:53 PM
    qq, on native executor (not distributed), would a cross join + project udf + filter + write parquet be memory stable (tbl 1 rows = ~700, tbl 2 rows ~2m). note I generally would not cross join, just a special case I am looking at. currently I am seeing memory explode and the job dies, but works on ray runner with decent partitioning
    c
    • 2
    • 2
  • g

    Giridhar Pathak

    08/06/2025, 9:43 PM
    hi folks! we been exploring databricks and there are usecases involving delta sharing.. im curious if anyone has tried setting up daft with delta sharing to be able to access data from a unity catalog share
    k
    • 2
    • 3
  • s

    Sky Yin

    08/09/2025, 3:54 PM
    Hi, new here. I'm curious what scheduler folks often use with Daft? Airflow, Dagster, or something else?
    j
    • 2
    • 1
  • k

    Kesav Kolla

    08/14/2025, 5:10 AM
    Hi, how to use external ecosystem of libraries with daft? I have a requirement to convert tiff images into PDF. I'm using Java library to do the conversion. How easy is it to integrate Java ecosystem into daft?
    s
    • 2
    • 1
  • m

    Michele Tasca

    08/24/2025, 4:24 PM
    Hi guys 👋🏼 👋🏼 Daft looks super cool. I was wondering, does it support
    “first”
    and
    “last”
    aggregation strategies for window functions? Are there plans to support them? I commented in this git issue, but also asking here in case i missed something (Btw.. I’m evaluating different framewroks for a new project of ours, and it’s amazing how many things “just work” in daft. Too bad no first or last is a deal breaker for us)
    👀 1
    s
    d
    • 3
    • 4
  • c

    can cai

    08/26/2025, 10:10 AM
    Hi, may I ask if there is a benchmark report of daft vs ray data? https://docs.daft.ai/en/stable/benchmarks/
    c
    m
    k
    • 4
    • 15
  • g

    Garrett Weaver

    08/27/2025, 5:54 AM
    👋 I have a class UDF that depends on an environment variable being available. I know that the environment variable is set, but it is complaining that it doesn't exist. context running native executor in Argo workflows step.
    j
    • 2
    • 3
  • k

    Kesav Kolla

    08/27/2025, 11:26 AM
    Is there any benefit of writing rust functions instead of Python UDF? Wondering what's the performance penalty of Python UDFs? I have billions of rows in my dataframe and need to operate row wise transformations.
    c
    m
    +2
    • 5
    • 7
  • g

    Garrett Weaver

    08/27/2025, 6:18 PM
    Is there general guidance on using
    daft.func
    vs
    daft.udf
    ? I would guess that if the the underlying python code is not taking advantage of any vectorization but maybe just a list comprehension
    [my_func(x) for x in some_series],
    then just use
    daft.func
    ?
    j
    k
    s
    • 4
    • 38
  • g

    Garrett Weaver

    08/28/2025, 4:21 PM
    sqlmesh python models + daft would be 🔥 https://sqlmesh.readthedocs.io/en/latest/concepts/models/python_models/#pyspark
    m
    • 2
    • 2
  • v

    VOID 001

    08/29/2025, 3:55 AM
    Hi, does daft json unnset support in SQL queries? Is there any grammar like DuckDB struct explode? Something similar to the following SQL would be nice
    Copy code
    df = daft.from_pydict({
        "json": [
            '{"a": 1, "b": 2}',
            '{"a": 3, "b": 4}',
        ],
    })
    df = daft.sql("SELECT json.* FROM df")
    df.collect()
    r
    • 2
    • 4
  • a

    Amir Shukayev

    08/29/2025, 4:01 AM
    is
    concat
    lazy? Like
    Copy code
    df = reduce(
        lambda df1, df2: df1.concat(df2),
        [
            df_provider[i].get_daft_df()
            for i in range(num_dfs)
        ],
    )
    Is there any way to lazily combine a set of dfs? in any order
    j
    m
    • 3
    • 5
  • s

    Sky Yin

    08/29/2025, 10:31 PM
    When looking at the document, I don't see any data connector for GCP. How does Daft query data in Google cloud storage?
    c
    k
    +2
    • 5
    • 5
  • g

    Garrett Weaver

    09/04/2025, 8:41 PM
    Hi, with the new flotilla runner, should I expect OOM on the head node? I see
    get_next_partition
    is running there.
    k
    c
    • 3
    • 13
  • d

    Desmond Cheong

    09/04/2025, 11:58 PM
    Apparently we're trending on github (in rust) now! Thank you for all your support and love, and thank you to everyone who's been using Daft and building Daft alongside us :') https://www.reddit.com/r/rust/comments/1n8o8ud/daft_is_trending_on_github_in_rust/
    🔥 6
    ❤️ 8
  • v

    VOID 001

    09/05/2025, 5:56 AM
    Hi group, is there any benchmark comparing ray data & daft?
    💯 1
    n
    k
    +2
    • 5
    • 10
  • p

    Peer Schendel

    09/07/2025, 9:10 AM
    Hi guys, I am a data engineer in an ai team. I stumbled over daft and see alot of benefits using it 🙂 I saw the llm.generate() function. I was wondering is also working with llm-proxy providers like liteLLM? I also heard in the video about it, that it is running in batches like batch inference. But I was wondering, if there might be a nice implementation to run the real batch_api from AzureOpenai, Openai or other providers: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/batch?tabs=global-bat[…]ndard-input%2Cpython-key&pivots=programming-language-python
    Copy code
    import os
    from openai import AzureOpenAI
        
    client = AzureOpenAI(
        api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
        api_version="2025-03-01-preview",
        azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
        )
    
    # Upload a file with a purpose of "batch"
    file = client.files.create(
      file=open("test.jsonl", "rb"), 
      purpose="batch",
      extra_body={"expires_after":{"seconds": 1209600, "anchor": "created_at"}} # Optional you can set to a number between 1209600-2592000. This is equivalent to 14-30 days
    )
    
    
    print(file.model_dump_json(indent=2))
    
    print(f"File expiration: {datetime.fromtimestamp(file.expires_at) if file.expires_at is not None else 'Not set'}")
    
    file_id = file.id
    j
    e
    • 3
    • 6
  • e

    Edmondo Porcu

    09/07/2025, 4:17 PM
    Hello world, minor member of the DataFusion community here 🙂 I actually think I met one of the founders of Daft at a startup event
    ❤️ 5
  • c

    ChanChan Mao

    09/08/2025, 5:29 PM
    The growth we saw last week was absolutely incredible 🔥 In the past 10 days, we've gained 700+ stars 🤯 Thank you to all for your support and for believing in Daft 🫶
    🎉 6
    🚀 8
    daft party 9
    c
    e
    y
    • 4
    • 4
  • c

    ChanChan Mao

    09/09/2025, 6:23 PM
    aaaaand we're live on Hugging Face documentation! Thank you to Quentin Lhoest, Daniel van Strien, and the Hugging Face team for all their help pushing this through, and excited for our continued collaboration! https://huggingface.co/docs/hub/datasets-daft
    🤗 7
    🙌 6
  • k

    Kyle

    09/11/2025, 5:04 AM
    For llm_generate is it possible to run a local huggingface model? Perhaps by directly putting the local model repo path in the params instead of an open huggingface repo name?
    k
    s
    +2
    • 5
    • 14
  • e

    Edmondo Porcu

    09/12/2025, 6:36 PM
    Quick question about Daft: how re-usable is its integration with Ray? The reason I asked is that datafusion-ray was an interesting project, tried to do some work, had no time, someone else picked it up, they had no time... Daft seems to be using Ray for distributing the workload
    j
    • 2
    • 2
  • r

    Rakesh Jain

    09/12/2025, 10:15 PM
    Hello Daft team, for the Lakevision project, which is for visualizing Iceberg based Lakehouses, we use daft for SQL and Sample Data, and we are very happy with it. Thanks for the great work!
    ❤️ 4
  • k

    Kyle

    09/15/2025, 6:22 AM
    Are there any plans to have a function to generate the perplexity of a model on a given text? E.g. perplexity of qwen1.5b on a particular string?