https://www.getdaft.io logo
Join Slack
Powered by
# general
  • g

    Garrett Weaver

    06/27/2025, 7:23 PM
    fyi, I am hitting this error https://github.com/apache/arrow/issues/21526 with
    pyarrow
    when using Ray data directly and also seeing errors trying to read the same data with
    daft
    .
    pyspark
    works fine. I assume daft might be impacted by the same issue (too large row groups)?
    s
    • 2
    • 2
  • d

    delasoul

    07/14/2025, 8:47 AM
    Hello, are you planning to support the Ducklake format?
    j
    • 2
    • 1
  • a

    Amir Shukayev

    07/19/2025, 12:32 AM
    Hey! it seems like a lot of the docs indexed on google are pointing to dead links 🤔
    r
    • 2
    • 1
  • c

    Coury Ditch

    07/22/2025, 7:34 PM
    Has anyone had the experience of the native runner being faster than the ray runner?
    j
    c
    c
    • 4
    • 34
  • g

    Garrett Weaver

    07/24/2025, 3:45 AM
    👋 I am seeing the following error with native parquet writer on
    0.5.9
    , goes away if I set
    native_parquet_writer=False
    , I am using
    anonymous
    (we use alternative way to authenticate in k8s)
    Copy code
    daft.exceptions.DaftCoreException: DaftError::External task 6617 panicked with message "Failed to create S3 multipart writer: Generic { store: S3, source: UploadsCannotBeAnonymous }"
    👀 1
    s
    c
    +2
    • 5
    • 6
  • g

    Garrett Weaver

    07/25/2025, 4:44 AM
    👋 added an issue around column order when writing out to parquet. In latest version on native runner, if I do a final
    select
    prior to writing to parquet, it is not necessarily respected such that the order when reading back is different, is this expected?
    👀 1
    c
    • 2
    • 1
  • g

    Garrett Weaver

    07/28/2025, 5:33 PM
    Hi, do window functions and joins work in a single query with the new distributed engine on Ray, with the join pieces falling back to old Ray distributed engine? I know if I toggle the new distribution engine off, I get a not implemented error
    daft.exceptions.DaftCoreException: Not Yet Implemented: Window functions are currently only supported on the native runner.
    A small test with new engine on seems to work, but want to make sure there are not any caveats.
    c
    • 2
    • 6
  • e

    Everett Kleven

    07/28/2025, 9:41 PM
    Hey Daft Team, how expensive is the average concat operation? is it more recommended to append rows with pyarrow recordbatches?
    c
    • 2
    • 2
  • y

    Yufan

    07/29/2025, 7:30 AM
    Hey folks, does anyone know if Daft has anything similar to ray data's
    AggregateFnV2
    interface to define an efficient aggregation UDF
    k
    r
    • 3
    • 18
  • a

    Amir Shukayev

    07/31/2025, 5:10 PM
    hey maybe a stupid question, but what is the best way do pandarallel-type parallelism with nativerunner on a single machine?
    c
    • 2
    • 8
  • p

    Piqi Chen

    07/31/2025, 11:59 PM
    Hi, @Kevin Wang, starting daft 0.5.5, our pipeline begins to fail with auth to Azure blob storage. it was working with 0.5.4 and we made no changes. We highly suspect it is due to the change here: https://github.com/Eventual-Inc/Daft/pull/4508/files#diff-710768e09629e57839fbb2446756db7c7349c641346bd17cfe7ee9bf4d0a4f8f
    k
    • 2
    • 3
  • g

    Garrett Weaver

    08/01/2025, 4:53 PM
    qq, on native executor (not distributed), would a cross join + project udf + filter + write parquet be memory stable (tbl 1 rows = ~700, tbl 2 rows ~2m). note I generally would not cross join, just a special case I am looking at. currently I am seeing memory explode and the job dies, but works on ray runner with decent partitioning
    c
    • 2
    • 2
  • g

    Giridhar Pathak

    08/06/2025, 9:43 PM
    hi folks! we been exploring databricks and there are usecases involving delta sharing.. im curious if anyone has tried setting up daft with delta sharing to be able to access data from a unity catalog share
    k
    • 2
    • 3
  • s

    Sky Yin

    08/09/2025, 3:54 PM
    Hi, new here. I'm curious what scheduler folks often use with Daft? Airflow, Dagster, or something else?
    j
    • 2
    • 1
  • k

    Kesav Kolla

    08/14/2025, 5:10 AM
    Hi, how to use external ecosystem of libraries with daft? I have a requirement to convert tiff images into PDF. I'm using Java library to do the conversion. How easy is it to integrate Java ecosystem into daft?
    s
    • 2
    • 1
  • m

    Michele Tasca

    08/24/2025, 4:24 PM
    Hi guys 👋🏼 👋🏼 Daft looks super cool. I was wondering, does it support
    “first”
    and
    “last”
    aggregation strategies for window functions? Are there plans to support them? I commented in this git issue, but also asking here in case i missed something (Btw.. I’m evaluating different framewroks for a new project of ours, and it’s amazing how many things “just work” in daft. Too bad no first or last is a deal breaker for us)
    👀 1
    s
    d
    • 3
    • 4
  • c

    can cai

    08/26/2025, 10:10 AM
    Hi, may I ask if there is a benchmark report of daft vs ray data? https://docs.daft.ai/en/stable/benchmarks/
    c
    m
    k
    • 4
    • 15
  • g

    Garrett Weaver

    08/27/2025, 5:54 AM
    👋 I have a class UDF that depends on an environment variable being available. I know that the environment variable is set, but it is complaining that it doesn't exist. context running native executor in Argo workflows step.
    j
    • 2
    • 3
  • k

    Kesav Kolla

    08/27/2025, 11:26 AM
    Is there any benefit of writing rust functions instead of Python UDF? Wondering what's the performance penalty of Python UDFs? I have billions of rows in my dataframe and need to operate row wise transformations.
    c
    m
    +2
    • 5
    • 7
  • g

    Garrett Weaver

    08/27/2025, 6:18 PM
    Is there general guidance on using
    daft.func
    vs
    daft.udf
    ? I would guess that if the the underlying python code is not taking advantage of any vectorization but maybe just a list comprehension
    [my_func(x) for x in some_series],
    then just use
    daft.func
    ?
    j
    k
    s
    • 4
    • 38
  • g

    Garrett Weaver

    08/28/2025, 4:21 PM
    sqlmesh python models + daft would be 🔥 https://sqlmesh.readthedocs.io/en/latest/concepts/models/python_models/#pyspark
    m
    • 2
    • 2
  • v

    VOID 001

    08/29/2025, 3:55 AM
    Hi, does daft json unnset support in SQL queries? Is there any grammar like DuckDB struct explode? Something similar to the following SQL would be nice
    Copy code
    df = daft.from_pydict({
        "json": [
            '{"a": 1, "b": 2}',
            '{"a": 3, "b": 4}',
        ],
    })
    df = daft.sql("SELECT json.* FROM df")
    df.collect()
    r
    • 2
    • 4
  • a

    Amir Shukayev

    08/29/2025, 4:01 AM
    is
    concat
    lazy? Like
    Copy code
    df = reduce(
        lambda df1, df2: df1.concat(df2),
        [
            df_provider[i].get_daft_df()
            for i in range(num_dfs)
        ],
    )
    Is there any way to lazily combine a set of dfs? in any order
    j
    m
    • 3
    • 5
  • s

    Sky Yin

    08/29/2025, 10:31 PM
    When looking at the document, I don't see any data connector for GCP. How does Daft query data in Google cloud storage?
    c
    k
    +2
    • 5
    • 5
  • g

    Garrett Weaver

    09/04/2025, 8:41 PM
    Hi, with the new flotilla runner, should I expect OOM on the head node? I see
    get_next_partition
    is running there.
    k
    c
    • 3
    • 13
  • d

    Desmond Cheong

    09/04/2025, 11:58 PM
    Apparently we're trending on github (in rust) now! Thank you for all your support and love, and thank you to everyone who's been using Daft and building Daft alongside us :') https://www.reddit.com/r/rust/comments/1n8o8ud/daft_is_trending_on_github_in_rust/
    🔥 5
    ❤️ 7
  • v

    VOID 001

    09/05/2025, 5:56 AM
    Hi group, is there any benchmark comparing ray data & daft?
    💯 1
    n
    k
    +2
    • 5
    • 5
  • p

    Peer Schendel

    09/07/2025, 9:10 AM
    Hi guys, I am a data engineer in an ai team. I stumbled over daft and see alot of benefits using it 🙂 I saw the llm.generate() function. I was wondering is also working with llm-proxy providers like liteLLM? I also heard in the video about it, that it is running in batches like batch inference. But I was wondering, if there might be a nice implementation to run the real batch_api from AzureOpenai, Openai or other providers: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/batch?tabs=global-bat[…]ndard-input%2Cpython-key&pivots=programming-language-python
    Copy code
    import os
    from openai import AzureOpenAI
        
    client = AzureOpenAI(
        api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
        api_version="2025-03-01-preview",
        azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
        )
    
    # Upload a file with a purpose of "batch"
    file = client.files.create(
      file=open("test.jsonl", "rb"), 
      purpose="batch",
      extra_body={"expires_after":{"seconds": 1209600, "anchor": "created_at"}} # Optional you can set to a number between 1209600-2592000. This is equivalent to 14-30 days
    )
    
    
    print(file.model_dump_json(indent=2))
    
    print(f"File expiration: {datetime.fromtimestamp(file.expires_at) if file.expires_at is not None else 'Not set'}")
    
    file_id = file.id
    j
    • 2
    • 4
  • e

    Edmondo Porcu

    09/07/2025, 4:17 PM
    Hello world, minor member of the DataFusion community here 🙂 I actually think I met one of the founders of Daft at a startup event
    ❤️ 4
  • c

    ChanChan Mao

    09/08/2025, 5:29 PM
    The growth we saw last week was absolutely incredible 🔥 In the past 10 days, we've gained 700+ stars 🤯 Thank you to all for your support and for believing in Daft 🫶
    daft party 5
    🎉 3
    🚀 5