https://www.getdaft.io logo
Join Slack
Powered by
# general
  • a

    Andrew Fuqua

    04/21/2025, 10:59 PM
    Does the read splitting mentioned in this blog post work in the native runner? I'm running into resource exhaustion when reading a single 2.5GB parquet file and attempting to write it back out repartitioned across 30 buckets using iceberg writer. This works for smaller files (tested with 200MB). I've set the daft execution config to the same values as in the post. The node has 32GB RAM which seems like enough for the task, but it is actually quickly exhausted (within 2 minutes), same for swap. Any other params I could tune to make this work? Would a local Ray cluster handle this task within the same resources?
    c
    • 2
    • 1
  • e

    Everett Kleven

    04/24/2025, 4:45 PM
    Is Anyone else experiencing strange scrolling behavior on the Docs? Seems to have started post update. If you scroll down, it will scroll to the top automatically
    Screen Recording 2025-04-24 at 11.39.28 AM.mov
    c
    d
    +2
    • 5
    • 17
  • y

    Yufan

    04/25/2025, 3:56 PM
    Hey folks, wanna get some inputs from the experts to see if this pattern is something Daft could support 🧵
    c
    • 2
    • 4
  • a

    Andrew Kursar

    04/25/2025, 9:29 PM
    Hello! the recent 0.4.11 daft release included a <=16 pyarrow contraint, see https://github.com/Eventual-Inc/Daft/pull/4225 , but the latest pyiceberg requires >=17, see https://github.com/apache/iceberg-python/blob/pyiceberg-0.9.0/pyproject.toml#L64 . Anyone know what the newly added max bound on daft is about? Or more directly if there are any issues to watch that are blocking the use of the later pyarrow releases? I've been using pyiceberg 0.9.0 with daft 0.4.10 but now can't upgrade daft without downgrading pyiceberg.
    c
    • 2
    • 3
  • y

    yashovardhan chaturvedi

    05/02/2025, 2:34 PM
    hey folks, does daft have something like https://fastht.ml/docs/#getting-help-from-ai , https://fastht.ml/docs/llms-ctx.txt which can be fed to llms might make using daft more productive with cursor etc.
    👀 1
    ❤️ 2
    n
    d
    • 3
    • 5
  • n

    Neil Wadhvana

    05/04/2025, 1:55 AM
    Hey guys, I'd like to do something like this in
    daft
    without needing to specify each column (since I can have anywhere from 1 to 5). Is there another syntax that would work? This is not working as is:
    Copy code
    @daft.udf(return_dtype=daft.DataType.python())
    def mean_ensemble(*depth_value_series: daft.Series) -> List[Dict[str, np.ndarray]]:
        """Apply mean ensemble to depth maps."""
        depth_value_lists = [series.to_pylist() for series in depth_value_series]
        reduced_depth_maps: List[Dict[str, np.ndarray]] = []
        for depth_value_list in depth_value_lists:
            # Calculate mean and standard deviation across all depth maps in the list
            stacked_depths = np.stack(depth_value_list, axis=0)
            mean_depth = np.mean(stacked_depths, axis=0)
            std_depth = np.std(stacked_depths, axis=0)
            reduced_depth_maps.append(
                {
                    "mean": mean_depth,
                    "std": std_depth,
                }
            )
    
        return reduced_depth_maps
    c
    • 2
    • 4
  • g

    Garrett Weaver

    05/14/2025, 1:05 PM
    👋 any timeline for when window functions will be available for ray runner?
    c
    • 2
    • 1
  • y

    Yufan

    05/21/2025, 11:29 PM
    hey folks, wanna have your inputs on whats the most efficient way of resizing partitions
    k
    c
    • 3
    • 8
  • y

    Yuri Gorokhov

    05/23/2025, 9:29 PM
    I am trying to explode on a column that is a list of structs (it's a fairly nested schema) and encountering this error:
    Copy code
    Attempting to downcast Map { key: Utf8, value: List(Utf8) } to \"daft_core::array::list_array::ListArray\"
    Wondering if someone has seen this before?
    c
    • 2
    • 5
  • g

    Giridhar Pathak

    05/25/2025, 1:37 AM
    hey folks im getting a weird Type error when reading from an iceberg table:
    Copy code
    TypeError: pyarrow.lib.large_list() takes no keyword arguments
    the code:
    Copy code
    table = catalog.load_table(table)
        return df.read_iceberg(table)
    has anyone experienced this before?
    • 1
    • 1
  • g

    Giridhar Pathak

    05/25/2025, 10:18 PM
    Im querying an iceberg table from a jyupter notebook (backed by 12Gb ram and 4 cpu)
    Copy code
    daft.read_table("platform.messages").filter("event_time > TIMESTAMP '2025-05-24T00:00:00Z'").limit(5).show()
    running this makes the process crash. looks like memory goes thru the roof. Not sure if its trying to read the whole table into memory. pre-materialization, i can get the schema just fine.
    c
    d
    • 3
    • 29
  • e

    Everett Kleven

    05/28/2025, 2:15 PM
    https://github.com/JanKaul/iceberg-rust 👀
    👌 1
    k
    • 2
    • 6
  • y

    Yuri Gorokhov

    05/28/2025, 4:14 PM
    Is there an equivalent to pyspark's
    Copy code
    .dropDuplicates(subset: Optional[List[str]] = None)
    where you can specify which columns to consider?
    r
    k
    s
    • 4
    • 9
  • p

    Pat Patterson

    05/29/2025, 11:37 PM
    Hi there - I’m trying out Daft after meeting @ChanChan Mao and @Sammy Sidhu at Data Council a few weeks ago. I got all the queries from my recent Iceberg and Backblaze B2 blog post working - see https://gist.github.com/metadaddy/ec9e645fa0929321b626d8be6e11162e Performance in general is not great, but one query in particular is extremely slow:
    Copy code
    # How many records are in the current Drive Stats dataset?
        count, elapsed_time = time_collect(drivestats.count())
        print(f'Total record count: {count.to_pydict()['count'][0]} ({elapsed_time:.2f} seconds)')
    With the other systems I tested in my blog post, the equivalent query takes between a fraction of a second and 15 seconds. That Daft call to
    drivestats.count()
    takes 80 seconds. I’m guessing it’s doing way more work than it needs to - reading the record counts from each of the 365 Parquet files rather than simply reading
    total-records
    from the most recent metadata file. Since
    SELECT COUNT(*)
    is such a common operation, I think it’s worth short-circuiting the current behavior.
    c
    • 2
    • 8
  • g

    Giridhar Pathak

    06/03/2025, 2:17 PM
    Question on the Daft Native runtime 🧵
    c
    • 2
    • 5
  • e

    Everett Kleven

    06/03/2025, 2:59 PM
    Hey daft squad, If I'm using a MemoryCatalog to track lancedb tables, am I restricted to only using dataframes at the moment?
    r
    • 2
    • 1
  • p

    Pat Patterson

    06/06/2025, 10:47 PM
    Where does the work take place when I use Daft with Ray? For example, consider the following minimal code:
    Copy code
    import daft
    import ray
    
    ray.init("<ray://head_node_host:10001>", runtime_env={"pip": ["daft"]})
    
    daft.context.set_runner_ray("<ray://head_node_host:10001>")
    
    catalog = load_catalog(
        'iceberg',
        **{
            'uri': 'sqlite:///:memory:',
            # configuration to access Backblaze B2's S3-compatible API such as
            # s3.endpoint, s3.region, etc
        }
    }
    
    catalog.create_namespace('default', { 'location': f'<s3://my-bucket/'}>)
    table = catalog.register_table('default.drivestats', metadata_location)
    
    drivestats = daft.read_iceberg(table)
    
    result = drivestats.count().collect()
    print(f'Total record count: {result.to_pydict()['count'][0]}')
    Presumably, the code to read Parquet files from Backblaze B2 via the AWS SDK executes on the Ray cluster, so I have to either install the necessary packages there ahead of time or specify them, and environment variables, in
    runtime_env
    ? For example:
    Copy code
    ray.init("ray://<head_node_host>:10001", runtime_env={
        "pip": ["daft==0.5.2", "boto3==1.34.162", "botocore==1.34.162", ...etc...],
        "env_vars": {
            "AWS_ENDPOINT_URL": os.environ["AWS_ENDPOINT_URL"],
            "AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
            "AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
            ...etc...
        }
    })
    j
    • 2
    • 3
  • d

    Dimitris

    06/10/2025, 8:41 PM
    Hi, how does one insert a row to a df with daft? All examples I see load an existing dataset and add columns. Thanks!
    r
    • 2
    • 1
  • t

    Tabrez Mohammed

    06/16/2025, 10:41 PM
    We use daft on Ray with Glue and Iceberg. We recently changed the table config from CoW to MoR to improve write perf on our Spark jobs. Unfortunately, Daft can't read the tables, let alone any operations like joins, without running out of memory. Tried up to 200GB worker and a 20GB table just loading in daft and converting to a ray dataset.
    👀 1
    j
    d
    • 3
    • 16
  • d

    Dimitris

    06/16/2025, 11:22 PM
    Hi, what are the recommended ways to debug UDFs? Is there a way to print or log in the console when using ray?
    d
    • 2
    • 1
  • s

    Sasha Shtern

    06/17/2025, 8:06 PM
    Hi Data Engineering Friends, I'm looking for someone to help us finalize our Daft / Ray setup. Is there anyone in the group who has solid experience with Daft and willing to do some consulting work? We're currently using Spark and I've ported our code over to Daft/Ray to give it a try. We're really love the tooling and API compared to Spark so far. I was hoping to see performance improvements, but we're hitting a few snags out of the gate. Iceberg tables writing much slower than Spark and some OOMs we weren't seeing before. I'm optimistic that these issues are solvable, but I could use the expertise of someone who's been around the block with Daft. Thank you!
    d
    c
    c
    • 4
    • 7
  • d

    Dimitris

    06/18/2025, 10:21 PM
    Hi, do you have recommendations for performing semantic search (with the use of embeddings) on daft? I haven’t found much so far. I’m mainly interested in the distributed case.
    r
    • 2
    • 2
  • g

    Garrett Weaver

    06/19/2025, 4:25 PM
    👋 qq, I am seeing mixed support with respect to identifiers for Iceberg tables. specifically, for some methods I can provide
    catalog.namepace.table
    (e.g.
    write_table
    ), but others break and seem to expect
    namespace.table
    (e.g.
    create_table_if_not_exists
    which calls pyiceberg under the hood that had breaking change that no longer allows including catalog). Any advice on how I should be providing identifiers?
    r
    • 2
    • 16
  • g

    Garrett Weaver

    06/20/2025, 7:52 PM
    👋 I am running into weird issue where native runner is "hanging" when trying to read an Iceberg table, but switching to Ray runner (local cluster) works fine (plan in 🧵). Maybe I am hitting an edge case as this is single row test table in staging environment for testing Nessie integration.
    c
    d
    • 3
    • 17
  • m

    Marco Gorelli

    06/24/2025, 8:58 AM
    If daft.pyspark were complete, then assuming it covers the operations one needs, would there still be an advantage to using Daft's own API instead?
    c
    • 2
    • 1
  • e

    Everett Kleven

    06/24/2025, 4:11 PM
    LFG DAFT SQUAD!
    🙏 4
    🙌 2
    🎉 2
  • s

    Sammy Sidhu

    06/24/2025, 4:19 PM
    Today we're thrilled to announce that Eventual has raised $30M in funding to power the future of multimodal AI infrastructure! Jay Chia and I started this journey 3 years ago, frustrated by the same wall every AI team hits: processing images, video, and documents at scale with tools built for entirely different use cases. What began as pure frustration in my basement has become the data engine trusted by Amazon, CloudKitchens, Essential AI, Together AI and other Fortune 25 companies. The numbers speak for themselves: Daft improved Amazon's most critical data job efficiency by 24%, saving them 40,000 years of compute time annually. Together AI replaced their custom pipelines with simple Daft queries for 100TB+ datasets while achieving 10x performance gains. But this is just the beginning. AI applications are now generating massive amounts of multimodal data at machine speed, and we're building the engine to power it all. Thank you to our incredible community and supporters who made this possible - from our earliest believers at Y Combinator and Array Ventures to our lead investors CRV and Felicis, plus M12, Microsoft's Venture Fund, Citi, and everyone who's been part of this journey. What's next? We're launching early access to Eventual Cloud - the first production platform built from scratch for multimodal AI workloads. We're also hiring across Engineering, DevRel, Design and Product Marketing. - Check out our open roles here. Please help share the love on LinkedIn & Twitter! https://daft-amplify.lovable.app/ (PS, check out the video in the posts, it's pretty cool)
    ✅ 4
    🔥 5
    🙏 4
    ➕ 4
    daft party 4
    🙌 7
    👏 1
    ❤️ 8
    k
    k
    • 3
    • 2
  • a

    Artur Ciocanu

    06/25/2025, 7:55 PM
    Hello community, I saw https://daft.ai/ being shown on this repo: https://github.com/Eventual-Inc/Daft, but the address doesn’t open. Is this a known issue?
    j
    d
    • 3
    • 19
  • c

    ChanChan Mao

    06/26/2025, 4:41 PM
    Another huge milestone achieved this week - Daft has surpassed 3000 stars! Thank you to our growing community for continuing to support us and sharing the love of Daft. And thank you to everyone who is using Daft and believing in the future that we're building. https://github.com/Eventual-Inc/Daft
    ⭐ 2
    🙌 2
    🤩 3
  • g

    Garrett Weaver

    06/27/2025, 7:23 PM
    fyi, I am hitting this error https://github.com/apache/arrow/issues/21526 with
    pyarrow
    when using Ray data directly and also seeing errors trying to read the same data with
    daft
    .
    pyspark
    works fine. I assume daft might be impacted by the same issue (too large row groups)?
    s
    • 2
    • 1