https://www.getdaft.io logo
Join Slack
Powered by
# cool-work
  • j

    jay

    08/15/2024, 11:09 PM
    set the channel description: Show off any cool work that we’re working on, Daft or otherwise!
  • j

    jay

    08/15/2024, 11:16 PM
    /invite_all
  • j

    jay

    08/15/2024, 11:47 PM
    Welcome to #C07GQUFH0S3 yall 🙂 We’ll be showing off any cool new upcoming work from the Daft team here…. Some stuff is experimental and not quite yet ready for prime time but I think is really cool haha
    🎉 4
  • s

    Sammy Sidhu

    08/16/2024, 10:37 PM
    Enabling Explain Analyze for the new local executer!
    clapclap e 6
    🔥 11
  • c

    Cory Grinstead

    08/20/2024, 7:41 PM
    Huggingface datasource!!
    😍 4
    🙌 1
    j
    k
    • 3
    • 5
  • k

    Kevin Wang

    08/28/2024, 8:37 PM
    Some benchmarks on our anti and semi joins Highlights: • more than 40% speedup on Q4 • almost 25% speedup on q22 • almost halving the time (along with no spilling!) to do a workflow simulating the slowest part of Together AI's document deduplication process!
    🔥 5
    j
    v
    k
    • 4
    • 11
  • s

    Sammy Sidhu

    08/30/2024, 11:15 PM
    @jay's latest PR to take limits into account when performing ScanTask right-sizing! 8x improvement when dealing with a ton of small files and doing a limit! https://github.com/Eventual-Inc/Daft/pull/2758
    🤣 1
    🔥 4
  • j

    jay

    09/10/2024, 6:16 PM
    This is really slick 😛 (running filters using SQL expressions)
  • j

    jay

    09/10/2024, 6:17 PM
    Unfortunately this doesn’t work though:
    Copy code
    DaftError::TypeError Cannot perform comparison on types: Date, Utf8
    Perhaps this is a SQL-level optimization we’d need to make @Cory Grinstead?
    😄 1
    c
    • 2
    • 7
  • d

    David Blum

    09/12/2024, 6:08 PM
    Curious if anyone is aware of efforts to utilize gluonTS formatted timeseries data within a daft dataframe in order to support distributed training of timeseries models? (https://ts.gluon.ai/stable/tutorials/forecasting/extended_tutorial.html) I’m following the recipe for multivariate training of the HuggingFace informer model (https://huggingface.co/blog/informer), and the variates offer a natural partition. I think I can write a UDF to load by data into a daft df, where each partition stores a gluonTS dataset consisting of a single variate. However, I’m pretty sure I’ll have to substantially refactor informer’s PyTorch dataloader, which may or may not derail this effort 😂
    j
    • 2
    • 8
  • c

    Colin Ho

    09/13/2024, 7:13 PM
    Memory usage for native executor (streaming) writes vs python executor when doing a simple read_parquet -> write parquet on TPCH lineitem SF1: • Native peak memory usage: 1.6 GiB • Py peak memory usage: 2.3 GiB PR here: https://github.com/Eventual-Inc/Daft/pull/2822
    🔥 7
    j
    s
    • 3
    • 2
  • j

    jay

    09/19/2024, 11:00 PM
    Imports go zoom zoom https://dist-data.slack.com/archives/C052CA6Q9N1/p1726786409375909
    🤯 4
  • s

    Sammy Sidhu

    09/23/2024, 4:50 AM
    Our very own @Andrew Gazelka Minecraft server is on top of the rust subreddit https://www.reddit.com/r/rust/s/JhrTIgTxN6
    🤯 5
    🔥 8
  • j

    jay

    09/24/2024, 9:49 PM
    Upcoming SQL support from @Cory Grinstead with some pretty advanced complex data accessor functionality 😮 😮 😮
    🥳 6
  • j

    jay

    09/26/2024, 8:43 AM
    Whee our first happy user of Delta/Iceberg partitioned writes
    ❤️ 7
  • c

    Cory Grinstead

    09/26/2024, 10:20 PM
    Soon you won't even need to create intermediate dataframes for SQL!!
    👀 6
    🔥 5
    j
    • 2
    • 1
  • c

    Cory Grinstead

    10/07/2024, 9:49 PM
    very close to finishing the arrow
    Interval
    datatype. This'll allow for relative date comparisons
    🔥 7
    j
    • 2
    • 2
  • j

    jay

    10/27/2024, 3:14 AM
    Playing around with better observability/visualizations of our queries! This one shows scheduling of various stages across nodes/workers
    daft party 6
  • j

    jay

    10/27/2024, 3:17 AM
    Another view to see how our scheduler interacts with Ray: You can see that sometimes we schedule something pretty early, but Ray only runs it much later. This view can help us debug that behavior (e.g. perhaps Ray doesn’t have enough resources to schedule something)
  • j

    jay

    12/11/2024, 4:28 AM
    Ray tracing of task retries after OOMs… Pretty interesting!
    daft party 4
  • c

    Colin Ho

    12/14/2024, 1:20 AM
    Here's an MVP progress bar for swordfish (Daft's upcoming local streaming engine), feedback appreciated!
    Progress Bar.mov
    daft party 10
    k
    j
    • 3
    • 12
  • s

    Sandeep

    12/17/2024, 8:38 PM
    Here is a blog I wrote recently that scans S3 bucket using daft.. 104M rows on 2core/16GB node in Microsoft Fabric https://fabric.guru/using-fabric-orgapps-notebooks-for-geospatial-data-exploration
    🙌 4
    s
    • 2
    • 2
  • k

    Kevin Wang

    01/30/2025, 10:29 PM
    New Daft docs!!!
    daft party 8
  • r

    Robert Howell

    03/11/2025, 10:05 PM
    We just published a new web page with more details on our data types! daft bro https://www.getdaft.io/projects/docs/en/latest/sql/datatypes/
    🔥 6
    👍 3
  • h

    Hongbo Miao

    03/16/2025, 9:20 PM
    I opened a feature request to support Daft DataFrame at Narwhals repo https://github.com/narwhals-dev/narwhals/issues/2222 This will allow all these tools supporting Daft https://narwhals-dev.github.io/narwhals/ecosystem/ 🚀 • altair • bokeh • darts • hierarchicalforecast • marimo • metalearners • panel-graphic-walker • plotly • pointblank • pymarginaleffects • py-shiny • rio • scikit-lego • scikit-playtime • tabmat • tea-tasting • timebasedcv • tubular • vegafusion • wimsey
    🙌 7
    c
    s
    • 3
    • 2
  • e

    Everett Kleven

    04/01/2025, 11:46 PM
    Yooo Daft Team. Long time no see. I'm building an ECS architecture with Daft and stumbled across the in memory Catalog, only to realize it wasn't quite ready yet. I opened a discussion and attached the component store implementation and wold love any feedback on the usage patterns! https://github.com/Eventual-Inc/Daft/discussions/4135
    👋 1
    r
    • 2
    • 1
  • e

    Everett Kleven

    04/28/2025, 10:58 PM
    Hey folks, Just wanted to draw your attention to some efforts over at LanceDB for a catalog: https://github.com/lancedb/lance-namespace if you guys were interested in adding support I’d be happy to write an issue.
    👀 6
    d
    • 2
    • 1
  • s

    Srihari Thyagarajan

    05/12/2025, 6:29 AM
    Hi all — we just kicked off a Daft tutorial series in the marimo learn repo with the first notebook underway (thanks to @Péter Ferenc Gyarmati for getting it started; proposing the issue and the structure!). Tracking issue (w/ proposed outline: https://github.com/marimo-team/learn/issues/43)
    🙌 5
    daft party 3
    j
    • 2
    • 1
  • r

    Robert Howell

    06/02/2025, 11:22 PM
    We recently merged in support for a deserialize / try_deserialize pair which enables you to parse your JSON into a Daft value given some schema. This pairs nicely with jq, allowing you to write complex jq filters to manipulate your JSON before needing to extract into some Daft value. Here's an example of some messy sensor data, and finding some "outlier sample" aka whichever point is furtherest away.
    Copy code
    import daft
    from daft import DataType as dt
    from daft import col
    
    # here's our raw sample data which is just some json dump from a sensor
    df = daft.from_pydict(
        {
            "sample": [
                '{ "x": 1 }',  # missing y, we'll insert 0 in its place
                '{ "x": 1, "y": 1 }',  # ok
                '"HELLO, WORLD!"',  # you're not supposed to be here..
                '{ "x": 3, "y": 3 }',  # ok
                '{ "x": 4, "y": 4 }',  # ok
                '{ "x": false }',  # wrong data type..
            ]
        }
    )
    
    # select all objects, using 0 as the default for missing keys
    filter = """
        (. | objects?) | { x: .x // 0, y: .y // 0 }
    """
    
    # our point type is an x/y pair.
    point_t = dt.struct({"x": dt.int64(), "y": dt.int64()})
    
    # we have the successfully extracted each sample point, now deserialize into our type.
    points = (df.select(col("sample").jq(filter).try_deserialize("json", point_t).alias("point"))).drop_null()
    
    # now find the max from the origin, no need to sqrt it.
    p = col("point")
    furthest_point = (
        points.with_column("distance", p["x"] * p["x"] + p["y"] * p["y"])
        .sort("distance", desc=True)
        .limit(1)
        .select(p)
        .to_pydict()["point"][0]
    )
    
    assert furthest_point == {"x": 4, "y": 4}
    Links • https://gist.github.com/rchowell/6d03fca6a44be2d8ef71a8d837acc4fa#file-test_jq-py • https://github.com/Eventual-Inc/Daft/pull/4470
  • c

    ChanChan Mao

    06/06/2025, 6:16 PM
    wanted to show off @Everett Kleven's cool project i came across on linkedin -- ECS simulation engine built with Daft and LanceDB! https://www.linkedin.com/posts/everett-kleven_id-like-to-formally-introduce-a-project-activity-7336591450958114818-nRVr/ (everett, please feel free to share more!)
    🥲 4
    👏 7
    e
    • 2
    • 2