Distributed Data Community #cool-work

Join Slack

jay

08/15/2024, 11:09 PM

set the channel description: Show off any cool work that we’re working on, Daft or otherwise!

jay

08/15/2024, 11:16 PM

/invite_all

jay

08/15/2024, 11:47 PM

Welcome to #C07GQUFH0S3 yall 🙂 We’ll be showing off any cool new upcoming work from the Daft team here…. Some stuff is experimental and not quite yet ready for prime time but I think is really cool haha

🎉 4

Sammy Sidhu

08/16/2024, 10:37 PM

Enabling Explain Analyze for the new local executer!

clapclap e 6

🔥 11

Cory Grinstead

08/20/2024, 7:41 PM

Huggingface datasource!!

😍 4

🙌 1

Kevin Wang

08/28/2024, 8:37 PM

Some benchmarks on our anti and semi joins Highlights: • more than 40% speedup on Q4 • almost 25% speedup on q22 • almost halving the time (along with no spilling!) to do a workflow simulating the slowest part of Together AI's document deduplication process!

🔥 5

Sammy Sidhu

08/30/2024, 11:15 PM

@jay's latest PR to take limits into account when performing ScanTask right-sizing! 8x improvement when dealing with a ton of small files and doing a limit! https://github.com/Eventual-Inc/Daft/pull/2758

🤣 1

🔥 4

jay

09/10/2024, 6:16 PM

This is really slick 😛 (running filters using SQL expressions)

jay

09/10/2024, 6:17 PM

Unfortunately this doesn’t work though:

Copy code

DaftError::TypeError Cannot perform comparison on types: Date, Utf8

Perhaps this is a SQL-level optimization we’d need to make @Cory Grinstead?

😄 1

David Blum

09/12/2024, 6:08 PM

Curious if anyone is aware of efforts to utilize gluonTS formatted timeseries data within a daft dataframe in order to support distributed training of timeseries models? (https://ts.gluon.ai/stable/tutorials/forecasting/extended_tutorial.html) I’m following the recipe for multivariate training of the HuggingFace informer model (https://huggingface.co/blog/informer), and the variates offer a natural partition. I think I can write a UDF to load by data into a daft df, where each partition stores a gluonTS dataset consisting of a single variate. However, I’m pretty sure I’ll have to substantially refactor informer’s PyTorch dataloader, which may or may not derail this effort 😂

Colin Ho

09/13/2024, 7:13 PM

Memory usage for native executor (streaming) writes vs python executor when doing a simple read_parquet -> write parquet on TPCH lineitem SF1: • Native peak memory usage: 1.6 GiB • Py peak memory usage: 2.3 GiB PR here: https://github.com/Eventual-Inc/Daft/pull/2822

🔥 7

jay

09/19/2024, 11:00 PM

Imports go zoom zoom https://dist-data.slack.com/archives/C052CA6Q9N1/p1726786409375909

🤯 4

Sammy Sidhu

09/23/2024, 4:50 AM

Our very own @Andrew Gazelka Minecraft server is on top of the rust subreddit https://www.reddit.com/r/rust/s/JhrTIgTxN6

🤯 5

🔥 8

jay

09/24/2024, 9:49 PM

Upcoming SQL support from @Cory Grinstead with some pretty advanced complex data accessor functionality 😮 😮 😮

🥳 6

jay

09/26/2024, 8:43 AM

Whee our first happy user of Delta/Iceberg partitioned writes

❤️ 7

Cory Grinstead

09/26/2024, 10:20 PM

Soon you won't even need to create intermediate dataframes for SQL!!

👀 6

🔥 5

Cory Grinstead

10/07/2024, 9:49 PM

very close to finishing the arrow

Interval

datatype. This'll allow for relative date comparisons

🔥 7

jay

10/27/2024, 3:14 AM

Playing around with better observability/visualizations of our queries! This one shows scheduling of various stages across nodes/workers

daft party 6

jay

10/27/2024, 3:17 AM

Another view to see how our scheduler interacts with Ray: You can see that sometimes we schedule something pretty early, but Ray only runs it much later. This view can help us debug that behavior (e.g. perhaps Ray doesn’t have enough resources to schedule something)

jay

12/11/2024, 4:28 AM

Ray tracing of task retries after OOMs… Pretty interesting!

daft party 4

Colin Ho

12/14/2024, 1:20 AM

Here's an MVP progress bar for swordfish (Daft's upcoming local streaming engine), feedback appreciated!

Progress Bar.mov

daft party 10

Sandeep

12/17/2024, 8:38 PM

Here is a blog I wrote recently that scans S3 bucket using daft.. 104M rows on 2core/16GB node in Microsoft Fabric https://fabric.guru/using-fabric-orgapps-notebooks-for-geospatial-data-exploration

🙌 4

Kevin Wang

01/30/2025, 10:29 PM

New Daft docs!!!

daft party 8

Robert Howell

03/11/2025, 10:05 PM

We just published a new web page with more details on our data types! daft bro https://www.getdaft.io/projects/docs/en/latest/sql/datatypes/

🔥 6

👍 3

Hongbo Miao

03/16/2025, 9:20 PM

I opened a feature request to support Daft DataFrame at Narwhals repo https://github.com/narwhals-dev/narwhals/issues/2222 This will allow all these tools supporting Daft https://narwhals-dev.github.io/narwhals/ecosystem/ 🚀 • altair • bokeh • darts • hierarchicalforecast • marimo • metalearners • panel-graphic-walker • plotly • pointblank • pymarginaleffects • py-shiny • rio • scikit-lego • scikit-playtime • tabmat • tea-tasting • timebasedcv • tubular • vegafusion • wimsey

🙌 7

Everett Kleven

04/01/2025, 11:46 PM

Yooo Daft Team. Long time no see. I'm building an ECS architecture with Daft and stumbled across the in memory Catalog, only to realize it wasn't quite ready yet. I opened a discussion and attached the component store implementation and wold love any feedback on the usage patterns! https://github.com/Eventual-Inc/Daft/discussions/4135

👋 1

Everett Kleven

04/28/2025, 10:58 PM

Hey folks, Just wanted to draw your attention to some efforts over at LanceDB for a catalog: https://github.com/lancedb/lance-namespace if you guys were interested in adding support I’d be happy to write an issue.

👀 6

Srihari Thyagarajan

05/12/2025, 6:29 AM

Hi all — we just kicked off a Daft tutorial series in the marimo learn repo with the first notebook underway (thanks to @Péter Ferenc Gyarmati for getting it started; proposing the issue and the structure!). Tracking issue (w/ proposed outline: https://github.com/marimo-team/learn/issues/43)

🙌 5

daft party 3

Robert Howell

06/02/2025, 11:22 PM

We recently merged in support for a deserialize / try_deserialize pair which enables you to parse your JSON into a Daft value given some schema. This pairs nicely with jq, allowing you to write complex jq filters to manipulate your JSON before needing to extract into some Daft value. Here's an example of some messy sensor data, and finding some "outlier sample" aka whichever point is furtherest away.

Copy code

import daft
from daft import DataType as dt
from daft import col

# here's our raw sample data which is just some json dump from a sensor
df = daft.from_pydict(
    {
        "sample": [
            '{ "x": 1 }',  # missing y, we'll insert 0 in its place
            '{ "x": 1, "y": 1 }',  # ok
            '"HELLO, WORLD!"',  # you're not supposed to be here..
            '{ "x": 3, "y": 3 }',  # ok
            '{ "x": 4, "y": 4 }',  # ok
            '{ "x": false }',  # wrong data type..
        ]
    }
)

# select all objects, using 0 as the default for missing keys
filter = """
    (. | objects?) | { x: .x // 0, y: .y // 0 }
"""

# our point type is an x/y pair.
point_t = dt.struct({"x": dt.int64(), "y": dt.int64()})

# we have the successfully extracted each sample point, now deserialize into our type.
points = (df.select(col("sample").jq(filter).try_deserialize("json", point_t).alias("point"))).drop_null()

# now find the max from the origin, no need to sqrt it.
p = col("point")
furthest_point = (
    points.with_column("distance", p["x"] * p["x"] + p["y"] * p["y"])
    .sort("distance", desc=True)
    .limit(1)
    .select(p)
    .to_pydict()["point"][0]
)

assert furthest_point == {"x": 4, "y": 4}

Links • https://gist.github.com/rchowell/6d03fca6a44be2d8ef71a8d837acc4fa#file-test_jq-py • https://github.com/Eventual-Inc/Daft/pull/4470

ChanChan Mao

06/06/2025, 6:16 PM

wanted to show off @Everett Kleven's cool project i came across on linkedin -- ECS simulation engine built with Daft and LanceDB! https://www.linkedin.com/posts/everett-kleven_id-like-to-formally-introduce-a-project-activity-7336591450958114818-nRVr/ (everett, please feel free to share more!)

🥲 4

👏 7