This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

08/02/2023, 6:13 PM

This message was deleted.

Elijah Ben Izzy

08/02/2023, 6:28 PM

In a call now — will get back to you shortly!

👍 1

Stefan Krawczyk

08/02/2023, 7:39 PM

@stephen bias want to jump on a call? We’re available this afternoon.

Stefan Krawczyk

08/02/2023, 7:46 PM

Otherwise some questions (h/t @Elijah Ben Izzy): • are you passing around pandas dataframes? or dask dataframes? or you don’t know? • are you running on a cluster? or a single machine? • what is the scale challenge you’re encountering motivating to use dask? data size? time it takes to compute something? or both?

stephen bias

08/02/2023, 8:10 PM

It's 9pm my time so will pass on the call today, but can arrange for another time if this takes some time. To answer your questions: • passing around a dask dataframe* • single machine (for now) • it's currently size, pure pandas is just going to require exponentially more ram as the dataset increases code snippet of what I've tried so far, based on this example:

Copy code

df = dask_bigquery.read_gbq(project_id=project, dataset_id=dataset, table_id=table)
    dask_df_map = {c:df[c] for c in df}

    cluster = LocalCluster(processes=False)
    client = Client(cluster)
    dga = h_dask.DaskGraphAdapter(client, base.PandasDataFrameResult())
    dr = driver.Driver(dask_df_map, features, adapter=dga)
    df = dr.execute(OUTPUT_COLUMNS)

*a dict of dask series, as that seemed to be what it wanted

Stefan Krawczyk

08/02/2023, 8:13 PM

@stephen bias thanks. Yeah I think a call would be quicker for us to understand the situation a bit better and to ensure we can replicate/understand the scenario. We can try to catch you first thing our morning tomorrow / your afternoon?

stephen bias

08/02/2023, 8:14 PM

sure thing - can we do Friday as I'm out of office most of tomorrow but free anytime of day Friday

Stefan Krawczyk

08/02/2023, 8:16 PM

Cool

Stefan Krawczyk

08/02/2023, 8:18 PM

We can work with that. Want to grab a slot here. Otherwise did you try visualizing the dask plan, in case that’s insightful?

👍 1

stephen bias

08/02/2023, 8:43 PM

Have just done the visualisation - it all collects into one point (sort of as expected). Can chat about it more on friday 😎

👍 1

Stefan Krawczyk

08/03/2023, 10:14 PM

@stephen bias I created a PR https://github.com/DAGWorks-Inc/hamilton/pull/251 that has a

DaskDataFrameResult

. See the new

run.py

. Otherwise in a comment in the PR I added a few things to try understand that we can walkthrough on the call. Note: Dask on a single machine might not help with memory issues if you’re running into them with Pandas.

Stefan Krawczyk

08/04/2023, 4:49 PM

Okay to close the loop: • new dask graph adapter additions helped. Will push that out today/tomorrow.

👏 1

🎉 1

🙌 1

Open in Slack

Previous Next