This message was deleted.
# hamilton-help
s
This message was deleted.
e
In a call now — will get back to you shortly!
👍 1
s
@stephen bias want to jump on a call? We’re available this afternoon.
Otherwise some questions (h/t @Elijah Ben Izzy): • are you passing around pandas dataframes? or dask dataframes? or you don’t know? • are you running on a cluster? or a single machine? • what is the scale challenge you’re encountering motivating to use dask? data size? time it takes to compute something? or both?
s
It's 9pm my time so will pass on the call today, but can arrange for another time if this takes some time. To answer your questions: • passing around a dask dataframe* • single machine (for now) • it's currently size, pure pandas is just going to require exponentially more ram as the dataset increases code snippet of what I've tried so far, based on this example:
Copy code
df = dask_bigquery.read_gbq(project_id=project, dataset_id=dataset, table_id=table)
    dask_df_map = {c:df[c] for c in df}

    cluster = LocalCluster(processes=False)
    client = Client(cluster)
    dga = h_dask.DaskGraphAdapter(client, base.PandasDataFrameResult())
    dr = driver.Driver(dask_df_map, features, adapter=dga)
    df = dr.execute(OUTPUT_COLUMNS)
*a dict of dask series, as that seemed to be what it wanted
s
@stephen bias thanks. Yeah I think a call would be quicker for us to understand the situation a bit better and to ensure we can replicate/understand the scenario. We can try to catch you first thing our morning tomorrow / your afternoon?
s
sure thing - can we do Friday as I'm out of office most of tomorrow but free anytime of day Friday
s
Cool
We can work with that. Want to grab a slot here. Otherwise did you try visualizing the dask plan, in case that’s insightful?
👍 1
s
Have just done the visualisation - it all collects into one point (sort of as expected). Can chat about it more on friday 😎
👍 1
s
@stephen bias I created a PR https://github.com/DAGWorks-Inc/hamilton/pull/251 that has a
DaskDataFrameResult
. See the new
run.py
. Otherwise in a comment in the PR I added a few things to try understand that we can walkthrough on the call. Note: Dask on a single machine might not help with memory issues if you’re running into them with Pandas.
Okay to close the loop: • new dask graph adapter additions helped. Will push that out today/tomorrow.
👏 1
🎉 1
🙌 1