This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

03/13/2023, 8:28 PM

This message was deleted.

❤️ 1

Elijah Ben Izzy

03/13/2023, 8:40 PM

Quick note — we’ve moved docs to hamilton.readthedocs.io :) Awesome question! In the current version Hamilton is more concerned with extracting and transforming, but loading is absolutely going to be in scope shortly — we have a plan and I’ll be building it out soon 🙂 For now, however, the best way to do this is adjacent to the driver.

Copy code

dr = driver.Driver(config, *modules)
df = dr.execute(vars)
save_df(df)

In the future (quick teaser, API not locked in stone), we’re thinking something like this:

Copy code

dr = driver.Driver(config, *modules)
df = dr.materialize(
    SaveToCSV('col1', 'col2', 'col3', path='training_data.csv'), 
    CustomMetricLogger('metric1', 'metric2'))

Which could also be done by defining custom data adapters:

Copy code

# materialize.py

@materialize_to(
    adapter=SaveToCSV,
    path=config('training_data_path') # or hardcoded
)
def training_data(col1: pd.Series, col2: pd.Series, col3: pd.Series) -> pd.DataFrame:
    return pd.DataFrame(col1, col2, col3)

@materialize_to(
    adapter=CustomMetricLogger
)
def metrics(metric1: float, metric2: float) -> Dict[str, float]:
    return {'metric1' : metric1, 'metric2' : metric2}

👍 1

💡 1

Elijah Ben Izzy

03/13/2023, 8:42 PM

Would love your thoughts on the API above! Getting this built is my plan for this upcoming week, so I’ll be iterating through some APIs. The idea is you’d run the first option (in the driver) for more ad-hoc stuff, then translate it to code when you have a second option. The classes would be pluggable so you could define your own loaders as well. The implementation would be such that the

materialize

call in the driver basically adds the nodes (both saving + joining) to the end of the DAG before executing, so you can see it in your visualize output.

Stefan Krawczyk

03/13/2023, 9:01 PM

@Luke thanks for the question! +1 to what Elijah said. We’ve been thinking about it. In short, today there isn’t really a standard “hamiltonian way”. The two paths are you either save it yourself after getting something from Hamilton. Or you write a function to do it and have Hamilton execute it for you (you’d probably want to change to using a DictResult builder in the driver to do this and then request

save_to_s3

as an output) - e.g.:

Copy code

def save_to_s3(col1: pd.Series, col2: pd.Series, ..., s3_client: 'Client', s3_path: str) -> dict:
    """saves df of data to S3"""
    _df = pd.DataFrame({"col1": col1, ... })
    result = s3_client.save_dataframe(s3_path, _df)
    return {"status": result}

👍 1

Luke

03/13/2023, 9:26 PM

Great stuff, guys! It’ll take some experimentation to see which approach works best for my team. I’ll plan on syncing back up once we have a chance to sandbox this some.

👍 1

Luke

03/13/2023, 9:39 PM

Would love your thoughts on the API above! Getting this built is my plan for this upcoming week, so I’ll be iterating through some APIs.

Are both

SaveToCSV

and

CustomMetricLogger

writing to disc here? Is

CustomMetricLogger

appending the specified metrics to ‘training_data.csv’? This is outside of Hamilton’s core concerns, but I’m anticipating how I may use this tentative API to link experiment parameters with experiment artifacts. There are already tools to do this — DVC, MLFlow, Kedro, etc. — so the solution is probably in integration. In my mind, a good API would allow integration with these other tools in a lightweight way without requiring it. How that would actually work without feature creep for Hamilton is nebulous.

➕ 1

Elijah Ben Izzy

03/13/2023, 10:04 PM

Good qs — so yeah, API still TBD, but the idea is they’d each be separate materializations. E.G.

SaveToCSV

would be provided by hamilton and write to disk.

CustomMetricLogger

would be something you write, but you could imagine an

MLFlowMetricsLogger

that would write to MLFlow. Great point re: feature-creep, we specifically don’t want to be in the business of storing data or processing metrics, only throwing them over the wall to potential partners. I’m thinking these would be extendable classes so you could plug into whatever you want — allowing pretty natural data saving interfaces with the options you mentioned + more. Also, we are planning to build similar adapter technology, so you could load from any of these 🙂 Thoughts?

➕ 1

Stefan Krawczyk

03/13/2023, 10:47 PM

In my mind, a good API would allow integration with these other tools in a lightweight way without requiring it.

Yep, we got you 😉 ! For example, we absolutely don’t want python dependency bloat. So if you want an implementation, it’ll be a separate dependency.

Open in Slack

Previous Next