Slackbot
03/13/2023, 8:28 PMElijah Ben Izzy
03/13/2023, 8:40 PMdr = driver.Driver(config, *modules)
df = dr.execute(vars)
save_df(df)
In the future (quick teaser, API not locked in stone), we’re thinking something like this:
dr = driver.Driver(config, *modules)
df = dr.materialize(
SaveToCSV('col1', 'col2', 'col3', path='training_data.csv'),
CustomMetricLogger('metric1', 'metric2'))
Which could also be done by defining custom data adapters:
# materialize.py
@materialize_to(
adapter=SaveToCSV,
path=config('training_data_path') # or hardcoded
)
def training_data(col1: pd.Series, col2: pd.Series, col3: pd.Series) -> pd.DataFrame:
return pd.DataFrame(col1, col2, col3)
@materialize_to(
adapter=CustomMetricLogger
)
def metrics(metric1: float, metric2: float) -> Dict[str, float]:
return {'metric1' : metric1, 'metric2' : metric2}
Elijah Ben Izzy
03/13/2023, 8:42 PMmaterialize
call in the driver basically adds the nodes (both saving + joining) to the end of the DAG before executing, so you can see it in your visualize output.Stefan Krawczyk
03/13/2023, 9:01 PMsave_to_s3
as an output) - e.g.:
def save_to_s3(col1: pd.Series, col2: pd.Series, ..., s3_client: 'Client', s3_path: str) -> dict:
"""saves df of data to S3"""
_df = pd.DataFrame({"col1": col1, ... })
result = s3_client.save_dataframe(s3_path, _df)
return {"status": result}
Luke
03/13/2023, 9:26 PMLuke
03/13/2023, 9:39 PMWould love your thoughts on the API above! Getting this built is my plan for this upcoming week, so I’ll be iterating through some APIs.Are both
SaveToCSV
and CustomMetricLogger
writing to disc here? Is CustomMetricLogger
appending the specified metrics to ‘training_data.csv’? This is outside of Hamilton’s core concerns, but I’m anticipating how I may use this tentative API to link experiment parameters with experiment artifacts. There are already tools to do this — DVC, MLFlow, Kedro, etc. — so the solution is probably in integration. In my mind, a good API would allow integration with these other tools in a lightweight way without requiring it. How that would actually work without feature creep for Hamilton is nebulous.Elijah Ben Izzy
03/13/2023, 10:04 PMSaveToCSV
would be provided by hamilton and write to disk. CustomMetricLogger
would be something you write, but you could imagine an MLFlowMetricsLogger
that would write to MLFlow.
Great point re: feature-creep, we specifically don’t want to be in the business of storing data or processing metrics, only throwing them over the wall to potential partners.
I’m thinking these would be extendable classes so you could plug into whatever you want — allowing pretty natural data saving interfaces with the options you mentioned + more. Also, we are planning to build similar adapter technology, so you could load from any of these 🙂 Thoughts?Stefan Krawczyk
03/13/2023, 10:47 PMIn my mind, a good API would allow integration with these other tools in a lightweight way without requiring it.Yep, we got you 😉 ! For example, we absolutely don’t want python dependency bloat. So if you want an implementation, it’ll be a separate dependency.