Thierry Jean
08/25/2022, 12:40 AM# definition
from typing import List
from hamilton import driver
from sklearn.base import BaseEstimator, TransformerMixin
class HamiltonDriver(BaseEstimator, TransformerMixin):
def __init__(self, outputs: List[str] = None, config: dict = None, modules: list = None):
self.outputs = [] if outputs is None else outputs
self.config = {} if config is None else config
self.modules = [] if modules is None else modules
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
hdriver = driver.Driver(self.config, *self.modules)
results = hdriver.execute(final_vars=self.outputs, inputs=X)
return results
# usage
import time_transform
import location_transform
df = pd.read_csv("./filepath")
transformer = HamiltonDriver(outputs=["day_of_week", "location_distance"], modules=[time_transform, location_transform])
transformer.transform(df.to_dict(orient="series"))
Ben
09/02/2022, 12:30 AMdef myfunc(col1: pd.Series(freq="M", index=pd.PeriodIndex)): ...
Presumably I can do this with validators (which I haven't tried using yet) but it feels more like a job for types. (?)
2. Errors that occur in the execute() stage are a black box, because the traceback just goes to execute and not the actual problems in your code (e.g. incompatible indexes).Elijah Ben Izzy
10/11/2022, 4:56 PMJames Marvin
11/02/2022, 9:31 AM@tag(type='metadata')
def create_some_metadata(input:pd.Series) -> pd.Series:
return helpers._get_some_metadata(input)
@tag(type='metadata')
def create_some_more_metadata(current_time:datetime) -> pd.Series:
return pd.Series(current_time)
def get_metadata_table(create_some_metadata:pd.Series, create_some_more_metadata:pd.Series) -> pd.DataFrame:
return pd.DataFrame([create_some_metadata, create_some_more_metadata])
It could be useful - especially where we are creating a new function accepting a high number of nodes of the same type as input - to have some feature enabling us to refer to nodes by type, as opposed to by name.
Hopefully this example shows in principle what I mean:
@tag(type='metadata')
def create_some_metadata(input:pd.Series) -> pd.Series:
return helpers._get_some_metadata(input)
@tag(type='metadata')
def create_some_more_metadata(current_time:datetime) -> pd.Series:
return pd.Series(current_time)
@nodes_by_tag(type='metadata')
def get_metadata_table(**kwargs) -> pd.DataFrame:
return pd.DataFrame(**kwargs)
In this example:
ā¢ All nodes have been assigned the same 'type' using the @tag feature
ā¢ Some method is supplied (in this case, a new decorator @nodes_by_tag) by which we're able to use to refer to all nodes of a given type in definition of a new node
ā¢ The new node is able to perform some action based on the assumption that all nodes of a given type have been provided as an input - without having to refer to each node by name
What do you think?Ian Hoffman
12/12/2022, 12:27 AMhello_world
example and ran into an issue. Specifically, if I change the output_columns
in my_script.py
to ["spend_mean"]
, Hamilton crashes with the following:
WARNING:hamilton.base:It appears no Pandas index type was detected. This will likely break when trying to create a DataFrame. E.g. are you requesting all scalar values? Use a different result builder or return at least one Pandas object with an index. Ignore this warning if you're using DASK for now.
ERROR:hamilton.driver:-------------------------------------------------------------------
Oh no an error! Need help with Hamilton?
Join our slack and ask for help! <https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg>
-------------------------------------------------------------------
Traceback (most recent call last):
File "my_script.py", line 29, in <module>
df = dr.execute(output_columns)
File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/hamilton/driver.py", line 142, in execute
raise e
File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/hamilton/driver.py", line 139, in execute
return self.adapter.build_result(**outputs)
File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/hamilton/base.py", line 171, in build_result
raise ValueError(f"Cannot build result. Cannot handle type {value}.")
ValueError: Cannot build result. Cannot handle type 28.333333333333332.
This is happening because the default Driver uses the SimplePythonDataFrameGraphAdapter
if no adapter is explicitly specified, and the SimplePythonDataFrameGraphAdapter
can't handle scalar values in its build_result
method. So I don't know that this qualifies as a bug, which is why I'm not filing a bug report. It seems it is expected, and yet it's unintuitive.
It seems that, if I can run an entire DAG, I should be able to inspect intermediate values in that DAG without Hamilton throwing an error.
I was wondering whether it would make sense to modify SimplePythonDataFrameGraphAdapter
, or to introduce a new default adapter, which maintains the behavior of SimplePythonDataFrameGraphAdapter
for DataFrames and Series but simply treats anything else as a "normal" value and lets it through. This would give the expected DX while still maintaining the nice data validation benefits that SimplePythonDataFrameGraphAdapter
provides today (e.g. in check_pandas_index_types_match
).
A minor thing for sure, since most people aren't introspecting intermediate nodes, but thought I'd bring it up since it bit me.
CC @Elijah Ben Izzy as we were talking about this.Stefan Krawczyk
12/27/2022, 6:41 AMStefan Krawczyk
01/02/2023, 5:00 PMStefan Krawczyk
01/23/2023, 9:52 PM@here
for once), I would loves some thoughts on this API update? i.e. the ability to pass functions into execute()
rather than just strings?
e.g.
import data_loaders, transforms, model_pipeline
...
dr = driver.Driver(config, data_loaders, transforms, model_pipeline)
kaggle_submission_df: pd.DataFrame = dr.execute([model_pipeline.kaggle_submission_df]) # <-- in addition to strings, you can pass in the function and we'll take the name from it.
Thoughts? React with your emoji of choice: š š š ā¦Stefan Krawczyk
01/25/2023, 6:46 PMElijah Ben Izzy
02/16/2023, 10:13 PMElijah Ben Izzy
04/06/2023, 2:57 AM@load_from
and @save_to
.
1. @load_from
does data loading and injects the result into a parameter in a function. It has many different flavors ā this one can load from a CSV into a pandas dataframe:
# load the parameter training_data_raw from the path `./training_data.csv`
@load_from.csv(path=value("./training_data.csv"))
def training_data(training_data_raw: pd.DataFrame) -> pd.DataFrame:
return some_small_modifications(training_data_raw)
2. @save_to
does the opposite ā it saves the result of a function
# save the result of final_output to the value of the node/input `output_result_json`
@save_to.json(path=source("output_result_json"), artifact_name_="final_save_node")
def final_output(...) -> Dict[str, Any]:
return {...: ...}
The idea is that:
ā¢ if you want final_output
you should be able to just call it with final_output
as the input to the driver.
ā¢ if you want to save the node, you need to call it with final_save_node
(the artifact name) as a var to the driver.
Some questions for you:
1. What āadaptersā would you like to see? Iāve got json
, csv
, and a smattering of others. They can work for load
, save
, or bothā¦
2. What name would you like for the artifact_name_
parameter? I donāt like what we have nowā¦Jan Hurst
04/21/2023, 5:04 AMdot.render(output_file_path...)
is erroring out for some reason in my environment
would it be possible to make the output_file_path
a keyword argument defaulting to None, and the render_kwargs
default to {"view": False}
or is that just going to screw up everyone else? šDavid Wesolowski
04/23/2023, 5:57 AMfunction_modifiers/expanders.py
fails at line: 384
@inject(params=source('my_func__params'))
def my_func(params: int) -> int:
"""Whoops"""
return 1
temp_module = ad_hoc_utils.create_temporary_module(
my_func, module_name="my_module"
)
dr = driver.Driver({'my_func__params': 2}, temp_module)
df = dr.execute(final_vars=['my_func'])
Jarrod Hamilton
08/27/2023, 10:52 PMJan Hurst
08/30/2023, 8:59 AM@save_to
functionality.
I have something that looks a little like this:
@save_to.feature(feature_name="my_feature_1")
def my_feature_1(foo: pd.Series, bar: pd.Series) -> pd.Series:
return foo + bar
my saving logic is some existing infrastructure but essentially is just using the name to figure out the right place to save to.... but really i want the function name to be accessible in my DataSaver.... i couldn't see that this available at present... any ideas on how I could achieve this?Arthur Andres
01/04/2024, 6:30 PMPieter Wijkstra
03/16/2024, 9:29 PM