Slackbot
09/08/2022, 2:40 PMElijah Ben Izzy
09/08/2022, 2:56 PMparameterize
comes close, but it's not quite the same (e.g. you already have to define parameters).
In most cases we've found that there are a few specific configurations, so a combination of config.when
, parameterize
, and optional params with defaults tends to do the best and ensure self-documenting, clear-to-read pipelines.
dynamic_node
but it's not proven generally useful yet and tends to build really confusing pipelines — one of those things that was built for a workflow at stitch fix that we probably should have bypassed entirely.Ben
09/08/2022, 3:01 PMElijah Ben Izzy
09/08/2022, 3:02 PMdef df(...):
return pd.DataFrame([[1,1,3,np.nan], [np.nan,2,3,4], [np.nan,0,5,6], index=['2022-01', '2022-02', '2022-03'], columns=['v202009', 'v202010', 'v202011', 'v202012'])
def last_value_series(df: pd.DataFrame, parameterized_cols: List[str]) -> pd.Series:
return df.loc[:, parameterized_cols].iloc[-1]
Then you pass parameterized_cols
into the driver as part of config
or a runtime input.Elijah Ben Izzy
09/08/2022, 3:03 PMversion
or something. If I understand what you're doing -- at Stitch Fix this is typically done with partitions over a dataset -- E.G. we save a new dataset for each time we regenerate and then run using that as of
date)Ben
09/08/2022, 3:29 PMElijah Ben Izzy
09/08/2022, 3:36 PMdata_loaders
then passing in overrides, but entirely depends on your preferred approach 🙂