This message was deleted.
# hamilton-help
s
This message was deleted.
👀 1
e
Glad to hear you're having a good time! And awesome question! So, a few thoughts: 1. We don't support dynamic parameterization/column creation -- this issue gets at it (although its slightly different). In general, having static (non-runtime-determined) makes the framework slightly less powerful, but is something we might consider. 2. Creating everything, IMO, isn't actually bad (although there's some nuance). Execution only runs the upstream nodes -- thus you can create 1000 of them and only use 2, and 998 will never be run so you won't take a performance hit. 3. You can also do this by passing around dataframes instead of series. It becomes slightly less expressive, but you could imagine something like this:
Copy code
@parameterize(**lag_parameterization)
def lag_series(feature: pd.Series, lags: List[int]) -> pd.DataFrame:
    # pretty sure there's a better vectorized way to do this, but...
    return pd.DataFrame({f"lag_{lag}" : feature.shift(lag) for lag in lags}
Then you can use that upstream.
But yeah, which one you use depends a bit on exactly how you're using the lagged data downstream. E.G. (2) works if you keep adding transforms that use new lagged versions of features. (3) is nice if you want it to be runtime-based -- which dependencies are used. If I understand what you're going for, I think your approach is pretty good so far -- I'd be curious how it impacts your workflow!
s
@Elijah Ben Izzy I think another approach could be to pass in the parametrization as a config? So the DAG would be static, it's just the parametrization could be passed in, rather than requiring it to be defined in the module... We'd need to build this feature in though.
g
In #3, I suppose I could just throw an @extract_columns on top to get the individual lagged columns back out? Seems similar to #2 but would allow me to pass in
lags
in initial_data or config
Unfortunately my use case has me passing around custom types (built on Polars dataframes) instead of pd.Series as in the example, so I'm likely sacrificing the utility of @extract_columns
e
@Gregory Jeffrey yep, but then you'd be knowing them at compile-time, so why not just create those series individually? I guess the thing I'm not understanding is how you plan to use them -- is it: • You want to run a test that uses a different set of lags every run (E.G. for optimizing some hyperparameter), or • You want to use a bunch of different ones in different places, but its mostly the same run-to-run Also, i would love polars support -- we'd have to be careful about pulling in dependencies, but I don't think it would be too hard for
extract_columns
to support multiple dataframes -- if you're interested in making an OS contribution at some point...
@Stefan Krawczyk yeah -- was thinking along similar lines. We have
source
,
value
, and
config
could be a third but its unclear whether its the value, or the source... One could even imagine
{'lag' : source(config('lag'))}
meaning that its a source, the value of which comes from config. Will need to noodle.
g
@Elijah Ben Izzy yep, essentially that first use case is what I'm going for
e
Got it -- so yeah, I think the cleanest way to do this would be something like the dataframe approach. If you have a single lag you're optimizing on its simple (this can return a series), if you're optimizing on a set of lags it can return a dataframe):
Copy code
def lagged_feature(feature: Series, lags: List[int]) -> Series:
    # pretty sure there's a better vectorized way to do this, but...
    return pd.DataFrame({f"lag_{lag}" : feature.shift(lag) for lag in lags}
Then you use that downstream. Exactly how is more dependent on how you intend to model your problem, but the core idea is labeling in some way to what is meaning to your workflow. E.G. you could have
lag_a
,
lag_b
, and
lag_c
(if you're looking for features based on three different lags 🤷 ) and extract that from your dataframe/pass it in via the
lags
parameter. Makes sense?
g
Makes sense- thanks again. As I get further along using Polars I'll certainly think about where I might be able to contribute!
e
Awesome! Good luck on your work and let us know if you have any more questions.
s
@Gregory Jeffrey just to plant the thought. We'd love a Polars example for the examples section in the hamilton repo; nothing extensive, but something to show people how to get started would be great please :) (I'm happy to help document it - just need some code)