This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

11/23/2022, 5:28 PM

This message was deleted.

👀 1

Elijah Ben Izzy

11/23/2022, 5:36 PM

Glad to hear you're having a good time! And awesome question! So, a few thoughts: 1. We don't support dynamic parameterization/column creation -- this issue gets at it (although its slightly different). In general, having static (non-runtime-determined) makes the framework slightly less powerful, but is something we might consider. 2. Creating everything, IMO, isn't actually bad (although there's some nuance). Execution only runs the upstream nodes -- thus you can create 1000 of them and only use 2, and 998 will never be run so you won't take a performance hit. 3. You can also do this by passing around dataframes instead of series. It becomes slightly less expressive, but you could imagine something like this:

Copy code

@parameterize(**lag_parameterization)
def lag_series(feature: pd.Series, lags: List[int]) -> pd.DataFrame:
    # pretty sure there's a better vectorized way to do this, but...
    return pd.DataFrame({f"lag_{lag}" : feature.shift(lag) for lag in lags}

Then you can use that upstream.

Elijah Ben Izzy

11/23/2022, 5:40 PM

But yeah, which one you use depends a bit on exactly how you're using the lagged data downstream. E.G. (2) works if you keep adding transforms that use new lagged versions of features. (3) is nice if you want it to be runtime-based -- which dependencies are used. If I understand what you're going for, I think your approach is pretty good so far -- I'd be curious how it impacts your workflow!

Stefan Krawczyk

11/23/2022, 5:50 PM

@Elijah Ben Izzy I think another approach could be to pass in the parametrization as a config? So the DAG would be static, it's just the parametrization could be passed in, rather than requiring it to be defined in the module... We'd need to build this feature in though.

Gregory Jeffrey

11/23/2022, 5:54 PM

In #3, I suppose I could just throw an @extract_columns on top to get the individual lagged columns back out? Seems similar to #2 but would allow me to pass in

lags

in initial_data or config

Gregory Jeffrey

11/23/2022, 5:55 PM

Unfortunately my use case has me passing around custom types (built on Polars dataframes) instead of pd.Series as in the example, so I'm likely sacrificing the utility of @extract_columns

Elijah Ben Izzy

11/23/2022, 5:59 PM

@Gregory Jeffrey yep, but then you'd be knowing them at compile-time, so why not just create those series individually? I guess the thing I'm not understanding is how you plan to use them -- is it: • You want to run a test that uses a different set of lags every run (E.G. for optimizing some hyperparameter), or • You want to use a bunch of different ones in different places, but its mostly the same run-to-run Also, i would love polars support -- we'd have to be careful about pulling in dependencies, but I don't think it would be too hard for

extract_columns

to support multiple dataframes -- if you're interested in making an OS contribution at some point...

Elijah Ben Izzy

11/23/2022, 6:01 PM

@Stefan Krawczyk yeah -- was thinking along similar lines. We have

source

value

, and

config

could be a third but its unclear whether its the value, or the source... One could even imagine

{'lag' : source(config('lag'))}

meaning that its a source, the value of which comes from config. Will need to noodle.

Gregory Jeffrey

11/23/2022, 6:10 PM

@Elijah Ben Izzy yep, essentially that first use case is what I'm going for

Elijah Ben Izzy

11/23/2022, 6:18 PM

Got it -- so yeah, I think the cleanest way to do this would be something like the dataframe approach. If you have a single lag you're optimizing on its simple (this can return a series), if you're optimizing on a set of lags it can return a dataframe):

Copy code

def lagged_feature(feature: Series, lags: List[int]) -> Series:
    # pretty sure there's a better vectorized way to do this, but...
    return pd.DataFrame({f"lag_{lag}" : feature.shift(lag) for lag in lags}

Then you use that downstream. Exactly how is more dependent on how you intend to model your problem, but the core idea is labeling in some way to what is meaning to your workflow. E.G. you could have

lag_a

lag_b

, and

lag_c

(if you're looking for features based on three different lags 🤷 ) and extract that from your dataframe/pass it in via the

lags

parameter. Makes sense?

Gregory Jeffrey

11/23/2022, 6:25 PM

Makes sense- thanks again. As I get further along using Polars I'll certainly think about where I might be able to contribute!

Elijah Ben Izzy

11/23/2022, 6:26 PM

Awesome! Good luck on your work and let us know if you have any more questions.

Stefan Krawczyk

11/23/2022, 7:01 PM

@Gregory Jeffrey just to plant the thought. We'd love a Polars example for the examples section in the hamilton repo; nothing extensive, but something to show people how to get started would be great please :) (I'm happy to help document it - just need some code)

5 Views

Open in Slack

Previous Next