This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

04/20/2023, 1:04 AM

This message was deleted.

👀 1

Elijah Ben Izzy

04/20/2023, 1:07 AM

Hey! So, yes, I’m pretty sure it can be done. To clarify — is the custom function something you want set on a case-by-case basis? A few different configurations? Or hardcoded?

David Wesolowski

04/20/2023, 1:29 AM

It is hardcoded, may take parameters in addition to the data

Elijah Ben Izzy

04/20/2023, 1:33 AM

Awesome! So yeah, I think this is pretty striaghtforward.

Copy code

def raw_data() -> pd.DataFrame:
    # load your data or pass it in

def processed_data(raw_data: pd.DataFrame, groupby_apply_param_1: ...) -> pd.DataFrame:
    raw_data.groupby(...).apply(...) # use your params

Elijah Ben Izzy

04/20/2023, 1:36 AM

Note that this just uses dataframes. If you want to do processing on a per-column basis prior to grouping, it should be pretty easy:

Copy code

@extract_columns('col_1', 'col_2', ...)
def raw_data() -> pd.DataFrame:
    ... 

def col_1_processed(col_1: pd.Series) ->  pd.Series:
    return do_something_with(col_1)

def processed_data(col_1_processed: pd.Series, col_2_processed: pd.Series, ...) -> pd.DataFrame:
    return pd.DataFrame({'col_1' : col_1, 'col_2' : col_2, ...}).groupby(...).apply(...)

Does this get at what you’re trying to do? I think the trick here is that Hamilton can happily process any type of object (pandas series, dataframes, primitives ,parameters, etc…)

David Wesolowski

04/20/2023, 1:46 AM

I was thinking that the apply function is explicitly treated as a node. I will be passing around dataframes rather than series' in this case

Elijah Ben Izzy

04/20/2023, 1:50 AM

Ahh yep, definitely doable then:

Copy code

def raw_data() -> pd.DataFrame:
    # load your data or pass it in

def apply_function(params: ...) -> Callable:
    def apply(...):
        # apply function
    return apply

def processed_data(raw_data: pd.DataFrame, apply_function: Callable, groupby_apply_param_1: ...) -> pd.DataFrame:
    raw_data.groupby(...).apply(...) # use your params

In this case the node is returning a function. You can pass it in as an override to the driver, or leave it as an input (but the above hardcodes it, which seems like what you want). That said, i’d be curious why it would be its own node as opposed to mixed with the

processed_data

function?

David Wesolowski

04/20/2023, 1:53 AM

I often use this pattern where the computation for one member is called by a function which is passed the collection. The computation is involved

David Wesolowski

04/20/2023, 1:55 AM

It's not that important. I am new to using DAGs. Thought it might be helpful to decompose things in this way. But it seems like it's not a good fit in the framework

Elijah Ben Izzy

04/20/2023, 1:57 AM

I think its fine either way — I wouldn’t say its non hamiltonian. I’m (personally) hesitant to send non-serializable data across function boundaries, but we do that in quite a few places and its not a problem. Its more a question of what you + your team find readable/easy to write.

David Wesolowski

04/20/2023, 2:01 AM

No problem. I am very thankful for this package. My primary use case is unit testing data transformation stages. I was sick of passing parameters across many functions to parameterise each test. Hamilton solves this problem very nicely. It's all explicit and I can inject data at any stage cleanly.

👍 1

Elijah Ben Izzy

04/20/2023, 2:02 AM

Awesome! Glad to hear 🙂 We’ll be here to answer any more questions you have.

David Wesolowski

04/20/2023, 2:03 AM

great work

🙏 2

Stefan Krawczyk

04/20/2023, 2:17 AM

to throw in one more idea — you could make the apply an explicit node that takes in a grouped data frame — and then have “config” determine which one to call.

Copy code

def raw_data() -> pd.DataFrame:
    # load your data or pass it in

def grouped_data(raw_data: pd.DataFrame, groupby_apply_param_1: ...) -> pd.GroupedDataFrame:
    raw_data.groupby(...) # use your params

@config.when(apply_type="mean")
def processed_data__mean(grouped_data: pd.GroupedDataFrame) -> pd.DataFrame:
    return grouped_data.mean()

@config.when(apply_type="foo-bar-apply")
def processed_data__foo_bar_apply(grouped_data: pd.GroupedDataFrame) -> pd.DataFrame:
    return grouped_data.apply(lambda x: foo(x) + bar(x))

But as @Elijah Ben Izzy says — there’s a few ways — and what’s more important is what is more ergonomic/going to be updated regularly or not…

David Wesolowski

04/20/2023, 5:53 AM

I will keep this trick in mind. Thank you.

Thierry Jean

04/20/2023, 12:07 PM

Interesting convo! I also encountered similar groupby scenarios a few times before

Elijah Ben Izzy

04/20/2023, 2:51 PM

@Thierry Jean nice! If you want to contribute a post or content/docs about it I’ll happily edit 🙂

Stefan Krawczyk

04/21/2023, 5:16 AM

One more thought are re-reading your initial post. a Fan-out-in pattern is possible. To do a fan-in, you’d either manually list out what is being fanned in via function parameter arguments, or if you need something more dynamic, then using

@resolve + @inject

(docs here and here) could also work.

Open in Slack

Previous Next