This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

03/07/2024, 11:17 PM

This message was deleted.

Elijah Ben Izzy

03/07/2024, 11:47 PM

Hey! So, this seems like a reasonable strategy. Two parts, assuming I’ve got the right mental model 1. Split/apply 2. Apply in parallel For (1), it’s pretty easy to express in Hamilton. Let me know if this is similar to your case: (a) how to partition it out

Copy code

@extract_fields(
    {
        "A": pd.DataFrame,
        "B": pd.DataFrame,
        ...
    }
)
def dataframe_partition(partitions: list[str], df: pd.DataFrame) -> pd.DataFrame:
    # get a dict by grouping
    # return dict of name to dataframe

(b) how to apply a pipeline of operations — this uses

@pipe

to transform

Copy code

def _rename_columns(df: pd.DataFrame, column_map: dict) -> pd.DataFrame:
    ...

def _filter_columns(df: pd.DataFrame, cols_to_drop: list[str]) -> pd.DataFrame:
    ...

@pipe(
    step(_rename_columns, column_map={"foo" : "bar"}),
    step(_rename_columns, cols_to_drop=...),
    ... # add as many as you'd like
)
def A_processed(A: pd.DataFrame) -> pd.DataFrame:
    print("I've just done a ton of transformations, each one of which is a node in the DAG")
    return A # it gets passed the result of transforming them

Elijah Ben Izzy

03/07/2024, 11:48 PM

Re: parallelism — you have a few ways to do this — what’s your data size?

Nicolas Huray

03/07/2024, 11:48 PM

Hi Elijah !

Nicolas Huray

03/07/2024, 11:49 PM

Thanks for your answer. Our dataset is quite small so maybe we don't require parallelization at all

Elijah Ben Izzy

03/07/2024, 11:49 PM

Yeah I’d suggest first seeing how far you can get without it then adding it in

Nicolas Huray

03/07/2024, 11:49 PM

Seems reasonable

Elijah Ben Izzy

03/07/2024, 11:50 PM

There are a few stretegies but the devil is in the details (especially with small datasets, serialization costs can hammer you)

Nicolas Huray

03/08/2024, 1:06 AM

Got a question about naming. Is there a way somehow to do that:

Copy code

@extract_fields(
    {
        "A White Horse": pd.DataFrame,
        "On a cherry tree": pd.DataFrame,
        ...
    }
)
def dataframe_partition(partitions: list[str], df: pd.DataFrame) -> pd.DataFrame:
    # get a dict by grouping
    # return dict of name to dataframe

Nicolas Huray

03/08/2024, 1:07 AM

Copy code

@pipe(
    step(_rename_columns, column_map={"foo" : "bar"}),
    step(_rename_columns, cols_to_drop=...),
    ... # add as many as you'd like
)
# A should be applied on "A White Horse" Dataframe
def A_processed(A: pd.DataFrame) -> pd.DataFrame:
    print("I've just done a ton of transformations, each one of which is a node in the DAG")
    return A # it gets passed the result of transforming them

Elijah Ben Izzy

03/08/2024, 1:08 AM

Is the question whether you want to have spaces in the name? Or whether you want to have a different name in the input to the

A_processed

function as was output by

@extract_fields

Nicolas Huray

03/08/2024, 1:08 AM

Both actually !

Nicolas Huray

03/08/2024, 1:10 AM

I'm asking because in my case the key gonna probably be the query we used to partition the Dataframe. So something like:

Copy code

@extract_fields(
    {
        "col1 == 'A' and col2 =='B'": pd.DataFrame,
        "col1 == 'B' and col2 =='C'": pd.DataFrame,
        ...
    }
)
def dataframe_partition(partitions: list[str], df: pd.DataFrame) -> pd.DataFrame:
    # get a dict by grouping
    # return dict of name to dataframe

Elijah Ben Izzy

03/08/2024, 1:15 AM

So there are ways to do it but its not super supported TBH. The way people have gotten around this is to have a mapping of column name-> partition key (often a simple translation to a variable) — we’ve it helps add a semantic meaning to the data (and often its not always 1:1). That said, curious to see if that works for you!

Stefan Krawczyk

03/08/2024, 1:16 AM

@ThreadScribe can you show an example of

@inject

ThreadScribe

03/08/2024, 1:16 AM

@Stefan Krawczyk asked "can you show an example of

@inject

?" Let me see what I can find... 🤔

ThreadScribe

03/08/2024, 1:16 AM

Certainly! Here's an example of how

@inject

can be used in Hamilton:

Copy code

python
@inject(nums=group(source('a'), value(10), source('b'), value(2)))
def a_plus_10_plus_b_plus_2(nums: List[int]) -> int:
    return sum(nums)

This example demonstrates how

@inject

allows you to replace parameters with values passed in, similar to a

@parameterize

call. The complete reference documentation for

@inject

can be found [here](https://hamilton.dagworks.io/en/latest/reference/decorators/inject/).

Stefan Krawczyk

03/08/2024, 1:18 AM

@Nicolas Huray if you must have spaces (for the most part we assume valid python variable names), then I believe you’d need to use a decorator like

@inject

@parameterize*

that can take in declarations of

source("name with space")

. Otherwise are the partitions fairly static?

Nicolas Huray

03/08/2024, 1:22 AM

yeah they are fairly static

Stefan Krawczyk

03/08/2024, 1:31 AM

Okay so

@inject

and

@pipe

don’t work together… 😕

Stefan Krawczyk

03/08/2024, 1:36 AM

So short term option is to place

for spaces/non-python variable characters.

Stefan Krawczyk

03/08/2024, 1:37 AM

dataframe_partition

you could return a dictionary of the mappings, that you could request in the output to then post process things back.

Nicolas Huray

03/08/2024, 1:38 AM

should we open a bug for that limitation ?

Stefan Krawczyk

03/08/2024, 1:39 AM

yep - lets. Would you mind?

👍 1

Stefan Krawczyk

03/08/2024, 1:41 AM

Or this is a work around:

Copy code

@inject(A=source("col1 == 'A' and col2 =='B'"))
# A should be applied on "A White Horse" Dataframe
def A_unprocessed(A: pd.DataFrame) -> pd.DataFrame:
    return A 
    
@pipe(
    step(_echo, v=1),
    step(_echo, v=2),
)
def A_processed2(A_unprocessed: pd.DataFrame) -> pd.DataFrame:
    print("I've just done a ton of transformations, each one of which is a node in the DAG")
    return A_unprocessed

Nicolas Huray

03/08/2024, 1:48 AM

just an indirection. But I guess it add an unnecessary node in the visualization

Stefan Krawczyk

03/08/2024, 1:49 AM

yep — a work around. 😬

Nicolas Huray

03/08/2024, 1:49 AM

that's ok ! both solutions are ok for now

Nicolas Huray

03/08/2024, 1:49 AM

Thanks so much guys for your help !

🫡 1

Nicolas Huray

03/08/2024, 1:58 AM

@Stefan Krawczyk I created the issue but I don't have much context about why it does not work. Would you mind update it ?

Stefan Krawczyk

03/08/2024, 2:12 AM

Perfect thanks

Nicolas Huray

03/09/2024, 12:01 AM

Hi guys, I'm working on an example to demonstrate what we discussed yesterday. Basically t's just a simple example doing a split / apply / combine.

Nicolas Huray

03/09/2024, 12:03 AM

I gonna probably ask some questions over the week-end ! No pressure to answer quickly it clearly can wait til next week !

🙌 2

Nicolas Huray

03/09/2024, 12:03 AM

Have a great week-end

🫡 2

Nicolas Huray

03/09/2024, 3:30 AM

That's it !! First contribution 🎉

👍 1

Elijah Ben Izzy

03/09/2024, 3:49 AM

Awesome! Thank you so much. we’ll be able to look over this soon 🙂

Open in Slack

Previous Next