This message was deleted.
# hamilton-help
s
This message was deleted.
e
Hey! So, this seems like a reasonable strategy. Two parts, assuming I’ve got the right mental model 1. Split/apply 2. Apply in parallel For (1), it’s pretty easy to express in Hamilton. Let me know if this is similar to your case: (a) how to partition it out
Copy code
@extract_fields(
    {
        "A": pd.DataFrame,
        "B": pd.DataFrame,
        ...
    }
)
def dataframe_partition(partitions: list[str], df: pd.DataFrame) -> pd.DataFrame:
    # get a dict by grouping
    # return dict of name to dataframe
(b) how to apply a pipeline of operations — this uses
@pipe
to transform
Copy code
def _rename_columns(df: pd.DataFrame, column_map: dict) -> pd.DataFrame:
    ...

def _filter_columns(df: pd.DataFrame, cols_to_drop: list[str]) -> pd.DataFrame:
    ...

@pipe(
    step(_rename_columns, column_map={"foo" : "bar"}),
    step(_rename_columns, cols_to_drop=...),
    ... # add as many as you'd like
)
def A_processed(A: pd.DataFrame) -> pd.DataFrame:
    print("I've just done a ton of transformations, each one of which is a node in the DAG")
    return A # it gets passed the result of transforming them
Re: parallelism — you have a few ways to do this — what’s your data size?
n
Hi Elijah !
Thanks for your answer. Our dataset is quite small so maybe we don't require parallelization at all
e
Yeah I’d suggest first seeing how far you can get without it then adding it in
n
Seems reasonable
e
There are a few stretegies but the devil is in the details (especially with small datasets, serialization costs can hammer you)
n
Got a question about naming. Is there a way somehow to do that:
Copy code
@extract_fields(
    {
        "A White Horse": pd.DataFrame,
        "On a cherry tree": pd.DataFrame,
        ...
    }
)
def dataframe_partition(partitions: list[str], df: pd.DataFrame) -> pd.DataFrame:
    # get a dict by grouping
    # return dict of name to dataframe
Copy code
@pipe(
    step(_rename_columns, column_map={"foo" : "bar"}),
    step(_rename_columns, cols_to_drop=...),
    ... # add as many as you'd like
)
# A should be applied on "A White Horse" Dataframe
def A_processed(A: pd.DataFrame) -> pd.DataFrame:
    print("I've just done a ton of transformations, each one of which is a node in the DAG")
    return A # it gets passed the result of transforming them
e
Is the question whether you want to have spaces in the name? Or whether you want to have a different name in the input to the
A_processed
function as was output by
@extract_fields
?
n
Both actually !
I'm asking because in my case the key gonna probably be the query we used to partition the Dataframe. So something like:
Copy code
@extract_fields(
    {
        "col1 == 'A' and col2 =='B'": pd.DataFrame,
        "col1 == 'B' and col2 =='C'": pd.DataFrame,
        ...
    }
)
def dataframe_partition(partitions: list[str], df: pd.DataFrame) -> pd.DataFrame:
    # get a dict by grouping
    # return dict of name to dataframe
e
So there are ways to do it but its not super supported TBH. The way people have gotten around this is to have a mapping of column name-> partition key (often a simple translation to a variable) — we’ve it helps add a semantic meaning to the data (and often its not always 1:1). That said, curious to see if that works for you!
s
@ThreadScribe can you show an example of
@inject
?
t
@Stefan Krawczyk asked "can you show an example of
@inject
?" Let me see what I can find... 🤔
Certainly! Here's an example of how
@inject
can be used in Hamilton:
Copy code
python
@inject(nums=group(source('a'), value(10), source('b'), value(2)))
def a_plus_10_plus_b_plus_2(nums: List[int]) -> int:
    return sum(nums)
This example demonstrates how
@inject
allows you to replace parameters with values passed in, similar to a
@parameterize
call. The complete reference documentation for
@inject
can be found [here](https://hamilton.dagworks.io/en/latest/reference/decorators/inject/).
s
@Nicolas Huray if you must have spaces (for the most part we assume valid python variable names), then I believe you’d need to use a decorator like
@inject
or
@parameterize*
that can take in declarations of
source("name with space")
. Otherwise are the partitions fairly static?
n
yeah they are fairly static
s
Okay so
@inject
and
@pipe
don’t work together… 😕
So short term option is to place
_
for spaces/non-python variable characters.
in
dataframe_partition
you could return a dictionary of the mappings, that you could request in the output to then post process things back.
n
should we open a bug for that limitation ?
s
yep - lets. Would you mind?
👍 1
Or this is a work around:
Copy code
@inject(A=source("col1 == 'A' and col2 =='B'"))
# A should be applied on "A White Horse" Dataframe
def A_unprocessed(A: pd.DataFrame) -> pd.DataFrame:
    return A 
    
@pipe(
    step(_echo, v=1),
    step(_echo, v=2),
)
def A_processed2(A_unprocessed: pd.DataFrame) -> pd.DataFrame:
    print("I've just done a ton of transformations, each one of which is a node in the DAG")
    return A_unprocessed
n
just an indirection. But I guess it add an unnecessary node in the visualization
s
yep — a work around. 😬
n
that's ok ! both solutions are ok for now
Thanks so much guys for your help !
🫡 1
@Stefan Krawczyk I created the issue but I don't have much context about why it does not work. Would you mind update it ?
s
Perfect thanks
n
Hi guys, I'm working on an example to demonstrate what we discussed yesterday. Basically t's just a simple example doing a split / apply / combine.
I gonna probably ask some questions over the week-end ! No pressure to answer quickly it clearly can wait til next week !
🙌 2
Have a great week-end
🫡 2
That's it !! First contribution 🎉
👍 1
e
Awesome! Thank you so much. we’ll be able to look over this soon 🙂