This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

07/04/2022, 8:48 AM

This message was deleted.

James Marvin

07/04/2022, 8:50 AM

I have created a transform which renames the columns such that they align with the list I pass as 'final_vars' to dr.execute() but still get an error

Elijah Ben Izzy

07/04/2022, 3:46 PM

Yep! Definitely possible if I understand correctly. That said, I think you'll need an

extract_columns

decorator on top of it. Let me know if this makes sense:

Copy code

@extract_columns(…)
def df_with_columns_renamed(df_original:pd.DataFrame) -> pd.DataFrame:
    #rename columns and return original df

Elijah Ben Izzy

07/04/2022, 3:48 PM

Basically you can think about it using two layers: 1. Hamilton deals in just variables, doesn't really care whether they're DataFrames, series, etc… 2. The decorators/drivers have df-specific logic (which extends past pandas as well) So you can make functions return anything, but if you want series to be variables they have to be individual nodes (hence extract_columns, which turns one DataFrame into multiple nodes, one per series)

Stefan Krawczyk

07/05/2022, 4:35 AM

@James Marvin I’m happy to spend time with you this week if you wanted to screen share what you’re doing. Just DM me if you’re interested.

James Marvin

07/05/2022, 12:13 PM

Thanks folks

James Marvin

07/05/2022, 3:25 PM

Just to check one thing on this answer - would I then need to reference the original df ('df_original' in the example) in the hamilton config?

Stefan Krawczyk

07/05/2022, 3:34 PM

You somehow need to supply it for computation. So yes, one way to do that is as part of the config. I'll be free in an hour if you wanted to chat live.

Stefan Krawczyk

07/05/2022, 4:24 PM

@James Marvin I’m around if you wanted to chat live.

James Marvin

07/11/2022, 7:33 AM

Hi @Stefan Krawczyk, apologies I missed you

James Marvin

07/11/2022, 7:35 AM

To expand on my problem I'm trying to build a pipeline whose last step is to perform a group operation on the target dataframe. The expected outcome of the operation is that it would reduce the total number of records in the dataframe. I am trying to understand: 1. How I can create a transform which is applied at the 'whole dataframe' level; and 2. How I can ensure that this transform is considered as the last stage in a DAG Would this be possible?

Stefan Krawczyk

07/11/2022, 2:05 PM

Yes there's a couple of ways to do this. Are you around in two hours to chat?

James Marvin

07/11/2022, 2:05 PM

I'll be around in about 4hrs time - is that OK?

👍 1

Stefan Krawczyk

07/11/2022, 2:06 PM

Yes that works for me.

Stefan Krawczyk

07/11/2022, 5:10 PM

@James Marvin I’m around for the next couple of hours — so just let me know when you wan to chat.

Stefan Krawczyk

07/11/2022, 7:05 PM

Here’s the basic mechanics of the different approaches: Option 1: Encode this as a transform: create a hamilton function:

Copy code

def grouped_df(col1: pd.Series, ..., colN: pd.Series) -> pd.DataFrame:
    # your group logic
    # new_df = ...
    return new_df

in your driver:

Copy code

dr = driver.Driver(config, logic_module, adapter=base.SimplePythonGraphAdapter(base.DictResult()))
result = dr.execute(['grouped_df'])

Option 2: Do it as a post step after running execute() in your driver

Copy code

dr = driver.Driver(config, logic_module)
df = dr.execute(['col1', ..., 'colN'])
grouped_df = ... # your logic here.

Option 3: Run two Hamilton DAGs This is the merger of Option 1 & Option 2.

Copy code

dr1 = driver.Driver(config, logic_module)
pre_grouped_df = dr.execute(['col1', ..., 'colN'])

dr2 = driver.Driver(other_config, grouping_logic_module, adapter=base.SimplePythonGraphAdapter(base.DictResult()))
result = dr2.execute(['grouped_df'], inputs={"raw_df": pre_grouped_df}). # you can write the function to operate on a dataframe, or columns

Option 4: Add a custom Result Builder to do this

Copy code

class GroupedByResult(base.ResultMixin):
    """This is a class and it has to have a static method."""

    @staticmethod
    def build_result(*, group_by_names: typing.List[str], **outputs: typing.Dict[str, pd.Series]) -> pd.DataFrame:
        """This function builds the result given the computed values."""
        df = pd.DataFrame(outputs)
        grouped_df = df.groupby(
            by=group_by_names,
        ) 
        # more logic here.          
        return grouped_df

# driver
dr = driver.Driver({ ... "group_by_names": ["COLUMN", "NAMES"]...} , modulez,
    adapter=base.SimplePythonGraphAdapter(result_builder=GroupedByResult()))

# to wire configuration through to the build_result function need to request it as an output.
output = ['USUAL', 'COLUMNS'] + ['group_by_names']
df = dr.execute(output)

Option 5: Do the filter/group on data load Not sure how applicable this would be — but if it’s based on an index or something, do this as step when you load the data. That way downstream functions only operate over the already filtered/grouped data.

Stefan Krawczyk

07/11/2022, 7:08 PM

They have different pros/cons and for me it depends on who and where the code is going to be run/maintained by and what you want to be easy to change or not.

Elijah Ben Izzy

07/12/2022, 1:25 PM

IMO its generally nice to have all in the same DAG, but yeah, really depends on the use-case/how you want to reuse the transformations.

James Marvin

07/14/2022, 6:39 PM

When I use the hamilton transform option (#1), how can I specify where in the dag/sequence the step happens? I guess I'm trying to understand how to use a transform which potentially affects the vertical 'length' of a dataframe in the middle of a dag. To see an example of what I'm trying to do, you can take a look at this sheet

Stefan Krawczyk

07/14/2022, 8:09 PM

Following up - we chatted and discussed that via pandas indexing, it’s possible to do this via operating over a series, as long as the index on the series is what you want it to be, then pandas will stitch everything correctly together in a dataframe.

7 Views

Open in Slack

Previous Next