This message was deleted.
# hamilton-help
s
This message was deleted.
j
I have created a transform which renames the columns such that they align with the list I pass as 'final_vars' to dr.execute() but still get an error
e
Yep! Definitely possible if I understand correctly. That said, I think you'll need an
extract_columns
decorator on top of it. Let me know if this makes sense:
Copy code
@extract_columns(…)
def df_with_columns_renamed(df_original:pd.DataFrame) -> pd.DataFrame:
    #rename columns and return original df
Basically you can think about it using two layers: 1. Hamilton deals in just variables, doesn't really care whether they're DataFrames, series, etc… 2. The decorators/drivers have df-specific logic (which extends past pandas as well) So you can make functions return anything, but if you want series to be variables they have to be individual nodes (hence extract_columns, which turns one DataFrame into multiple nodes, one per series)
s
@James Marvin I’m happy to spend time with you this week if you wanted to screen share what you’re doing. Just DM me if you’re interested.
j
Thanks folks
Just to check one thing on this answer - would I then need to reference the original df ('df_original' in the example) in the hamilton config?
s
You somehow need to supply it for computation. So yes, one way to do that is as part of the config. I'll be free in an hour if you wanted to chat live.
@James Marvin I’m around if you wanted to chat live.
j
Hi @Stefan Krawczyk, apologies I missed you
To expand on my problem I'm trying to build a pipeline whose last step is to perform a group operation on the target dataframe. The expected outcome of the operation is that it would reduce the total number of records in the dataframe. I am trying to understand: 1. How I can create a transform which is applied at the 'whole dataframe' level; and 2. How I can ensure that this transform is considered as the last stage in a DAG Would this be possible?
s
Yes there's a couple of ways to do this. Are you around in two hours to chat?
j
I'll be around in about 4hrs time - is that OK?
👍 1
s
Yes that works for me.
@James Marvin I’m around for the next couple of hours — so just let me know when you wan to chat.
Here’s the basic mechanics of the different approaches: Option 1: Encode this as a transform: create a hamilton function:
Copy code
def grouped_df(col1: pd.Series, ..., colN: pd.Series) -> pd.DataFrame:
    # your group logic
    # new_df = ...
    return new_df
in your driver:
Copy code
dr = driver.Driver(config, logic_module, adapter=base.SimplePythonGraphAdapter(base.DictResult()))
result = dr.execute(['grouped_df'])
Option 2: Do it as a post step after running execute() in your driver
Copy code
dr = driver.Driver(config, logic_module)
df = dr.execute(['col1', ..., 'colN'])
grouped_df = ... # your logic here.
Option 3: Run two Hamilton DAGs This is the merger of Option 1 & Option 2.
Copy code
dr1 = driver.Driver(config, logic_module)
pre_grouped_df = dr.execute(['col1', ..., 'colN'])

dr2 = driver.Driver(other_config, grouping_logic_module, adapter=base.SimplePythonGraphAdapter(base.DictResult()))
result = dr2.execute(['grouped_df'], inputs={"raw_df": pre_grouped_df}). # you can write the function to operate on a dataframe, or columns
Option 4: Add a custom Result Builder to do this
Copy code
class GroupedByResult(base.ResultMixin):
    """This is a class and it has to have a static method."""

    @staticmethod
    def build_result(*, group_by_names: typing.List[str], **outputs: typing.Dict[str, pd.Series]) -> pd.DataFrame:
        """This function builds the result given the computed values."""
        df = pd.DataFrame(outputs)
        grouped_df = df.groupby(
            by=group_by_names,
        ) 
        # more logic here.          
        return grouped_df

# driver
dr = driver.Driver({ ... "group_by_names": ["COLUMN", "NAMES"]...} , modulez,
    adapter=base.SimplePythonGraphAdapter(result_builder=GroupedByResult()))

# to wire configuration through to the build_result function need to request it as an output.
output = ['USUAL', 'COLUMNS'] + ['group_by_names']
df = dr.execute(output)
Option 5: Do the filter/group on data load Not sure how applicable this would be — but if it’s based on an index or something, do this as step when you load the data. That way downstream functions only operate over the already filtered/grouped data.
They have different pros/cons and for me it depends on who and where the code is going to be run/maintained by and what you want to be easy to change or not.
e
IMO its generally nice to have all in the same DAG, but yeah, really depends on the use-case/how you want to reuse the transformations.
j
When I use the hamilton transform option (#1), how can I specify where in the dag/sequence the step happens? I guess I'm trying to understand how to use a transform which potentially affects the vertical 'length' of a dataframe in the middle of a dag. To see an example of what I'm trying to do, you can take a look at this sheet
s
Following up - we chatted and discussed that via pandas indexing, it’s possible to do this via operating over a series, as long as the index on the series is what you want it to be, then pandas will stitch everything correctly together in a dataframe.