This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

09/21/2023, 1:53 PM

This message was deleted.

Thierry Jean

09/21/2023, 2:08 PM

Hi Tobias! It depends on the level of visibility you want in your transforms, but I'd suggest 2 approaches: • If filling a missing value is not that meaningful, you can do all your fill steps together and use extract columns

Copy code

@extract_columns(["housing_type", "feature_b", "feature_c"])
def filled_df(raw_df: pd.DataFrame) -> pd.DataFrame:
   filled_df = raw_df.copy()
   filled_df = ...  # do your fill on all columns
   return filled_df

• To preserve the lineage of columnwise operations, you can provide distinct names to your transforms and have a rename step at the end. It's a bit hacky and probably harder to maintain

Copy code

def housing_type_filled(housing_type: pd.Series) -> pd.Series:
   return housing_type.fillna("unknown")

...  # more transforms

def joined_dataset(
  housing_type_filled: pd.Series,
  feature_b: pd.Series,
  feature_c: pd.Series,
) -> pd.DataFrame:
  # you will need to pass the name for each series
  return pd.concat([
    pd.Series(housing_type_filled, name="housing_type"),
    pd.Series(feature_b, name="feature_b"),
    pd.Series(feature_c, name="feature_c),
  ], axis=1)

Tobias

09/21/2023, 2:17 PM

Hey Thierry! Thanks for the quick answer. 🙂 It's not always the same transformation, but often very minor ones, e.g. transforming strings to lower, filling missing values, mapping values (e.g. -999 to NaN), etc. So there are multiple of those cases, but sometimes very specific to a single column only. The way we are currently doing this is also by renaming, but in reverse order: we have a renaming step first, and would then refer to the renamed series in the function. However, as you mentioned, this feels quite hacky and also adds unnecessary complexity to the code (even though it's not much). We could also rename the initial columns to something like

_raw

in the query directly since we're creating the initial DF directly by querying our DWH, however, I thought there might be a simple and straightforward way to handle those cases, since it feels like there should be.

Thierry Jean

09/21/2023, 2:29 PM

For Hamilton, I think the clearest pattern is to add the

_raw

suffix to your initial inputs and have the final name of your column be the function name (i.e., the desired output should be the function named

housing_type()

). That should lead to the easiest code to read and maintain. Otherwise, we currently have a GitHub issue to add a pattern for redefining node. If you have specific requirements or features that would be useful, let me and @Elijah Ben Izzy know in this thread 🙂

Elijah Ben Izzy

09/21/2023, 2:43 PM

@Thierry Jean has it perfectly — one thing that I’ve been meaning to add (and plan to shortly) is the

pipe

decorator — this allows you to effectively rename. Would love your feedback! https://github.com/DAGWorks-Inc/hamilton/issues/372

Open in Slack

Previous Next