This message was deleted.
# hamilton-help
s
This message was deleted.
t
Hi Tobias! It depends on the level of visibility you want in your transforms, but I'd suggest 2 approaches: • If filling a missing value is not that meaningful, you can do all your fill steps together and use extract columns
Copy code
@extract_columns(["housing_type", "feature_b", "feature_c"])
def filled_df(raw_df: pd.DataFrame) -> pd.DataFrame:
   filled_df = raw_df.copy()
   filled_df = ...  # do your fill on all columns
   return filled_df
• To preserve the lineage of columnwise operations, you can provide distinct names to your transforms and have a rename step at the end. It's a bit hacky and probably harder to maintain
Copy code
def housing_type_filled(housing_type: pd.Series) -> pd.Series:
   return housing_type.fillna("unknown")

...  # more transforms

def joined_dataset(
  housing_type_filled: pd.Series,
  feature_b: pd.Series,
  feature_c: pd.Series,
) -> pd.DataFrame:
  # you will need to pass the name for each series
  return pd.concat([
    pd.Series(housing_type_filled, name="housing_type"),
    pd.Series(feature_b, name="feature_b"),
    pd.Series(feature_c, name="feature_c),
  ], axis=1)
t
Hey Thierry! Thanks for the quick answer. 🙂 It's not always the same transformation, but often very minor ones, e.g. transforming strings to lower, filling missing values, mapping values (e.g. -999 to NaN), etc. So there are multiple of those cases, but sometimes very specific to a single column only. The way we are currently doing this is also by renaming, but in reverse order: we have a renaming step first, and would then refer to the renamed series in the function. However, as you mentioned, this feels quite hacky and also adds unnecessary complexity to the code (even though it's not much). We could also rename the initial columns to something like
_raw
in the query directly since we're creating the initial DF directly by querying our DWH, however, I thought there might be a simple and straightforward way to handle those cases, since it feels like there should be.
t
For Hamilton, I think the clearest pattern is to add the
_raw
suffix to your initial inputs and have the final name of your column be the function name (i.e., the desired output should be the function named
housing_type()
). That should lead to the easiest code to read and maintain. Otherwise, we currently have a GitHub issue to add a pattern for redefining node. If you have specific requirements or features that would be useful, let me and @Elijah Ben Izzy know in this thread 🙂
e
@Thierry Jean has it perfectly — one thing that I’ve been meaning to add (and plan to shortly) is the
pipe
decorator — this allows you to effectively rename. Would love your feedback! https://github.com/DAGWorks-Inc/hamilton/issues/372