This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

03/12/2024, 5:12 PM

This message was deleted.

Seth Stokes

03/12/2024, 5:13 PM

or would it just be a mapping before the pipeline/`@extract_columns`?

Copy code

def input_field_mapping(mapping: dict = {}) -> dict:
    """Field mapping step to ensure downstream node don't break should the input field names change."""
    mapping = {...}
    return mapping

Copy code

def raw_df(data_path: str) -> pd.DataFrame:
    return pd.read_csv(data_path)

Copy code

@extract_columns(
    "YearBuilt", 
    "LotFrontage", 
    "GarageArea", 
    "OverallQual", 
    "OverallCond", 
    "MSZoning", 
    "TotalBsmtSF"
)
def raw_data_w_standard_field_names(raw_df: pd.DataFrame, mapping: dict) -> pd.DataFrame:
    # some work
    return raw_df.rename(columns=mapping)

Stefan Krawczyk

03/12/2024, 6:09 PM

Yep that seems like a reasonable approach. Maintaining a mapping that helps keep things standard for everyone else downstream.

Thierry Jean

03/15/2024, 3:27 PM

Maybe I'm not fully covering your use case, but I like having columns as a module level constant. This allows you to reuse it throughout the module for consistency. It also depends how dynamic the mapping is for example

Copy code

RAW_COLUMN_MAPPING = {
    ...: "YearBuilt", 
    ...: "LotFrontage", 
    ...: "GarageArea", 
    ...: "OverallQual", 
    ...: "OverallCond", 
    ...: "MSZoning", 
    ...: "TotalBsmtSF",
}

# allows you to do
@extract_columns(*RAW_COLUMN_MAPPING.values())  # unpack dictionary values
def raw_data_w_standard_field_names(raw_df: pd.DataFrame, mapping: dict = RAW_COLUMN_MAPPING) -> pd.DataFrame:
   return raw_df.rename(columns=mapping)

If your dataflows spans multiple modules, you can still access the mapping / column names via

my_module.RAW_COLUMN_MAPPING

Open in Slack

Previous Next