This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

03/08/2023, 9:01 AM

This message was deleted.

👀 1

Elijah Ben Izzy

03/08/2023, 3:54 PM

Hey! So, yeah, its an interesting question. I think you have a few approaches here (and there’s a tool we’re thinking of building that’ll simplify your life — see this one: https://github.com/DAGWorks-Inc/hamilton/issues/65). Inline:

1. Do I miss something in my pipeline?

Not sure what you mean but I think your overall design is reasonable from what I can see. We’re working on the ability to be able to add saving/mateiralizing within Hamilton so it could handle more, but that’s not available yet.

2. How and where to fix datatypes?

3. What is the best way to do this (if in Hamilton)?

You have some options here — one is to do this within the hamilton pipeline, E.G. with a step called something like

df_with_fixed_datatypes

. It could even take in a bunch of configuration arguments for how to do this:

Copy code

def df_with_fixed_datatypes(
    raw_df: pd.DataFrame,
    id_column_type: str,
    ...) -> pd.DataFrame:
    return ...

4. Should I make features “ideal” on stage of Hamilton or it will be better to handle them later in terms of “bad values”?

If I understand what you’re getting at, I think that you want to do this earlier, rather than later. Keep it in the hamilton pipeline, making it so that the first piece makes the data good and the latter piece processes it. You can take the strategy I showed above and make a specific function for doing this this — or you can break it out into columns. If you’re dropping rows, you might want a single function at the beginning that does something like:

Copy code

def df_features_cleaned(
    raw_df: pd.DataFrame,
    replace_nulls_with='mean',
    ...) -> pd.DataFrame:

Then you pass

replace_nulls_with

to the driver on execution. Or, you can do it on a column-level. The piece that we’re adding: https://github.com/DAGWorks-Inc/hamilton/issues/65, would allow you to do conditional redefinition of functions, which would make this really natural (so you could do it in multiple steps). Does this help?

Stefan Krawczyk

03/08/2023, 6:06 PM

From another side I will loose flexibility to preprocess data in different ways: “replace missingnot with 0 but with mean”, etc for DS team

@Игорь Хохолко if you are worried about flexibility, that’s when you can use

@config.when

to switch out an implementation based on configuration — e.g. don’t replace NaNs if you don’t have to. So you’d have a node that does whatever cleaning/formatting it needs and just have different implementations for different contexts.

👀 1

🔥 1

Open in Slack

Previous Next