This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

10/31/2022, 5:13 PM

This message was deleted.

Harry Wrightson

10/31/2022, 5:16 PM

i.e. looking to pass all my derived variables to a function to create a test/train split without laboriously naming each one in the call to the function.

Stefan Krawczyk

10/31/2022, 5:17 PM

Yep, so writing out all the node names as function arguments is one way:

Copy code

def model_function(col1: pd.Series, ..., colN: pd.Series) -> ...:
   # update the function signature for each and every column we want for the model

It’s then very clear when things are changing from a change management process, but yes this requires updates anytime something changes — and maybe be verbose to write out. Question, just to understand a bit better before answering more, what’s the pain for you? Development? Or is this going to change frequently, if so how frequently?

Harry Wrightson

10/31/2022, 5:26 PM

Hi Stefan, it's classification problem so engineering quite a few derived variables, splitting into test/train and then the fitting model/assessing model accuracy. I guess I was stuck on how logically I go from my last derived variable to passing a new dataframe of all my derived cols to be split. As you mention this is a process which is likely to get reviewed semi-regularly (every quarter maybe) with the creation/removal of certain DVs. To avoid verbosity I guess it'd be nice to not have to add every DV as a series to the function call. Although agreed that this would make it clear what's involved. Realistically I was looking for a shortcut but it sounds like this might just be the go-to approach at this point?

Stefan Krawczyk

10/31/2022, 5:32 PM

Yep you can operate over dataframes, I’d just construct a function to do that, and then have the rest operate over dataframes, e.g.

Copy code

def data_set(col1: pd.Series, ..., colN: pd.Series) -> pd.DataFrame:
   # this function describes the columns that go into the data set.
   # logic to create the dataframe
   return df

@extract_fields('train_set', 'test_set')
def train_test_split(data_set: pd.DataFrame, split_ratio: float, ...) -> Dict[str, pd.DataFrame]:
   # logic to split the data_set
   return {'train_set': train_df, 'test_set': test_df}

def train_model(train_set: pd.DataFrame, ...) -> ...:
    # fit the model...

You can see some of this structure in the scikit-learn example.

Stefan Krawczyk

10/31/2022, 5:32 PM

Link to extract_fields doc.

Stefan Krawczyk

10/31/2022, 5:35 PM

Happy to jump on a call too to help explain.

Harry Wrightson

10/31/2022, 5:40 PM

Cheers Stefan I'll give it a go and maybe revisit if I encounter any blockers 👍

👍 1

Stefan Krawczyk

10/31/2022, 5:41 PM

Cool, in terms of structuring your code, it might be helpful to split your code into: • feature creation python modules • modules to create the data sets • model fitting modules That way you can instantiate a Driver and the DAG easily to include everything, or just the portions that you want to run — i.e. giving yourself flexibility with respect to the driver code you want to write and run.

Harry Wrightson

10/31/2022, 5:52 PM

Yes I can see the benefit of that I think 🤔 I'll discuss further with James tmrw, tyvm!

Wit Jakuczun

11/01/2022, 9:45 AM

If I may introduce my opinion here. I use hamilton to generate dataframes for training and testing but I split feature gen and model train/score. This makes it very easy to select features I want to use without a need to explicitly put them in a function.

👍 1

Stefan Krawczyk

11/01/2022, 5:37 PM

@Wit Jakuczun love it — blog post perhaps? Or do you want to submit an example to the examples folder?

2 Views

Open in Slack

Previous Next