This message was deleted.
# hamilton-help
s
This message was deleted.
h
i.e. looking to pass all my derived variables to a function to create a test/train split without laboriously naming each one in the call to the function.
s
Yep, so writing out all the node names as function arguments is one way:
Copy code
def model_function(col1: pd.Series, ..., colN: pd.Series) -> ...:
   # update the function signature for each and every column we want for the model
It’s then very clear when things are changing from a change management process, but yes this requires updates anytime something changes — and maybe be verbose to write out. Question, just to understand a bit better before answering more, what’s the pain for you? Development? Or is this going to change frequently, if so how frequently?
h
Hi Stefan, it's classification problem so engineering quite a few derived variables, splitting into test/train and then the fitting model/assessing model accuracy. I guess I was stuck on how logically I go from my last derived variable to passing a new dataframe of all my derived cols to be split. As you mention this is a process which is likely to get reviewed semi-regularly (every quarter maybe) with the creation/removal of certain DVs. To avoid verbosity I guess it'd be nice to not have to add every DV as a series to the function call. Although agreed that this would make it clear what's involved. Realistically I was looking for a shortcut but it sounds like this might just be the go-to approach at this point?
s
Yep you can operate over dataframes, I’d just construct a function to do that, and then have the rest operate over dataframes, e.g.
Copy code
def data_set(col1: pd.Series, ..., colN: pd.Series) -> pd.DataFrame:
   # this function describes the columns that go into the data set.
   # logic to create the dataframe
   return df

@extract_fields('train_set', 'test_set')
def train_test_split(data_set: pd.DataFrame, split_ratio: float, ...) -> Dict[str, pd.DataFrame]:
   # logic to split the data_set
   return {'train_set': train_df, 'test_set': test_df}

def train_model(train_set: pd.DataFrame, ...) -> ...:
    # fit the model...
You can see some of this structure in the scikit-learn example.
Happy to jump on a call too to help explain.
h
Cheers Stefan I'll give it a go and maybe revisit if I encounter any blockers 👍
👍 1
s
Cool, in terms of structuring your code, it might be helpful to split your code into: • feature creation python modules • modules to create the data sets • model fitting modules That way you can instantiate a Driver and the DAG easily to include everything, or just the portions that you want to run — i.e. giving yourself flexibility with respect to the driver code you want to write and run.
h
Yes I can see the benefit of that I think 🤔 I'll discuss further with James tmrw, tyvm!
w
If I may introduce my opinion here. I use hamilton to generate dataframes for training and testing but I split feature gen and model train/score. This makes it very easy to select features I want to use without a need to explicitly put them in a function.
👍 1
s
@Wit Jakuczun love it — blog post perhaps? Or do you want to submit an example to the examples folder?