Hey folks, happy Wednesday. I had a question abou...
# general
j
Hey folks, happy Wednesday. I had a question about scenarios for ML training. There's a nice example for iris on the website : https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/model_examples/scikit-learn However, I'm interested in models that have more complex pipelines (e.g. text transformation). It's relatively easy to set up a training pipeline for it, and then it's possible to override the pipeline stages with some data for an inference pipeline, but that just winds up feeling super fragile.
I think the key thing I'm realizing is that I'm trying to think of "looping" through the dag for training and inference, and that perhaps best practice would suggest that I create two distinct phases in a single DAG, even if there's some duplication.
I was trying to minimize the duplication as best as possible. I'm realizing that I really just want copies of the node functions with slightly altered names (so e.g. inference phase names don't collide with the training phase names).
I suppose one approach would involve using a function that creates names and aliases for all the functions in another module. It could be given some special treatment in the visualization as well maybe. I wanted to throw this idea out there and see if I'm barking up the right tree.
s
Could you draw a flowchart of what you want and what the difference between training & inference would be? Some ideas: • @with_columns for pandas/polars (operations are namespaced) • Hamilton within Hamilton (no namespace collisions) • Burr as high level orchestration of “stages”, Hamilton for what happens inside. (no namespace collisions) • subdag + @config.when ? (you manage namespacing more manually here)
j
Yeah, let me just write the thing out with full duplication. I'll go with that as a first pass anyways.
I wound up figuring out a way to design around it. My main learning was to avoid mutating data until it's necessary... sort of a "just in time" approach.
The other thing I'm learning is to take advantage of the fact that you can label nodes that refer to the different states of the same reference. E.g. you can have two nodes named "init_model" and "trained_model", and it's clear what the state is, even though they're referring to the same memory location.
The other thing I'm learning is to create nodes that have multiple outputs, and to use something lightweight like NamedTuples to capture the struct of fields. Then, create a few nodes to show the results.
s
I’m learning is to create nodes that have multiple outputs
you mean more complex objects? TypedDict +
@extract_fields
can work well here
👍 1
j
E.g. here's a simple train_inputs one where I capture the output in a type called "TrainResult" :
Copy code
class TrainResult(NamedTuple):
    metrics: pl.DataFrame
    model: SimpleClassifier
    binarizer: LabelBinarizer
👍 2
Oh I missed the @extract_fields, thanks for that.
s
extract_fields should be able to work on a named tuple, but we might need to add / change a type check for it.
j
I'm pretty happy as-is. I'll try to do a pull request if I get super fussy about it 🙂
👍 1
I like having the return types defined outside of the DAG code. Partly because they're part of the contract of the code that's separate from the implementation. And, partly because they're really just tedious to read. Most of the time I'm just bundling a few values together in a struct.
Have a great evening, thanks again!
e
Cool how
training_metrics
is a sink node, whereas
trained_model
and
trained_binarizer
are not — E.G. you may want
training_metrics
but will want
trained_model
, and caching can allow you to get
training_metrics
after the fact!
j
Yeah this last week or two I've rewired my brain around DAGs a bit better thanks to you two. 🙂
❤️ 1
Ok, even after I cleaned up the code I still hit a bit of a wall here.
I have a batch_predict that takes the entire training dataset, and produces predictions for each record.
Next, I want to apply the trained_model on a new dataset that is comprised of records that has no categorization labels.
I thought subdag would work here, but it doesn't appear to let me use an override.
Screenshot 2024-12-06 at 6.04.14 PM.png
In essence, I want to make the batch_label_unlabeled node.
I can do it by hacking things with driver output and overrides, but I'd rather have the whole thing in a single DAG.
s
Because you want to reuse the featurization part?
What's the part you want to reuse?
j
I'd like to re-use the part that generates the input tensor. I have it spread across two modules... one creates embeddings, the other combines them into a single tensor.
s
For those watching I have a sketch here - https://github.com/DAGWorks-Inc/hamilton/pull/1251