Hey folks happy Wednesday I had a question about scenarios f Hamilton Open Source #general

Hey folks, happy Wednesday. I had a question abou...

Justin Donaldson

12/04/2024, 10:58 PM

Hey folks, happy Wednesday. I had a question about scenarios for ML training. There's a nice example for iris on the website : https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/model_examples/scikit-learn However, I'm interested in models that have more complex pipelines (e.g. text transformation). It's relatively easy to set up a training pipeline for it, and then it's possible to override the pipeline stages with some data for an inference pipeline, but that just winds up feeling super fragile.

Justin Donaldson

12/04/2024, 10:59 PM

I think the key thing I'm realizing is that I'm trying to think of "looping" through the dag for training and inference, and that perhaps best practice would suggest that I create two distinct phases in a single DAG, even if there's some duplication.

Justin Donaldson

12/04/2024, 11:01 PM

I was trying to minimize the duplication as best as possible. I'm realizing that I really just want copies of the node functions with slightly altered names (so e.g. inference phase names don't collide with the training phase names).

Justin Donaldson

12/04/2024, 11:03 PM

I suppose one approach would involve using a function that creates names and aliases for all the functions in another module. It could be given some special treatment in the visualization as well maybe. I wanted to throw this idea out there and see if I'm barking up the right tree.

Stefan Krawczyk

12/04/2024, 11:16 PM

Could you draw a flowchart of what you want and what the difference between training & inference would be? Some ideas: • @with_columns for pandas/polars (operations are namespaced) • Hamilton within Hamilton (no namespace collisions) • Burr as high level orchestration of “stages”, Hamilton for what happens inside. (no namespace collisions) • subdag + @config.when ? (you manage namespacing more manually here)

Justin Donaldson

12/04/2024, 11:18 PM

Yeah, let me just write the thing out with full duplication. I'll go with that as a first pass anyways.

Justin Donaldson

12/05/2024, 5:39 AM

I wound up figuring out a way to design around it. My main learning was to avoid mutating data until it's necessary... sort of a "just in time" approach.

Justin Donaldson

12/05/2024, 5:42 AM

The other thing I'm learning is to take advantage of the fact that you can label nodes that refer to the different states of the same reference. E.g. you can have two nodes named "init_model" and "trained_model", and it's clear what the state is, even though they're referring to the same memory location.

Justin Donaldson

12/05/2024, 5:49 AM

The other thing I'm learning is to create nodes that have multiple outputs, and to use something lightweight like NamedTuples to capture the struct of fields. Then, create a few nodes to show the results.

Stefan Krawczyk

12/05/2024, 5:51 AM

I’m learning is to create nodes that have multiple outputs

you mean more complex objects? TypedDict +

@extract_fields

can work well here

👍 1

Justin Donaldson

12/05/2024, 5:51 AM

E.g. here's a simple train_inputs one where I capture the output in a type called "TrainResult" :

Copy code

class TrainResult(NamedTuple):
    metrics: pl.DataFrame
    model: SimpleClassifier
    binarizer: LabelBinarizer

👍 2

Justin Donaldson

12/05/2024, 5:52 AM

Oh I missed the @extract_fields, thanks for that.

Stefan Krawczyk

12/05/2024, 5:53 AM

extract_fields should be able to work on a named tuple, but we might need to add / change a type check for it.

Justin Donaldson

12/05/2024, 5:55 AM

I'm pretty happy as-is. I'll try to do a pull request if I get super fussy about it 🙂

👍 1

Justin Donaldson

12/05/2024, 6:00 AM

I like having the return types defined outside of the DAG code. Partly because they're part of the contract of the code that's separate from the implementation. And, partly because they're really just tedious to read. Most of the time I'm just bundling a few values together in a struct.

Justin Donaldson

12/05/2024, 6:00 AM

Have a great evening, thanks again!

Elijah Ben Izzy

12/05/2024, 6:12 AM

Cool how

training_metrics

is a sink node, whereas

trained_model

and

trained_binarizer

are not — E.G. you may want

training_metrics

but will want

trained_model

, and caching can allow you to get

training_metrics

after the fact!

Justin Donaldson

12/05/2024, 3:57 PM

Yeah this last week or two I've rewired my brain around DAGs a bit better thanks to you two. 🙂

❤️ 1

Justin Donaldson

12/07/2024, 2:01 AM

Ok, even after I cleaned up the code I still hit a bit of a wall here.

Justin Donaldson

12/07/2024, 2:02 AM

I have a batch_predict that takes the entire training dataset, and produces predictions for each record.

Justin Donaldson

12/07/2024, 2:03 AM

Next, I want to apply the trained_model on a new dataset that is comprised of records that has no categorization labels.

Justin Donaldson

12/07/2024, 2:03 AM

I thought subdag would work here, but it doesn't appear to let me use an override.

Justin Donaldson

12/07/2024, 2:04 AM

Screenshot 2024-12-06 at 6.04.14 PM.png

Justin Donaldson

12/07/2024, 2:04 AM

In essence, I want to make the batch_label_unlabeled node.

Justin Donaldson

12/07/2024, 2:05 AM

I can do it by hacking things with driver output and overrides, but I'd rather have the whole thing in a single DAG.

Stefan Krawczyk

12/07/2024, 2:10 AM

Because you want to reuse the featurization part?

Stefan Krawczyk

12/07/2024, 2:11 AM

What's the part you want to reuse?

Justin Donaldson

12/09/2024, 5:10 AM

I'd like to re-use the part that generates the input tensor. I have it spread across two modules... one creates embeddings, the other combines them into a single tensor.

Stefan Krawczyk

12/09/2024, 5:48 AM

For those watching I have a sketch here - https://github.com/DAGWorks-Inc/hamilton/pull/1251

Open in Slack

Previous Next