oh it seems I was indeed just thinking about it wrong as whe Hamilton Open Source #hamilton-help

oh, it seems I was indeed just thinking about it w...

Tom Barber

03/15/2024, 4:42 PM

oh, it seems I was indeed just thinking about it wrong as when I include the earlier dataframe as an function object it doesn't create 2 of them. I could have sworn it was running the upstream function twice last night. I'll return to my hole

😁 2

🙌 2

Thierry Jean

03/15/2024, 4:56 PM

glad you found a solution!

Stefan Krawczyk

03/15/2024, 5:06 PM

@Tom Barber would you mind summarizing what you missed? Could be helpful for others / we could add something to the docs? 🙂

Tom Barber

03/15/2024, 5:13 PM

sure Stefan, there's a few bits that are still a bit fuzzy although I'm getting there, and I must say the examples in the repo and the docs are great, I can see how much work must go in there. materialization, and how you might chain storage, loading and processing of other data around the materialization process. Although now it seems obvious it wasn't obvious that an array of materializers and various inputs on the function side would result in the functions being called once and flowed through the pipeline. I'd also say the other thing that I haven't yet figured out and maybe it doesn't matter as much but subdags. For example here, I want 1 top level DAG that then calls a number of subdags. But if I want to materialize inside a subdag, it seems(and this is probably the bit I'm missing), that I still need to setup a driver within the subdag to call dr.materialize Also lastly, config properties, can they get passed down from parent dag to subdag? I see the config dict in the subdag decorator but its not clear whether I can pass things from the parent dag that might be compile time or global, for example.

👍 1

Stefan Krawczyk

03/15/2024, 5:18 PM

When you get time (doesn’t have to be now) can you say more about the reason to want to model things as top level DAG, with subdags? > But if I want to materialize inside a subdag, it seems(and this is probably the bit I’m missing), that I still need to setup a driver within the subdag to call dr.materialize What you’ve articulated is correct around materialization and subdags. Are things fairly static? or are they dynamic? You should be able to inject the materializers for a subdag — but I don’t think we have a test case for that. > Also lastly, config properties, can they get passed down from parent dag to subdag? I see the config dict in the subdag decorator but its not clear whether I can pass things from the parent dag that might be compile time or global, for example. right now IIRC (I believe this is true) that subdags can “reach into” the global config space to pull things out. In terms of DAG construction, everything is done before execution, unless you have a driver inside your function that you are manually creating, in which case that will run when that function is called.

Tom Barber

03/15/2024, 5:21 PM

well so thats part of the question thats still a bit fuzzy. I'm migrating from a homebrew pipeline, the pipeline has a number of steps for a data processing. So I could call dr.execute a bunch of times, to run them in order. Or I could wire them up as subdags and chain them that way. I don't know if there is a right way, or a wrong way. Perhaps, judging by your question, subdags are the wrong way 🙂

Stefan Krawczyk

03/15/2024, 5:27 PM

> I don’t know if there is a right way, or a wrong way. Perhaps, judging by your question, subdags are the wrong way 🙂 no I don’t think so. I was more curious about the use case and motivation. The design decision is do you want to model things as a single giant DAG, in which case subdag allows you to parameterize parts and see it as a giant DAG. So no “wrong answer” here, just maybe what is more ergonomic for the task at hand (what are you going to be updating, is there going to be many subdags, etc).

Tom Barber

03/15/2024, 5:30 PM

Yeah so it basically data ingestion, model training and prediction stuff. As I said earlier its currently a bunch of legacy stuff I'm porting into get it working. But eventually it'll get rewritten in a more hamilton-esque way, unlike the huge blob of insanely complex DBT it currently is. But even when its rewritten it still makes sense to chunk it up into "data processing" "training" "prediction" etc. So in that sense I guess there wont be a huge amount of subdags, its more control flow. But, there will probably become a time where we end up managing multiple iterations of the same flow in the same process at the same time(multi tennancy, for example)

Stefan Krawczyk

03/15/2024, 5:34 PM

Cool. Next week at the meet-up we’ll be doing a deep dive on approaches to “parameterize”/“reuse” logic. It’ll be recorded if you miss it.

🙌 3

Open in Slack

Previous Next