This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

12/01/2023, 3:29 PM

This message was deleted.

miek

12/01/2023, 3:31 PM

Say, moduleA defines nodeA which depends on nodeB defined in moduleB NodeA will fail unless I pass both moduleA and moduleB to the driver

Thierry Jean

12/01/2023, 4:57 PM

Since decorators (config, parameterize, etc.) affect the DAG creation when the Driver is created from modules, you need a instantiated Driver to resolve the question "can I run node A". The most efficient method to check if your request will succeed is Driver.validate_execution(). Notice that the question "can I run node A" implies "can I run node A *for given inputs and overrides*" You can add a test that creates a Driver with module A and module B and ensures that you can request node X. It would add confidence that changes to A or B won't break your ability to request node X. Splitting functions into modules helps in many regards, but it brings two challenges: • ensure all required nodes are available and properly connected (the one you mentioned) • avoid name collisions from nodes defined in separate modules

Thierry Jean

12/01/2023, 5:01 PM

To avoid code duplication, you can still use the content of Hamilton module as regular functions (losing other benefits though). For example:

Copy code

# module_a.py

# is a util function
def _string_to_lowercase(string):
  return ...

# is a node
def set_dtypes(df: pd.DataFrame) -> pd.DataFrame:
   return ...

Copy code

# module_b.py
import module_a

def load_data(path: str) -> pd.DataFrame:
   df = pd.read_parquet(path)
   return module_a.set_dtypes(df)

def get_user_id(...) -> str:
   user_id = ...
   return module_a._string_to_lowercase(user_id)

miek

12/01/2023, 5:01 PM

Is there a programmatic way to create a master module that just imports all node definitions that exists in a certain folder

miek

12/01/2023, 5:01 PM

I tried it but couldn’t get it to work yet

Elijah Ben Izzy

12/01/2023, 5:02 PM

Yep, so adding what to @Thierry Jean said, one valid strategy is to do a big “import-only” module. It depends on how you want to break it up. And yep there should be using some python foo… can dig in.

Thierry Jean

12/01/2023, 5:03 PM

@Elijah Ben Izzy there's probably some shenanigans about dumping the source code of all modules from

importlib

into an ad-hoc module and registering it in

sys

Thierry Jean

12/01/2023, 5:04 PM

The util function could take in a list of Paths or module objects

Thierry Jean

12/01/2023, 5:04 PM

Would be hard to debug though if you hit "function foo is defined twice" because you wouldn't know from which import

Elijah Ben Izzy

12/01/2023, 5:05 PM

Yeah I think defining an init.py module and importing everything else in it (import *) is pretty clean, if you want to break it up but have them all part of the same DAG. Should validate that it works though :) And yeah, you could easily build something that looks for all of them — what did you try and what broke?

miek

12/01/2023, 5:12 PM

That’s what I tried but somehow the driver didn’t allow it - will give it another shot, likely due to my lack of low level python skills

miek

12/01/2023, 5:13 PM

Intent is to have all node names to be unique btw

Elijah Ben Izzy

12/01/2023, 5:13 PM

If you want to pair on getting it to work I probably have some time later this afternoon (1PT)

Elijah Ben Izzy

12/01/2023, 5:13 PM

Heh good, cause that’s required :)

💯 1

miek

12/01/2023, 5:14 PM

Cannot do today but let me hack a bit over the weekend at home with a repo I can share then

👍 1

miek

12/02/2023, 6:18 PM

@Elijah Ben Izzy I tried your idea but it doesn’t seem to work

# nodes.py

from moduleA import *

from moduleB import *

Then call the driver with

import nodes

dr = driver (…,nodes,…)

lst = dr.list_all_variables()

The

lst

does NOT contain the nodes from moduleA and moduleB… so somehow under the hood it doesn’t seem to go off tangent here…it feels like it should work but it doesn’t :-( Will dig a bit more, just wanted to post an update here (AFK today)

Elijah Ben Izzy

12/02/2023, 6:19 PM

Hmm — I can also recreate it and debug. This might be an edge case worth fixing, or a use-case worth thinking through. Auto-detecting the modules might even be easier…

Elijah Ben Izzy

12/02/2023, 6:20 PM

My guess is that we have a check that loos for the origin of the variables in the file — that’s not triggering for this import style. Been a bit since I looked at that code though.

miek

12/02/2023, 6:20 PM

Yeah, nothing urgent to fix. My hunch is that this indirection messes with the orginating function look up

Elijah Ben Izzy

12/02/2023, 6:21 PM

Another option is to have init.py define some global var “MODULES” — then all you have to do is import those. So adding a new one is (1) creating it and (2) importing it from init and sticking it in that list

💡 1

Elijah Ben Izzy

12/02/2023, 6:22 PM

But that’s not too different from putting it elsewhere (closer to where the driver is instamtiated), just that it lives close to the code

miek

12/02/2023, 6:23 PM

Oh, that’s not a bad idea, as I control a bit more what goes into it. Actually, this might work well… my end goal is to define a bunch of snowflake tables and besides tables specs, I need to specify somewhere which nodes should make it into each table. So having the modules defined at that level might work well for my use case.

miek

12/02/2023, 6:26 PM

One thing I haven’t figured out yet 1. Define a snowflake table spec and tell it which nodes to include 2. Define a bunch of stuff in the node tags and automatically have that determine what goes into a particular snowflake table 1/ Feels bit cleaner, also it’s possible that the same 1 node gets persisted into various snowflake table

miek

12/02/2023, 6:26 PM

And then somehow throw materializer into the mix, still exploring

Elijah Ben Izzy

12/02/2023, 6:37 PM

Interesting — so yeah, depends what you want to be the “source of truth”, and what you want to be modifying to make changes. Both seem reasonable — the advantage of (1) is that its decoupled from the code (allowing you to change it at the data level, if I understand correctly) — this is better if you have multiple ways you want to call each metric that can’t easily be expressed in tags or need to be configurable separate to Hamilton. (2) is nice because the code is attached to the data — there are multiple ways to go about it, but a system of tags with a custom decorator that validates them (delegating to the tag decorator) could allow your driver code to first query then decide what to materialize, based off of the tags that exist. The big question is what the standard workflow should be — E.G. what code do you want to change to make the common adjustments, and how can you ensure that’s the lowest possible cognitive burden for encouraged operations, and makes you think about the right things by virtue of the workflow itself.

miek

12/02/2023, 6:42 PM

(2) is closer to our current sql- only set up. End goal is to have exactly 1 place to touch to add a new column to a table. Or maybe 2places, define a new node in Hamilton + register it to the snowflake table it should be added to

Elijah Ben Izzy

12/02/2023, 6:53 PM

Yep, I think the trick here is to start with the desired workflow and work backwards. Also, got a function that’ll import everything in a subdirectory, presuming a basic init.py in that module:

Copy code

from hamilton import driver
import sample_module

from types import ModuleType
from typing import List
import pkgutil
import importlib

def import_all(base_module: ModuleType) -> List[ModuleType]:
    modules = []
    for module_info in pkgutil.iter_modules(base_module.__path__):
        module_name = f"{base_module.__name__}.{module_info.name}"
        module = importlib.import_module(module_name)
        modules.append(module)
    return modules

all_modules = import_all(sample_module)

dr = driver.Driver({}, *all_modules)

print(dr.execute(["foo", "bar"]))

This is

import_all.py

— overall structure is: (

module_1

contains

def foo

module_2

contains

def bar

). Haven’t tried it recursively, but might be easy enough.

Copy code

.
├── import_all.py
└── sample_module
    ├── __init__.py
    ├── module_1.py
    └── module_2.py

miek

12/02/2023, 6:58 PM

This looks pretty good, will try tomorrow when back at the computer

👍 1

Elijah Ben Izzy

12/02/2023, 6:59 PM

LMK how it works when you do — could be a nice recipe to include!

miek

12/02/2023, 8:25 PM

@Elijah Ben Izzy Pondering this a bit more, your example could really work well for my use case. That is, I could define one “import_all.py”-like file per table, ie take the logic and call it my_tableA.py, my_tableB.py and run them as separate Airflow tasks (that’s what we currently use). Maybe even wrap a Python class around it to store things like Snowflake schema, and other db details, etc… This way, everything that defines my final snowflake table would be defined in its my_tableXYZ.py file - and I can easily re-use nodes for multiple snowflake tables. Overall, this feels pretty clean and easy to navigate once you have 100+ tables and many nodes define across different business lines. Let me toy with this but right now I cannot see why this shouldn’t work well. As mentioned before, main goal is to codify tribal knowledge of my team into “define once, re-use everywhere code” and above pattern might just do the trick

Thierry Jean

12/02/2023, 8:48 PM

Not Hamilton-sponsored, but Booking just released a PyData talk titled "Tables as Code". You might find interesting design decisions! talk:

https://www.youtube.com/watch?v=gTARHyGrcq0▾

🙏 1

miek

12/02/2023, 9:22 PM

@Thierry Jean just watched it, that’s a really great talk, a lot of their principles are very much aligned how my team has been pondering these things - thanks for sharing!

😁 1

Elijah Ben Izzy

12/02/2023, 9:31 PM

Nice talk, super related! Yep, I think that makes a lot of sense as an approach. They can share some utility functions but there’s a clear mapping of code to table and it’s easy to reuse definitions but you can visualize lineage for each table individually. If you wanted to visualize them all together you could do some clever stuff with subdag, but it could get pretty unwieldy

🙌 1

miek

12/06/2023, 3:29 AM

@Elijah Ben Izzy toying with my “use Hamilton to generate multiple snowflake tables that could share various Hamilton node columns” experiment, I believe I found a good way to do structure the whole thing with a helper class HamiltonTable that takes care of whatever logic (define schema/table_name/ required nodes, then call the driver to compute it all and stitch columns together) And then I had an epiphany… would it actually make sense to simply define each of my Snowflake tables (basically a data frame output) as its own Hamilton node that simply takes, say, N nodes as input and stitches them together just like the driver would do? I guess what I’m trying to figure out: 1. should I let the driver do the stitching and pass in the nodes for a given table to the driver? Or, 2. should I simply define a final node that returns a data frame and takes N nodes as in put? Or, 3. Would it be better to structure each table as a subgraph? Curious if you have any thoughts on this. There’s probably no “right” answer here but maybe certain trade-offs I should consider?

miek

12/06/2023, 3:31 AM

In a sense, each of my snowflake table is like a feature_df data frame

miek

12/06/2023, 3:35 AM

Elijah Ben Izzy

12/06/2023, 3:41 AM

So yeah! That’s completely reasonable. To me the meta-question is what you want to enable changing as part of the design. Three general options: 1. Store the config for each table somewhere on the snowflake side, query hamilton for the components + parameters 2. Store the config for each table in code, one per file (as you were planning), query hamilton + parameter 3. Store the config in the DAG itself (as you suggested just now) So, its question of what the most common operations are, who will do them, and what code you want them to touch. For example… • If it is an analyst querying it and data scientists defining it, then maybe it makes sense to have the analyst create a new config (either somewhere in snowflake/externally, or mayhbe in your codebase), and have your driver load it up (or create a new driver, using your utility class) • If it is the same people doing everything, then putting it in code makes sense — they’ll be comfortable touching the internal code. From a feasibility perspective the big question is whether they are views of the same execution instance (E.G. have the same inputs), or if they have different inputs for each one (E.G. time grouping granularity, etc…). If they have the same parameters (or some fixed set that you run over), then you can easily get away with having one dataframe per table and declare their dependencies. This shares compute (which is potentially a double-edged sword). If they’re different you’d use

@subdag

to stitch them together, which essentially looks the same, but with a little more complexity. E.G. a subdag that specifies granularity, as well as some config stuff. So, doable to represent it in Hamilton.

🙏 1

Elijah Ben Izzy

12/06/2023, 3:42 AM

This said, you might like the pattern of materializers for (2). Think:

Copy code

from hamilton import base

dr.materialize(
    to.snowflake( 
        id="save_to_snowflake",
        dependencies=["metric_1", "metric_2", ...],
        table="...",
        combine=base.PandasDataFrameResult()
    ),
    inputs={...}
)

The cool thing about this is that its represented cen trally (I/O is not included in the dag), its customizable (you write the

snowflake

adapter and register it), but it actually does do DAG operations — its effecitvely appending a

save_to_snowflake

node to the end of the DAG and calling that — you can see this with the corresponding call

visualize_materialization

🆒 1

Elijah Ben Izzy

12/06/2023, 3:46 AM

Anyway will be out for a bit but happy to answer more questions later! Docs for adapters are here: https://hamilton.dagworks.io/en/latest/reference/drivers/Driver/#hamilton.driver.Driver.materialize

miek

12/06/2023, 3:53 AM

Thanks for this! Very helpful while I’m pondering this. Your last proposal is very close to my current class implementation:

Copy code

class HamiltonTable(
   table_name=“schema.xyz”,
   nodes_to_incl = [ all nodes that go into the table here ]
   # some more meta data field here
   inputs=…
   config=…
)

And then you have a method like .run(persist=True) If persist=True, it would call your materialize wrapper. Looks like I’m on the right track here

🔥 1

miek

12/06/2023, 4:28 AM

That is, I’m after a good abstraction for the “not so savvy data scientist” to define a new table and configure which nodes should go into it. Seems like I’m almost there with my approach.

Elijah Ben Izzy

12/06/2023, 4:38 AM

Yeah! So I think you’re pretty much there. A common framework we think about when designing this stuff is the “two-layer API”. The first layer is what everyone touches, so its highly optimized towards the standard use-cases. The second layer is what power-users touch/people can learn when they do new things. So, the first layer in your case is your class — they’ll easily be able to copy/paste and its pretty clear what’s going on. The second layer is the nodes themselves — adding new metrics, etc…

Elijah Ben Izzy

12/06/2023, 4:39 AM

The really cool thing you can do is add a layer of validation in your wrapper. For example, if you have nodes that are “intermediate-only”, you can add a tag that makes it so no one can have them in

nodes_to_incl

. You can also do automated schema inspection/documentation that way…

🆒 1

miek

12/06/2023, 4:49 AM

Nice trick, and yes, have been starting to toy with my own tag-like decorator already… getting there :)

🔥 1

👍 1

Stefan Krawczyk

12/06/2023, 6:10 AM

wow completely missed this thread! Some thoughts / side-notes: 1. instead of subdag we have talked about driver chaining instead, but haven’t gotten there yet. 2. I think I like having tables map to a function, rather than being defined by an instance of a materializer. That would then allow you to package up the state of the world at a particular point in time more easily if that’s important (or not). Materializers in this instance would just take the function name as argument and be only responsible for saving things. You can also add in pandera checks here too; which you can also do via a custom materializer… 3. but otherwise yeah, we have a few “isomorphic” ways to do things, and thus your intended UX should probably dictate how things are broken up/split up 🙂

miek

12/07/2023, 3:07 AM

@Stefan Krawczyk thanks for the additional thoughts. You make a good point in your #2 that a map of table name to a hard-coded set of nodes/funcs makes things a bit more point-in-time deterministic. Kinda hammers home my thinking that this is probably the way I wanna go with this - thanks! adding node value checks is also on my list :-)

Elijah Ben Izzy

12/07/2023, 4:01 PM

Agreed with @Stefan Krawczyk, definitely some trade-offs. But there’s a nice clarity of adding a table as a specific node! Only thing to think about it whether or not different tables need to run the same nodes with different parameters, E.G. one table for monthly, one table for weekly granularity, etc… If so you’ll need to use

@subdag

, which is a slightly more advanced concept in Hamilton, so its worth thinking about the right way to expose that to your users!

miek

12/07/2023, 5:00 PM

That’s a good angle. For now, I’m focusing everything being at daily resolution, and sort out the weekly/monthly aggs later. But definitely good to keep these in mind

👍 1

Elijah Ben Izzy

12/07/2023, 5:21 PM

Yeah — lots of ways to bridge that when you get to it

3 Views

Open in Slack

Previous Next