This message was deleted.
# general
s
This message was deleted.
e
Hey! Will get back to you shortly (have a few meetings)
m
sure, no rush, thanks
e
Hey! So you have a few options here, and we’ve also been prototyping some framework-sponsored approaches. Some quick Qs: • Where will this be running (saving to disk or do you want to save to s3 or something)? • What’s the size of the data?
m
good questions. in this particular case, this would be just "local" -- either truly local while developing, testing features, or other changes, or deployed but still only write those stages locally there for debugging in case something goes wrong
and we're talking below 1GB at the moment, this will likely grow, but still low-digit GBs at most in any foreseeable future
e
Got it. So, flights about to take off but your initial intuition is pretty reasonable on. Looks it’s on -disk/in memory size data. Two approaches that should be easy: 1. Use overrides — have the functions log their output given a run ID, then before running the dag load up the cache from the run ID and pass it in as overrides. 2. Use config.when (but I’d use when_in) like you suggested — This is the same deal — you’d just load it up and pass a list of cache keys to the driver in the config. Then you can use the config.when key to do that. With a bit of meta-programming, you could probably create a decorator delegating to config.when to do that… I can put some pseudo code to make this a little clearer later tonight!
m
ok, thanks, I'll think this over. hopefully I'll get some time to work on that soon 🙂
I can put some pseudo code to make this a little clearer later tonight!
thanks, but don't worry about it for now. I think I have a good-enough idea, I'll just need to maybe do it on a small example to get a closer look. I'll come back with some concrete questions if I hit an obstacle 🙂
👍 1
e
Great! Good luck — I think it should be pretty easy 🙂
s
@Michal Siedlaczek you could also write your own graph adapter to do this caching logic; but that has other pros/cons 🙂
👍 1
m
this is what I have come up with so far: https://gist.github.com/elshize/ca82102786b7b739aa20e4c591e57289 at the moment it assumes that the node returns a dataframe, and it's quite limited, but looks to be working.
I couldn't figure out how I could use the function name instead of
label
for node resoltution
let me know if you have any comments
👀 1
s
@Michal Siedlaczek interesting — left a comment. Wasn’t expecting to inherit from config directly - but yeah there are many ways to get a solution here.
m
yeah, I tried with a simpler decorator, but I couldn't figure out how to do both config and autogenerate caching at the same time
but maybe just was confused.
s
how many places do you want to use the caching?
m
I'm also thinking how one could make it co-exist with things like extracting fields
how many places do you want to use the caching?
didn't count, but might be 10-ish
maybe less
I'm still working on a proof of concept to see what kind of problems may arise
e
Piping in — can’t look now — but I’ll take a look later :)
👍 1
s
a naive way would be to write two functions:
Copy code
@config.when(use_cache="False")
def foo__compute(...) -> pd.DataFrame:
   df = ... # compute it
   cache_df(df, name=...)
   return df

@config.when(use_cache="True")
def foo__cached() -> pd.DataFrame:
   df = load_df(name= ...)
   return df
it’s simple, but potentially less manageable depending on scale/need.
m
this is essentially how I'm using my decorators at the moment.
the
load_cache
one really doesn't require function body, but it is annoying that I have to define it
it is certainly manageable in my case, though I wouldn't mind improving on it
(especially considering how much more manageable hamilton makes the whole codebase to begin with)
s
yep — a vanilla python decorator wrapping
foo
would also work — but then whether that works well depends on how you want things to be configurable
e.g.
Copy code
@my_cache_decorator
def foo(...) -> pd.DataFrame:
   df = ... 
   return df
and if done correctly Hamilton would still crawl this, and create the node
foo
it’d just have the path check for the file or not; i.e. load from cache if it exists, else compute it.
this is effectively similar to creating your own graph adapter — but you’d not only have access to the function name, but any tags that were on the function.
m
right, I haven't looked into graph adapter yet
s
Otherwise inheriting from
config
could also work — you can get the function name by just doing
fn.___name___
I believe.
m
yes, I tried
fn.___name___
when writing/loading the file -- that works
my problem is when I'm passing resolver to the base class
I don't have the function yet, and the resolver only takes the config
s
ah right, makes sense.
m
I'll definitely revisit this later this week
I first want to make poc running with this, it's almost ready 🙂
👍 1
s
@Elijah Ben Izzy might have some better ideas here
👍 1
m
sure, I'll keep digging as well. thanks for the help!
e
Ok, looking at the code — the problem is that there’s a decoupling between nodes in the DAG and function names (although they usually map 1:1) — the ideal would be that one could pass a
target
to the caching function, then we could load up when we actually run it…
Copy code
@checkpoint(target="foo")
@extract_columns('bar', 'baz')
def foo() -> pd.DataFrame:
    """Some expensive computation"""
But, currently, this requires some surgery into the way it works (particularly being aware of nodes)
s
@Michal Siedlaczek I added my simple graph adapter to the gist - just so you could see it
e
@Michal Siedlaczek I think yours is a good approach for now btw -- it’s slightly messy but clean enough, and clearly works. You can probably also switch on the output type of the function. The other way is to decorate it with a function that saves the result to cache, then when you call it, have a utility that looks at the cache, crawls it, loads the available ones, and injects it with
overrides
. You’d still have to use the function name (and only have ones that the function name corresponds to the nodes), but then you’d only need a single decorator/implementation, and get short-circuiting as well.
m
Thank you both. @Stefan Krawczyk the adapter looks interesting, I'll play with that at some point!
👍 1
e
Only caveat is that it doesn’t stop upstream nodes from running — it just cancels the one, expensive node. @Stefan Krawczyk and I will be thinking about this and getting back to you!
m
ok, I came up with something that is getting really close to what I need based on @Stefan Krawczyk's adapter example: https://gist.github.com/elshize/ca82102786b7b739aa20e4c591e57289 (last comment) it still doesn't integrate with things like
@extract_fields
for example, but I work around it.
implemented my first iteration! 🎉
I'm just geeking out about this so much, I really love how it went and I'm sure there's still lots to improve
e
Oh that’s awesome!! Nice and self-documenting/easy to figure out, too.
If we came up with a hamilton-specific caching/system would you be game to try it out?
m
sure, I'd definitely try it out.
👍 1
🙌 1
s
@Michal Siedlaczek also if you have time, we’d love to have more examples in the hamilton/examples folder if you think what you’re doing would be useful for other people to consume.
m
yes, I'll make sure to come up with something once I'm done and clean up, etc.
👍 1
e
Ok, this is pretty cool. Nice work! It seems that: 1. the graph adapter keeps state so you can invalidate downstream items/rerun 2. It won’t bypass any upstream computation Pretty clever using the state/propogating downstream 🫡 Working now on a caching mechanism that’s built in — will likely take a different approach, but I want to ensure that it has at least as much functionality as yours 🙂
m
Working now on a caching mechanism that’s built in — will likely take a different approach, but I want to ensure that it has at least as much functionality as yours
I'm definitely interested in seeing what comes out of it. I'm sure my approach is limited, including some ways I'm not even aware of. but in my use case, it seems to be working great so far.
e
Yeah! It fits in nicely to a few changes, so the fact that you’re unblocked and happy lets me go to the drawing board 🙂
m
this is really for development: in production, I don't care about caching much, but when testing stuff, this is super convenient
e.g., I'm implementing this thing where I can run it to a point A, then using A get prepare some external data for testing, and then resume from that point to see the results
this is so easy now, where I can simply say, get me "A". then do my custom magic for testing. then say, "ok" now pick up this additional file and give me "Z"
I'm really benefiting from graph computation under the hood, and caching just lets me resume quickly. same goes for when I'm developing and get an error in some node, then I don't need to waste time, especially since some nodes involve some slow I/O over network
e
This is 🔥 — really appreciate the use-case here. And happy its helping you out 🙂 So, to sum up — main benefit is you can iterate without having to wait the full amount of time you need to recompute, right? Something breaks (or something is weird and you want to fix it), then you can use the cache.
And, as opposed to a jupyter notebook (which offer interactivity), you can (a) persist state and (b) have prod-ready code?
m
Yeah, I think the real gain is that this is the production code, not just some functions I'm running in a notebook. plus, the automated resolution of all that is needed for each step from which I'd like to resume
and once i play with it, then I quickly run it without cache to verify all is good
it also means that if a new dev wants to do something in one stage of it, they don't have to have deep knowledge of what needs to be done before you start testing your bit. it's easier to instruct to simply: if you got an error, just run again. if you want to rerun a step, just pass a flag, etc.
it really simplifies things
e.g., I have a new data scientist working in our team, and I don't want her to get bogged down in all the engineering stuff, so i'm always trying to come up with some simplifications
e
Awesome! That all makes sense. Re: cache management tooling — are you worried that you’ll accidentally shoot yourself in the foot by changing an upstream function but reading the downstream one from cache? (or maybe your graph adapter handles it — didn’t think so…). Or, put more simply, are you worried about changing the code/some input and not overwriting it in the cache, then getting the wrong value?
m
yeah, that can certainly happen if I forget to invalidate cache (force recomputation) when I run it. but I'm not too concerned about it
worst case I'll waste some time swearing at my monitor until I remember that it doesn't still work because I forgot to invalidate
😄
again, this is dev tool for me -- there will always be a full run in acceptance tests executed
I'll definitely keep thinking about this -- maybe it's better to not cache by default, but only explicitly (say,
--cache
flag or something)
e
Ha! Fair enough. Caching also gets a good deal tougher/less transparent the more intelligence you apply, so there’s something nice about a dead-simple approach (just one cache, you can iterate on that and only that), especially for dev.
But yeah, curious what you find helps your workflow — keep sharing as you learn more!
s
(and you can instrument what came from the cache in a log message…)
👍 1
m
I do
but I can still miss it 😄
I'll let you know if I run into any problems that I haven't thought of
👍 1
and if at any point I try improving it. there's much it won't work with, like extracting columns or fields (though some of it could be remediated)