This message was deleted Hamilton Open Source #general

Join Slack

This message was deleted.

# general

Slackbot

03/27/2023, 1:49 PM

This message was deleted.

Elijah Ben Izzy

03/27/2023, 3:13 PM

Hey! Will get back to you shortly (have a few meetings)

Michal Siedlaczek

03/27/2023, 3:14 PM

sure, no rush, thanks

Elijah Ben Izzy

03/27/2023, 5:02 PM

Hey! So you have a few options here, and we’ve also been prototyping some framework-sponsored approaches. Some quick Qs: • Where will this be running (saving to disk or do you want to save to s3 or something)? • What’s the size of the data?

Michal Siedlaczek

03/27/2023, 9:00 PM

good questions. in this particular case, this would be just "local" -- either truly local while developing, testing features, or other changes, or deployed but still only write those stages locally there for debugging in case something goes wrong

Michal Siedlaczek

03/27/2023, 9:01 PM

and we're talking below 1GB at the moment, this will likely grow, but still low-digit GBs at most in any foreseeable future

Elijah Ben Izzy

03/27/2023, 9:04 PM

Got it. So, flights about to take off but your initial intuition is pretty reasonable on. Looks it’s on -disk/in memory size data. Two approaches that should be easy: 1. Use overrides — have the functions log their output given a run ID, then before running the dag load up the cache from the run ID and pass it in as overrides. 2. Use config.when (but I’d use when_in) like you suggested — This is the same deal — you’d just load it up and pass a list of cache keys to the driver in the config. Then you can use the config.when key to do that. With a bit of meta-programming, you could probably create a decorator delegating to config.when to do that… I can put some pseudo code to make this a little clearer later tonight!

Michal Siedlaczek

03/27/2023, 9:13 PM

ok, thanks, I'll think this over. hopefully I'll get some time to work on that soon 🙂

Michal Siedlaczek

03/27/2023, 9:15 PM

I can put some pseudo code to make this a little clearer later tonight!

thanks, but don't worry about it for now. I think I have a good-enough idea, I'll just need to maybe do it on a small example to get a closer look. I'll come back with some concrete questions if I hit an obstacle 🙂

👍 1

Elijah Ben Izzy

03/27/2023, 10:01 PM

Great! Good luck — I think it should be pretty easy 🙂

Stefan Krawczyk

03/28/2023, 3:59 AM

@Michal Siedlaczek you could also write your own graph adapter to do this caching logic; but that has other pros/cons 🙂

👍 1

Michal Siedlaczek

03/29/2023, 10:16 PM

this is what I have come up with so far: https://gist.github.com/elshize/ca82102786b7b739aa20e4c591e57289 at the moment it assumes that the node returns a dataframe, and it's quite limited, but looks to be working.

Michal Siedlaczek

03/29/2023, 10:17 PM

I couldn't figure out how I could use the function name instead of

label

for node resoltution

Michal Siedlaczek

03/29/2023, 10:17 PM

let me know if you have any comments

👀 1

Stefan Krawczyk

03/29/2023, 10:37 PM

@Michal Siedlaczek interesting — left a comment. Wasn’t expecting to inherit from config directly - but yeah there are many ways to get a solution here.

Michal Siedlaczek

03/29/2023, 10:38 PM

yeah, I tried with a simpler decorator, but I couldn't figure out how to do both config and autogenerate caching at the same time

Michal Siedlaczek

03/29/2023, 10:39 PM

but maybe just was confused.

Stefan Krawczyk

03/29/2023, 10:39 PM

how many places do you want to use the caching?

Michal Siedlaczek

03/29/2023, 10:39 PM

I'm also thinking how one could make it co-exist with things like extracting fields

Michal Siedlaczek

03/29/2023, 10:40 PM

how many places do you want to use the caching?

didn't count, but might be 10-ish

Michal Siedlaczek

03/29/2023, 10:40 PM

maybe less

Michal Siedlaczek

03/29/2023, 10:42 PM

I'm still working on a proof of concept to see what kind of problems may arise

Elijah Ben Izzy

03/29/2023, 10:43 PM

Piping in — can’t look now — but I’ll take a look later :)

👍 1

Stefan Krawczyk

03/29/2023, 10:44 PM

a naive way would be to write two functions:

Copy code

@config.when(use_cache="False")
def foo__compute(...) -> pd.DataFrame:
   df = ... # compute it
   cache_df(df, name=...)
   return df

@config.when(use_cache="True")
def foo__cached() -> pd.DataFrame:
   df = load_df(name= ...)
   return df

Stefan Krawczyk

03/29/2023, 10:45 PM

it’s simple, but potentially less manageable depending on scale/need.

Michal Siedlaczek

03/29/2023, 10:45 PM

this is essentially how I'm using my decorators at the moment.

Michal Siedlaczek

03/29/2023, 10:46 PM

the

load_cache

one really doesn't require function body, but it is annoying that I have to define it

Michal Siedlaczek

03/29/2023, 10:47 PM

it is certainly manageable in my case, though I wouldn't mind improving on it

Michal Siedlaczek

03/29/2023, 10:47 PM

(especially considering how much more manageable hamilton makes the whole codebase to begin with)

Stefan Krawczyk

03/29/2023, 10:48 PM

yep — a vanilla python decorator wrapping

foo

would also work — but then whether that works well depends on how you want things to be configurable

Stefan Krawczyk

03/29/2023, 10:50 PM

e.g.

Copy code

@my_cache_decorator
def foo(...) -> pd.DataFrame:
   df = ... 
   return df

and if done correctly Hamilton would still crawl this, and create the node

foo

it’d just have the path check for the file or not; i.e. load from cache if it exists, else compute it.

Stefan Krawczyk

03/29/2023, 10:51 PM

this is effectively similar to creating your own graph adapter — but you’d not only have access to the function name, but any tags that were on the function.

Michal Siedlaczek

03/29/2023, 10:52 PM

right, I haven't looked into graph adapter yet

Stefan Krawczyk

03/29/2023, 10:53 PM

Otherwise inheriting from

config

could also work — you can get the function name by just doing

fn.___name___

I believe.

Michal Siedlaczek

03/29/2023, 10:54 PM

yes, I tried

fn.___name___

when writing/loading the file -- that works

Michal Siedlaczek

03/29/2023, 10:55 PM

my problem is when I'm passing resolver to the base class

Michal Siedlaczek

03/29/2023, 10:55 PM

https://gist.github.com/elshize/ca82102786b7b739aa20e4c591e57289#file-hamilton-cache-py-L7

Michal Siedlaczek

03/29/2023, 10:55 PM

I don't have the function yet, and the resolver only takes the config

Stefan Krawczyk

03/29/2023, 10:55 PM

ah right, makes sense.

Michal Siedlaczek

03/29/2023, 10:56 PM

I'll definitely revisit this later this week

Michal Siedlaczek

03/29/2023, 10:57 PM

I first want to make poc running with this, it's almost ready 🙂

👍 1

Stefan Krawczyk

03/29/2023, 11:00 PM

@Elijah Ben Izzy might have some better ideas here

👍 1

Michal Siedlaczek

03/29/2023, 11:08 PM

sure, I'll keep digging as well. thanks for the help!

Elijah Ben Izzy

03/30/2023, 2:14 PM

Ok, looking at the code — the problem is that there’s a decoupling between nodes in the DAG and function names (although they usually map 1:1) — the ideal would be that one could pass a

target

to the caching function, then we could load up when we actually run it…

Copy code

@checkpoint(target="foo")
@extract_columns('bar', 'baz')
def foo() -> pd.DataFrame:
    """Some expensive computation"""

But, currently, this requires some surgery into the way it works (particularly being aware of nodes)

Stefan Krawczyk

03/30/2023, 2:19 PM

@Michal Siedlaczek I added my simple graph adapter to the gist - just so you could see it

Elijah Ben Izzy

03/30/2023, 3:34 PM

@Michal Siedlaczek I think yours is a good approach for now btw -- it’s slightly messy but clean enough, and clearly works. You can probably also switch on the output type of the function. The other way is to decorate it with a function that saves the result to cache, then when you call it, have a utility that looks at the cache, crawls it, loads the available ones, and injects it with

overrides

. You’d still have to use the function name (and only have ones that the function name corresponds to the nodes), but then you’d only need a single decorator/implementation, and get short-circuiting as well.

Michal Siedlaczek

03/30/2023, 4:28 PM

Thank you both. @Stefan Krawczyk the adapter looks interesting, I'll play with that at some point!

👍 1

Elijah Ben Izzy

03/30/2023, 5:36 PM

Only caveat is that it doesn’t stop upstream nodes from running — it just cancels the one, expensive node. @Stefan Krawczyk and I will be thinking about this and getting back to you!

Michal Siedlaczek

03/30/2023, 9:54 PM

ok, I came up with something that is getting really close to what I need based on @Stefan Krawczyk's adapter example: https://gist.github.com/elshize/ca82102786b7b739aa20e4c591e57289 (last comment) it still doesn't integrate with things like

@extract_fields

for example, but I work around it.

Michal Siedlaczek

03/31/2023, 2:19 PM

implemented my first iteration! 🎉

Michal Siedlaczek

03/31/2023, 2:21 PM

I'm just geeking out about this so much, I really love how it went and I'm sure there's still lots to improve

Elijah Ben Izzy

03/31/2023, 3:08 PM

Oh that’s awesome!! Nice and self-documenting/easy to figure out, too.

Elijah Ben Izzy

03/31/2023, 3:09 PM

If we came up with a hamilton-specific caching/system would you be game to try it out?

Michal Siedlaczek

03/31/2023, 6:57 PM

sure, I'd definitely try it out.

👍 1

🙌 1

Stefan Krawczyk

03/31/2023, 7:59 PM

@Michal Siedlaczek also if you have time, we’d love to have more examples in the hamilton/examples folder if you think what you’re doing would be useful for other people to consume.

Michal Siedlaczek

03/31/2023, 7:59 PM

yes, I'll make sure to come up with something once I'm done and clean up, etc.

👍 1

Elijah Ben Izzy

03/31/2023, 9:16 PM

Ok, this is pretty cool. Nice work! It seems that: 1. the graph adapter keeps state so you can invalidate downstream items/rerun 2. It won’t bypass any upstream computation Pretty clever using the state/propogating downstream 🫡 Working now on a caching mechanism that’s built in — will likely take a different approach, but I want to ensure that it has at least as much functionality as yours 🙂

Michal Siedlaczek

03/31/2023, 9:54 PM

Working now on a caching mechanism that’s built in — will likely take a different approach, but I want to ensure that it has at least as much functionality as yours

I'm definitely interested in seeing what comes out of it. I'm sure my approach is limited, including some ways I'm not even aware of. but in my use case, it seems to be working great so far.

Elijah Ben Izzy

03/31/2023, 9:55 PM

Yeah! It fits in nicely to a few changes, so the fact that you’re unblocked and happy lets me go to the drawing board 🙂

Michal Siedlaczek

03/31/2023, 9:55 PM

this is really for development: in production, I don't care about caching much, but when testing stuff, this is super convenient

Michal Siedlaczek

03/31/2023, 9:56 PM

e.g., I'm implementing this thing where I can run it to a point A, then using A get prepare some external data for testing, and then resume from that point to see the results

Michal Siedlaczek

03/31/2023, 9:57 PM

this is so easy now, where I can simply say, get me "A". then do my custom magic for testing. then say, "ok" now pick up this additional file and give me "Z"

Michal Siedlaczek

03/31/2023, 9:58 PM

I'm really benefiting from graph computation under the hood, and caching just lets me resume quickly. same goes for when I'm developing and get an error in some node, then I don't need to waste time, especially since some nodes involve some slow I/O over network

Elijah Ben Izzy

03/31/2023, 10:01 PM

This is 🔥 — really appreciate the use-case here. And happy its helping you out 🙂 So, to sum up — main benefit is you can iterate without having to wait the full amount of time you need to recompute, right? Something breaks (or something is weird and you want to fix it), then you can use the cache.

Elijah Ben Izzy

03/31/2023, 10:01 PM

And, as opposed to a jupyter notebook (which offer interactivity), you can (a) persist state and (b) have prod-ready code?

Michal Siedlaczek

03/31/2023, 10:10 PM

Yeah, I think the real gain is that this is the production code, not just some functions I'm running in a notebook. plus, the automated resolution of all that is needed for each step from which I'd like to resume

Michal Siedlaczek

03/31/2023, 10:11 PM

and once i play with it, then I quickly run it without cache to verify all is good

Michal Siedlaczek

03/31/2023, 10:13 PM

it also means that if a new dev wants to do something in one stage of it, they don't have to have deep knowledge of what needs to be done before you start testing your bit. it's easier to instruct to simply: if you got an error, just run again. if you want to rerun a step, just pass a flag, etc.

Michal Siedlaczek

03/31/2023, 10:13 PM

it really simplifies things

Michal Siedlaczek

03/31/2023, 10:14 PM

e.g., I have a new data scientist working in our team, and I don't want her to get bogged down in all the engineering stuff, so i'm always trying to come up with some simplifications

Elijah Ben Izzy

03/31/2023, 10:16 PM

Awesome! That all makes sense. Re: cache management tooling — are you worried that you’ll accidentally shoot yourself in the foot by changing an upstream function but reading the downstream one from cache? (or maybe your graph adapter handles it — didn’t think so…). Or, put more simply, are you worried about changing the code/some input and not overwriting it in the cache, then getting the wrong value?

Michal Siedlaczek

03/31/2023, 10:28 PM

yeah, that can certainly happen if I forget to invalidate cache (force recomputation) when I run it. but I'm not too concerned about it

Michal Siedlaczek

03/31/2023, 10:29 PM

worst case I'll waste some time swearing at my monitor until I remember that it doesn't still work because I forgot to invalidate

Michal Siedlaczek

03/31/2023, 10:29 PM

😄

Michal Siedlaczek

03/31/2023, 10:30 PM

again, this is dev tool for me -- there will always be a full run in acceptance tests executed

Michal Siedlaczek

03/31/2023, 10:31 PM

I'll definitely keep thinking about this -- maybe it's better to not cache by default, but only explicitly (say,

--cache

flag or something)

Elijah Ben Izzy

03/31/2023, 10:31 PM

Ha! Fair enough. Caching also gets a good deal tougher/less transparent the more intelligence you apply, so there’s something nice about a dead-simple approach (just one cache, you can iterate on that and only that), especially for dev.

Elijah Ben Izzy

03/31/2023, 10:31 PM

But yeah, curious what you find helps your workflow — keep sharing as you learn more!

Stefan Krawczyk

03/31/2023, 10:32 PM

(and you can instrument what came from the cache in a log message…)

👍 1

Michal Siedlaczek

03/31/2023, 10:32 PM

I do

Michal Siedlaczek

03/31/2023, 10:32 PM

but I can still miss it 😄

Michal Siedlaczek

03/31/2023, 10:33 PM

I'll let you know if I run into any problems that I haven't thought of

👍 1

Michal Siedlaczek

03/31/2023, 10:35 PM

and if at any point I try improving it. there's much it won't work with, like extracting columns or fields (though some of it could be remediated)

2 Views

Open in Slack

Previous Next