Hi I am little confused with design of `cacheadapter` In `Ca Hamilton Open Source #hamilton-help

Hi, I am little confused with design of `cacheadap...

Roy Kid

07/17/2024, 10:05 AM

Hi, I am little confused with design of

cacheadapter

. In

CachingGraphAdapter.execute_node

, it says the condition 3 that recomputes a node is upstream node was recomputed. I rerun the program with different input, the cached node seems wont be recomputed.

Copy code

"""Executes nodes conditionally according to caching rules.

        This node is executed if at least one of these is true:

        * no cache is present,
        * it is explicitly forced by passing it to the adapter in ``force_compute``,
        * at least one of its upstream nodes that had a @cache annotation was computed,
          either due to lack of cache or being explicitly forced.

        """

What if we write input of this node to the cache file? So when we check if a node need to be recomputed, we can compare input from upstream node with input in cache file. I think the advantage here is even if we rerun the program, the caching system still work. PS: maybe I don't fully understand the source code...

Thierry Jean

07/17/2024, 1:40 PM

Just to be sure, you're talking about

hamilton.experimental.h_cache.CachingGraphAdapter

? You're correct, this adapter doesn't "recompute a node on new inputs". The best way to explain this adapter IMO is "only compute once"; it doesn't have the smarts to automatically determine when to recompute. On the other hand, we have

hamilton.lifecyle.CacheAdapter

which has the smarts to figure out when to recompute. It's downside is that it uses the

pickle

format and doesn't automatically create reliable files like materializers do (e.g.,

parquet

csv

json

) You can read more about the two in this example. Note that the

DiskCacheAdapter

is like

hamilton.lifecycle.CacheAdapter

but requires a 3rd party library. It's normal that the topic is confusing. We're realizing that "caching" is a very loaded term and it's an aspect that we're currently overhauling and promoting towards a first-class feature of Hamilton. It's a sizeable feature to ship, so it may take some time 😅

Roy Kid

07/17/2024, 1:48 PM

Yes, I am talking about

CachingGraphAdapter

, since I started to use Hamilton a long time ago, and those two experimental methods are the only thing I know. I just read the new code and find

CachingAdapter

. It seems the doc not update yet? Should I update to latest version and use CachingAdapter instead?

Thierry Jean

07/17/2024, 1:56 PM

They do slightly different things, so it depends on your use case. If you're doing a lot of iterative development, I'd suggest

CacheAdapter

. If you're using it to create "checkpoints" for a large pipeline, I'd suggest

CachingGraphAdapter

In short

CachingGraphAdapter

• "explicit" rule-based cache recompute (the rules mentioned above), making it easier to debug • uses robust file formats via materializers

CacheAdapter

• automatically determines if needs to recompute based on inputs and code version • use pickle, so the artifacts are not easy to introspect and are Python-version dependent I think both might encounter issues with

Parallelizable[]/Collect[]

Roy Kid

07/17/2024, 2:00 PM

super! crystal clear! thx for your help!

Thierry Jean

07/17/2024, 2:00 PM

My pleasure!

Open in Slack

Previous Next