Hi, I am little confused with design of `cacheadap...
# hamilton-help
r
Hi, I am little confused with design of
cacheadapter
. In
CachingGraphAdapter.execute_node
, it says the condition 3 that recomputes a node is upstream node was recomputed. I rerun the program with different input, the cached node seems wont be recomputed.
Copy code
"""Executes nodes conditionally according to caching rules.

        This node is executed if at least one of these is true:

        * no cache is present,
        * it is explicitly forced by passing it to the adapter in ``force_compute``,
        * at least one of its upstream nodes that had a @cache annotation was computed,
          either due to lack of cache or being explicitly forced.

        """
What if we write input of this node to the cache file? So when we check if a node need to be recomputed, we can compare input from upstream node with input in cache file. I think the advantage here is even if we rerun the program, the caching system still work. PS: maybe I don't fully understand the source code...
t
Just to be sure, you're talking about
hamilton.experimental.h_cache.CachingGraphAdapter
? You're correct, this adapter doesn't "recompute a node on new inputs". The best way to explain this adapter IMO is "only compute once"; it doesn't have the smarts to automatically determine when to recompute. On the other hand, we have
hamilton.lifecyle.CacheAdapter
which has the smarts to figure out when to recompute. It's downside is that it uses the
pickle
format and doesn't automatically create reliable files like materializers do (e.g.,
parquet
,
csv
,
json
) You can read more about the two in this example. Note that the
DiskCacheAdapter
is like
hamilton.lifecycle.CacheAdapter
but requires a 3rd party library. It's normal that the topic is confusing. We're realizing that "caching" is a very loaded term and it's an aspect that we're currently overhauling and promoting towards a first-class feature of Hamilton. It's a sizeable feature to ship, so it may take some time 😅
r
Yes, I am talking about
CachingGraphAdapter
, since I started to use Hamilton a long time ago, and those two experimental methods are the only thing I know. I just read the new code and find
CachingAdapter
. It seems the doc not update yet? Should I update to latest version and use CachingAdapter instead?
t
They do slightly different things, so it depends on your use case. If you're doing a lot of iterative development, I'd suggest
CacheAdapter
. If you're using it to create "checkpoints" for a large pipeline, I'd suggest
CachingGraphAdapter
In short
CachingGraphAdapter
• "explicit" rule-based cache recompute (the rules mentioned above), making it easier to debug • uses robust file formats via materializers
CacheAdapter
• automatically determines if needs to recompute based on inputs and code version • use pickle, so the artifacts are not easy to introspect and are Python-version dependent I think both might encounter issues with
Parallelizable[]/Collect[]
r
super! crystal clear! thx for your help!
t
My pleasure!