Roy Kid
07/17/2024, 10:05 AMcacheadapter
. In CachingGraphAdapter.execute_node
, it says the condition 3 that recomputes a node is upstream node was recomputed. I rerun the program with different input, the cached node seems wont be recomputed.
"""Executes nodes conditionally according to caching rules.
This node is executed if at least one of these is true:
* no cache is present,
* it is explicitly forced by passing it to the adapter in ``force_compute``,
* at least one of its upstream nodes that had a @cache annotation was computed,
either due to lack of cache or being explicitly forced.
"""
What if we write input of this node to the cache file? So when we check if a node need to be recomputed, we can compare input from upstream node with input in cache file. I think the advantage here is even if we rerun the program, the caching system still work.
PS: maybe I don't fully understand the source code...Thierry Jean
07/17/2024, 1:40 PMhamilton.experimental.h_cache.CachingGraphAdapter
?
You're correct, this adapter doesn't "recompute a node on new inputs". The best way to explain this adapter IMO is "only compute once"; it doesn't have the smarts to automatically determine when to recompute.
On the other hand, we have hamilton.lifecyle.CacheAdapter
which has the smarts to figure out when to recompute. It's downside is that it uses the pickle
format and doesn't automatically create reliable files like materializers do (e.g., parquet
, csv
, json
)
You can read more about the two in this example. Note that the DiskCacheAdapter
is like hamilton.lifecycle.CacheAdapter
but requires a 3rd party library.
It's normal that the topic is confusing. We're realizing that "caching" is a very loaded term and it's an aspect that we're currently overhauling and promoting towards a first-class feature of Hamilton. It's a sizeable feature to ship, so it may take some time 😅Roy Kid
07/17/2024, 1:48 PMCachingGraphAdapter
, since I started to use Hamilton a long time ago, and those two experimental methods are the only thing I know.
I just read the new code and find CachingAdapter
. It seems the doc not update yet? Should I update to latest version and use CachingAdapter instead?Thierry Jean
07/17/2024, 1:56 PMCacheAdapter
. If you're using it to create "checkpoints" for a large pipeline, I'd suggest CachingGraphAdapter
In short
CachingGraphAdapter
• "explicit" rule-based cache recompute (the rules mentioned above), making it easier to debug
• uses robust file formats via materializers
CacheAdapter
• automatically determines if needs to recompute based on inputs and code version
• use pickle, so the artifacts are not easy to introspect and are Python-version dependent
I think both might encounter issues with Parallelizable[]/Collect[]
Roy Kid
07/17/2024, 2:00 PMThierry Jean
07/17/2024, 2:00 PM