Hi there :slightly_smiling_face: I would like to a...
# hamilton-help
j
Hi there 🙂 I would like to add different node highlighting to the DAG based on the state of the cache of the diskcache_adapter to achieve something similar to the visualisations of the
targets
R package (https://books.ropensci.org/targets/walkthrough.html#change-code). Is there a way of querying the driver for differences between it's cache and workflow? Something like a dry-run that checks the nodes but doesn't execute them.
👀 2
s
I'm on my phone, so I'd have to look at the source code. But technically that should be possible, but my guess is that it might not be exposed.
Otherwise what's your use case? This for development? Or production? Or?
Follow up, what output would you like to get from it? An image? Or?
j
Production and development, though "production" means something slightly different in my case. I'm primarily a research scientist, so production is more in the sense of publishing an analysis, for which it is nice to have a visual confirmation that the output of a workflow is up-to-date. And for development I like to periodically check my workflow for "If I ran this now, which nodes would be recomputed based on the cache and changes to the code?" before e.g. deciding if this is a run I start before a coffee or just a small update I execute immediately.
As far as output goes I was trying to get a list of nodes with outdated or up-to-date cache to then modify the graphviz object with it to color the nodes differently
s
Cool. Are you developing in an notebook? or IDE? or running through a CLI?
j
And if this works I would ask if you are interested in a PR as part of the driver visualization
s
So something like https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/cli#diff in terms of output? (for context, this does a diff between git commits on a module)
j
I'm working in Neovim, where I have workflows as python modules and import those into a quarto notebook (quarto is like plain text jupyter notebooks) from which I interactively run code in an ipython console or execute/render the whole notebook to various output formats (works out of the box with hamilton so far!)
💡 1
ohhh, that looks good! So my goal is to take essentially this and augment it with the state of the diskcache
similar to this:
t
yup, so the diff viz here is based on the node's code version. Now, we'd have to create a mechanism to consider the "data version" used internally by the diskcache feature. Just to be sure, you'd want to see the visualization before execution the dataflow?
j
👍 1
yes
t
The way we have it right now, would it make sense to have a viz that distinguishes: • the inputs you pass • the nodes we're certain we can read from cache • the nodes we might have to compute (don't know until execution) • the nodes we must compute It might be a bit difficult to know 100% what can be read from cache before execution
👍 1
s
@Jannik Buhr what is the precedence that you’d expect? 1. Write code. 2. Run - things are cached. 3. Modify code of node 4. Run - only modified node & downstream of it is rerun? 5. Is this invariant to inputs being changed? or?
j
oh, btw. I don't want to come across as that "guy who has just discovered someones framework and now wants to make it work like this other framework he likes" :D
😁 1
t
No worries! You have good timing because we're currently looking at overhauling the cache features haha
j
Feel free to tell me if I'm completely missing a point or something :)
s
heh no worries - we’re trying to figure out the best path forward for “caching” and this is great input. E.g. is it check pointing (invariant of code & inputs)? is it some sort of intelligent caching? (change code, rerun things only downstream of it, change input, rerun only what’s downstream) or something in between how do people want to use it? etc
j
In that case taking inspiration from the ux of targets might actually be helpful, because this works incredibly well for many for those data science tasks. I'll try to summarize key features
🙌 1
So the philosophy is actually really similar (just implemented differently with R's meta programming instead of python's dependency injection). It's also declarative and function based. You define a list of targets, call
tar_make()
which only recomputes the targets (=hamilton nodes) whose inputs, code or dependencies have changed, or you call
tar_visnetwork()
(=hamilton
dr.display_all_functions()
) to look at your workflow and also see which nodes are cached and similar to hamilton data loaders there are special targets you can use to declare a file path as an input such that the node gets invalidated (so will be recomputed) when the file changes.
I believe there is a slight difference as to how functions that are not themselves nodes/targets, but rather used by nodes, are handled. e.g.
Copy code
# _targets.R file
library(targets)
source("R/functions.R")
tar_option_set(packages = c("readr", "dplyr", "ggplot2"))
list(
  tar_target(file, "data.csv", format = "file"),
  tar_target(data, get_data(file)),
  tar_target(model, fit_model(data)),
  tar_target(plot, plot_model(model, data))
)
turns into the following DAG: notice, how the functions like
get_data
that are used within the targets automatically become part of the the DAG such the changes to them can be tracked
👍 1
When it comes to caching, targets basically always caches everything (also on disk as R objects similar to python pickles), but you can also manually invalidate targets if you want to recompute them
💡 1
s
Looking at the code - we have the individual bits, but not wired together in the same place right now. • driver visualization path doesn’t have access to adapters — so nothing quick to be done there. • disk cache knows what nodes it has a cache for and their code hash versions — this could be pulled out from the adapter, e.g. we could expose a function, what do I have and what code versions? • We have the building blocks to provide node hashes given a driver. • We then can figure out the path between nodes and outputs So right now you could probably code up a utility function that does the following:
Copy code
def hash_hamilton_nodes(dr: driver.Driver) -> Dict[str, str]:
    """Hash the source code of Hamilton functions from nodes in a Driver"""
    from hamilton import graph_types

    graph = graph_types.HamiltonGraph.from_graph(dr.graph)
    return {n.name: n.version for n in graph.nodes}

def what_is_still_valid_in_cache(dr: driver.Driver, disk_cache: ...) -> list[str]:
    node_hashes = hash_hamilton_nodes(dr)
    nodes_in_cache = disk_cache.nodes_history 
    result = []
    for node, node_versions in nodes_in_cache.items():
        # caveat assumes we produce the same hashes -- would need to double check code paths here
        current_hash = node_hashes[node]
        if current_hash in node_versions:
            result.append(node)
    return result

# then you could minimally style the viz that way
# see <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/styling_visualization>
ideally we’d figure out the paths impacted, but at least visually you could see something quickly…
created https://github.com/DAGWorks-Inc/hamilton/issues/940 to track some of these thoughts. please add links/thoughts/ideas for requirements & APIs.
🙏 1