Hi there slightly smiling face I would like to add different Hamilton Open Source #hamilton-help

Hi there :slightly_smiling_face: I would like to a...

Jannik Buhr

06/05/2024, 3:47 PM

Hi there 🙂 I would like to add different node highlighting to the DAG based on the state of the cache of the diskcache_adapter to achieve something similar to the visualisations of the

targets

R package (https://books.ropensci.org/targets/walkthrough.html#change-code). Is there a way of querying the driver for differences between it's cache and workflow? Something like a dry-run that checks the nodes but doesn't execute them.

👀 2

Stefan Krawczyk

06/05/2024, 4:08 PM

I'm on my phone, so I'd have to look at the source code. But technically that should be possible, but my guess is that it might not be exposed.

Stefan Krawczyk

06/05/2024, 4:11 PM

Otherwise what's your use case? This for development? Or production? Or?

Stefan Krawczyk

06/05/2024, 4:13 PM

Follow up, what output would you like to get from it? An image? Or?

Jannik Buhr

06/05/2024, 7:07 PM

Production and development, though "production" means something slightly different in my case. I'm primarily a research scientist, so production is more in the sense of publishing an analysis, for which it is nice to have a visual confirmation that the output of a workflow is up-to-date. And for development I like to periodically check my workflow for "If I ran this now, which nodes would be recomputed based on the cache and changes to the code?" before e.g. deciding if this is a run I start before a coffee or just a small update I execute immediately.

Jannik Buhr

06/05/2024, 7:09 PM

As far as output goes I was trying to get a list of nodes with outdated or up-to-date cache to then modify the graphviz object with it to color the nodes differently

Stefan Krawczyk

06/05/2024, 7:10 PM

Cool. Are you developing in an notebook? or IDE? or running through a CLI?

Jannik Buhr

06/05/2024, 7:11 PM

And if this works I would ask if you are interested in a PR as part of the driver visualization

Stefan Krawczyk

06/05/2024, 7:14 PM

So something like https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/cli#diff in terms of output? (for context, this does a diff between git commits on a module)

Jannik Buhr

06/05/2024, 7:14 PM

I'm working in Neovim, where I have workflows as python modules and import those into a quarto notebook (quarto is like plain text jupyter notebooks) from which I interactively run code in an ipython console or execute/render the whole notebook to various output formats (works out of the box with hamilton so far!)

💡 1

Jannik Buhr

06/05/2024, 7:17 PM

ohhh, that looks good! So my goal is to take essentially this and augment it with the state of the diskcache

Jannik Buhr

06/05/2024, 7:18 PM

similar to this:

Thierry Jean

06/05/2024, 7:19 PM

yup, so the diff viz here is based on the node's code version. Now, we'd have to create a mechanism to consider the "data version" used internally by the diskcache feature. Just to be sure, you'd want to see the visualization before execution the dataflow?

Jannik Buhr

06/05/2024, 7:19 PM

👍 1

Jannik Buhr

06/05/2024, 7:19 PM

yes

Thierry Jean

06/05/2024, 7:21 PM

The way we have it right now, would it make sense to have a viz that distinguishes: • the inputs you pass • the nodes we're certain we can read from cache • the nodes we might have to compute (don't know until execution) • the nodes we must compute It might be a bit difficult to know 100% what can be read from cache before execution

👍 1

Stefan Krawczyk

06/05/2024, 7:21 PM

@Jannik Buhr what is the precedence that you’d expect? 1. Write code. 2. Run - things are cached. 3. Modify code of node 4. Run - only modified node & downstream of it is rerun? 5. Is this invariant to inputs being changed? or?

Jannik Buhr

06/05/2024, 7:22 PM

oh, btw. I don't want to come across as that "guy who has just discovered someones framework and now wants to make it work like this other framework he likes" :D

😁 1

Thierry Jean

06/05/2024, 7:22 PM

No worries! You have good timing because we're currently looking at overhauling the cache features haha

Jannik Buhr

06/05/2024, 7:23 PM

Feel free to tell me if I'm completely missing a point or something :)

Stefan Krawczyk

06/05/2024, 7:23 PM

heh no worries - we’re trying to figure out the best path forward for “caching” and this is great input. E.g. is it check pointing (invariant of code & inputs)? is it some sort of intelligent caching? (change code, rerun things only downstream of it, change input, rerun only what’s downstream) or something in between how do people want to use it? etc

Jannik Buhr

06/05/2024, 7:26 PM

In that case taking inspiration from the ux of targets might actually be helpful, because this works incredibly well for many for those data science tasks. I'll try to summarize key features

🙌 1

Jannik Buhr

06/05/2024, 7:37 PM

So the philosophy is actually really similar (just implemented differently with R's meta programming instead of python's dependency injection). It's also declarative and function based. You define a list of targets, call

tar_make()

which only recomputes the targets (=hamilton nodes) whose inputs, code or dependencies have changed, or you call

tar_visnetwork()

(=hamilton

dr.display_all_functions()

) to look at your workflow and also see which nodes are cached and similar to hamilton data loaders there are special targets you can use to declare a file path as an input such that the node gets invalidated (so will be recomputed) when the file changes.

Jannik Buhr

06/05/2024, 7:41 PM

I believe there is a slight difference as to how functions that are not themselves nodes/targets, but rather used by nodes, are handled. e.g.

Copy code

# _targets.R file
library(targets)
source("R/functions.R")
tar_option_set(packages = c("readr", "dplyr", "ggplot2"))
list(
  tar_target(file, "data.csv", format = "file"),
  tar_target(data, get_data(file)),
  tar_target(model, fit_model(data)),
  tar_target(plot, plot_model(model, data))
)

turns into the following DAG: notice, how the functions like

get_data

that are used within the targets automatically become part of the the DAG such the changes to them can be tracked

👍 1

Jannik Buhr

06/05/2024, 7:44 PM

When it comes to caching, targets basically always caches everything (also on disk as R objects similar to python pickles), but you can also manually invalidate targets if you want to recompute them

💡 1

Stefan Krawczyk

06/05/2024, 8:07 PM

Looking at the code - we have the individual bits, but not wired together in the same place right now. • driver visualization path doesn’t have access to adapters — so nothing quick to be done there. • disk cache knows what nodes it has a cache for and their code hash versions — this could be pulled out from the adapter, e.g. we could expose a function, what do I have and what code versions? • We have the building blocks to provide node hashes given a driver. • We then can figure out the path between nodes and outputs So right now you could probably code up a utility function that does the following:

Copy code

def hash_hamilton_nodes(dr: driver.Driver) -> Dict[str, str]:
    """Hash the source code of Hamilton functions from nodes in a Driver"""
    from hamilton import graph_types

    graph = graph_types.HamiltonGraph.from_graph(dr.graph)
    return {n.name: n.version for n in graph.nodes}

def what_is_still_valid_in_cache(dr: driver.Driver, disk_cache: ...) -> list[str]:
    node_hashes = hash_hamilton_nodes(dr)
    nodes_in_cache = disk_cache.nodes_history 
    result = []
    for node, node_versions in nodes_in_cache.items():
        # caveat assumes we produce the same hashes -- would need to double check code paths here
        current_hash = node_hashes[node]
        if current_hash in node_versions:
            result.append(node)
    return result

# then you could minimally style the viz that way
# see <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/styling_visualization>

ideally we’d figure out the paths impacted, but at least visually you could see something quickly…

Stefan Krawczyk

06/05/2024, 8:52 PM

created https://github.com/DAGWorks-Inc/hamilton/issues/940 to track some of these thoughts. please add links/thoughts/ideas for requirements & APIs.

🙏 1

Open in Slack

Previous Next