This message was deleted.
# hamilton-help
s
This message was deleted.
e
Yes! This is a feature that we just released (as a supported API). We have a blog post in flight that’ll document it well, but the tool is called the lifecycle API. Still improving the docs, but you can: (1) implement the class referenced here https://hamilton.dagworks.io/en/latest/concepts/customizing-execution/#execution-hooks (2) example is here in the PrintLnHook https://github.com/DAGWorks-Inc/hamilton/blob/main/hamilton/lifecycle/default.py (3) pass it into the driver as part of a list in the
adapter=
or use
with_adapters
and pass it in as *args if you’re using the new driver builder Should take 5 minutes although the blog post will make it much simpler just made the API public facing. If you’re having trouble I’m happy to share the draft of the post — it walks you through this.
a
ah thanks, let me give it a try. I need to move to the driver builder API as well, it’s much better.
👍 1
e
Yep! To be clear that’s not necessary for this but you absolutely should, it’s cleaner. Lmk if you haves any problems with the API for hooks, it’s still pretty new but well-tested and quite powerful.
a
How do I pass the execution hook to the driver? I’m already using a custom-ish adapter:
.with_adapter(SimplePythonGraphAdapter(DictResult()))
oh got it, just pass a list of adapters / hooks
e
Yep! You shouldn’t need the simple graph adapter btw — it’s the default one. And you’ll want to use with_adapters (pluralized), it’s a bit more ergonomic
a
Thanks for your help, I’ve managed to do what I wanted. I can log the shape and memory usage of arrow tables
Copy code
import logging
from typing import Any, Dict, Optional

import pyarrow as pa
import humanize
from hamilton.lifecycle import NodeExecutionHook

logger = logging.getLogger(__name__)

class LogTableStatsNodeExecutionHook(NodeExecutionHook):
    def run_before_node_execution(
        self,
        *,
        node_name: str,
        node_tags: Dict[str, Any],
        node_kwargs: Dict[str, Any],
        node_return_type: type,
        task_id: Optional[str],
        **future_kwargs: Any,
    ):
        pass

    def run_after_node_execution(
        self,
        *,
        node_name: str,
        node_tags: Dict[str, Any],
        node_kwargs: Dict[str, Any],
        node_return_type: type,
        result: Any,
        error: Optional[Exception],
        success: bool,
        task_id: Optional[str],
        **future_kwargs: Any,
    ):
        if isinstance(result, pa.Table):
            <http://logger.info|logger.info>(
                "Table %d: {result.num_rows:_d} * {result.num_columns}"
                f" = {humanize.naturalsize(result.nbytes)}",
            )
🙌 2
🔥 1
e
Love it! We can easily add this in to Hamilton as a part of a pyarrow plugin if you think others wil use it. quick tip: You don’t have to include any params you don’t need — **future_kwargs will handle that
s
@Arthur Andres a pyarrow plugin would be great - it would end up here https://hamilton.dagworks.io/en/latest/reference/lifecycle-hooks/#available-adapters It also seems like we should add support for pyarrow in general.
a
sure, though I’m not sure exactly what the plugin would do. At the moment I’m happy to use hamilton in plain mode. Each node outputs a
pyarrow.Table
that I then join together myself.
BTW we’ve put hamilton in our production pipeline. Thanks for your help. 🙏
e
That’s awesome! Glad to hear it made it in — thanks for your feedback/thoughts along the way! A plugin is pretty general, although we do have specific constructs. Some ideas (can share what code they would involve writing in a bit): 1. We have specific types for dataframes/series — we have a way of registering it so
extract_columns
works. I think this is a little less relevant to pyarrow (most people use tables), but might be worth exploring 2. Moving your pyarrow hooks into something like
plugins.h_pyarrow
3. Adding a result-builder to do the joining if its common logic Basically any bespoke piece that you did we might be able to fit into a plugin
a
Beside the adapter that logs table size stats, I haven’t written anything arrow related in the hamilton framework. We do have some internal tooling that we use for pyarrow. For example we have a decorator to enforce schemas of output column (I believe you have something similar). But it is very opinionated. eg: what do you do with missing column, extra columns, columns that you could cast, do you fill non-nullable null with empty etc. And it is independent from hamilton. Though maybe it could be interesting. Like hamilton could leverage it to document the schema of the output. I’m not sure where you want to draw the line between the custom business logic of users . I guess maybe you’d want to reimplement what you’ve done for pandas, but for pyarrow?
s
I think those all sounds like reasonable features to have!
I guess maybe you’d want to reimplement what you’ve done for pandas, but for pyarrow?
yep supporting all “table” like data types is something we should have.