https://github.com/stitchfix/hamilton logo
Join Slack
Powered by
# hamilton-help
  • e

    Elton Cardoso Do Nascimento

    01/16/2025, 9:23 PM
    Hi! I'm creating some functions and I'd like them to be usable also without Hamilton. My problem is with the Parallelizable annotation, since using it I can't indicate what the return type would be if the function were executed manually, which in my case is a list[str], which makes the linter not work correctly. Is there any way to annotate a "Parallelizable ∩ list" without changing the Hamilton code?
  • e

    Elton Cardoso Do Nascimento

    01/16/2025, 9:23 PM
    I managed to make it work by changing Hamilton's code. If there is no alternative, may I submit a pull request?
    e
    • 2
    • 12
  • h

    Hadi Gharibi

    01/18/2025, 12:46 PM
    Hello, Im relatively new to Hamilton, so my questions might be basic, but I couldn't find them in the docs. It would be great if you could guide me in the right direction. The problem I want to solve: My objective is to make a system of models and ETLs that can describe a business. The models need to be defined/developed independently and then connect them together for overall optimization and simulations. I want to chain some models in a way that the output of some will be input of others(with some extra features). with this system I can train models independently and generate simulation and what-if scenarios by running the dag end to end and changing a bit of input or variables. What I considered is having an abstraction for models(since they need to know how to do HP searches for them). Of course, a combination of some models needs to define a model collection. These two objects should be replaceable; the only difference is that the model collection should create the dags based on the model's inputs and outputs. the dags need two have two models(since we train we know the inputs and outputs, while when we predict to create simulations inputs and outputs need to be chained). What I think I need: 1. Can I define dags independently and then merge them later? I saw subdag decorator but was not sure if that's what I need.(for model collection) 2. How can I combine this way of dag definition with classes and abstraction? I need to define interfaces/protocols for the developers to develop some specific methods in their class(e.g. fit, predict). I there anyway for me to create the dags instead of automated version? like I expect in the fit function signature they say what data sources they need, but since this fit model is with, I cant just wrap it around a decorator. 3. How can I have to mode of execution for dag? create two dag in parallel that one has no dependency and other is an actual dag? use config? any other way?
    e
    • 2
    • 13
  • s

    Stefan Krawczyk

    01/21/2025, 5:38 PM
    sorry everyone, no office hours today - we’re a little tied up.
  • e

    Ethan Kim

    01/23/2025, 6:17 PM
    Hi everyone, new to using Hamilton as part of our MLOps stack! I couldn't find a reference for passing on a column from a dataframe to another function. For example, we're loading a table from Postgres initially for user IDs, and then using those user IDs for a subsequent SQL queries to filter other dataframes by. Any help would be greatly appreciated!
    👀 1
    e
    s
    • 3
    • 4
  • g

    Gilad Rubin

    01/27/2025, 9:03 AM
    Has anyone established a good workflow to work with an AI editor like Cursor/Windsurf and Hamilton in Jupyter Notebooks? I'm struggling with the lack of native support for notebooks in these editors + having some difficulties "teaching" these AI how to use Hamilton properly @Thierry Jean
    s
    i
    +3
    • 6
    • 18
  • s

    Stefan Krawczyk

    01/28/2025, 5:31 PM
    https://hamilton-opensource.slack.com/archives/C03AJNGDGQL/p1738085510752839
  • m

    Matthew K

    01/29/2025, 3:37 PM
    Hello, I'm new to Hamilton and I'm interested in adopting it to organize data science pipelines. I've dug around the docs for a few hours and made some cursory searches in slack, and I have some follow up questions now. Please forgive me if some of these have been answered elsewhere. One reason I'm interested in Hamilton is it seems to offer an intriguing way to organize processing/analytic steps across batch pipelines written by various data scientists and engineers. The passing of function names as nodes seems like a way to enforce a sort of schema across various processing jobs when sharing base components, as well as baked in documentation and knowledge management. I envision making an analytic store, where data scientists write their pipelines on a one module per pipeline basis, and a corresponding team analytic library, where commonly used nodes and subgraphs can be imported for various pipelines in the analytic store. Given consistent naming enforced on the function level, a setup like this should make it very clear what analytics are consuming what inputs, what sort of follow up processing on common nodes is being done, etc. to both reduce duplication of effort and to increase visibility of analytic deficiencies. I imagine writing a little code so a data scientist could query something like, 'what other pipelines use the
    process_data_from_special_api
    function, so they could find that a coworker's DAG brings in data from another source to enrich the data in their pipeline that would also be useful in their own, or that so many analytics use the same follow on steps that perhaps they should be modularized in a component library to increase consistency across analytics.· The thing is, this doesn't actually seem that easy to do in Hamilton, so I first have to ask, does the above setup I have in mind seem reasonable? From reading the documentation and doing a little experimenting, function reuse is tricky in general, or at least not intuitive. Imagine I made a module of reusable components that I import into
    my_analytic
    like
    Builder().with_modules([component_library, my_analytic])
    , but I really only want one function from
    component_library
    , say
    process_special_data
    . My understanding is, to hook it into the
    my_analytic
    dag, I would need to make sure I have upstream nodes that are the expected inputs to
    process_special_data
    and output nodes that will consume
    process_special_data
    as input. What happens to any extra nodes/subgraphs that I don't want from the module? Is there a way to just import the
    process_special_data
    function to make it explicit in the
    my_analytic
    calling code exactly what the analytic depends on? Is there any good way to further decorate an imported function like this, if I wanted to tag further metadata or change parameterizations specific to the analytic? If another analytic uses a function I want in mine, is there a better way to bring it into my own graph/model outside of copy pasting it? I have a separate, unrelated question about materialization and external APIs in general. Is materialization a concept that should only be used for initial and final I/O? If I grab data from a local parquet file using the built in parquet loader, process it, then grab data from an external API based on that processing to further enrich it, should I treat building a client, making the request, and processing the response as nodes in my graph, or should I write and register a data loader for the API that does this under the hood, so I can just decorate a node in my graph with
    @load_from.my_api(request_params=source["parquet_processing"])
    ? Static materialization, if I understand correctly, is out of the picture for this case, since I don't know how to construct the request until I do the initial parquet processing? I don't really see any examples in Hamilton docs on data loading/materialization outside of fairly simple static examples. I appreciate any help on these questions.
    s
    • 2
    • 15
  • s

    Seth Stokes

    01/30/2025, 9:47 PM
    Hey Hamilton users, has anyone else used Hamilton to create metrics? If so, how did you go about it?
    s
    • 2
    • 1
  • s

    Stefan Krawczyk

    02/04/2025, 5:43 PM
    <!here> office hours starting now - meet.google.com/enx-bhus-fae
  • z

    Zoltan Eisler

    02/04/2025, 10:38 PM
    Hi All, is there an easy way to ensure that a function is always called (and always at the beginning of the DAG run) even if no requested final variable requires it? I can think of hacky solutions, but maybe there’s a “proper” way? (In case you wonder why: This would serve to initialize a module that the different DAG nodes themselves require. I usually just initialize from outside the DAG before running it. However, now I’m moving to a Ray runner and have to do the initialization step remotely.)
    s
    e
    • 3
    • 14
  • j

    Joao Castro

    02/05/2025, 1:44 PM
    Hi All, is there a way to create dependencies between nodes when one of them returns None? For example, having a node creating a table in a db and then run another node after that. They way i thought was to return a boolean but wondering if there's another way recommended as I couldnt find anything in the docs. Thanks
    e
    s
    • 3
    • 7
  • j

    João Paulo Vieira

    02/06/2025, 1:49 PM
    Hey all, I would like to know if its possible to create loops with hamilton like this image down. Essentially, I want to use different functions for each step on my pipeline (read data, preprocess, train/test split, etc). I want to be able to execute the train/test/split multiple times, saving the results of each model`s inference, and then displaying/saving one of the models (with a fixed seed). The reason I want to execute train/test split multiple times is that I want to test different seeds, and then calculate mean/std values for N models. At the same time, I would like to execute this pipeline in a single step (with N=1, for instance), not activating that loop.
    t
    • 2
    • 2
  • s

    sahil-shetty

    02/09/2025, 6:21 PM
    Hi all, is there a way to access my config parameters while building the driver? (my use case in thread below)
    e
    • 2
    • 4
  • s

    Seth Stokes

    02/10/2025, 9:12 PM
    Hello, I am wanting to use
    .with_materializers()
    to save out intermediate step in a dag. I am currently trying the following but am encountering some issues. Details in thread.
    • 1
    • 4
  • v

    Volker Lorrmann

    02/11/2025, 1:33 PM
    Hi guys, I have a question regarding Hamilton UI. As describe here (https://hamilton.dagworks.io/en/latest/hamilton-ui/ui/#changing-behavior-of-what-is-captured), I am able to disable capturing any data statistics of each run. However, I don´t get, what MAX_LIST_LENGTH_CAPTURE and MAX_DICT_LENGTH_CAPTURE are used for? Furthermore, I´d like to setup a hamitlon dataflow (using flowerpower), that cleans all stats older than a threshold time from the hamilton ui postgres db, but I need some more input. Which are the relevant database tables? Do I just need to delete all entries from the database tables older than the threshold time?
    👀 1
    e
    s
    • 3
    • 16
  • s

    Stefan Krawczyk

    02/11/2025, 5:35 PM
    <!here> office hours starting now - https://meet.google.com/enx-bhus-fae
  • z

    Zoltan Eisler

    02/12/2025, 8:48 AM
    Hi Folks, I am repurposing a fairly large DAG that used to run locally, to deploy it remotely on Ray. This is super easy with the adapters provided. I have run into two performance bottlenecks though. First, I move polars dataframes that can have a billion rows. I have noticed that when attaching a
    HamiltonTracker
    up to 20% of my DAG tuns can be spent calculating dataframe stats for display in the UI. I have short-circuited the calculations by overriding
    hamilton_sdk.tracking.polars_stats._compute_stats_
    by an empty function, but any less hacky suggestions are welcome. Second, and more crucially, I hit a very steep serialization penalty when moving data between nodes. If I understood correctly, we can eliminate serialization (at the cost of parallelism, possibly) by putting all the nodes in the same group, but I can’t seem to pull this off. Here’s my code:
    Copy code
    class MyGS(grouping.GroupingStrategy):
        def group_nodes(self, nodes: List[Node] -> List[NodeGroup]:
            group = NodeGroup(base_id=“whatever”, spawning_task_base_id=None, nodes=nodes, purpose=NodeGroupPurpose.EXECUTE_BLOCK)
            return [group]
    Then I just attach this to my driver by calling
    Copy code
    driver.Builder().with_grouping_strategy(MyGS())
    However, this doesn’t seem to have the desired effect, stuff still gets serialized. Any suggestion what I am doing wrong? Or another possible workaround?
    e
    s
    • 3
    • 10
  • e

    Elton Cardoso Do Nascimento

    02/12/2025, 7:42 PM
    Is possible to use
    parameterize
    to define a new node in a file different from the original funcition? For example, I have file a.py:
    Copy code
    def a(input1:int) -> int:
      ...
    And a file b.py:
    Copy code
    b = parameterize(b={"input1":value(1))(a)
    When I try this, and import in the module init.py both a and b, I get "Cannot define function a more than once." when building the driver.
    s
    e
    • 3
    • 20
  • s

    Stefan Krawczyk

    02/18/2025, 5:47 PM
    <!here> sorry - a bit late on the office hours, but they’re happening now for the next 45 mins. meet.google.com/enx-bhus-fae
  • s

    Seth Stokes

    02/18/2025, 10:33 PM
    Hey, If I am trying to debug this function
    final_result
    will all the
    step
    functions be applied to
    upstream_int
    when I get it in the function body? What if it were
    pipe_output
    ?
    Copy code
    @pipe_input(
        step(_add_one),
        step(_multiply, y=2),
        step(_sum, y=value(3)),
        step(_multiply, y=source("upstream_node_to_multiply")),
    )
    def final_result(upstream_int: int) -> int:
        pdb.set_trace()
        return upstream_int
    Slack Conversation
    s
    • 2
    • 4
  • s

    Seth Stokes

    02/20/2025, 9:04 PM
    Hey, Is it possible to see all the configs available for a DAG? For example, I want to see all possible `last_run`config options. When I pass the config (in green), the DAG is nice to tell me which
    last_run
    is being used and its reflected by the function getting used
    _with_filter_run_date
    . However, for documentation purposes, how would you suggest amending so that all
    last_run
    config options are visible in the DAG? Is this possible?
    Copy code
    @extract_fields({
        "child_process_run_id": int,
        "child_process_completed_at": pd.Timestamp,
    })
    @pipe_input(
        step(_filter_run_bom).when(last_run="bom"),
        step(_filter_run_latest).when(last_run="last"),
        step(_filter_run_date, completion_date=source("completion_date")).when(last_run="completion_date"),
    )
    def child_process_run_info(
        child_process_runs: pd.DataFrame, 
        child_rec_process_code: str, 
        completion_date: Optional[str] = None
    ) -> Dict[int, pd.Timestamp]:
        """Return run id for process from `last_run` logic."""
    Slack Conversation
    s
    • 2
    • 8
  • j

    Jonas Meyer-Ohle

    02/21/2025, 4:36 PM
    Hiya, Thanks for the hamilton package, I'm currently setting up the hamilton-ui on a domain subpath, using nginx as a reverse proxy. Having some issues getting this to work since the frontend is expecting the main.js and css file to be located under the main path. Has anyone had any success to get this to work? I've done similar with grafana and the following environment variables exist to make it work: - GF_SERVER_ROOT_URL=%(protocol)s://%(domain)s/grafana - GF_SERVER_SERVE_FROM_SUB_PATH=true Thanks for any help :)
    s
    e
    • 3
    • 21
  • e

    Emeka Anyanwu

    02/21/2025, 7:37 PM
    Hey all! Is there any way to access a node's metadata like run_id from within the node itself? I'd like to use to add it to some manual telemetry.
    e
    s
    • 3
    • 45
  • s

    Seth Stokes

    02/25/2025, 4:40 PM
    Looking forward to office hours today to hopefully work through this. I recently migrated from looping over the
    config
    outside of the dag and executing the
    driver
    for each config. To using
    Parallelizable/Collect
    to handling this and get some speed up. The issue I am having is that previously, the dag depended on the `config`to know which downstream functions to call via
    config.when
    Now this config variable is the iterated list, is there a way to still expose each of these so that the dag can know which downstream functions to call?
  • e

    Elijah Ben Izzy

    02/25/2025, 5:37 PM
    <!here> office hours starting now - https://meet.google.com/enx-bhus-fae
  • b

    Benoit Aubuchon

    03/03/2025, 6:07 PM
    I'm slowly porting an existing codebase to Hamilton and I'm trying to access the result of one of the workflow step. The result is a DataFrame. I made a function to return it (and called from the execute()) but when I access it from the execute function, each column is a key in the result dictionary. Is there a way to not have hamilton deconstruct the dataframe in the result?
    t
    • 2
    • 2
  • s

    Seth Stokes

    03/03/2025, 6:19 PM
    when using
    @check_output
    does
    hamilton
    have the ability to split the offending rows into two datasets?
    Copy code
    @check_output(positions_schema)
    def final_dataframe_before_api_call(prepped_df: pd.DataFrame) -> pd.DataFrame:
        return prepped_df
    
    def final_dataframe_validated(prepped_df: pd.DataFrame) -> pd.DataFrame:
        return prepped_df
    
    def final_dataframe_violated(prepped_df: pd.DataFrame) -> pd.DataFrame:
        return prepped_df
    
    def api_request_object(final_df_validated: pd.DataFrame) -> dict:
        return final_df_validated.to_dict(orient="records")
    
    def post_request_to_api(api_request_object: dict) -> Any:
        ...
    s
    • 2
    • 1
  • s

    Seth Stokes

    03/04/2025, 5:10 PM
    Hey should I be able to
    extract_columns
    on a `Collect`'ed node? I am getting a seemingly unrelated error but I just added that so thats what I am suspecting.
    s
    t
    • 3
    • 10
  • s

    Stefan Krawczyk

    03/04/2025, 5:46 PM
    Sorry no office hours today - but do ping here if you have Qs!
12Latest