Elijah Ben Izzy
09/10/2024, 7:50 PMsf-hamilton==1.76.0
• sf-hamilton-sdk==0.7.1
️ 🌟 Features:
• OpenLineage integration — all materializers can now report to any OpenLineage provider!
◦ @Stefan Krawczyk will be giving a talk at the meetup thursday!
• Ray graph adapter now works with Hamilton Tracker
◦ As well as additional remote execution hooks. Big thanks to @Jernej Frank for taking the charge here!
📚 Docs/Examples
• OpenLineage example — play around with Hamilton + your favorite OL provider (we’ve been testing with marquez)
• A few typo fixes by Rizebos
🐛 Bug fixes
• (sf-hamilton-sdk
) Fixed broke polars API call in Hamilton SDK (thanks @Yaser Martínez Palenzuela!)
----------------------------
Reminder: Office hours
----------------------------
• They’re on most Tuesdays at 9:30am PT.
--------------------------------
Reminder: Meet-up Group
--------------------------------
We’re going to be planning our next meet up for October. Event will be scheduled soon so you can sign up.
If you’d like to speak - do reach out!
See full release notes here: https://github.com/DAGWorks-Inc/hamilton/releases/tag/sf-hamilton-1.76.0Stefan Krawczyk
09/17/2024, 4:31 PMElijah Ben Izzy
09/18/2024, 6:59 PMsf-hamilton==1.77.0
🌟 Features
• Ability to override nodes with separate modules (see 🧵 )
◦ Thanks to @Jernj Frank for the implementation and @Yijun Tang for the request!
• New pydantic data validator (Thanks to @Charles Swartz for the implementation, see below!)
• A few updates in the docs/minor fixes.
🔍 Pydantic validation
You can now validate any pydantic model/dict with model contents against a schema with `check_output`:
from hamilton.function_modifiers import check_output
from hamilton.plugins import h_pydantic
class MyModel(BaseModel):
name: str
@check_output(model=MyModel)
def foo() -> dict:
return {"name": "hamilton"}
# or
@h_pydantic.check_output()
def foo() -> MyModel:
return MyModel(name="hamilton")
----------------------------
Reminder: Office hours
----------------------------
• They’re on most Tuesdays at 9:30am PT — listen for an announcement in #C03M33QB4M8, or put the link on your calendar (it’s stable)
--------------------------------
Reminder: Meet-up Group
--------------------------------
Our next meetup with be 10/15 — it will feature Sholto from Capitec, talking about hte decision tree library they built on top of Hamilton. Sign up here!
Furthermore, read about other instances of Hamilton in the wild in our recent blog post. Up next is new and improved caching, reach out if you want to playtest!Elijah Ben Izzy
09/19/2024, 5:44 AMsf-hamilton-sdk==0.7.2
— this has the polars fix (thanks @sahil-shetty!)
• sf-hamilton-ui==0.0.15
— corresponding UI updatesElijah Ben Izzy
09/23/2024, 7:02 PMsf-hamilton==1.77.1
— this has a quick fix for some pydantic import issues (backwards compatibility with pydantic=1.x
Release notes: https://github.com/DAGWorks-Inc/hamilton/releases/tag/sf-hamilton-1.77.1Elijah Ben Izzy
09/26/2024, 11:24 PMsf-hamilton==1.78.0
📖 Changes
• @pipe_output
— new decorator to make post-function modifications to node results
◦ Thank you @Jernj Frank for leading the charge here! Really drove the implementation + cleaned up a lot of the existing code 🙌
◦ Note that @pipe
has been deprecated in favor of @pipe_input
— we will keep @pipe
around, however, until Hamilton 2.0 comes out!
• Fixed an issue in which materializers could not apply default values in certain cases
◦ Thanks @Thierry Jean for the fix! and thanks to Riezebos for catching the bug!
• Additional documentation on importing functions with the jupyter notebook
◦ Thanks creative-resort for the contribution!
🔍 Post pipe:
You can now run a chain of transformations on the output of a function — the node wil have the value of the final transformation. Here’s a pretty simple pipeline — use source
/ value
to inject results in!
from hamilton.function_modidifers import pipe_output, step
def _add_one(x: int) -> int:
return x + 1
def _sum(x: int, y: int) -> int:
return x + y
def _multiply(x: int, y: int) -> int:
return x * y
@pipe_output(
step(_add_one),
step(_multiply, y=2),
step(_sum, y=value(3)),
step(_multiply, y=source("upstream_node_to_multiply")),
)
def initial_data() -> int:
return 1 # gets turned into (((1 + 1) * 2) + 3) * upstream_node_to_multiply with post-function modifications
See docs here: https://hamilton.dagworks.io/en/latest/reference/decorators/pipe/#pipe-output
---------------------------
Reminder: Office hours
----------------------------
• They’re on most Tuesdays at 9:30am PT — listen for an announcement in #C03M33QB4M8, or put the link on your calendar (it’s stable)
--------------------------------
Reminder: Meet-up Group
--------------------------------
Our next meetup with be 10/15 — it will feature Sholto from Capitec, talking about hte decision tree library they built on top of Hamilton. Sign up here!
Caching coming soon — let us know if you want to mess with it! We can make an RC and help you debug anything you find.Stefan Krawczyk
10/01/2024, 4:33 PMThierry Jean
10/03/2024, 9:37 PMsf-hamilton==1.79.0
Release notes
---------------------------------------
🥋 Getting started with caching
---------------------------------------
To try caching, simply add .with_cache()
to your Builder()
from hamilton import driver
import my_dataflow
dr = (
driver.Builder()
.with_module(my_dataflow)
.with_cache() # <- one line addition
.build()
)
dr.execute([...])
dr.execute([...])
The first execution will store metadata and results next to the current directory under ./.hamilton_cache
. The next execution will retrieve results from cache when possible to skip node execution.
The cache can be accessed via Driver.cache
. It comes with many utilities to introspect it, including a new visualization!
dr.cache.view_run()
-------------------
🎭 Caching 101
-------------------
Caching determine if a node needs to be recomputed based on the node's code and the input data. It will determine the minimal amount of nodes to execute and only execute them, saving time and resources (computations, API calls, etc.)
There are many benefits and use cases
• Lightning fast execution and iteration in notebooks.
• Persist results for your notebook. On the next day, you can pick up where you left off
• If your Parallel/Collect
fails, you can restart and skip the successful branches
• Debug failed dataflows by loading cached data in a notebook or script
• Works great with the Hamilton UI and the MLFlowTracker
Try it for yourself and we're eager to hear your findings and use cases!
-----------------
📚 Resources
-----------------
• The caching tutorial notebook is highly recommended. Try it on Google Colab or view it in the user guide documentation.
• The concepts page on Caching should answer most questions.
----------------
📣 Roadmap
----------------
This is only the beginning for caching! We have many items on the roadmap and it currently has some limitations (more on that in the docs). Getting your feedback will help us prioritize!
Roadmap
• it will not work with lifecycle adapters using run_to_execute_node()
(PDBDebugger, GracefulErrorAdapter, etc.)
• add async support
• add more backends for storing results and metadata (S3, Redis, and more)
• automated cache eviction (delete old results, set max storage space, etc.)
• integrate remote execution: execute expensive nodes using Ray, Skypilot, Modal, Runhouse, etc. and cache results locally
View the full discussion on GitHub
------------
📰 Blogs
------------
See the awesome guest post by @Ryan Whitten titled Building a Better Feature Platform with Hamilton where he details the journey evaluating feature platform vendors and eventually building one in-house. He shares insights about how to reuse Hamilton transforms across execution contexts (batch, web services, backfills, and streaming) and their platform architecture!
-----------------------------------
⌚ Reminder: Meet-up group
-----------------------------------
Sign up here for next week's meetup! On Tuesday, October 15, Sholto from Capitec will be sharing the decision tree library they built on top of Hamilton.Elijah Ben Izzy
10/10/2024, 10:20 PMsf-hamilton==1.80.0
📖 Changes
• Adds new @mutate
decorator (see below) for post-hoc modification of functions
• Adds on_output
parameter to @pipe_output
to control which node transforms get applied to
• Improves h_slack
notifications plugin (see thread for pic!)
◦ Adds information on inputs/outputs
◦ Reformats to be easier to read/get to root cause
◦ Fixes up stack traces to show more informative python error
• A few docs updates (thanks @Seth Stokes!)
🔍 @mutate
You can now mutate the results of a node outside of the function itself! This effectively applies a chain of transformations to the output of functions. Use this for feature transformations, etc…
In the following case, we’re mutating a features dataframe step-by-step. Note this is isomorphic to (and uses the underlying tooling of) @pipe_output
from hamilton.function_modidifers import mutate
def features() -> pd.DataFrame:
return ...
@mutate(features)
def _normalize_columns(df: pd.DataFrame) -> pd.DataFrame:
for column in df.columns:
df[column] = (df[column]-df[column].min())/(df[column].max() - df[column].min())
return df
@mutate(features, outlier_threshold=10)
def _remove_outliers(df: pd.DataFrame, outlier_threshold: float) -> pd.DataFrame:
return df[df < outlier_threshold]
See docs here: https://hamilton.dagworks.io/en/latest/reference/decorators/pipe/#hamilton.function_modifiers.macros.mutate
----------------------------
Reminder: Office hours
----------------------------
• They’re on most Tuesdays at 9:30am PT — listen for an announcement in #C03M33QB4M8, or put the link on your calendar (it’s stable)
• Next week we will not hold them (see below) — if you need help, feel free to reach out and we can find time!
--------------------------------
Reminder: Meet-up Group
--------------------------------
Our next meetup will be next Tuesday (10/15) — it will feature Sholto from Capitec, talking about the decision tree library they built on top of Hamilton. Sign up here!
We just released caching and we’re really excited about it. We’re also thinking alot about hamilton’s async mode.
If you’re curious about the new featuers/how they can help you, don’t hesitate to reach out or come to office hours!
⭐ Also, if you haven’t starred the github repository (github.com/dagworks-inc/hamilton), we’d love a star — stars + star growth gives a good signal to others about adopting Hamilton 🙂Stefan Krawczyk
10/15/2024, 4:42 PMThierry Jean
10/16/2024, 6:24 PMon_input
! One of the reason I wasn't using @pipe
more was the restriction around having the same function signature for all functions (e.g., def features(df):
, def _nulls_removed(df):
, etc.). This provides much more flexibilityElijah Ben Izzy
10/24/2024, 8:35 PMsf-hamilton==1.81.2
(and, prior, 1.81.1
)
📖 Changes
• Jupyter notebook for reusing functions example
• Fixes bug in which keep_dot
was not propogated through all visualization function
• DockerX setup for Hamilton UI builds — now cross-platform (and faster!)
• Fixes up convention for naming in mutate
and check_output
(<node_name>.raw
is the new name for the raw one, replacing a common conflict node_name_raw
)
◦ Thanks @Jernj Frank!
• Fixes .get_run_ids
to return all run IDs (thanks @Thierry Jean for the quick turnaround, and Evan Lutins for flagging!)
Reminder: Office hours
• They’re on most Tuesdays at 9:30am PT — listen for an announcement in #C03M33QB4M8, or put the link on your calendar (it’s stable)
• Next week we will not hold them (see below) — if you need help, feel free to reach out and we can find time!
Call to action: Meet-up Group
We will be holding our next meetup in december! If you’re interested in presenting, please reach out — we’ve had some really exciting presentations recently. Yesterday Sholto from Capitec spoke on the decision tree engine they open-sourced that leveraged Hamilton! https://github.com/capitec/dsp-decision-engineStefan Krawczyk
10/29/2024, 4:31 PMElijah Ben Izzy
11/08/2024, 4:40 PM1.82.0
and 1.83.0
🚀 What’s been released 🚀
• sf-hamilton==1.82.0
• sf-hamilton==1.83.0
📖 Changes
• In-memory caching — cache results in memory then persist to disk
◦ Thanks @Thierry Jean!
• with_columns
for pandas
◦ Nice work @Jernj Frank!
• allow_module_overrides
for AsyncDriver
◦ Appreciate it @Ryan Whitten!
🔍 In memory caching/`@with_columns`
You can now use an in-memory implementation of the hamilton cache. This stores it in process, and you can sync with disk as you want!
from hamilton import driver
from hamilton.caching.stores.sqlite import SQLiteMetadataStore
from hamilton.caching.stores.file import FileResultStore
from hamilton.caching.stores.memory import InMemoryMetadataStore, InMemoryResultStore
import my_dataflow
dr = (
driver.Builder()
.with_modules(my_dataflow)
.with_cache(
metadata_store=InMemoryMetadataStore(),
result_store=InMemoryResultStore(),
)
.build()
)
# execute the Driver several time. This will populate the in-memory stores
dr.execute(...)
# persist to disk
dr.cache.metadata_store.persist_to(SQLiteMetadataStore(path="./.hamilton_cache"))
dr.cache.result_store.persist_to(FileResultStore(path="./.hamilton_cache"))
Read more about it here.
Also — we have an implementation of the with_columns
decorator for pandas — read the docs here. This allows you to run a set of transformations as a pandas series -> series operations as a subDAG on the columns of an input dataframe. As you might be aware, this is actually an extension of the with_columns
pyspark decorator, that allows you to do the same on spark dataframes.
# with_columns_module.py
def a_plus_b(a: pd.Series, b: pd.Series) -> pd.Series:
return a + b
# the with_columns call
@with_columns(
*[my_module], # Load from any module
*[a_plus_b], # or list operations directly
columns_to_pass=["a_from_df", "b_from_df"], # The columns to pass from the dataframe to
# the subdag
select=["a", "b", "a_plus_b", "a_b_average"], # The columns to select from the dataframe
)
def final_df(initial_df: pd.DataFrame) -> pd.DataFrame:
# process, or just return unprocessed
...
Reminder: Office hours
They’re on most Tuesdays at 9:30am PT — listen for an announcement in #C03M33QB4M8, or put the link on your calendar (it’s stable)
Call to action: Meet-up Group
We will be holding our next meetup in december! If you’re interested in presenting, please reach out — we’ve had some really exciting presentations recently. Last time Sholto from Capitec spoke on the decision tree engine they open-sourced that leveraged Hamilton! https://github.com/capitec/dsp-decision-engineStefan Krawczyk
11/12/2024, 6:03 PMStefan Krawczyk
11/21/2024, 10:08 PMpip install sf-hamilton-sdk==0.8.0
🚀 What’s been released 🚀
• ability to provide configuration for what is or is not captured by the SDK
There is a new module hamilton_sdk.tracking.constants
. It allows you to tweak what is or is not captured and sent to the Hamilton UI. You modify this before running your Hamilton code.
from hamilton_sdk.tracking import constants
# for example we want to limit the length of lists or dictionaries
constants.MAX_LIST_LENGTH_CAPTURE = 100
constants.MAX_DICT_LENGTH_CAPTURE = 200
tracker = adapters.HamiltonTracker(
project_id=PROJECT_ID,
username="USERNAME/EMAIL_YOU_PUT_IN_THE_UI",
dag_name="my_version_of_the_dag",
tags={"environment": "DEV", "team": "MY_TEAM", "version": "X"}
)
dr = (
driver.Builder()
.with_config(your_config)
.with_modules(*your_modules)
.with_adapters(tracker)
.build()
)
dr.execute(...)
If you don’t want to capture any data statistics at all, you can simply do:
from hamilton_sdk.tracking import constants
constants.CAPTURE_DATA_STATISTICS = False
There are environment variable as well as config file ways to drive this behavior as well. Please see the docs for more details.
We’re going to add more constants and configuration over time, so do reach out, adding new configuration should be straightforward.
🕐 Reminder: Office hours 🕐
They’re on most Tuesdays at 9:30am PT — listen for an announcement in #C03M33QB4M8, or put the link on your calendar (it’s stable).
☎️ Call to action: Meet-up Group ☎️
We will be holding our next meetup in December! If you’re interested in presenting, please reach out. Sign up here.Stefan Krawczyk
11/26/2024, 2:00 PMpip install sf-hamilton==1.83.3
🚀 What’s been released 🚀
A few 🐛 🛠️ (bug fixes):
• Fixes polars to parquet saving issue via Hamilton’s caching functionality.
• Fixes pandas Timestamp hashing for use with Hamilton’s caching. Thanks @Thierry Jean.
• Fix pipe_input
example doc string. Thanks @Seth Stokes !
🕐 Reminder: Office hours 🕐
They’re on most Tuesdays at 9:30am PT — listen for an announcement in #C03M33QB4M8, or put the link on your calendar (it’s stable). Looking for it in a couple of hours.
☎️ Call to action: Meet-up Group ☎️
We will be holding our next meetup on Tuesday December 17th at 9:30am PT ! If you’re interested in presenting, please reach out. We’d love to spotlight what you’re working on. Sign up here.Stefan Krawczyk
11/26/2024, 6:03 PMElijah Ben Izzy
12/05/2024, 1:15 AMwith_columns
— apply a submodule/set of functions to a dataframe, encpsulating transformations. Useful for feature engineering + more. In this case we apply a set of series-level transformations to a pandas dataframe:
# with_columns_module.py
def a_from_df(initial_df: pd.Series) -> pd.Series:
return initial_df["a_from_df"] / 100
def b_from_df(initial_df: pd.Series) -> pd.Series:
return initial_df["b_from_df"] / 100
# the with_columns call
@with_columns(
*[my_module],
*[a_from_df],
on_input="initial_df",
select=["a_from_df", "b_from_df", "a", "b", "a_plus_b", "a_b_average"],
)
def final_df(initial_df: pd.DataFrame, ...) -> pd.DataFrame:
# process, or just return unprocessed
...
• pandas
• polars
• polars lazyframe
• pysparkElijah Ben Izzy
12/10/2024, 5:30 PMElijah Ben Izzy
12/12/2024, 8:50 PMpip install sf-hamilton==1.85.0
pip install sf-hamilton-ui==0.16.0
🚢 What’s been released 🚢
• @extract_fields
now supports TypedDict — thanks to @Stefan Krawczyk for implementing and OS user earshinov for the issue
◦ See docs here
• Fix in the Hamilton UI to display all inputs in the “run parameters” section
• New modular subdag example in the respository — shows how to tie together/reuse a few modules using @subdag
More in 🧵
Full release notes: https://github.com/DAGWorks-Inc/hamilton/releases/tag/sf-hamilton-1.85.0
🕐 Reminder: Office hours 🕐
They’re on most Tuesdays at 9:30am PT — listen for an announcement in #C03M33QB4M8, or put the link on your calendar (it’s stable). Looking for it in a couple of hours.
☎️ Call to action: Meet-up Group ☎️
We will be holding our next meetup on Tuesday December 17th at 9:30am PT ! @Jernj Frank will be talking about his experience contributing/working with decorators! Sign up here.Elijah Ben Izzy
12/17/2024, 5:32 PMElijah Ben Izzy
12/17/2024, 9:36 PMdata_loader
with annotations
• Examples for running the hamilton UI on snowflake
pip install sf-hamilton==1.85.1
Release notes hereStefan Krawczyk
12/24/2024, 10:57 PMElijah Ben Izzy
01/07/2025, 4:47 AMpip install sf-hamilton==1.86.1
🚢 What’s been released 🚢
FutureAdapter
that delegates to a threadpool for parallelization.
from hamilton.plugins.h_threadpool import FutureAdapter
dr = (
driver.Builder()
.with_modules(my_module_with_lots_of_io_in_parallel)
.with_adapter(FutureAdapter())
.build()
)
dr.execute(["my_variable"], inputs={...}, overrides={...})
This makes your DAG run optimally in parallel using a Threadpool! Use if you have highly I/O bound, non-async code that you want to parallelize. Hamilton will optimize execution in a thread-safe manner (assuming your code, itself, is thread-safe/doesn’t access shared resources. This is likely not an issue with the global interpreter lock, but use at your own risk, especially as the GIL is getting phased out…)
Release notes: https://github.com/DAGWorks-Inc/hamilton/releases/tag/sf-hamilton-1.86.1
Otherwise we’re looking for future meetup speakers and guest blog-post writers! Reach out if that’s interesting to you.Stefan Krawczyk
01/07/2025, 5:33 PMStefan Krawczyk
01/28/2025, 5:31 PMElijah Ben Izzy
04/02/2025, 4:05 AMStefan Krawczyk
05/08/2025, 8:54 PMStefan Krawczyk
06/23/2025, 3:00 PM