Hello I am getting familiar with Hamilton and trying to figu Hamilton Open Source #hamilton-help

Hello, I am getting familiar with Hamilton and try...

Artem

04/20/2024, 5:12 PM

Hello, I am getting familiar with Hamilton and trying to figure out how to use functions which use primitive python types instead of pd.Series. This is a hello world example that uses pd.Series.

Copy code

# functions.py - declare and link your transformations as functions....
import pandas as pd

def a(input: pd.Series) -> pd.Series:
    return input % 7

def b(a: pd.Series) -> pd.Series:
    return a * 2

# And run them!
import functions
from hamilton import driver
dr = driver.Driver({}, functions)
result = dr.execute(
   ['a', 'b'], 
   inputs={'input': pd.Series([1, 2, 3, 4, 5])}
)
print(result)

I want to define my functions as

Copy code

def a(input: float) -> float:
    return input % 7

def b(a: float) -> float:
    return a * 2

and apply them to the same input and get same output as in the example above. My goal is to avoid using pandas in the functions and be able to apply them in real-time production environment (which does not use pandas), apply them to pandas dataframes and spark dataframes in notebooks when developing those functions. I would very appreciate help and recommendations. This thing is blocking me to make a decision to use Hamilton for our feature development. Thanks.

Stefan Krawczyk

04/20/2024, 5:46 PM

@Artem yep you can use primitives.

Stefan Krawczyk

04/20/2024, 5:49 PM

So the only thing to think about is type checking. Hamilton has some defaults that could be overriden here. You could also define a union type. For operations that aren't 1-1 between environments you could swap them based on configuration

Artem

04/20/2024, 5:50 PM

@Stefan Krawczyk Is there an example that uses primitives?

Stefan Krawczyk

04/20/2024, 5:52 PM

https://hamilton.dagworks.io/en/latest/concepts/driver/

Stefan Krawczyk

04/20/2024, 5:54 PM

Hamilton is data type agnostic. Hamilton constructs a graph from the functions and it checks that the types match between the function input, arguments and any functions it finds. Then the other thing to think about is what is the return type you want from .execute()? In the code you showed it by the default is to create a pandas data frame. But if you look at the documentation and use the Builder The output of .execute() defaults to a dictionary; this is customizable too.

Stefan Krawczyk

04/20/2024, 5:57 PM

https://blog.dagworks.io/p/feature-engineering-with-hamilton could also be a useful read.

Artem

04/20/2024, 6:05 PM

Thank you, @Stefan Krawczyk Is there an example that show how to apply these functions on a dataframe?

Stefan Krawczyk

04/20/2024, 6:07 PM

What are the rough requirements you want to fulfill? I'm on my phone at the moment but can get/find some code later today.

Artem

04/20/2024, 6:19 PM

I want Driver to get a function (a list of functions, a module) and apply on a dataframe. Simplified example:

Copy code

def add(a: int, b: int) -> int:
    return a + b

df = pd.DataFrame({'a': [1,4], 'b': [3,2]})

Input dataframe:

Copy code

a  b
0  1  3
1  4  2

Output data frame:

Copy code

a  b  add
0  1  3    4
1  4  2    6

Stefan Krawczyk

04/20/2024, 10:35 PM

One way to do things is like this — define the types to possibly be primitives or series.

Copy code

import pandas as pd
from typing import Union

INT = Union[pd.Series, int]
FLOAT = Union[pd.Series, float]

def add(a: INT, b: INT) -> INT:
    return a + b

def bar(add: INT, c: INT) -> INT:
    return add * 2 + c * 3

then run.py

Copy code

# And run them!
import functions
from hamilton import driver, base
import pandas as pd
dr = (
   driver.Builder()
   .with_config({})
   .with_modules(functions)
   .with_adapters(base.PandasDataFrameResult())
   .build()
)
df = pd.DataFrame({'a': pd.Series([1,2,3,4]), 
           'b': pd.Series([3,5,6,6]),
           'c': pd.Series([1,1,1,1])
   })

result = dr.execute(
   ['a', 'b', 'add', 'bar'], 
   inputs=df.to_dict(orient='series')
)
print(result)
dr.display_all_functions(
   "graph.dot", orient="TB", show_legend=False)

In the online side you’d just change the adapter to be a dictionary result — and just pass in primitive values.

Stefan Krawczyk

04/20/2024, 10:38 PM

Otherwise to push back a little on pandas not being used in production — how tight is your SLA? Creating a single row pandas isn’t that much overhead depending on the SLA.

Artem

04/21/2024, 1:41 AM

@Stefan Krawczyk Thank you so much for the details. The trick with the Union worked out. As for the latency, it is quite critical. All the ML features must be generated in <1sec.

👍 1

Stefan Krawczyk

04/21/2024, 1:44 AM

@Artem note due to the type hints you'll need to guard against pandas not being in the environment. You could even depending on the environment choose which type to use. Actually nvm I forgot Hamilton still technically comes with a pandas dependency. So what I said doesn't make that much sense. 😅

Artem

04/22/2024, 12:20 PM

Thank you @Stefan Krawczyk I am mainly thinking of Hamilton as a framework for developing ML features = pure functions and simulating them in offline mode during R&D phase. The main production pipeline serves as the orchestrator with no dependency on Hamilton.

Stefan Krawczyk

04/22/2024, 4:35 PM

The main production pipeline serves as the orchestrator with no dependency on Hamilton.

Not having the same implementation can be a source of training-serving skew. But it sounds like you’d be reusing the functions in production, just not using Hamilton to run it? Is that correct?

Artem

04/22/2024, 4:50 PM

That's correct @Stefan Krawczyk

👍 1

Open in Slack

Previous Next