Hello, I am getting familiar with Hamilton and try...
# hamilton-help
a
Hello, I am getting familiar with Hamilton and trying to figure out how to use functions which use primitive python types instead of pd.Series. This is a hello world example that uses pd.Series.
Copy code
# functions.py - declare and link your transformations as functions....
import pandas as pd

def a(input: pd.Series) -> pd.Series:
    return input % 7

def b(a: pd.Series) -> pd.Series:
    return a * 2

# And run them!
import functions
from hamilton import driver
dr = driver.Driver({}, functions)
result = dr.execute(
   ['a', 'b'], 
   inputs={'input': pd.Series([1, 2, 3, 4, 5])}
)
print(result)
I want to define my functions as
Copy code
def a(input: float) -> float:
    return input % 7

def b(a: float) -> float:
    return a * 2
and apply them to the same input and get same output as in the example above. My goal is to avoid using pandas in the functions and be able to apply them in real-time production environment (which does not use pandas), apply them to pandas dataframes and spark dataframes in notebooks when developing those functions. I would very appreciate help and recommendations. This thing is blocking me to make a decision to use Hamilton for our feature development. Thanks.
s
@Artem yep you can use primitives.
So the only thing to think about is type checking. Hamilton has some defaults that could be overriden here. You could also define a union type. For operations that aren't 1-1 between environments you could swap them based on configuration
a
@Stefan Krawczyk Is there an example that uses primitives?
Hamilton is data type agnostic. Hamilton constructs a graph from the functions and it checks that the types match between the function input, arguments and any functions it finds. Then the other thing to think about is what is the return type you want from .execute()? In the code you showed it by the default is to create a pandas data frame. But if you look at the documentation and use the Builder The output of .execute() defaults to a dictionary; this is customizable too.
a
Thank you, @Stefan Krawczyk Is there an example that show how to apply these functions on a dataframe?
s
What are the rough requirements you want to fulfill? I'm on my phone at the moment but can get/find some code later today.
a
I want Driver to get a function (a list of functions, a module) and apply on a dataframe. Simplified example:
Copy code
def add(a: int, b: int) -> int:
    return a + b

df = pd.DataFrame({'a': [1,4], 'b': [3,2]})
Input dataframe:
Copy code
a  b
0  1  3
1  4  2
Output data frame:
Copy code
a  b  add
0  1  3    4
1  4  2    6
s
One way to do things is like this — define the types to possibly be primitives or series.
Copy code
import pandas as pd
from typing import Union

INT = Union[pd.Series, int]
FLOAT = Union[pd.Series, float]

def add(a: INT, b: INT) -> INT:
    return a + b

def bar(add: INT, c: INT) -> INT:
    return add * 2 + c * 3
then run.py
Copy code
# And run them!
import functions
from hamilton import driver, base
import pandas as pd
dr = (
   driver.Builder()
   .with_config({})
   .with_modules(functions)
   .with_adapters(base.PandasDataFrameResult())
   .build()
)
df = pd.DataFrame({'a': pd.Series([1,2,3,4]), 
           'b': pd.Series([3,5,6,6]),
           'c': pd.Series([1,1,1,1])
   })

result = dr.execute(
   ['a', 'b', 'add', 'bar'], 
   inputs=df.to_dict(orient='series')
)
print(result)
dr.display_all_functions(
   "graph.dot", orient="TB", show_legend=False)
In the online side you’d just change the adapter to be a dictionary result — and just pass in primitive values.
Otherwise to push back a little on pandas not being used in production — how tight is your SLA? Creating a single row pandas isn’t that much overhead depending on the SLA.
a
@Stefan Krawczyk Thank you so much for the details. The trick with the Union worked out. As for the latency, it is quite critical. All the ML features must be generated in <1sec.
👍 1
s
@Artem note due to the type hints you'll need to guard against pandas not being in the environment. You could even depending on the environment choose which type to use. Actually nvm I forgot Hamilton still technically comes with a pandas dependency. So what I said doesn't make that much sense. 😅
a
Thank you @Stefan Krawczyk I am mainly thinking of Hamilton as a framework for developing ML features = pure functions and simulating them in offline mode during R&D phase. The main production pipeline serves as the orchestrator with no dependency on Hamilton.
s
The main production pipeline serves as the orchestrator with no dependency on Hamilton.
Not having the same implementation can be a source of training-serving skew. But it sounds like you’d be reusing the functions in production, just not using Hamilton to run it? Is that correct?
a
That's correct @Stefan Krawczyk
👍 1