James Marvin
06/30/2022, 11:06 AMJames Marvin
06/30/2022, 11:11 AMJames Marvin
07/04/2022, 8:48 AMSimon Helmig
08/02/2022, 5:59 PMBrian Ritz
08/04/2022, 8:01 PMsubcategory_tag
(think a primary key for subcategories).
• Each Category has a corresponding category_tag
(a primary key for categories).
My Ask: At “query time,” I’d like to be able to specify a subcategory_tag
or category_tag
, and get back a category:
One way to do this is as follows, but it feels “against the framework” — is there a better way?
# node_definitions.py
def data() -> list:
return [
{"category": "Peanut Butter", "category_tag": "PB", "subcategory": "Natural Peanut Butter", "subcategory_tag": "NPB"},
{"category": "Peanut Butter", "category_tag": "PB", "subcategory": "Conventional Peanut Butter", "subcategory_tag": "CPB"}
]
def category_tag() -> str:
return None
def subcategory_tag() -> str:
return None
def subcategory(subcategory_tag: str, data: list) -> str:
if subcategory_tag is None:
return None
d = {d['subcategory_tag']: d['subcategory'] for d in data}
return d[subcategory_tag]
def category(category_tag: str, subcategory: str, data: list) -> str:
if category_tag is None and subcategory is None:
raise ValueError
if category_tag is not None:
d = {d['category_tag']: d['category'] for d in data}
else:
d = {d['subcategory']: d['category'] for d in data}
val = category_tag if category_tag is not None else subcategory
return d[val]
# test.py
from hamilton import driver
from hamilton.base import SimplePythonGraphAdapter, DictResult
import node_definitions
dr = driver.Driver({}, node_definitions, adapter=SimplePythonGraphAdapter(DictResult))
result_1 = dr.execute(['category'], overrides={"category_tag": "PB"})
print(result_1)
result_2 = dr.execute(['category'], overrides={"subcategory_tag": "NPB"})
print(result_2)
Running it:
root@ae148f8887a5:/project/src/hamilton_tests# python test.py
{'category': 'Peanut Butter'}
{'category': 'Peanut Butter'}
Ben
09/02/2022, 5:19 PM@parameterize[-sources]
with @extract_columns
? (i.e. to get multiple columns back from a parameterized function) I can't wrap my head around how it would work, if it is.Ben
09/08/2022, 2:40 PM@does
I can pass in a list of kwargs
to the replacing function but I still need to define the actual function arguments in the originating function.
E.g. is there a way to rewrite this into hamilton, where parameterized_cols
would be passed in as kwargs
or similar.
df = pd.DataFrame([[1,1,3,np.nan], [np.nan,2,3,4], [np.nan,0,5,6], index=['2022-01', '2022-02', '2022-03'], columns=['v202009', 'v202010', 'v202011', 'v202012'])
parametererized_cols = ['v202009','v202011']
last_value_series = df.loc[:, parameterized_cols].iloc[-1]
Olaide Joseph
09/28/2022, 3:08 PMJohn Herr
10/05/2022, 11:15 PMcolumn A
, but Hamilton has a function to calculate column A
based on other fields found in the dataframe, will Hamilton recalculate and overwrite the column, or skip the recalculation having realized the column was already provided?Avnish Pal
10/07/2022, 3:20 PMZouhair Mahboubi
10/25/2022, 8:24 PMJames Marvin
10/25/2022, 8:31 PMZouhair Mahboubi
10/26/2022, 11:32 PMZouhair Mahboubi
10/27/2022, 4:00 PMJames Marvin
10/31/2022, 5:13 PMZouhair Mahboubi
11/02/2022, 7:00 PMFilip Piasevoli
11/16/2022, 9:46 PMZouhair Mahboubi
11/17/2022, 7:59 PMSeth Stokes
11/18/2022, 9:17 PMhamilton
when much of the features in the source data need to be filled via mapping/merging. Is my assumption correct in that these need to be handled before passing the DataFrame to the hamilton driver? Do hamilton functions allow for merges between DataFrames? If the data was whole beforehand I could simply define functions for each column and execute df.pipe()
(here I can see the benefit of hamilton composing these functions instead). My issue are the merges that are done in between to fill missing values for each subsequent fn. Any direction would be welcome.Filip Piasevoli
11/18/2022, 9:44 PMGregory Jeffrey
11/23/2022, 5:28 PM@parameterize(
feature_lag_1={"lag": value(1)},
feature_lag_2={"lag": value(2)},)
def lag_series(feature: pd.Series, lag: int = 1) -> pd.Series:
return feature.shift(lag)
The best I've been able to come up with is to enumerate all possible parameterizations, and do the random selection when building the list of output nodes. I.e.:
lag_parameterization = {f"feature_lag_{lag}": {"lag": value(lag)} for lag in list(range(1, 100)}
@parameterize(**lag_parameterization)
def lag_series(feature: pd.Series, lag: int = 1) -> pd.Series:
return feature.shift(lag)
This would still give a very long list of available features, however, which is less desirableConor Digan
12/01/2022, 12:30 PMdef some_function(a, b, c):
return a+c, b+c
d,e = some_function(a,b,c)
Conor Digan
12/01/2022, 1:48 PMpip install sf-hamilton[visualization]
:Filip Piasevoli
12/01/2022, 7:59 PMARG2S = {
('FOO', 'US Election 2016 Dummy'): 'a',
('BAR', 'Doc string for this thing'): 'b',
}
@parameterize_sources(
FOO=dict(
arg1='hello'
),
BAR=dict(
arg1='world'
)
)
@parameterize_values(
parameter='arg2', assigned_output=ARG2S
)
def fn(
arg1: str,
arg2: str,
) -> Object:
<rest of function here>
Peter Robinson
12/05/2022, 2:47 PMn x m
set of inputs?
My specific use case is the following:
I have three series of data, and I need to find maximums for given time scales (eg maximum averaged over 5s, max averaged over 10s etc.) for each. I realise its possible to do one node for each series, with a parameterised input of each maximum "time window", but it feels like it might be possible to do with a single node and an n x m
set of parameterized inputs.Seth Terrell
12/05/2022, 4:46 PMC:\Stuff\Source\hamilton\examples\dbt>dbt run
16:41:59 Running with dbt=1.3.0
16:42:00 Found 2 models, 0 tests, 0 snapshots, 0 analyses, 267 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
16:42:00
16:42:02 Concurrency: 4 threads (target='dev')
16:42:02
16:42:02 1 of 2 START sql table model HAMILTON.raw_passengers ........................... [RUN]
16:42:04 1 of 2 OK created sql table model HAMILTON.raw_passengers ...................... [SUCCESS 1 in 2.11s]
16:42:04 2 of 2 START python table model HAMILTON.train_and_infer ....................... [RUN]
16:42:04 2 of 2 ERROR creating python table model HAMILTON.train_and_infer .............. [ERROR in 0.00s]
16:42:04
16:42:04 Finished running 2 table models in 0 hours 0 minutes and 4.51 seconds (4.51s).
16:42:04
16:42:04 Completed with 1 error and 0 warnings:
16:42:04
16:42:04 Compilation Error in model train_and_infer (models\train_and_infer.py)
16:42:04 'py_script_postfix' is undefined. This can happen when calling a macro that does not exist. Check for typos and/or install package dependencies with "dbt deps".
16:42:04
16:42:04 Done. PASS=1 WARN=0 ERROR=1 SKIP=0 TOTAL=2
Thanks for any ideas!Baldo Faieta
12/06/2022, 10:29 PMBaldo Faieta
12/13/2022, 11:04 PMJames Marvin
01/10/2023, 3:25 PMValueError: array length 618 does not match index length 1759
James Marvin
01/10/2023, 4:21 PM