Slackbot
02/05/2024, 3:21 PMRyan Whitten
02/05/2024, 3:21 PMextract_fields
could handle DFs:
# data_source.py
class DataSource(BaseModel):
customer_id: str
some_int: int
another_int: int
@classmethod
def all_fields_with_types(cls) -> dict[str, Any]:
# get fields and their types from model (pydantic, sqlalchemy, etc)
# essentially:
return typing.get_type_hints(cls)
def extract_source_fields(source: Type[DataSource], exclude: Optional[list[str]] = None) -> Callable:
source_fields = source.all_fields_with_types()
included_fields = {field: Series[field_type] for field, field_type in source_fields.items() if field not in (exclude or [])}
def decorator(func: Callable) -> Callable:
parent_tag = tag(target_=func.__name__, node_type="data_source")
return parent_tag(tag(node_type="source_feature")(extract_fields(fields=included_fields)(func)))
return decorator
# data_loaders.py
@extract_source_fields(DataSource)
def extract_user() -> pd.DataFrame:
# build & return df
return pd.DataFrame(...)
# features.py
def add_ints(some_int: Series[int], another_int: Series[int]) -> Series[int]:
return some_int + another_int
Stefan Krawczyk
02/05/2024, 5:24 PMextract_fields
to anything “dict” like I think (thought that might be easier said than done)
Or yeah introduce a new decorator… hmm.
Or we could pull these extra annotations from the dataframe annotation? 🤔Elijah Ben Izzy
02/05/2024, 5:32 PMStefan Krawczyk
02/05/2024, 5:47 PMplugins/h_pandas.py
, if we can make it generic for all dataframes, then we could have it in general.
Otherwise we’ve talked about using Pandera/Pydantic like you mention (and maybe automatically doing what @check_output
does) - we just need to figure out the UI/UX for providing the schema. Either on the return type of the function that outputs the dataframe, or via the decorator.
Would you be up for prototyping something, at least on the UI/UX you want (the code here looks like pydantic? not pandera?), and we can figure out where to make it live?Ryan Whitten
02/05/2024, 9:53 PMextract_fields
to work with a dataframe. I just had to update the validate
call to also allow dataframes, and then modify the final `Node`s being built in transform_node
to set input_types={node_.name: pd.DataFrame},
instead of input_types={node_.name: dict},
. Seems like a fairly small change even if that's all that was expanded without a change to the interface