Slackbot
07/04/2023, 7:59 AMAmos
07/04/2023, 8:00 AM@extract*
that pulls sets of columns using regex? I guess it isn't strictly necessary but might be nice to have.Elijah Ben Izzy
07/04/2023, 4:57 PMcheck_output_custom
. Opened up this ticket to track: https://github.com/DAGWorks-Inc/hamilton/issues/212. Otherwise the closest example we have is unit tests, but I’ve got a basic POC for what you want below! Its pretty easy (you’re not the first), just turns out a lot of people want what pandera offers.
https://gist.github.com/elijahbenizzy/ea3528e0d8b08f7205b0fe441b4b9cfbElijah Ben Izzy
07/04/2023, 4:58 PMElijah Ben Izzy
07/04/2023, 4:59 PMAmos
07/05/2023, 12:20 AMElijah Ben Izzy
07/05/2023, 12:28 AMextract_fields
and have a function return a dict of dataframes. Something like (all pseudocode so don’t just copy paste 🙂 )
def all_data() -> pd.DataFrame:
return ...
CLASSES = {...}
@extract_fields(
{key: pd.DataFrame for key in CLASSES}
)
def grouped_by_class(all_data: pd.DataFrame) -> Dict[str, pd.DataFrame]:
cols = defaultdict(dict)
for column in all_data.columns
for class_name, regex in CLASSES.items():
if match(regex, column):
cols[class_name].append(column)
return {key: pd.DataFrame(value) for key, value in cols.items()}
This assumes you know the classes but not the column names (which you can then declare, mess around with, and do data quality checks downstream). You will be referring to them as dataframes though.
If you know the column names then its simpler — you can just construct/store them externally to the function and reference in the decorator (although it seems like that’s not the case).Amos
07/05/2023, 12:39 AMPPM = r"^[A-Z][a-z]*_ppm$"
PCT = r"^[A-Z][a-z]*_pct$"
PPB = r"^[A-Z][a-z]*_ppb$"
OXD = r"^[A-Z][a-z]*\d*O\d?_pct$"
Data for a given element or compound for a given sample may be missing, in one format or in several. You may want to treat the formats separately or to convert to a single format and sum the totals. Just an example…Elijah Ben Izzy
07/05/2023, 12:40 AMAmos
07/05/2023, 12:44 AMElijah Ben Izzy
07/05/2023, 4:01 PM