This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

07/04/2023, 7:59 AM

This message was deleted.

👀 1

Amos

07/04/2023, 8:00 AM

P.S. As a subsidiary Q, is there a flavour of

@extract*

that pulls sets of columns using regex? I guess it isn't strictly necessary but might be nice to have.

Elijah Ben Izzy

07/04/2023, 4:57 PM

Happy 4th! OK, so, we need to add examples for

check_output_custom

. Opened up this ticket to track: https://github.com/DAGWorks-Inc/hamilton/issues/212. Otherwise the closest example we have is unit tests, but I’ve got a basic POC for what you want below! Its pretty easy (you’re not the first), just turns out a lot of people want what pandera offers. https://gist.github.com/elijahbenizzy/ea3528e0d8b08f7205b0fe441b4b9cfb

Elijah Ben Izzy

07/04/2023, 4:58 PM

Now, the even easier way is to just use pandera custom checkers — it should be able to do exactly what we’re doing here 🙂 https://pandera.readthedocs.io/en/stable/extensions.html. Then you don’t have to mess around with custom data validators at all.

Elijah Ben Izzy

07/04/2023, 4:59 PM

Re: regex, we don’t have that — its a little confusing. What would it look like ideally/what’s the use-case? The difficult part is referring to them downstream…

Amos

07/05/2023, 12:20 AM

Thanks for the prompt reply @Elijah Ben Izzy. Your POC is helpful. I'll try Pandera custom checks first — may have missed a trick there. For the regex, the immediate use-case is wanting to operate differently on subsets of columns whose names follow particular patterns. For instance, chemical data may be supplied in various units and as compounds that need deconvolving via stoichiometric conversion.

Elijah Ben Izzy

07/05/2023, 12:28 AM

Ahh got it. Do you know the column sets beforehand? One trick might be to use

extract_fields

and have a function return a dict of dataframes. Something like (all pseudocode so don’t just copy paste 🙂 )

Copy code

def all_data() -> pd.DataFrame:
    return ...

CLASSES = {...}

@extract_fields(
    {key: pd.DataFrame for key in CLASSES}
)
def grouped_by_class(all_data: pd.DataFrame) -> Dict[str, pd.DataFrame]: 
    cols = defaultdict(dict)
    for column in all_data.columns
    for class_name, regex in CLASSES.items():
        if match(regex, column):
            cols[class_name].append(column)
    return {key: pd.DataFrame(value) for key, value in cols.items()}

This assumes you know the classes but not the column names (which you can then declare, mess around with, and do data quality checks downstream). You will be referring to them as dataframes though. If you know the column names then its simpler — you can just construct/store them externally to the function and reference in the decorator (although it seems like that’s not the case).

Amos

07/05/2023, 12:39 AM

Generally you know the form but not the membership beforehand. A super simple example might be:

PPM = r"^[A-Z][a-z]*_ppm$"

PCT = r"^[A-Z][a-z]*_pct$"

PPB = r"^[A-Z][a-z]*_ppb$"

OXD = r"^[A-Z][a-z]*\d*O\d?_pct$"

Data for a given element or compound for a given sample may be missing, in one format or in several. You may want to treat the formats separately or to convert to a single format and sum the totals. Just an example…

Elijah Ben Izzy

07/05/2023, 12:40 AM

Yeah so I think this is a good case for grouping into dataframes? One node per dataframe then you can treat it how you want?

Amos

07/05/2023, 12:44 AM

Yes, manipulate them across a common spine and then rejoin later. I'll play with your example. I guess what I had in mind was some subtle further abstraction along the lines of named regex classes supplied to a decorator automatically becoming nodes that could be referred to directly in subsequent transforms without needing to first unpack a dictionary. But perhaps that's unnecessary.

Elijah Ben Izzy

07/05/2023, 4:01 PM

Hmm, so currently we don’t have the ability to produce them dynamically (although that’s something we are thining through, see https://github.com/DAGWorks-Inc/hamilton/issues/49), but partially that’s by design. It starts getting confusing if you need to refer to things that might/might not exist depending on what was produced upstream. If you know the superset of possible nodes, then you could wrap the decorator in one that searches through the list of possible ones and creates n per each regex…

👍 1

Open in Slack

Previous Next