This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

08/09/2023, 11:22 PM

This message was deleted.

Elijah Ben Izzy

08/09/2023, 11:24 PM

Hey! So here’s what I’m thinking (as we discussed a bit in DM — thanks for posting publicly) You can do this without validating the columns individually, and can validate the dataframe. If it has to be configuration-driven it would be

resolve

So, specifically: 1. Pandera can create a dataframe schema for multiple columns 2. If you do want

extract_columns

you can use the

target_

parameter to hit the dataframe 3.

resolve

will make it config-driven. That said, if you validate a superset, you may just want optional columns and to validate all 4. If you do use resolve, I’d consider wrapping it in your own decorator that delegates to `resolve`(really just creating a function that calls to resolve. OK, (2):

Copy code

@check_output(schema=..., target_="foo") # tells you to check the dataframe
@extract_columns(...)
def foo() -> pd.DataFrame:
    ...

Elijah Ben Izzy

08/09/2023, 11:26 PM

With

resolve

, you would do something like:

Copy code

@resolve(
   when=ResolveAt.CONFIG_AVAILABLE,
   resolve = lambda columns_to_resolve : check_output(..., target_="foo")
)
def foo() -> pd.DataFrame
    ...

Then, if you wanted to, you could define a custom decorator:

Copy code

@check_columns(columns: str, ..., target_="foo")
def foo() -> pd.DataFrame:
    ...

This would just delegate to

resolve

Amos

08/09/2023, 11:53 PM

Yeah, thanks for keeping the conversation going. I want to validate the dataframe all at once using a schema with multiple columns and

strict="filter"

so anything not recognised (required or optional) gets dropped at that point (easy to rejoin the cruft later if necessary). The issue is that the names of a subset of required columns can't be known in advance, so I need to pass them in and adjust the schema to pick them up. I think the solution is something like

resolve

with a custom decorator, as you suggest. Just have to wrap my head around how to do that. Maybe a coffee first.

Elijah Ben Izzy

08/09/2023, 11:56 PM

Yeah! I think that’s a clean idea. It would look something like:

Copy code

def check_columns_with_my_schema(columns: List[str], df_name: str): # maybe take in other params?
    schema = _build_schema_from_columns_and_other_params(...) # implement this
    return resolve(
        when=ResolveAt.CONFIG_AVAILABLE,
        resolve= lambda columns: check_output(..., target_="...")
    )

Then you just have to pass in the

columns

param at configuration time

Amos

08/09/2023, 11:59 PM

Nice, a contiguous version of the vague concept in my head. Thanks. Will give it a go.

Elijah Ben Izzy

08/10/2023, 12:00 AM

Yep! Note that one thing we don’t do now (but could potentially), is wire through other parameters to custom result checkers. E.G. the output of other nodes. That way, if its determined at runtime (as opposed to DAG-creation time), you could pass, say, the columns from another dataframe in. See: https://github.com/DAGWorks-Inc/hamilton/issues/190, not implemented yet, but definitely doable.

Open in Slack

Previous Next