This message was deleted.
# hamilton-help
s
This message was deleted.
e
Hey! So here’s what I’m thinking (as we discussed a bit in DM — thanks for posting publicly) You can do this without validating the columns individually, and can validate the dataframe. If it has to be configuration-driven it would be
resolve
So, specifically: 1. Pandera can create a dataframe schema for multiple columns 2. If you do want
extract_columns
you can use the
target_
parameter to hit the dataframe 3.
resolve
will make it config-driven. That said, if you validate a superset, you may just want optional columns and to validate all 4. If you do use resolve, I’d consider wrapping it in your own decorator that delegates to `resolve`(really just creating a function that calls to resolve. OK, (2):
Copy code
@check_output(schema=..., target_="foo") # tells you to check the dataframe
@extract_columns(...)
def foo() -> pd.DataFrame:
    ...
With
resolve
, you would do something like:
Copy code
@resolve(
   when=ResolveAt.CONFIG_AVAILABLE,
   resolve = lambda columns_to_resolve : check_output(..., target_="foo")
)
def foo() -> pd.DataFrame
    ...
Then, if you wanted to, you could define a custom decorator:
Copy code
@check_columns(columns: str, ..., target_="foo")
def foo() -> pd.DataFrame:
    ...
This would just delegate to
resolve
a
Yeah, thanks for keeping the conversation going. I want to validate the dataframe all at once using a schema with multiple columns and
strict="filter"
so anything not recognised (required or optional) gets dropped at that point (easy to rejoin the cruft later if necessary). The issue is that the names of a subset of required columns can't be known in advance, so I need to pass them in and adjust the schema to pick them up. I think the solution is something like
resolve
with a custom decorator, as you suggest. Just have to wrap my head around how to do that. Maybe a coffee first.
e
Yeah! I think that’s a clean idea. It would look something like:
Copy code
def check_columns_with_my_schema(columns: List[str], df_name: str): # maybe take in other params?
    schema = _build_schema_from_columns_and_other_params(...) # implement this
    return resolve(
        when=ResolveAt.CONFIG_AVAILABLE,
        resolve= lambda columns: check_output(..., target_="...")
    )
Then you just have to pass in the
columns
param at configuration time
a
Nice, a contiguous version of the vague concept in my head. Thanks. Will give it a go.
e
Yep! Note that one thing we don’t do now (but could potentially), is wire through other parameters to custom result checkers. E.G. the output of other nodes. That way, if its determined at runtime (as opposed to DAG-creation time), you could pass, say, the columns from another dataframe in. See: https://github.com/DAGWorks-Inc/hamilton/issues/190, not implemented yet, but definitely doable.