https://flyte.org logo
Join Slack
Powered by
# pandera-support
  • a

    average-finland-92144

    02/27/2025, 6:24 PM
    has renamed the channel from "pandera-users" to "pandera-support"
  • b

    broad-monitor-993

    02/28/2025, 1:48 AM
    Welcome @better-carpet-53403 👋
  • b

    broad-monitor-993

    03/06/2025, 1:44 PM
    Hi <!here>, Just wanted to get some early feedback on this PR: https://github.com/unionai-oss/pandera/pull/1926 It basically removes the
    pandas
    and
    numpy
    dependency from pandera so that folks who want to use it with
    polars
    or
    pyspark
    don’t have to install pandas as well. It does introduce a breaking change in the way some users might install pandera: of users who counted on
    pandera
    to also install
    pandas
    (which is not recommended anyway), they’ll have to explicitly install pandas in their environment. Would appreciate any thoughts/concerns here. The plan is to add this to the
    0.24.0
    release and just make sure that the docs and changelog provide guidance on the breaking change.
    g
    • 2
    • 7
  • b

    broad-monitor-993

    03/13/2025, 2:24 AM
    👋 welcome @helpful-lighter-24518!
  • g

    gentle-toddler-32466

    03/14/2025, 6:15 PM
    I'm digging into the test setup, trying to understand the local and github actions testing. I have a few questions: • Is there a developer's guide that covers testing, other than the contributing section of the docs? ◦ would you like me to add one? • Is running
    make nox-tests
    locally intended to be equivalent to running the ci-tests on github? ◦ It looks like this is the case, but testing matrices in noxfile.py and tests.yaml need to be kept in sync manually. • doc-string tests... ◦ what is the proper way to run them locally?
    pytest --doctest-modules pandera
    ? ◦ It looks like they are currently disabled in CI,
    -name: Check Docs
    is commented out. ◦ Is there already a mechanism in place to skip doc tests for optional-extras which are not installed? • running
    make nox-tests
    failed when I created a new pandera-dev environment in coda with error
    ValueError("No backends present, looked for ('uv',).")
    installing UV with
    pip install uv
    fixed the problem. Does it need to be added to the dev or test dependencies?
    b
    • 2
    • 2
  • b

    broad-monitor-993

    03/14/2025, 9:01 PM
    Is there a developer’s guide that covers testing, other than the contributing section of the docs?
    That’s the only guide, can you add improvements to that same doc?
    It looks like this is the case, but testing matrices in noxfile.py and tests.yaml need to be kept in sync manually.
    correct. Any improvements to this welcome!
    what is the proper way to run them locally?
    pytest --doctest-modules pandera
    I typically do
    make docs
    It looks like they are currently disabled in CI,
    -name: Check Docs
    is commented out.
    Yeah I forget what exact errors I saw on CI, need to uncomment those and see what happens
    Is there already a mechanism in place to skip doc tests for optional-extras which are not installed?
    nope, improvements there welcome!
    installing UV with
    pip install uv
    fixed the problem. Does it need to be added to the dev or test dependencies?
    yeah let’s add it
    g
    • 2
    • 13
  • q

    quick-bird-504

    03/18/2025, 1:22 PM
    Hi, I hope this is a good place to ask this question. I know that the default DateType in Pandera is datetime64, is there any way to validate the dataframe and coerce datetime64 to datetime.datetime? I need to do it because in BigQuery I will get errors because of the TIMESTAMP. I was trying the following simple solution but it is not working 🙂 any suggestions?
    Copy code
    import datetime
    from pandera import dtypes
    from pandera.engines import pandas_engine
    
    @pandas_engine.Engine.register_dtype 
    @dtypes.immutable  
    class PythonDatetime(pandas_engine.DateTime):
        def coerce(self, series):
            return pd.to_datetime(series, errors='coerce').dt.to_pydatetime()
    
    COERSE_DO_DATETIME = pa.DataFrameSchema(
        {
            'date': Column(PythonDatetime(), nullable=True), 
        },
        index=Index(int),
        strict=True,
        coerce=True
        )
    data = pd.DataFrame({'date': [np.datetime64('2025-01-01'), np.datetime64('NaT')]})
    
    try:
        print('Pandera validation started')
        validated_data = COERSE_DO_DATETIME.validate(data)
    except pa.errors.SchemaErrors as e:
        print(e)
    return validated_data
    b
    • 2
    • 5
  • b

    broad-monitor-993

    03/20/2025, 1:22 PM
    👋 welcome @wonderful-salesclerk-43106!
  • w

    wonderful-piano-5966

    03/21/2025, 8:22 PM
    I wish this was a clearer repro case, but I’ll ask anyway. Are there any recommendations on working with optional fields in nested structs (this case uses PolaRS, but I’m interested in any library). Running this repro case in the debugger it looks like the typing.Optional on column.col2 is discarded in polars Struct, so I’m not sure there’s enough information available to Pandera to handle optional cases. I am considering using custom checks to deal with this situation if there’s not an easier option.
    Copy code
    import pandera.polars as pla
    import polars as pl
    from typing import Optional
    from pandera.engines.polars_engine import Struct
    
    nested_struct = {
        "col1": pl.Utf8,
        "col2": Optional[pl.Utf8]
    }
    
    top_level_struct = {
        "column": pl.Struct(nested_struct)
    }
    
    
    class ReproModel(pla.DataFrameModel):
        column: Optional[Struct] = pla.Field(
            nullable=True,
            dtype_kwargs={ "fields": nested_struct }
        )
    
    df = (
        pl.DataFrame()
        .with_columns(
            pl.struct(
                pl.lit('some string').alias("col1"),
                pl.lit('some other string').alias("col2")
            ).alias("column")
        )
    )
    
    df2 = (
        pl.DataFrame()
        .with_columns(
            pl.struct(
                pl.lit('some string').alias("col1")
            ).alias("column")
        )
    )
    
    # these both print correctly
    print(df)
    print(df2)
    
    print(top_level_struct)
    
    ReproModel.validate(df)   # ok
    ReproModel.validate(df2)  # should be ok, but throws exception
    b
    • 2
    • 4
  • a

    agreeable-school-21279

    04/25/2025, 8:20 PM
    Hey @broad-monitor-993, what is the expected usage pattern for instantiating examples with Arrow datatypes. With the code:
    Copy code
    from pandera import Field
    from pandera.typing import Series
    import pandera as pn
    import pyarrow as pa
    
    class Position(pn.DataFrameModel):
        x: Series[pa.float32] = Field(default=0.0) # m
        y: Series[pa.float32] = Field(default=0.0) # m
        z: Series[pa.float32] = Field(default=0.0) # m
    
    Position.example(size=10)
    I am getting the following error.
    Copy code
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    File ~/anaconda3/lib/python3.11/site-packages/pandera/engines/pandas_engine.py:288, in Engine.numpy_dtype(cls, pandera_dtype)
        287 try:
    --> 288     return np.dtype(alias)
        289 except TypeError as err:
    
    TypeError: data type 'float[pyarrow]' not understood
    
    The above exception was the direct cause of the following exception:
    
    TypeError                                 Traceback (most recent call last)
    File ~/anaconda3/lib/python3.11/site-packages/pandera/strategies/pandas_strategies.py:350, in to_numpy_dtype(pandera_dtype)
        349 try:
    --> 350     np_dtype = pandas_engine.Engine.numpy_dtype(pandera_dtype)
        351 except TypeError as err:
    
    File ~/anaconda3/lib/python3.11/site-packages/pandera/engines/pandas_engine.py:290, in Engine.numpy_dtype(cls, pandera_dtype)
        289 except TypeError as err:
    --> 290     raise TypeError(
        291         f"Data type '{pandera_dtype}' cannot be cast to a numpy dtype."
        292     ) from err
    
    TypeError: Data type 'float[pyarrow]' cannot be cast to a numpy dtype.
    ...
        358     ) from err
        360 if np_dtype == np.dtype("object") or str(pandera_dtype) == "str":
        361     np_dtype = np.dtype(str)
    
    TypeError: Data generation for the 'float[pyarrow]' data type is currently unsupported.
    b
    b
    • 3
    • 5
  • r

    rhythmic-boots-31361

    05/01/2025, 3:32 PM
    Hi folks, I just wanted to get a sanity check for whether my planned usage of pandera makes sense? I'm a data analyst and I use python (among other tools), but others in my team are not comfortable with any kind of coding and use no-code tools like Alteryx. Currently we do basically no automated data validation. My plan is to make a tool where members of the team can write up a config for their data outputs (usually csv files at the moment), and my tool will suck up all the configs and perform the specified checks. My plan is for the configs to basically be a yaml schema to feed into pandera, but with an extra line at the top for them to specify the path to the csv. The tool will then read the file at that path into a dataframe, read the rest of the file as a schema to pass to pandera and then validate. I'll output the results of all the validations, plus some other details like last modified, to a log. This will run once per day and I'll pick up the log in a dashboard so the team can see at a glance whether whatever ETL data jobs have run and if so do they have any warnings or errors. Does this make sense? My goal is to make data validation as simple as possible for the less technically minded members of the team, to make it more likely that they will actually start doing some validation. I'm a bit worried about those members of the team struggling with the yaml (particularly the significant whitespace) but that should be doable to overcome. Any feedback would be much appreciated.
    b
    • 2
    • 7
  • r

    rhythmic-boots-31361

    05/01/2025, 3:46 PM
    Also, as far as I can tell from the docs there aren't built in checks for checks comparing columns (like column "start" < column "end") but I could implement these as custom ones inside my tool and then let users specify them in the yaml files?
    b
    • 2
    • 1
  • p

    powerful-horse-58724

    05/28/2025, 9:19 PM
    set the channel description: Pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects. https://github.com/unionai-oss/pandera
  • p

    powerful-horse-58724

    05/28/2025, 9:19 PM
    set the channel description: Pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects: https://github.com/unionai-oss/pandera
  • c

    cool-nest-98527

    06/25/2025, 8:00 PM
    @cool-nest-98527 has left the channel
  • b

    broad-monitor-993

    07/08/2025, 8:28 PM
    📢 Pandera v0.25.0 is out! 🔎📊✅ And just in time for Scipy 2025 🙃 The main highlight here is that Pandera now supports 🦩 Ibis table validation 🎉 What does this mean? It means that you can now perform data validation on #duckdb, #snowflake, #bigquery, #awsathena, #sqlite, #postgres, and all of the other backends that Ibis supports! Huge shoutout to @enough-evening-77193 on the herculean effort building this integration into Pandera. 📖 Docs: https://pandera.readthedocs.io/en/stable/ibis.html 📝 Full changelog: https://github.com/unionai-oss/pandera/releases/tag/v0.25.0
    ❤️ 3
    🎉 4
  • n

    nutritious-piano-11388

    07/14/2025, 3:47 PM
    @nutritious-piano-11388 has left the channel
  • a

    average-finland-92144

    08/01/2025, 3:07 PM
    x-posting: https://flyte-org.slack.com/archives/C02JMT8KTEE/p1754060814611289
    🙏 1
  • f

    few-electrician-9464

    08/01/2025, 5:02 PM
    @few-electrician-9464 has left the channel
  • v

    victorious-cpu-10033

    08/05/2025, 6:34 PM
    Hi everyone! I’m building a data validation and cleaning tool. Users upload a dataset.csv and validation_rules.csv. I tried using LLMs, but hit rate limits due to large data and free plan limits. Now I’m exploring Pandera for static validation in Python. Looking forward to hearing from you and working together friends!
    👍 1
  • b

    broad-monitor-993

    08/05/2025, 7:06 PM
    Welcome @victorious-cpu-10033! You can certainly use an LLM to auto-generate pandera schemas if you want to try that 🙂
    👍 1
  • v

    victorious-cpu-10033

    08/05/2025, 7:26 PM
    Okay @broad-monitor-993 sure I'll try this. Thanks!
  • v

    victorious-cpu-10033

    08/05/2025, 7:27 PM
    Niels, U have any blog for reference..?
    b
    • 2
    • 4
  • b

    broad-monitor-993

    08/06/2025, 5:23 PM
    welcome @big-processor-87979! 👋