Flyte #pandera-support

Join Slack

average-finland-92144

02/27/2025, 6:24 PM

has renamed the channel from "pandera-users" to "pandera-support"

broad-monitor-993

02/28/2025, 1:48 AM

Welcome @better-carpet-53403 👋

broad-monitor-993

03/06/2025, 1:44 PM

Hi <!here>, Just wanted to get some early feedback on this PR: https://github.com/unionai-oss/pandera/pull/1926 It basically removes the

pandas

and

numpy

dependency from pandera so that folks who want to use it with

polars

pyspark

don’t have to install pandas as well. It does introduce a breaking change in the way some users might install pandera: of users who counted on

pandera

to also install

pandas

(which is not recommended anyway), they’ll have to explicitly install pandas in their environment. Would appreciate any thoughts/concerns here. The plan is to add this to the

0.24.0

release and just make sure that the docs and changelog provide guidance on the breaking change.

broad-monitor-993

03/13/2025, 2:24 AM

👋 welcome @helpful-lighter-24518!

gentle-toddler-32466

03/14/2025, 6:15 PM

I'm digging into the test setup, trying to understand the local and github actions testing. I have a few questions: • Is there a developer's guide that covers testing, other than the contributing section of the docs? ◦ would you like me to add one? • Is running

make nox-tests

locally intended to be equivalent to running the ci-tests on github? ◦ It looks like this is the case, but testing matrices in noxfile.py and tests.yaml need to be kept in sync manually. • doc-string tests... ◦ what is the proper way to run them locally?

pytest --doctest-modules pandera

? ◦ It looks like they are currently disabled in CI,

-name: Check Docs

is commented out. ◦ Is there already a mechanism in place to skip doc tests for optional-extras which are not installed? • running

make nox-tests

failed when I created a new pandera-dev environment in coda with error

ValueError("No backends present, looked for ('uv',).")

installing UV with

pip install uv

fixed the problem. Does it need to be added to the dev or test dependencies?

broad-monitor-993

03/14/2025, 9:01 PM

Is there a developer’s guide that covers testing, other than the contributing section of the docs?

That’s the only guide, can you add improvements to that same doc?

It looks like this is the case, but testing matrices in noxfile.py and tests.yaml need to be kept in sync manually.

correct. Any improvements to this welcome!

what is the proper way to run them locally?
pytest --doctest-modules pandera

I typically do

make docs

It looks like they are currently disabled in CI,
-name: Check Docs
is commented out.

Yeah I forget what exact errors I saw on CI, need to uncomment those and see what happens

Is there already a mechanism in place to skip doc tests for optional-extras which are not installed?

nope, improvements there welcome!

installing UV with
pip install uv
fixed the problem. Does it need to be added to the dev or test dependencies?

yeah let’s add it

quick-bird-504

03/18/2025, 1:22 PM

Hi, I hope this is a good place to ask this question. I know that the default DateType in Pandera is datetime64, is there any way to validate the dataframe and coerce datetime64 to datetime.datetime? I need to do it because in BigQuery I will get errors because of the TIMESTAMP. I was trying the following simple solution but it is not working 🙂 any suggestions?

Copy code

import datetime
from pandera import dtypes
from pandera.engines import pandas_engine

@pandas_engine.Engine.register_dtype 
@dtypes.immutable  
class PythonDatetime(pandas_engine.DateTime):
    def coerce(self, series):
        return pd.to_datetime(series, errors='coerce').dt.to_pydatetime()

COERSE_DO_DATETIME = pa.DataFrameSchema(
    {
        'date': Column(PythonDatetime(), nullable=True), 
    },
    index=Index(int),
    strict=True,
    coerce=True
    )
data = pd.DataFrame({'date': [np.datetime64('2025-01-01'), np.datetime64('NaT')]})

try:
    print('Pandera validation started')
    validated_data = COERSE_DO_DATETIME.validate(data)
except pa.errors.SchemaErrors as e:
    print(e)
return validated_data

broad-monitor-993

03/20/2025, 1:22 PM

👋 welcome @wonderful-salesclerk-43106!

wonderful-piano-5966

03/21/2025, 8:22 PM

I wish this was a clearer repro case, but I’ll ask anyway. Are there any recommendations on working with optional fields in nested structs (this case uses PolaRS, but I’m interested in any library). Running this repro case in the debugger it looks like the typing.Optional on column.col2 is discarded in polars Struct, so I’m not sure there’s enough information available to Pandera to handle optional cases. I am considering using custom checks to deal with this situation if there’s not an easier option.

Copy code

import pandera.polars as pla
import polars as pl
from typing import Optional
from pandera.engines.polars_engine import Struct

nested_struct = {
    "col1": pl.Utf8,
    "col2": Optional[pl.Utf8]
}

top_level_struct = {
    "column": pl.Struct(nested_struct)
}


class ReproModel(pla.DataFrameModel):
    column: Optional[Struct] = pla.Field(
        nullable=True,
        dtype_kwargs={ "fields": nested_struct }
    )

df = (
    pl.DataFrame()
    .with_columns(
        pl.struct(
            pl.lit('some string').alias("col1"),
            pl.lit('some other string').alias("col2")
        ).alias("column")
    )
)

df2 = (
    pl.DataFrame()
    .with_columns(
        pl.struct(
            pl.lit('some string').alias("col1")
        ).alias("column")
    )
)

# these both print correctly
print(df)
print(df2)

print(top_level_struct)

ReproModel.validate(df)   # ok
ReproModel.validate(df2)  # should be ok, but throws exception

agreeable-school-21279

04/25/2025, 8:20 PM

Hey @broad-monitor-993, what is the expected usage pattern for instantiating examples with Arrow datatypes. With the code:

Copy code

from pandera import Field
from pandera.typing import Series
import pandera as pn
import pyarrow as pa

class Position(pn.DataFrameModel):
    x: Series[pa.float32] = Field(default=0.0) # m
    y: Series[pa.float32] = Field(default=0.0) # m
    z: Series[pa.float32] = Field(default=0.0) # m

Position.example(size=10)

I am getting the following error.

Copy code

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/anaconda3/lib/python3.11/site-packages/pandera/engines/pandas_engine.py:288, in Engine.numpy_dtype(cls, pandera_dtype)
    287 try:
--> 288     return np.dtype(alias)
    289 except TypeError as err:

TypeError: data type 'float[pyarrow]' not understood

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
File ~/anaconda3/lib/python3.11/site-packages/pandera/strategies/pandas_strategies.py:350, in to_numpy_dtype(pandera_dtype)
    349 try:
--> 350     np_dtype = pandas_engine.Engine.numpy_dtype(pandera_dtype)
    351 except TypeError as err:

File ~/anaconda3/lib/python3.11/site-packages/pandera/engines/pandas_engine.py:290, in Engine.numpy_dtype(cls, pandera_dtype)
    289 except TypeError as err:
--> 290     raise TypeError(
    291         f"Data type '{pandera_dtype}' cannot be cast to a numpy dtype."
    292     ) from err

TypeError: Data type 'float[pyarrow]' cannot be cast to a numpy dtype.
...
    358     ) from err
    360 if np_dtype == np.dtype("object") or str(pandera_dtype) == "str":
    361     np_dtype = np.dtype(str)

TypeError: Data generation for the 'float[pyarrow]' data type is currently unsupported.

rhythmic-boots-31361

05/01/2025, 3:32 PM

Hi folks, I just wanted to get a sanity check for whether my planned usage of pandera makes sense? I'm a data analyst and I use python (among other tools), but others in my team are not comfortable with any kind of coding and use no-code tools like Alteryx. Currently we do basically no automated data validation. My plan is to make a tool where members of the team can write up a config for their data outputs (usually csv files at the moment), and my tool will suck up all the configs and perform the specified checks. My plan is for the configs to basically be a yaml schema to feed into pandera, but with an extra line at the top for them to specify the path to the csv. The tool will then read the file at that path into a dataframe, read the rest of the file as a schema to pass to pandera and then validate. I'll output the results of all the validations, plus some other details like last modified, to a log. This will run once per day and I'll pick up the log in a dashboard so the team can see at a glance whether whatever ETL data jobs have run and if so do they have any warnings or errors. Does this make sense? My goal is to make data validation as simple as possible for the less technically minded members of the team, to make it more likely that they will actually start doing some validation. I'm a bit worried about those members of the team struggling with the yaml (particularly the significant whitespace) but that should be doable to overcome. Any feedback would be much appreciated.

rhythmic-boots-31361

05/01/2025, 3:46 PM

Also, as far as I can tell from the docs there aren't built in checks for checks comparing columns (like column "start" < column "end") but I could implement these as custom ones inside my tool and then let users specify them in the yaml files?

powerful-horse-58724

05/28/2025, 9:19 PM

set the channel description: Pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects. https://github.com/unionai-oss/pandera

powerful-horse-58724

05/28/2025, 9:19 PM

set the channel description: Pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects: https://github.com/unionai-oss/pandera

cool-nest-98527

06/25/2025, 8:00 PM

@cool-nest-98527 has left the channel

broad-monitor-993

07/08/2025, 8:28 PM

📢 Pandera v0.25.0 is out! 🔎📊✅ And just in time for Scipy 2025 🙃 The main highlight here is that Pandera now supports 🦩 Ibis table validation 🎉 What does this mean? It means that you can now perform data validation on #duckdb, #snowflake, #bigquery, #awsathena, #sqlite, #postgres, and all of the other backends that Ibis supports! Huge shoutout to @enough-evening-77193 on the herculean effort building this integration into Pandera. 📖 Docs: https://pandera.readthedocs.io/en/stable/ibis.html 📝 Full changelog: https://github.com/unionai-oss/pandera/releases/tag/v0.25.0

❤️ 3

🎉 4

nutritious-piano-11388

07/14/2025, 3:47 PM

@nutritious-piano-11388 has left the channel

average-finland-92144

08/01/2025, 3:07 PM

x-posting: https://flyte-org.slack.com/archives/C02JMT8KTEE/p1754060814611289

🙏 1

few-electrician-9464

08/01/2025, 5:02 PM

@few-electrician-9464 has left the channel

victorious-cpu-10033

08/05/2025, 6:34 PM

Hi everyone! I’m building a data validation and cleaning tool. Users upload a dataset.csv and validation_rules.csv. I tried using LLMs, but hit rate limits due to large data and free plan limits. Now I’m exploring Pandera for static validation in Python. Looking forward to hearing from you and working together friends!

👍 1

broad-monitor-993

08/05/2025, 7:06 PM

Welcome @victorious-cpu-10033! You can certainly use an LLM to auto-generate pandera schemas if you want to try that 🙂

👍 1

victorious-cpu-10033

08/05/2025, 7:26 PM

Okay @broad-monitor-993 sure I'll try this. Thanks!

victorious-cpu-10033

08/05/2025, 7:27 PM

Niels, U have any blog for reference..?

broad-monitor-993

08/06/2025, 5:23 PM

welcome @big-processor-87979! 👋