Pandera-Schema Model-Showing incorrect results for Integer column validations #833

learnnk · 2022-04-04T02:42:45Z

learnnk
Apr 4, 2022

Hi Team,
I'm using the Pandera Schema model to validate my file which as Integer column.

My Schema Model
class Retaildata(pa.SchemaModel):
orderid: Series[int] = pa.Field(coerce=True,nullable=False)
orderdate: Series[pa.DateTime] = pa.Field(coerce=True)
orderamount: Series[int] = pa.Field(coerce=True)
orderstatus: Series[str] = pa.Field(coerce=True, nullable=False)

Data
orderid, orderdate, orderamount, orderstatus
0 1, 2013-07-25 00:00:00.0,11599, PENDING_PAYMENT
1 -1, 2013-07-25 00:00:00.0, 256, PENDING_PAYMENT
2 abc, 2013-07-25 00:00:00.0, 12111, COMPLETE
output
schema_context column ... failure_case index
0 Column orderid ... abc 2

[1 rows x 6 columns]

To test this column

we are using the String value for one row
we are using the (Negative) value for another row

Ouput- It always shows the failure case as String value-index number(which is point 1) not showing the Index value for (point 2) from above

Can you please suggest/guide how to capture the index of the failure case for negative row which ( is index 1 from data)

cosmicBboy · 2022-04-04T02:51:07Z

cosmicBboy
Apr 4, 2022
Maintainer

Hi @learnnk, can you please provide a minimally reproducible code snippet with your (i) schema and (ii) data... basically what you've done above except in an easily copy-pasteable form (and the unexpected error/output you're seeing). This will help us help you 🤝

0 replies

learnnk · 2022-04-04T03:38:03Z

learnnk
Apr 4, 2022
Author

Thanks for your response
Here is the code snippet

import pandera as pa
import pandas as pd
from pandera.typing import Index, DataFrame, Series
from pandera import Check, Column, DataFrameSchema

class Retaildata(pa.SchemaModel):
orderid: Series[int] = pa.Field(coerce=True,nullable=False,gt=0)
orderdate: Series[pa.DateTime] = pa.Field(coerce=True)
orderamount: Series[int] = pa.Field(coerce=True)
orderstatus: Series[str] = pa.Field(coerce=True, nullable=False)

df = pd.DataFrame({
"orderid": ["1", "-1", "abc"],
"orderdate": ["2013-07-25", "2013-07-27", "2013-07-28"],
"orderamount": ["1234", "156", "365"],
"orderstatus":["PENDING_PAYMENT","COMPLETE","CLOSED"]
})

try:
#schema.validate(data_frame, lazy=True)
Retaildata.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
print(err.failure_cases, '\n')

output getting as below
schema_context column ... failure_case index
0 Column orderid ... abc 2

[1 rows x 6 columns]

As i said i'm not getting failure case for negative value in orderid for index 1

0 replies

cosmicBboy · 2022-04-04T03:40:23Z

cosmicBboy
Apr 4, 2022
Maintainer

FYI you can do triple backticks ``` to preserve indentation:

import pandera as pa
import pandas as pd
from pandera.typing import Index, DataFrame, Series
from pandera import Check, Column, DataFrameSchema

class Retaildata(pa.SchemaModel):
    orderid: Series[int] = pa.Field(coerce=True,nullable=False,gt=0)
    orderdate: Series[pa.DateTime] = pa.Field(coerce=True)
    orderamount: Series[int] = pa.Field(coerce=True)
    orderstatus: Series[str] = pa.Field(coerce=True, nullable=False)

this makes it easier to copy-paste into (in this case) my editor to help debug

0 replies

cosmicBboy · 2022-04-04T03:51:27Z

cosmicBboy
Apr 4, 2022
Maintainer

ah, you're seeing just the coercion error (failure to coerce "abc" to an int)

  schema_context   column                  check check_number failure_case  index
0         Column  orderid  coerce_dtype('int64')         None          abc      2

because pandera doesn't apply the gt=0 constraint check for the following reason: currently, failure to coerce a column to the specified dtype will short-circuit the rest of the property checks... the original reasoning for this is the following: if a column cannot even be coerced to the correct type, then it would be fair to assume that we cannot apply the check constraint (which assumes the column to be of a certain type) in the first place.

This might be too strict an assumption, but I'd be open to changing this behavior if you and others find this unintuitive.

In any case, it would be worth adding an explanation of this behavior in the lazy validation docs... https://pandera.readthedocs.io/en/stable/lazy_validation.html

I'm curious: what is your specific use case, and what are the ways you're using pandera/pandas to clean/validate data? E.g. is this a programmatic data pipeline, or are you using pandera in the context of a GUI where a human updates the state of the data with multiple rounds of validation?

0 replies

learnnk · 2022-04-04T04:13:05Z

learnnk
Apr 4, 2022
Author

Thanks for quick update
Heres is my response to your question

I'm curious: what is your specific use case, and what are the ways you're using pandera/pandas to clean/validate data? E.g. is this a programmatic data pipeline, or are you using pandera in the context of a GUI where a human updates the state of the data with multiple rounds of validation?

[learnnk]-We are using Pandera for DataQuality Validation. We built a in house framework which automatically generates the schema model based on my schema(Flat File/DB) and we will validate the schema using Pandera in built functions
so as part of this we are using to validate daily 100+ files with 200-300 millions of records with 200+ columns
Automatically these data we will feed to GUI/Generating the reports

It would be better if this use case suffice with out having the dependency of column-coerece type
I'm expecting -failure case should throw both(index 1 and index 2) for my data

Thanks!!

0 replies

cosmicBboy · 2022-04-04T04:25:52Z

cosmicBboy
Apr 4, 2022
Maintainer

cool, so if I understand correctly, this is a summary of your situation:

you have a flat file/DB with raw data represented as strings
you want to validate numeric properties of certain columns, but first you need to convert the strings to ints
you want to know (a) which values cannot be converted to ints (b) which values fulfill certain numeric properties (e.g. gt=0)

my next question is: do you care that the resulting valid data are of the expected dtypes?

0 replies

learnnk · 2022-04-06T01:10:09Z

learnnk
Apr 6, 2022
Author

Here is my response
1.you have a flat file/DB with raw data represented as strings
[learnnk]-Its my bad. My file will not have entire string data.it’s mixture of string and int data
My ask - for integer column we want to make sure it has only integer data. When we use this schema model and validated against the data
Scenario
If integer column has alphanumeric value and another value which does not fall for the given range. It throws
Error only for alphanumeric not for the range which I have given

output getting as below
schema_context column ... failure_case index
0 Column orderid ... abc 2

Thanks!

0 replies

cosmicBboy · 2022-04-06T15:09:00Z

cosmicBboy
Apr 6, 2022
Maintainer

for integer column we want to make sure it has only integer data. When we use this schema model and validated against the data
If integer column has alphanumeric value and another value which does not fall for the given range. It throws Error only for alphanumeric not for the range which I have given

Gotcha, so as I mentioned, pandera can't do both of these things at the same time for reasons stated here #821 (comment)

Recommendation 1: Custom Check with no Dtype, or `Any` dtype.

You can also forego specifying the integer datatype and create a custom check that (a) finds uncoercible values and (b) applies the range constraint validation. This would mean that you can't coerce non-integer values to integers, but you'd be able to catch all the cases in one go, for example:

class Retaildata(pa.SchemaModel):
    orderid: Series[Any] = pa.Field(nullable=False)
    orderdate: Series[pa.DateTime] = pa.Field(coerce=True)
    orderamount: Series[int] = pa.Field(coerce=True)
    orderstatus: Series[str] = pa.Field(coerce=True, nullable=False)

    @pa.check("orderid")
    def check_is_int(cls, series):
        """Check that values are integers"""
        return series.map(lambda x: isinstance(x, int))

    @pa.check("orderid")
    def check_positive(cls, series):
        """Check that integer values are positive, ignoring non-integers"""
        mask = series.map(lambda x: not isinstance(x, int))
        return (series.mask(mask) > 0) | mask

This'll give you the error report you want:

  schema_context   column           check  check_number failure_case  index
0         Column  orderid    check_is_int             0          abc      2
1         Column  orderid  check_positive             1           -1      1

However, it will not coerce the datatype of orderid.

Recommendation 2: Custom Dtype

If you care about the coercion to integers, you can also implement a custom data type that implements a coerce method to handle values of the wrong type before applying the range constriants, see here for more details on implementing custom dtypes.

I also offer consulting services for these kinds of custom use-cases, I'd be happy to work with you and your company to help you implement these solutions! Please reach out at [email protected] if you're interested.

0 replies

cosmicBboy · 2022-04-17T14:19:01Z

cosmicBboy
Apr 17, 2022
Maintainer

converting this to a discussion, @learnnk feel free to mark continue the discussion there!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandera-Schema Model-Showing incorrect results for Integer column validations #833

{{title}}

Replies: 9 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Pandera-Schema Model-Showing incorrect results for Integer column validations #833

learnnk Apr 4, 2022

Replies: 9 comments

cosmicBboy Apr 4, 2022 Maintainer

learnnk Apr 4, 2022 Author

cosmicBboy Apr 4, 2022 Maintainer

cosmicBboy Apr 4, 2022 Maintainer

learnnk Apr 4, 2022 Author

cosmicBboy Apr 4, 2022 Maintainer

learnnk Apr 6, 2022 Author

cosmicBboy Apr 6, 2022 Maintainer

Recommendation 1: Custom Check with no Dtype, or Any dtype.

Recommendation 2: Custom Dtype

cosmicBboy Apr 17, 2022 Maintainer

learnnk
Apr 4, 2022

cosmicBboy
Apr 4, 2022
Maintainer

learnnk
Apr 4, 2022
Author

cosmicBboy
Apr 4, 2022
Maintainer

cosmicBboy
Apr 4, 2022
Maintainer

learnnk
Apr 4, 2022
Author

cosmicBboy
Apr 4, 2022
Maintainer

learnnk
Apr 6, 2022
Author

cosmicBboy
Apr 6, 2022
Maintainer

Recommendation 1: Custom Check with no Dtype, or `Any` dtype.

cosmicBboy
Apr 17, 2022
Maintainer