Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type hints specified in DataFrameModel are lost in the DataFrame[MyDataFrameModel] type, causing type checkers to fail on all pandera code. #1900

Open
2 of 3 tasks
dolfandringa opened this issue Jan 23, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@dolfandringa
Copy link
Contributor

dolfandringa commented Jan 23, 2025

Describe the bug
A clear and concise description of what the bug is.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

When creating a very basic example and while using pyright for type checking in my library, typechecking fails due to multiple Unknown data types. I think there are 2 main issues underlying this, which is that some function argument types are not properly specified in pandera,
and for some reason typing information is lost somewhere along the way.

When running the very basic example below, pyright reports multiple issues of two main types. Here I am reporting only one so the other can be reported in a separate issue:

  pandera_test/main.py:17:11 - error: Type of "field1" is partially unknown
    Type of "field1" is "Series[Unknown]" (reportUnknownMemberType)
  pandera_test/main.py:17:11 - error: Argument type is partially unknown
    Argument corresponds to parameter "values" in function "print"
    Argument type is "Series[Unknown]" (reportUnknownArgumentType)

Code Sample

The MRE example, including dependencies and the list of issues in the reproduction can be found on https://github.com/dolfandringa/pandera_test

import pandera as pa
from pandera.typing import DataFrame
from pandera.typing import Series


class MyModel(pa.DataFrameModel):
    field1: Series[int] = pa.Field(coerce=True)
    field2: Series[str] = pa.Field(coerce=True)


class MyModel2(pa.DataFrameModel):
    field1: int = pa.Field(coerce=True)
    field2: str = pa.Field(coerce=True)


def process_data(data: DataFrame[MyModel]):
    print(data.field1)
    print(data.field2)


def process_data2(data: DataFrame[MyModel2]):
    print(data.field1)
    print(data.field2)

Run
pyright main.py

Expected behavior

  • data.field1 and data.field2 types' are correctly reported as the type specified in in the DataFrameModel (Series[int] and Series[str] respectively or even int and str) instead of Series[Unknown].

Desktop (please complete the following information):

  • OS: Fedora
  • Browser: n/a
  • Version: 40

Additional context

Workaround:
The only way to currently work around this issue as far as I am aware (which does loose all type checking on the pandera object) are to change:

def process_data(data: DataFrame[MyModel]):
    print(data.field1)
    print(data.field2)

to

def process_data(data: Any):
    print(data.field1)
    print(data.field2)

or to add type:ignore[] statements on every field, which becomes very cumbersome quickly:

def process_data2(data: DataFrame[MyModel2]):
    print(
        data.field1  # pyright: ignore [reportUnknownMemberType, reportUnknownArgumentType]
    )
    print(
        data.field2  # pyright: ignore [reportUnknownMemberType, reportUnknownArgumentType]
    )

Underlying issue:

  • For some reason in the pandera.typing.DataFrame(Generic[T]) the types of the individual fields of T are lost and they are just reported as Series without any type. It doesn't matter if the type specification in the DataFrameModel specifies Series[int] (MyModel above) or directly int (MyModel2 above). The DataFrame type always reports their attributes as Series[Unknown].

Why this is relevant:

The recent (python ^3.11) type hinting additions are a huge improvement in terms of developer experience because it improves the IDEs ability to auto complete code more accurately and improves speed and quality of development in larger applications and public libraries by preventing bugs due to type incompatibilities.

Allowing developers to at least use static type checking in their code without having to manually add type:ignore statements everywhere they use pandera is the minimal behaviour I would expect. So specifying typing.Any instead of not specifying any type (resulting in type checkers reporting it as Unknown ) would go a long way.

Even better would be to make use of the type hints users already provide in the DataFrameModel anyway and make sure those are reported correctly, so type checkers can also validate whether a value from a column is compatible with a specific function argument somewhere would be even nicer. pydantic (which I use a lot) does this very well, so I was expecting pandera to do the same.

@dolfandringa dolfandringa added the bug Something isn't working label Jan 23, 2025
@AndBoyS
Copy link

AndBoyS commented Jan 24, 2025

Same thing, I was very excited to use pandas dataframes as typed objects when looking into this library

@AndBoyS
Copy link

AndBoyS commented Jan 26, 2025

I looked into this and made a simple example to achieve wanted behaviour with mypy plugin:

# mypy plugin definition
from collections.abc import Callable

from mypy.plugin import FunctionContext, Plugin
from mypy.types import Instance, Type


def _transform_dataframe_type(ctx: FunctionContext) -> Type:
    frame_type = ctx.default_return_type
    if isinstance(frame_type, Instance):
        schema = frame_type.args[0]

        if isinstance(schema, Instance) and str(schema) == "src.plugin_test.Schema":
            frame_type = frame_type.copy_modified()
            for attr, symbol_table in schema.type.names.items():
                attr_type = symbol_table.type
                if attr_type is not None and str(attr_type).startswith("src.plugin_test.Series"):
                    frame_type.type.names[attr] = symbol_table
            return frame_type

    return ctx.default_return_type


class GenericTypePlugin(Plugin):
    def get_function_hook(self, fullname) -> Callable[[FunctionContext], Type] | None:
        if "dataframe" in fullname.lower():
            return _transform_dataframe_type
        return None


def plugin(version: str) -> type[GenericTypePlugin]:
    return GenericTypePlugin
# src.plugin_test
import typing

T = typing.TypeVar("T")


class Series(typing.Generic[T]):
    pass


class Schema:
    some_col: Series[str]


class DataFrame(typing.Generic[T]):
    def __getattribute__(self, name):
        return super().__getattribute__(name)


a = DataFrame[Schema]()
typing.reveal_type(a.some_col)  # Series[str] is shown

This method is not perfect because pyright won't see this attribute, I thought that I would be able to solve this the same way typing.NamedTuple does this (you can do A = typing.NamedTuple("ClassName", [("some_attr", int)]) and pyright/mypy will see some_attr and its type. However, even if I copy whole typing module and use NamedTuple from there, this behaviour isn't preserved, so I guess mypy/pyright somehow hardcoded this

I would check if logic of my example makes sense within pandera infrastructure and try to implement it

@AndBoyS
Copy link

AndBoyS commented Jan 26, 2025

Also, better workaround for now is to use typing.cast:

from typing import cast


def process_data(data: DataFrame[MyModel]):
    field = cast(Series[int], data.field1)
    print(field)

@AndBoyS
Copy link

AndBoyS commented Feb 18, 2025

I feel like fixing this would be problematic because dataframes are usually used in dynamic manner, adding and removing columns with runtime values (I guess it would be possible to process adding/removing of literal columns)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants