-
-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Type hints specified in DataFrameModel are lost in the DataFrame[MyDataFrameModel]
type, causing type checkers to fail on all pandera code.
#1900
Comments
Same thing, I was very excited to use pandas dataframes as typed objects when looking into this library |
I looked into this and made a simple example to achieve wanted behaviour with mypy plugin: # mypy plugin definition
from collections.abc import Callable
from mypy.plugin import FunctionContext, Plugin
from mypy.types import Instance, Type
def _transform_dataframe_type(ctx: FunctionContext) -> Type:
frame_type = ctx.default_return_type
if isinstance(frame_type, Instance):
schema = frame_type.args[0]
if isinstance(schema, Instance) and str(schema) == "src.plugin_test.Schema":
frame_type = frame_type.copy_modified()
for attr, symbol_table in schema.type.names.items():
attr_type = symbol_table.type
if attr_type is not None and str(attr_type).startswith("src.plugin_test.Series"):
frame_type.type.names[attr] = symbol_table
return frame_type
return ctx.default_return_type
class GenericTypePlugin(Plugin):
def get_function_hook(self, fullname) -> Callable[[FunctionContext], Type] | None:
if "dataframe" in fullname.lower():
return _transform_dataframe_type
return None
def plugin(version: str) -> type[GenericTypePlugin]:
return GenericTypePlugin # src.plugin_test
import typing
T = typing.TypeVar("T")
class Series(typing.Generic[T]):
pass
class Schema:
some_col: Series[str]
class DataFrame(typing.Generic[T]):
def __getattribute__(self, name):
return super().__getattribute__(name)
a = DataFrame[Schema]()
typing.reveal_type(a.some_col) # Series[str] is shown This method is not perfect because pyright won't see this attribute, I thought that I would be able to solve this the same way I would check if logic of my example makes sense within pandera infrastructure and try to implement it |
Also, better workaround for now is to use from typing import cast
def process_data(data: DataFrame[MyModel]):
field = cast(Series[int], data.field1)
print(field) |
I feel like fixing this would be problematic because dataframes are usually used in dynamic manner, adding and removing columns with runtime values (I guess it would be possible to process adding/removing of literal columns) |
Describe the bug
A clear and concise description of what the bug is.
When creating a very basic example and while using pyright for type checking in my library, typechecking fails due to multiple
Unknown
data types. I think there are 2 main issues underlying this, which is that some function argument types are not properly specified in pandera,and for some reason typing information is lost somewhere along the way.
When running the very basic example below, pyright reports multiple issues of two main types. Here I am reporting only one so the other can be reported in a separate issue:
Code Sample
The MRE example, including dependencies and the list of issues in the reproduction can be found on https://github.com/dolfandringa/pandera_test
Run
pyright main.py
Expected behavior
data.field1
anddata.field2
types' are correctly reported as the type specified in in theDataFrameModel
(Series[int]
andSeries[str]
respectively or evenint
andstr
) instead ofSeries[Unknown]
.Desktop (please complete the following information):
Additional context
Workaround:
The only way to currently work around this issue as far as I am aware (which does loose all type checking on the pandera object) are to change:
to
or to add
type:ignore[]
statements on every field, which becomes very cumbersome quickly:Underlying issue:
pandera.typing.DataFrame(Generic[T])
the types of the individual fields ofT
are lost and they are just reported asSeries
without any type. It doesn't matter if the type specification in theDataFrameModel
specifiesSeries[int]
(MyModel
above) or directlyint
(MyModel2
above). TheDataFrame
type always reports their attributes asSeries[Unknown]
.Why this is relevant:
The recent (python ^3.11) type hinting additions are a huge improvement in terms of developer experience because it improves the IDEs ability to auto complete code more accurately and improves speed and quality of development in larger applications and public libraries by preventing bugs due to type incompatibilities.
Allowing developers to at least use static type checking in their code without having to manually add
type:ignore
statements everywhere they use pandera is the minimal behaviour I would expect. So specifyingtyping.Any
instead of not specifying any type (resulting in type checkers reporting it asUnknown
) would go a long way.Even better would be to make use of the type hints users already provide in the DataFrameModel anyway and make sure those are reported correctly, so type checkers can also validate whether a value from a column is compatible with a specific function argument somewhere would be even nicer. pydantic (which I use a lot) does this very well, so I was expecting pandera to do the same.
The text was updated successfully, but these errors were encountered: