Modify Schema in the class-based API? #854

dantheand · 2022-05-11T19:12:47Z

dantheand
May 11, 2022

Hello!

I love the SchemaModel API and how it allows me to exploit inheritance in a Pydantic-style way. One use case that I'm having trouble finding an elegant solution for is when I have a pipeline with a function that transforms an InputSchema in a way that subsets the upstream columns (and likely adds a few more). I currently have to copy and paste the columns that are in both the InputSchema and OutputSchema. I'd love to have OutputSchema inherit the InputSchema and then just do modifications like remove columns. Current implementation for a input and output schema for a function that drops a column:

class InputSchema(pdr.SchemaModel):
    keep_col: Series[float] = pdr.Field()
    toss_col: Series[str] = pdr.Field()

    
class OutputSchema(pdr.SchemaModel):
    keep_col: Series[float] = pdr.Field()
    new_col: Series[int] = pdr.Field()

@pa.check_types
def drop_col(input_df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    input_df['new_col'] = 'a'
    input_df = input_df.drop('toss_col', axis = 1)
    return input_df

It's clear from this implementation that keep_col is duplicated exactly across both schema, so it would be nice to figure out how to deduplicate that code somehow.

I know that I could use SchemaModel.to_schema() and then use the DataFrameSchema Transformation methods, but that would require me to mix and match the different syntactic representations of schema, which isn't ideal.

Does anyone else run into a similar problem and have an elegant solution?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify Schema in the class-based API? #854

{{title}}

Replies: 0 comments

Select a reply

Modify Schema in the class-based API? #854

dantheand May 11, 2022

Replies: 0 comments

dantheand
May 11, 2022