Skip to content

Validation result is sometimes incorrect when using group rules #38

@borchero

Description

@borchero

In yesterday's PyData meetup in Zurich, one question prompted me to realize that we're incorrectly dealing with group rules and row-level rules: if a row-level rule removes a row which would make a group rule fail, we do not realize it. For example:

import dataframely as dy
import polars as pl

class DiagnosisSchema(dy.Schema):
    invoice_id = dy.String(primary_key=True)
    diagnosis = dy.String(primary_key=True, regex="^[A-Z]{3}$")
    is_main = dy.Bool(nullable=False)

    @dy.rule()
    def exactly_one_main_diagnosis() -> pl.Expr:
        return pl.col("is_main").sum() == 1

df = pl.DataFrame(
    {
        "invoice_id": ["A", "A", "A"],
        "diagnosis": ["ABC", "ABD", "123"],
        "is_main": [False, False, True],
    }
)
good, _ = DiagnosisSchema.filter(df)
print(good)

results in

shape: (2, 3)
┌────────────┬───────────┬─────────┐
│ invoice_id ┆ diagnosis ┆ is_main │
│ ---        ┆ ---       ┆ ---     │
│ str        ┆ str       ┆ bool    │
╞════════════╪═══════════╪═════════╡
│ A          ┆ ABC       ┆ false   │
│ A          ┆ ABD       ┆ false   │
└────────────┴───────────┴─────────┘

which clearly violates the schema since we don't have a main diagnosis for the group.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions