-
Notifications
You must be signed in to change notification settings - Fork 15
RFC: Deprecate @dy.rule(group_by=) parameter in favor of .over() #316
Description
Summary
This RFC proposes deprecating the @dy.rule(group_by=) parameter in favor of native Polars window functions (.over()), which provide a strictly more expressive and precise abstraction for defining group-aware validation rules. This change unifies the API and reduces internal implementation complexity.
Motivation
Although users can already use native window functions to define group-aware custom rules, the officially endorsed group_by approach degrades the API by promoting a redundant and inferior abstraction. Furthermore, Dataframely takes on unnecessary internal complexity by re-implementing a pattern that Polars already solves natively.
Overloads Semantics
The user's mental model of what a @dy.rule function is/does must shift significantly when the group_by parameter is used.
- Standard Rules: Must return a row-wise predicate expression.
- Group Rules: Must "take care to use some kind of 'aggregate function' in order to produce exactly one value per group: in group rules, the 'input' that the expression is evaluated on is a set of rows".
Using native window functions to express group rules unifies the @dy.rule abstraction: every rule must return a row-wise predicate expression, regardless of whether the validation logic involves grouping.
Fractures Logic
The group_by parameter extracts the grouping context out of the function body and into the decorator; the resulting rule expression is not self-contained and cannot be understood in isolation.
Stringly-Typed
It is impossible to pass actual column references to the group_by parameter. By requiring magic strings, the group-rules API breaks editor support (e.g., refactor safety). With window functions, the grouping context is expressed within the rule function, enabling strongly-typed column references (e.g., .over(cls.zip_code.col)).
Less Expressive
Polars window functions (.over()) support grouping by expressions and provide an order_by parameter. The group_by parameter only accepts column-name strings.
Less Precise
With the group_by parameter, validation necessarily has group-level semantics — the entire group either passes or fails. But some group-aware rules are better served by surgical, row-level validation.
class Employee(dy.Schema):
employee_id = dy.String(primary_key=True)
department = dy.Categorical()
salary = dy.Float64()
@dy.rule(group_by=["department"])
def reasonable_salary_for_department(cls) -> pl.Expr:
MAX_MULTIPLE = 3.0
upper_bound = cls.salary.col.mean() * MAX_MULTIPLE
# Must wrap in `.all()` to return exactly 1 boolean per group
return (cls.salary.col <= upper_bound).all()In the example above, an entire department is considered invalid if a single employee (row) in that department (group) fails the rule. For such a rule, this large "blast radius" is probably not desirable.
- Imprecise Diagnostics: The user must manually hunt for the actual violation(s).
- Aggressive Filtering: Valid data is needlessly dropped in production pipelines.
Window functions enable surgical group-aware validation while also trivially supporting group-level validation; the user chooses which behavior is appropriate for their rule.
# ... Employee schema ...
@dy.rule()
def reasonable_salary_for_department(cls) -> pl.Expr:
MAX_MULTIPLE = 3.0
upper_bound = cls.salary.col.mean() * MAX_MULTIPLE
# User chooses whether to include `.all()` based on requirements
return (cls.salary.col <= upper_bound).over(cls.department.col)In the version above, only the specific employees (rows) that have "unreasonable" salaries for their department (group) are considered invalid. Group-level (all-or-nothing) validation semantics is achieved by calling .all() prior to .over().
Note
Polars window functions (.over()) perform aggregations on groups while preserving the grain of the context they're evaluated in. This is what enables "surgical" group-aware validation.
Unnecessary Complexity
Rules that use the group_by parameter require a distinct internal representation (GroupRule) and a separate execution path (_with_group_rules). When group rules are expressed via window functions, all rules share a uniform internal representation and are evaluated within a single .with_columns() context.
Consequently, GroupRule and _with_group_rules can be deleted entirely. Dataframely can defer to Polars' native query optimizer to efficiently evaluate overlapping .over() windows.
Proposed API
The @dy.rule(group_by=) parameter is deprecated. Custom rules that involve grouping are expressed via native Polars window functions within the rule expression.
# From User Guide > Quickstart
class HouseSchema(dy.Schema):
zip_code = dy.String(min_length=3)
...
# BEFORE
@dy.rule(group_by=["zip_code"])
def minimum_zip_code_count(cls) -> pl.Expr:
return pl.len() >= 2
# AFTER
@dy.rule()
def minimum_zip_code_count(cls) -> pl.Expr:
return pl.len().over(cls.zip_code.col) >= 2# From User Guide > Examples > Real-world example…
class DiagnosisSchema(dy.Schema):
invoice_id = dy.String(primary_key=True)
diagnosis_code = dy.String(primary_key=True)
is_main = dy.Bool()
# BEFORE
@dy.rule(group_by=["invoice_id"])
def exactly_one_main_diagnosis(cls) -> pl.Expr:
return cls.is_main.col.sum() == 1
# AFTER
@dy.rule()
def exactly_one_main_diagnosis(cls) -> pl.Expr:
return cls.is_main.col.sum().over(cls.invoice_id.col) == 1The example below highlights the expressive power of window functions for defining group-aware custom rules. The signal_is_active rule does two things that are impossible with the group_by parameter: it specifies an order_by parameter, and it defines surgical, row-wise validation semantics.
class SensorReading(dy.Schema):
sensor_id = dy.String(primary_key=True)
observed_at = dy.Datetime(primary_key=True)
signal_value = dy.Float64()
@dy.rule()
def signal_is_active(cls) -> pl.Expr:
NOISE_FLOOR = 0.001
std_rolling = cls.signal_value.col.std().rolling(
cls.observed_at.col,
period="500ms",
)
return (std_rolling > NOISE_FLOOR).over(
cls.sensor_id.col,
order_by=cls.observed_at.col,
)The user has total control over the validation semantics.
# Inactive period invalidates all subsequent readings
return (std_rolling > NOISE_FLOOR).cum_min().over(...)
# Inactive period invalidates entire time series
return (std_rolling > NOISE_FLOOR).all().over(...)Note
The group_by="primary_key" functionality (which enables dynamic resolution of the schema's primary key columns at class-creation time) is achieved without the use of magic strings via .over(cls.primary_key()).
Proposed Migration Path
- Update Documentation: Revise the User Guide to establish Polars window functions (
.over()) as the idiomatic way to define group-aware custom rules. - Issue
FutureWarning: In the next minor release, trigger aFutureWarningin@dy.ruleif thegroup_byparameter is passed. - Remove: In the next major release, remove the
group_byparameter and its associated internal implementation (GroupRule,_with_group_rules).
Note
Migrating existing group rules requires no changes to the underlying logic and can be performed purely mechanically:
# Legacy: group_by=["a", "b"]
return (expr).over(["a", "b"])
# Legacy: group_by="primary_key"
return (expr).over(cls.primary_key())Because legacy group rules must produce exactly one value per group, group-level validation semantics are guaranteed.
Alternatives Considered
Retain group_by as syntactic sugar: The @dy.rule(group_by=) parameter could be retained as syntactic sugar. Under the hood, the decorator would intercept the returned rule expression, wrap it in parentheses, and call .over() on it, passing the "by" columns. This would enable the deletion of GroupRule and _with_group_rules without forcing users to migrate their code.
Why it was rejected: This alternative was rejected because it violates the principle of having "one obvious way" to do things. While it resolves the "Unnecessary Complexity" motivation for maintainers, it leaves all the user-facing motivations outstanding.