Skip to content

RFC: Deprecate @dy.rule(group_by=) parameter in favor of .over() #316

@haydena7

Description

Summary

This RFC proposes deprecating the @dy.rule(group_by=) parameter in favor of native Polars window functions (.over()), which provide a strictly more expressive and precise abstraction for defining group-aware validation rules. This change unifies the API and reduces internal implementation complexity.

Motivation

Although users can already use native window functions to define group-aware custom rules, the officially endorsed group_by approach degrades the API by promoting a redundant and inferior abstraction. Furthermore, Dataframely takes on unnecessary internal complexity by re-implementing a pattern that Polars already solves natively.

Overloads Semantics

The user's mental model of what a @dy.rule function is/does must shift significantly when the group_by parameter is used.

  • Standard Rules: Must return a row-wise predicate expression.
  • Group Rules: Must "take care to use some kind of 'aggregate function' in order to produce exactly one value per group: in group rules, the 'input' that the expression is evaluated on is a set of rows".

Using native window functions to express group rules unifies the @dy.rule abstraction: every rule must return a row-wise predicate expression, regardless of whether the validation logic involves grouping.

Fractures Logic

The group_by parameter extracts the grouping context out of the function body and into the decorator; the resulting rule expression is not self-contained and cannot be understood in isolation.

Stringly-Typed

It is impossible to pass actual column references to the group_by parameter. By requiring magic strings, the group-rules API breaks editor support (e.g., refactor safety). With window functions, the grouping context is expressed within the rule function, enabling strongly-typed column references (e.g., .over(cls.zip_code.col)).

Less Expressive

Polars window functions (.over()) support grouping by expressions and provide an order_by parameter. The group_by parameter only accepts column-name strings.

Less Precise

With the group_by parameter, validation necessarily has group-level semantics — the entire group either passes or fails. But some group-aware rules are better served by surgical, row-level validation.

class Employee(dy.Schema):
	employee_id = dy.String(primary_key=True)
	department = dy.Categorical()
	salary = dy.Float64()
	
	@dy.rule(group_by=["department"])
	def reasonable_salary_for_department(cls) -> pl.Expr:
		MAX_MULTIPLE = 3.0
		upper_bound = cls.salary.col.mean() * MAX_MULTIPLE
		# Must wrap in `.all()` to return exactly 1 boolean per group
		return (cls.salary.col <= upper_bound).all()

In the example above, an entire department is considered invalid if a single employee (row) in that department (group) fails the rule. For such a rule, this large "blast radius" is probably not desirable.

  • Imprecise Diagnostics: The user must manually hunt for the actual violation(s).
  • Aggressive Filtering: Valid data is needlessly dropped in production pipelines.

Window functions enable surgical group-aware validation while also trivially supporting group-level validation; the user chooses which behavior is appropriate for their rule.

# ... Employee schema ...

	@dy.rule()
	def reasonable_salary_for_department(cls) -> pl.Expr:
		MAX_MULTIPLE = 3.0
		upper_bound = cls.salary.col.mean() * MAX_MULTIPLE
		# User chooses whether to include `.all()` based on requirements
		return (cls.salary.col <= upper_bound).over(cls.department.col)

In the version above, only the specific employees (rows) that have "unreasonable" salaries for their department (group) are considered invalid. Group-level (all-or-nothing) validation semantics is achieved by calling .all() prior to .over().

Note

Polars window functions (.over()) perform aggregations on groups while preserving the grain of the context they're evaluated in. This is what enables "surgical" group-aware validation.

Unnecessary Complexity

Rules that use the group_by parameter require a distinct internal representation (GroupRule) and a separate execution path (_with_group_rules). When group rules are expressed via window functions, all rules share a uniform internal representation and are evaluated within a single .with_columns() context.

Consequently, GroupRule and _with_group_rules can be deleted entirely. Dataframely can defer to Polars' native query optimizer to efficiently evaluate overlapping .over() windows.

Proposed API

The @dy.rule(group_by=) parameter is deprecated. Custom rules that involve grouping are expressed via native Polars window functions within the rule expression.

# From User Guide > Quickstart

class HouseSchema(dy.Schema):
    zip_code = dy.String(min_length=3)
    ...
	
    # BEFORE
    @dy.rule(group_by=["zip_code"])
    def minimum_zip_code_count(cls) -> pl.Expr:
        return pl.len() >= 2
    
    # AFTER
    @dy.rule()
    def minimum_zip_code_count(cls) -> pl.Expr:
	    return pl.len().over(cls.zip_code.col) >= 2
# From User Guide > Examples > Real-world example…

class DiagnosisSchema(dy.Schema):
    invoice_id = dy.String(primary_key=True)
    diagnosis_code = dy.String(primary_key=True)
    is_main = dy.Bool()

    # BEFORE
    @dy.rule(group_by=["invoice_id"])
    def exactly_one_main_diagnosis(cls) -> pl.Expr:
        return cls.is_main.col.sum() == 1
    
    # AFTER
    @dy.rule()
    def exactly_one_main_diagnosis(cls) -> pl.Expr:
	    return cls.is_main.col.sum().over(cls.invoice_id.col) == 1

The example below highlights the expressive power of window functions for defining group-aware custom rules. The signal_is_active rule does two things that are impossible with the group_by parameter: it specifies an order_by parameter, and it defines surgical, row-wise validation semantics.

class SensorReading(dy.Schema):
	sensor_id = dy.String(primary_key=True)
	observed_at = dy.Datetime(primary_key=True)
	signal_value = dy.Float64()
	
	@dy.rule()
	def signal_is_active(cls) -> pl.Expr:
		NOISE_FLOOR = 0.001
		std_rolling = cls.signal_value.col.std().rolling(
			cls.observed_at.col,
			period="500ms",
		)
		return (std_rolling > NOISE_FLOOR).over(
			cls.sensor_id.col,
			order_by=cls.observed_at.col,
		)

The user has total control over the validation semantics.

# Inactive period invalidates all subsequent readings
return (std_rolling > NOISE_FLOOR).cum_min().over(...)

# Inactive period invalidates entire time series
return (std_rolling > NOISE_FLOOR).all().over(...)

Note

The group_by="primary_key" functionality (which enables dynamic resolution of the schema's primary key columns at class-creation time) is achieved without the use of magic strings via .over(cls.primary_key()).

Proposed Migration Path

  1. Update Documentation: Revise the User Guide to establish Polars window functions (.over()) as the idiomatic way to define group-aware custom rules.
  2. Issue FutureWarning: In the next minor release, trigger a FutureWarning in @dy.rule if the group_by parameter is passed.
  3. Remove: In the next major release, remove the group_by parameter and its associated internal implementation (GroupRule, _with_group_rules).

Note

Migrating existing group rules requires no changes to the underlying logic and can be performed purely mechanically:

# Legacy: group_by=["a", "b"]
return (expr).over(["a", "b"])

# Legacy: group_by="primary_key"
return (expr).over(cls.primary_key())

Because legacy group rules must produce exactly one value per group, group-level validation semantics are guaranteed.

Alternatives Considered

Retain group_by as syntactic sugar: The @dy.rule(group_by=) parameter could be retained as syntactic sugar. Under the hood, the decorator would intercept the returned rule expression, wrap it in parentheses, and call .over() on it, passing the "by" columns. This would enable the deletion of GroupRule and _with_group_rules without forcing users to migrate their code.

Why it was rejected: This alternative was rejected because it violates the principle of having "one obvious way" to do things. While it resolves the "Unnecessary Complexity" motivation for maintainers, it leaves all the user-facing motivations outstanding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions