RFC: Deprecate `@dy.rule(group_by=)` parameter in favor of `.over()`

# Summary

This RFC proposes deprecating the `@dy.rule(group_by=)` parameter in favor of native Polars [window functions](https://docs.pola.rs/user-guide/expressions/window-functions/) ([`.over()`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)), which provide a strictly more expressive and precise abstraction for defining group-aware validation rules. This change unifies the API and reduces internal implementation complexity.

# Motivation

Although users can already use native window functions to define group-aware custom rules, the officially endorsed `group_by` approach degrades the API by promoting a redundant and inferior abstraction. Furthermore, Dataframely takes on unnecessary internal complexity by re-implementing a pattern that Polars already solves natively.

## Overloads Semantics

The user's mental model of what a `@dy.rule` function is/does must shift significantly when the `group_by` parameter is used.

- **Standard Rules:** Must return a row-wise predicate expression.
- **Group Rules:** Must "take care to use some kind of 'aggregate function' in order to produce exactly one value per group: in group rules, the 'input' that the expression is evaluated on is a set of rows".

Using native window functions to express group rules unifies the `@dy.rule` abstraction: every rule must return a row-wise predicate expression, regardless of whether the validation logic involves grouping.

## Fractures Logic

The `group_by` parameter extracts the grouping context out of the function body and into the decorator; the resulting rule expression is not self-contained and cannot be understood in isolation.

## Stringly-Typed

It is impossible to pass actual column references to the `group_by` parameter. By requiring magic strings, the group-rules API breaks editor support (e.g., refactor safety). With window functions, the grouping context is expressed within the rule function, enabling strongly-typed column references (e.g., `.over(cls.zip_code.col)`).

## Less Expressive

Polars window functions (`.over()`) support grouping by expressions and provide an `order_by` parameter. The `group_by` parameter only accepts column-name strings.

## Less Precise

With the `group_by` parameter, validation necessarily has group-level semantics — the *entire group* either passes or fails. But some group-aware rules are better served by surgical, row-level validation.

```python
class Employee(dy.Schema):
	employee_id = dy.String(primary_key=True)
	department = dy.Categorical()
	salary = dy.Float64()
	
	@dy.rule(group_by=["department"])
	def reasonable_salary_for_department(cls) -> pl.Expr:
		MAX_MULTIPLE = 3.0
		upper_bound = cls.salary.col.mean() * MAX_MULTIPLE
		# Must wrap in `.all()` to return exactly 1 boolean per group
		return (cls.salary.col <= upper_bound).all()
```

In the example above, an entire department is considered invalid if a single employee (row) in that department (group) fails the rule. For such a rule, this large "blast radius" is probably not desirable.

- **Imprecise Diagnostics:** The user must manually hunt for the actual violation(s).
- **Aggressive Filtering:** Valid data is needlessly dropped in production pipelines.

Window functions enable surgical group-*aware* validation while also trivially supporting group-*level* validation; the user chooses which behavior is appropriate for their rule.

```python
# ... Employee schema ...

	@dy.rule()
	def reasonable_salary_for_department(cls) -> pl.Expr:
		MAX_MULTIPLE = 3.0
		upper_bound = cls.salary.col.mean() * MAX_MULTIPLE
		# User chooses whether to include `.all()` based on requirements
		return (cls.salary.col <= upper_bound).over(cls.department.col)
```

In the version above, only the specific employees (rows) that have "unreasonable" salaries for their department (group) are considered invalid. Group-level (all-or-nothing) validation semantics is achieved by calling `.all()` prior to `.over()`.

> [!NOTE]
> Polars window functions (`.over()`) perform aggregations on groups while **preserving the grain** of the context they're evaluated in. This is what enables "surgical" group-aware validation.

## Unnecessary Complexity

Rules that use the `group_by` parameter require a distinct internal representation (`GroupRule`) and a separate execution path (`_with_group_rules`). When group rules are expressed via window functions, all rules share a uniform internal representation and are evaluated within a single `.with_columns()` context.

Consequently, `GroupRule` and `_with_group_rules` can be deleted entirely. Dataframely can defer to Polars' native query optimizer to efficiently evaluate overlapping `.over()` windows.

# Proposed API

The `@dy.rule(group_by=)` parameter is deprecated. Custom rules that involve grouping are expressed via native Polars window functions within the rule expression.

```python
# From User Guide > Quickstart

class HouseSchema(dy.Schema):
    zip_code = dy.String(min_length=3)
    ...
	
    # BEFORE
    @dy.rule(group_by=["zip_code"])
    def minimum_zip_code_count(cls) -> pl.Expr:
        return pl.len() >= 2
    
    # AFTER
    @dy.rule()
    def minimum_zip_code_count(cls) -> pl.Expr:
	    return pl.len().over(cls.zip_code.col) >= 2
```

```python
# From User Guide > Examples > Real-world example…

class DiagnosisSchema(dy.Schema):
    invoice_id = dy.String(primary_key=True)
    diagnosis_code = dy.String(primary_key=True)
    is_main = dy.Bool()

    # BEFORE
    @dy.rule(group_by=["invoice_id"])
    def exactly_one_main_diagnosis(cls) -> pl.Expr:
        return cls.is_main.col.sum() == 1
    
    # AFTER
    @dy.rule()
    def exactly_one_main_diagnosis(cls) -> pl.Expr:
	    return cls.is_main.col.sum().over(cls.invoice_id.col) == 1
```

The example below highlights the expressive power of window functions for defining group-aware custom rules. The `signal_is_active` rule does two things that are impossible with the `group_by` parameter: it specifies an `order_by` parameter, and it defines surgical, row-wise validation semantics.

```python
class SensorReading(dy.Schema):
	sensor_id = dy.String(primary_key=True)
	observed_at = dy.Datetime(primary_key=True)
	signal_value = dy.Float64()
	
	@dy.rule()
	def signal_is_active(cls) -> pl.Expr:
		NOISE_FLOOR = 0.001
		std_rolling = cls.signal_value.col.std().rolling(
			cls.observed_at.col,
			period="500ms",
		)
		return (std_rolling > NOISE_FLOOR).over(
			cls.sensor_id.col,
			order_by=cls.observed_at.col,
		)
```

The user has total control over the validation semantics.

```python
# Inactive period invalidates all subsequent readings
return (std_rolling > NOISE_FLOOR).cum_min().over(...)

# Inactive period invalidates entire time series
return (std_rolling > NOISE_FLOOR).all().over(...)
```

> [!NOTE]
> The `group_by="primary_key"` functionality (which enables dynamic resolution of the schema's primary key columns at class-creation time) is achieved without the use of magic strings via `.over(cls.primary_key())`.

# Proposed Migration Path

1. **Update Documentation:** Revise the User Guide to establish Polars window functions (`.over()`) as the idiomatic way to define group-aware custom rules.
2. **Issue `FutureWarning`:** In the next minor release, trigger a `FutureWarning` in `@dy.rule` if the `group_by` parameter is passed.
3. **Remove:** In the next major release, remove the `group_by` parameter and its associated internal implementation (`GroupRule`, `_with_group_rules`).

> [!NOTE]
> Migrating existing group rules requires no changes to the underlying logic and can be performed purely mechanically:
> ```python
> # Legacy: group_by=["a", "b"]
> return (expr).over(["a", "b"])
> 
> # Legacy: group_by="primary_key"
> return (expr).over(cls.primary_key())
> ```
> 
> Because legacy group rules must produce exactly one value per group, group-level validation semantics are guaranteed.

# Alternatives Considered

**Retain `group_by` as syntactic sugar:** The `@dy.rule(group_by=)` parameter could be retained as syntactic sugar. Under the hood, the decorator would intercept the returned rule expression, wrap it in parentheses, and call `.over()` on it, passing the "by" columns. This would enable the deletion of `GroupRule` and `_with_group_rules` without forcing users to migrate their code.

**Why it was rejected:** This alternative was rejected because it violates the principle of having "one obvious way" to do things. While it resolves the "Unnecessary Complexity" motivation for maintainers, it leaves all the user-facing motivations outstanding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Deprecate `@dy.rule(group_by=)` parameter in favor of `.over()` #316

Summary

Motivation

Overloads Semantics

Fractures Logic

Stringly-Typed

Less Expressive

Less Precise

Unnecessary Complexity

Proposed API

Proposed Migration Path

Alternatives Considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Deprecate @dy.rule(group_by=) parameter in favor of .over() #316

Description

Summary

Motivation

Overloads Semantics

Fractures Logic

Stringly-Typed

Less Expressive

Less Precise

Unnecessary Complexity

Proposed API

Proposed Migration Path

Alternatives Considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

RFC: Deprecate `@dy.rule(group_by=)` parameter in favor of `.over()` #316