diff --git a/CLAUDE.md b/CLAUDE.md index 38c87a73..34b02522 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -119,13 +119,20 @@ pytest tests/test_rust_backend.py -v - Integrated with `TwoWayFixedEffects.decompose()` method - **`diff_diff/linalg.py`** - Unified linear algebra backend (v1.4.0+): - - `solve_ols()` - OLS solver using scipy's gelsy LAPACK driver (QR-based, faster than SVD) + - `solve_ols()` - OLS solver with R-style rank deficiency handling + - `_detect_rank_deficiency()` - Detect linearly dependent columns via pivoted QR - `compute_robust_vcov()` - Vectorized HC1 and cluster-robust variance-covariance estimation - `compute_r_squared()` - R-squared and adjusted R-squared computation - `LinearRegression` - High-level OLS helper class with unified coefficient extraction and inference - `InferenceResult` - Dataclass container for coefficient-level inference (SE, t-stat, p-value, CI) - Single optimization point for all estimators (reduces code duplication) - Cluster-robust SEs use pandas groupby instead of O(n × clusters) loop + - **Rank deficiency handling** (R-style): + - Detects rank-deficient matrices using pivoted QR decomposition + - `rank_deficient_action` parameter: "warn" (default), "error", or "silent" + - Dropped columns have NaN coefficients (like R's `lm()`) + - VCoV matrix has NaN for rows/cols of dropped coefficients + - Warnings include column names when provided - **`diff_diff/_backend.py`** - Backend detection and configuration (v2.0.0): - Detects optional Rust backend availability @@ -240,7 +247,7 @@ diff-diff achieved significant performance improvements in v1.4.0, now **faster All estimators use a single optimized OLS/SE implementation: -- **scipy.linalg.lstsq with 'gelsy' driver**: QR-based solving, faster than NumPy's default SVD-based solver +- **R-style rank deficiency handling**: Uses pivoted QR to detect linearly dependent columns, drops them, sets NaN for their coefficients, and emits informative warnings (following R's `lm()` approach) - **Vectorized cluster-robust SE**: Uses pandas groupby aggregation instead of O(n × clusters) Python loop - **Single optimization point**: Changes to `linalg.py` benefit all estimators @@ -248,8 +255,17 @@ All estimators use a single optimized OLS/SE implementation: # All estimators import from linalg.py from diff_diff.linalg import solve_ols, compute_robust_vcov -# Example usage +# Example usage (warns on rank deficiency, sets NaN for dropped coefficients) coefficients, residuals, vcov = solve_ols(X, y, cluster_ids=cluster_ids) + +# Suppress warning or raise error: +coefficients, residuals, vcov = solve_ols(X, y, rank_deficient_action="silent") # no warning +coefficients, residuals, vcov = solve_ols(X, y, rank_deficient_action="error") # raises ValueError + +# At estimator level (DifferenceInDifferences, MultiPeriodDiD): +from diff_diff import DifferenceInDifferences +did = DifferenceInDifferences(rank_deficient_action="error") # raises on collinear data +did = DifferenceInDifferences(rank_deficient_action="silent") # no warning ``` #### CallawaySantAnna Optimizations (`staggered.py`) diff --git a/TODO.md b/TODO.md index 96ebf679..fd6316a3 100644 --- a/TODO.md +++ b/TODO.md @@ -15,6 +15,21 @@ Current limitations that may affect users: | MultiPeriodDiD wild bootstrap not supported | `estimators.py:1068-1074` | Low | Edge case | | `predict()` raises NotImplementedError | `estimators.py:532-554` | Low | Rarely needed | +### ~~NaN Standard Errors for Rank-Deficient Matrices~~ (RESOLVED) + +**Status**: Resolved in v2.2.0 with R-style rank deficiency handling. + +**Solution**: The OLS solver now detects rank-deficient design matrices using pivoted QR decomposition and handles them following R's `lm()` approach: +- Warns users about dropped columns +- Sets NaN for coefficients of linearly dependent columns +- Computes valid SEs for identified (non-dropped) coefficients only +- Expands vcov matrix with NaN for dropped rows/columns + +This is controlled by the `rank_deficient_action` parameter in `solve_ols()`: +- `"warn"` (default): Emit warning, set NA for dropped coefficients +- `"error"`: Raise ValueError +- `"silent"`: No warning, but still set NA for dropped coefficients + --- ## Code Quality @@ -143,3 +158,37 @@ Potential future optimizations: - [ ] JIT compilation for bootstrap loops (numba) - [ ] Sparse matrix handling for large fixed effects +### QR+SVD Redundancy in Rank Detection + +**Background**: The current `solve_ols()` implementation performs both QR (for rank detection) and SVD (for solving) decompositions on rank-deficient matrices. This is technically redundant since SVD can determine rank directly. + +**Current approach** (R-style, chosen for robustness): +1. QR with pivoting for rank detection (`_detect_rank_deficiency()`) +2. scipy's `lstsq` with 'gelsd' driver (SVD-based) for solving + +**Why we use QR for rank detection**: +- QR with pivoting provides the canonical ordering of linearly dependent columns +- R's `lm()` uses this approach for consistent dropped-column reporting +- Ensures consistent column dropping across runs (SVD column selection can vary) + +**Potential optimization** (future work): +- Skip QR when `rank_deficient_action="silent"` since we don't need column names +- Use SVD rank directly in the Rust backend (already implemented) +- Add `skip_rank_check` parameter for hot paths where matrix is known to be full-rank (implemented in v2.2.0) + +**Priority**: Low - the QR overhead is minimal compared to SVD solve, and correctness is more important than micro-optimization. + +### Incomplete `check_finite` Bypass + +**Background**: The `solve_ols()` function accepts a `check_finite=False` parameter intended to skip NaN/Inf validation for performance in hot paths where data is known to be clean. + +**Current limitation**: When `check_finite=False`, our explicit validation is skipped, but scipy's internal QR decomposition in `_detect_rank_deficiency()` still validates finite values. This means callers cannot fully bypass all finite checks. + +**Impact**: Minimal - the scipy check is fast and only affects edge cases where users explicitly pass `check_finite=False` with non-finite data (which would be a bug in their code anyway). + +**Potential fix** (future work): +- Pass `check_finite=False` through to scipy's QR call (requires scipy >= 1.9.0) +- Or skip `_detect_rank_deficiency()` entirely when `check_finite=False` and `_skip_rank_check=True` + +**Priority**: Low - this is an edge case optimization that doesn't affect correctness. + diff --git a/diff_diff/estimators.py b/diff_diff/estimators.py index 7fa4c805..83fc72b0 100644 --- a/diff_diff/estimators.py +++ b/diff_diff/estimators.py @@ -64,6 +64,11 @@ class DifferenceInDifferences: seed : int, optional Random seed for reproducibility when using bootstrap inference. If None (default), results will vary between runs. + rank_deficient_action : str, default "warn" + Action when design matrix is rank-deficient (linearly dependent columns): + - "warn": Issue warning and drop linearly dependent columns (default) + - "error": Raise ValueError + - "silent": Drop columns silently without warning Attributes ---------- @@ -120,7 +125,8 @@ def __init__( inference: str = "analytical", n_bootstrap: int = 999, bootstrap_weights: str = "rademacher", - seed: Optional[int] = None + seed: Optional[int] = None, + rank_deficient_action: str = "warn", ): self.robust = robust self.cluster = cluster @@ -129,6 +135,7 @@ def __init__( self.n_bootstrap = n_bootstrap self.bootstrap_weights = bootstrap_weights self.seed = seed + self.rank_deficient_action = rank_deficient_action self.is_fitted_ = False self.results_ = None @@ -283,6 +290,7 @@ def fit( robust=self.robust, cluster_ids=cluster_ids if self.inference != "wild_bootstrap" else None, alpha=self.alpha, + rank_deficient_action=self.rank_deficient_action, ).fit(X, y, df_adjustment=n_absorbed_effects) coefficients = reg.coefficients_ @@ -596,6 +604,7 @@ def get_params(self) -> Dict[str, Any]: "n_bootstrap": self.n_bootstrap, "bootstrap_weights": self.bootstrap_weights, "seed": self.seed, + "rank_deficient_action": self.rank_deficient_action, } def set_params(self, **params) -> "DifferenceInDifferences": @@ -873,29 +882,43 @@ def fit( # type: ignore[override] var_names.append(col) # Fit OLS using unified backend - coefficients, residuals, fitted, _ = solve_ols( - X, y, return_fitted=True, return_vcov=False + # Pass cluster_ids to solve_ols for proper vcov computation + # This handles rank-deficient matrices by returning NaN for dropped columns + cluster_ids = data[self.cluster].values if self.cluster is not None else None + + # Note: Wild bootstrap for multi-period effects is complex (multiple coefficients) + # For now, we use analytical inference even if inference="wild_bootstrap" + coefficients, residuals, fitted, vcov = solve_ols( + X, y, + return_fitted=True, + return_vcov=True, + cluster_ids=cluster_ids, + column_names=var_names, + rank_deficient_action=self.rank_deficient_action, ) r_squared = compute_r_squared(y, residuals) - # Degrees of freedom - df = len(y) - X.shape[1] - n_absorbed_effects + # Degrees of freedom using effective rank (non-NaN coefficients) + k_effective = int(np.sum(~np.isnan(coefficients))) + df = len(y) - k_effective - n_absorbed_effects - # Compute standard errors - # Note: Wild bootstrap for multi-period effects is complex (multiple coefficients) - # For now, we use analytical inference even if inference="wild_bootstrap" - if self.cluster is not None: - cluster_ids = data[self.cluster].values - vcov = compute_robust_vcov(X, residuals, cluster_ids) - elif self.robust: - vcov = compute_robust_vcov(X, residuals) - else: + # For non-robust, non-clustered case, we need homoskedastic vcov + # solve_ols returns HC1 by default, so compute homoskedastic if needed + if not self.robust and self.cluster is None: n = len(y) - k = X.shape[1] - mse = np.sum(residuals**2) / (n - k) + mse = np.sum(residuals**2) / (n - k_effective) # Use solve() instead of inv() for numerical stability - # solve(A, B) computes X where AX=B, so this yields (X'X)^{-1} * mse - vcov = np.linalg.solve(X.T @ X, mse * np.eye(k)) + # Only compute for identified columns (non-NaN coefficients) + identified_mask = ~np.isnan(coefficients) + if np.all(identified_mask): + vcov = np.linalg.solve(X.T @ X, mse * np.eye(X.shape[1])) + else: + # For rank-deficient case, compute vcov on reduced matrix then expand + X_reduced = X[:, identified_mask] + vcov_reduced = np.linalg.solve(X_reduced.T @ X_reduced, mse * np.eye(X_reduced.shape[1])) + # Expand to full size with NaN for dropped columns + vcov = np.full((X.shape[1], X.shape[1]), np.nan) + vcov[np.ix_(identified_mask, identified_mask)] = vcov_reduced # Extract period-specific treatment effects period_effects = {} @@ -922,19 +945,43 @@ def fit( # type: ignore[override] effect_indices.append(idx) # Compute average treatment effect - # Average ATT = mean of period-specific effects - avg_att = np.mean(effect_values) - - # Standard error of average: need to account for covariance - # Var(avg) = (1/n^2) * sum of all elements in the sub-covariance matrix - n_post = len(post_periods) - sub_vcov = vcov[np.ix_(effect_indices, effect_indices)] - avg_var = np.sum(sub_vcov) / (n_post ** 2) - avg_se = np.sqrt(avg_var) - - avg_t_stat = avg_att / avg_se if avg_se > 0 else 0.0 - avg_p_value = compute_p_value(avg_t_stat, df=df) - avg_conf_int = compute_confidence_interval(avg_att, avg_se, self.alpha, df=df) + # R-style NA propagation: if ANY period effect is NaN, average is undefined + effect_arr = np.array(effect_values) + + if np.any(np.isnan(effect_arr)): + # Some period effects are NaN (unidentified) - cannot compute valid average + # This follows R's default behavior where mean(c(1, 2, NA)) returns NA + avg_att = np.nan + avg_se = np.nan + avg_t_stat = np.nan + avg_p_value = np.nan + avg_conf_int = (np.nan, np.nan) + else: + # All effects identified - compute average normally + avg_att = float(np.mean(effect_arr)) + + # Standard error of average: need to account for covariance + n_post = len(post_periods) + sub_vcov = vcov[np.ix_(effect_indices, effect_indices)] + avg_var = np.sum(sub_vcov) / (n_post ** 2) + + if np.isnan(avg_var) or avg_var < 0: + # Vcov has NaN (dropped columns) - propagate NaN + avg_se = np.nan + avg_t_stat = np.nan + avg_p_value = np.nan + avg_conf_int = (np.nan, np.nan) + else: + avg_se = float(np.sqrt(avg_var)) + if avg_se > 0: + avg_t_stat = avg_att / avg_se + avg_p_value = compute_p_value(avg_t_stat, df=df) + avg_conf_int = compute_confidence_interval(avg_att, avg_se, self.alpha, df=df) + else: + # Zero SE (degenerate case) + avg_t_stat = np.nan + avg_p_value = np.nan + avg_conf_int = (np.nan, np.nan) # Count observations n_treated = int(np.sum(d)) diff --git a/diff_diff/linalg.py b/diff_diff/linalg.py index 0caf4190..c833a885 100644 --- a/diff_diff/linalg.py +++ b/diff_diff/linalg.py @@ -5,15 +5,34 @@ Rust backend for maximum performance. The key optimizations are: -1. scipy.linalg.lstsq with 'gelsy' driver (QR-based, faster than SVD) +1. scipy.linalg.lstsq with 'gelsd' driver (SVD-based, handles rank-deficient matrices) 2. Vectorized cluster-robust SE via groupby (eliminates O(n*clusters) loop) 3. Single interface for all estimators (reduces code duplication) 4. Optional Rust backend for additional speedup (when available) +5. R-style rank deficiency handling: detect, warn, and set NA for dropped columns The Rust backend is automatically used when available, with transparent fallback to NumPy/SciPy implementations. + +Rank Deficiency Handling +------------------------ +When a design matrix is rank-deficient (has linearly dependent columns), the OLS +solution is not unique. This module follows R's `lm()` approach: + +1. Detect rank deficiency using pivoted QR decomposition +2. Identify which columns are linearly dependent +3. Drop redundant columns from the solve +4. Set NA (NaN) for coefficients of dropped columns +5. Warn with clear message listing dropped columns +6. Compute valid SEs for remaining (identified) coefficients + +This is controlled by the `rank_deficient_action` parameter: +- "warn" (default): Emit warning, set NA for dropped coefficients +- "error": Raise ValueError with dropped column information +- "silent": No warning, but still set NA for dropped coefficients """ +import warnings from dataclasses import dataclass from typing import Dict, List, Optional, Tuple, Union @@ -21,6 +40,7 @@ import pandas as pd from scipy import stats from scipy.linalg import lstsq as scipy_lstsq +from scipy.linalg import qr # Import Rust backend if available (from _backend to avoid circular imports) from diff_diff._backend import ( @@ -30,6 +50,276 @@ ) +# ============================================================================= +# Utility Functions +# ============================================================================= + + +def _factorize_cluster_ids(cluster_ids: np.ndarray) -> np.ndarray: + """ + Convert cluster IDs to contiguous integer codes for Rust backend. + + Handles string, categorical, or non-contiguous integer cluster IDs by + mapping them to contiguous integers starting from 0. + + Parameters + ---------- + cluster_ids : np.ndarray + Cluster identifiers (can be strings, integers, or categorical). + + Returns + ------- + np.ndarray + Integer cluster codes (dtype int64) suitable for Rust backend. + """ + # Use pandas factorize for efficient conversion of any dtype + codes, _ = pd.factorize(cluster_ids) + return codes.astype(np.int64) + + +# ============================================================================= +# Rank Deficiency Detection and Handling +# ============================================================================= + + +def _detect_rank_deficiency( + X: np.ndarray, + rcond: Optional[float] = None, +) -> Tuple[int, np.ndarray, np.ndarray]: + """ + Detect rank deficiency using pivoted QR decomposition. + + This follows R's lm() approach of using pivoted QR to detect which columns + are linearly dependent. The pivoting ensures we drop the "least important" + columns (those with smallest contribution to the column space). + + Parameters + ---------- + X : ndarray of shape (n, k) + Design matrix. + rcond : float, optional + Relative condition number threshold for determining rank. + Diagonal elements of R smaller than rcond * max(|R_ii|) are treated + as zero. If None, uses 1e-07 to match R's qr() default tolerance. + + Returns + ------- + rank : int + Numerical rank of the matrix. + dropped_cols : ndarray of int + Indices of columns that are linearly dependent (should be dropped). + Empty if matrix is full rank. + pivot : ndarray of int + Column permutation from QR decomposition. + """ + n, k = X.shape + + # Compute pivoted QR decomposition: X @ P = Q @ R + # P is a permutation matrix, represented as pivot indices + Q, R, pivot = qr(X, mode='economic', pivoting=True) + + # Determine rank tolerance + # R's qr() uses tol = 1e-07 by default, which is sqrt(eps) ≈ 1.49e-08 + # We use 1e-07 to match R's lm() behavior for consistency + if rcond is None: + rcond = 1e-07 + + # The diagonal of R contains information about linear independence + # After pivoting, |R[i,i]| is decreasing + r_diag = np.abs(np.diag(R)) + + # Find numerical rank: count singular values above threshold + # The threshold is relative to the largest diagonal element + if r_diag[0] == 0: + rank = 0 + else: + tol = rcond * r_diag[0] + rank = int(np.sum(r_diag > tol)) + + # Columns after rank position (in pivot order) are linearly dependent + # We need to map back to original column indices + if rank < k: + dropped_cols = np.sort(pivot[rank:]) + else: + dropped_cols = np.array([], dtype=int) + + return rank, dropped_cols, pivot + + +def _format_dropped_columns( + dropped_cols: np.ndarray, + column_names: Optional[List[str]] = None, +) -> str: + """ + Format dropped column information for error/warning messages. + + Parameters + ---------- + dropped_cols : ndarray of int + Indices of dropped columns. + column_names : list of str, optional + Names for the columns. If None, uses indices. + + Returns + ------- + str + Formatted string describing dropped columns. + """ + if len(dropped_cols) == 0: + return "" + + if column_names is not None: + names = [column_names[i] if i < len(column_names) else f"column {i}" + for i in dropped_cols] + if len(names) == 1: + return f"'{names[0]}'" + elif len(names) <= 5: + return ", ".join(f"'{n}'" for n in names) + else: + shown = ", ".join(f"'{n}'" for n in names[:5]) + return f"{shown}, ... and {len(names) - 5} more" + else: + if len(dropped_cols) == 1: + return f"column {dropped_cols[0]}" + elif len(dropped_cols) <= 5: + return ", ".join(f"column {i}" for i in dropped_cols) + else: + shown = ", ".join(f"column {i}" for i in dropped_cols[:5]) + return f"{shown}, ... and {len(dropped_cols) - 5} more" + + +def _expand_coefficients_with_nan( + coef_reduced: np.ndarray, + k_full: int, + kept_cols: np.ndarray, +) -> np.ndarray: + """ + Expand reduced coefficients to full size, filling dropped columns with NaN. + + Parameters + ---------- + coef_reduced : ndarray of shape (rank,) + Coefficients for kept columns only. + k_full : int + Total number of columns in original design matrix. + kept_cols : ndarray of int + Indices of columns that were kept. + + Returns + ------- + ndarray of shape (k_full,) + Full coefficient vector with NaN for dropped columns. + """ + coef_full = np.full(k_full, np.nan) + coef_full[kept_cols] = coef_reduced + return coef_full + + +def _expand_vcov_with_nan( + vcov_reduced: np.ndarray, + k_full: int, + kept_cols: np.ndarray, +) -> np.ndarray: + """ + Expand reduced vcov matrix to full size, filling dropped entries with NaN. + + Parameters + ---------- + vcov_reduced : ndarray of shape (rank, rank) + Variance-covariance matrix for kept columns only. + k_full : int + Total number of columns in original design matrix. + kept_cols : ndarray of int + Indices of columns that were kept. + + Returns + ------- + ndarray of shape (k_full, k_full) + Full vcov matrix with NaN for dropped rows/columns. + """ + vcov_full = np.full((k_full, k_full), np.nan) + # Use advanced indexing to fill in the kept entries + ix = np.ix_(kept_cols, kept_cols) + vcov_full[ix] = vcov_reduced + return vcov_full + + +def _solve_ols_rust( + X: np.ndarray, + y: np.ndarray, + *, + cluster_ids: Optional[np.ndarray] = None, + return_vcov: bool = True, + return_fitted: bool = False, +) -> Union[ + Tuple[np.ndarray, np.ndarray, Optional[np.ndarray]], + Tuple[np.ndarray, np.ndarray, np.ndarray, Optional[np.ndarray]], +]: + """ + Rust backend implementation of solve_ols for full-rank matrices. + + This is only called when: + 1. The Rust backend is available + 2. The design matrix is full rank (no rank deficiency handling needed) + + For rank-deficient matrices, the Python backend is used instead to + properly handle R-style NA coefficients for dropped columns. + + Why the backends differ (by design): + - Rust uses SVD-based solve (minimum-norm solution for rank-deficient) + - Python uses pivoted QR to identify and drop linearly dependent columns + - ndarray-linalg doesn't support QR with pivoting, so Rust can't identify + which specific columns to drop + - For full-rank matrices, both approaches give identical results + - For rank-deficient matrices, only Python can provide R-style NA handling + + Parameters + ---------- + X : np.ndarray + Design matrix of shape (n, k), must be full rank. + y : np.ndarray + Response vector of shape (n,). + cluster_ids : np.ndarray, optional + Cluster identifiers for cluster-robust SEs. + return_vcov : bool + Whether to compute variance-covariance matrix. + return_fitted : bool + Whether to return fitted values. + + Returns + ------- + coefficients : np.ndarray + OLS coefficients of shape (k,). + residuals : np.ndarray + Residuals of shape (n,). + fitted : np.ndarray, optional + Fitted values if return_fitted=True. + vcov : np.ndarray, optional + Variance-covariance matrix if return_vcov=True. + """ + # Convert cluster_ids to int64 for Rust (handles string/categorical IDs) + if cluster_ids is not None: + cluster_ids = _factorize_cluster_ids(cluster_ids) + + # Call Rust backend + coefficients, residuals, vcov = _rust_solve_ols( + X, y, cluster_ids=cluster_ids, return_vcov=return_vcov + ) + + # Convert to numpy arrays + coefficients = np.asarray(coefficients) + residuals = np.asarray(residuals) + if vcov is not None: + vcov = np.asarray(vcov) + + # Return with optional fitted values + if return_fitted: + fitted = X @ coefficients + return coefficients, residuals, fitted, vcov + else: + return coefficients, residuals, vcov + + def solve_ols( X: np.ndarray, y: np.ndarray, @@ -38,6 +328,9 @@ def solve_ols( return_vcov: bool = True, return_fitted: bool = False, check_finite: bool = True, + rank_deficient_action: str = "warn", + column_names: Optional[List[str]] = None, + skip_rank_check: bool = False, ) -> Union[ Tuple[np.ndarray, np.ndarray, Optional[np.ndarray]], Tuple[np.ndarray, np.ndarray, np.ndarray, Optional[np.ndarray]], @@ -65,24 +358,48 @@ def solve_ols( check_finite : bool, default True Whether to check that X and y contain only finite values (no NaN/Inf). Set to False for faster computation if you are certain your data is clean. + rank_deficient_action : str, default "warn" + How to handle rank-deficient design matrices: + - "warn": Emit warning and set NaN for dropped coefficients (R-style) + - "error": Raise ValueError with dropped column information + - "silent": No warning, but still set NaN for dropped coefficients + column_names : list of str, optional + Names for the columns (used in warning/error messages). + If None, columns are referred to by their indices. + skip_rank_check : bool, default False + If True, skip the pivoted QR rank check and use Rust backend directly + (when available). This saves O(nk²) computation but will not detect + rank-deficient matrices. Use only when you know the design matrix is + full rank. If the matrix is actually rank-deficient, results may be + incorrect (minimum-norm solution instead of R-style NA handling). Returns ------- coefficients : ndarray of shape (k,) - OLS coefficient estimates. + OLS coefficient estimates. For rank-deficient matrices, coefficients + of linearly dependent columns are set to NaN. residuals : ndarray of shape (n,) - Residuals (y - X @ coefficients). + Residuals (y - fitted). For rank-deficient matrices, uses only + identified coefficients to compute fitted values. fitted : ndarray of shape (n,), optional - Fitted values (X @ coefficients). Only returned if return_fitted=True. + Fitted values. For full-rank matrices, this is X @ coefficients. + For rank-deficient matrices, uses only identified coefficients + (X_reduced @ coefficients_reduced). Only returned if return_fitted=True. vcov : ndarray of shape (k, k) or None Variance-covariance matrix (HC1 or cluster-robust). - None if return_vcov=False. + For rank-deficient matrices, rows/columns for dropped coefficients + are filled with NaN. None if return_vcov=False. Notes ----- - This function uses scipy.linalg.lstsq with the 'gelsy' driver, which is - QR-based and typically faster than NumPy's default SVD-based solver for - well-conditioned matrices. + This function detects rank-deficient matrices using pivoted QR decomposition + and handles them following R's lm() approach: + + 1. Detect linearly dependent columns via pivoted QR + 2. Drop redundant columns and solve the reduced system + 3. Set NaN for coefficients of dropped columns + 4. Compute valid SEs for identified coefficients only + 5. Expand vcov matrix with NaN for dropped rows/columns The cluster-robust standard errors use the sandwich estimator with the standard small-sample adjustment: (G/(G-1)) * ((n-1)/(n-k)). @@ -95,6 +412,15 @@ def solve_ols( >>> y = 2 + 3 * X[:, 1] + np.random.randn(100) >>> coef, resid, vcov = solve_ols(X, y) >>> print(f"Intercept: {coef[0]:.2f}, Slope: {coef[1]:.2f}") + + For rank-deficient matrices with collinear columns: + + >>> X = np.random.randn(100, 3) + >>> X[:, 2] = X[:, 0] + X[:, 1] # Perfect collinearity + >>> y = np.random.randn(100) + >>> coef, resid, vcov = solve_ols(X, y) # Emits warning + >>> print(np.isnan(coef[2])) # Dropped column has NaN coefficient + True """ # Validate inputs X = np.asarray(X, dtype=np.float64) @@ -117,6 +443,14 @@ def solve_ols( "Cannot solve underdetermined system." ) + # Validate rank_deficient_action + valid_actions = {"warn", "error", "silent"} + if rank_deficient_action not in valid_actions: + raise ValueError( + f"rank_deficient_action must be one of {valid_actions}, " + f"got '{rank_deficient_action}'" + ) + # Check for NaN/Inf values if requested if check_finite: if not np.isfinite(X).all(): @@ -130,43 +464,87 @@ def solve_ols( "Clean your data or set check_finite=False to skip this check." ) - # Use Rust backend if available - # Note: Fall back to NumPy if check_finite=False since Rust's LAPACK - # doesn't support non-finite values - if HAS_RUST_BACKEND and check_finite: - # Ensure contiguous arrays for Rust - X = np.ascontiguousarray(X, dtype=np.float64) - y = np.ascontiguousarray(y, dtype=np.float64) + # Fast path: skip rank check and use Rust directly when requested + # This saves O(nk²) QR overhead but won't detect rank-deficient matrices + if skip_rank_check: + if HAS_RUST_BACKEND and _rust_solve_ols is not None: + return _solve_ols_rust( + X, y, + cluster_ids=cluster_ids, + return_vcov=return_vcov, + return_fitted=return_fitted, + ) + # Fall through to Python without rank check (user guarantees full rank) + return _solve_ols_numpy( + X, y, + cluster_ids=cluster_ids, + return_vcov=return_vcov, + return_fitted=return_fitted, + rank_deficient_action=rank_deficient_action, + column_names=column_names, + _skip_rank_check=True, + ) - # Convert cluster_ids to int64 for Rust (if provided) - cluster_ids_int = None - if cluster_ids is not None: - cluster_ids_int = pd.factorize(cluster_ids)[0].astype(np.int64) + # Check for rank deficiency using fast pivoted QR decomposition. + # This adds O(nk²) overhead but is necessary for: + # 1. Detecting which columns to drop (R-style NA handling) + # 2. Routing rank-deficient cases to Python (Rust doesn't support pivoted QR) + # + # Trade-off: ~2x compute cost for full-rank matrices in exchange for proper + # rank deficiency handling. For maximum performance on known full-rank data, + # set skip_rank_check=True. + rank, dropped_cols, pivot = _detect_rank_deficiency(X) + is_rank_deficient = len(dropped_cols) > 0 + + # Routing strategy: + # - Full-rank + Rust available → fast Rust backend (SVD-based solve) + # - Rank-deficient → Python backend (proper NA handling, valid SEs) + # - No Rust → Python backend (works for all cases) + if HAS_RUST_BACKEND and _rust_solve_ols is not None and not is_rank_deficient: + result = _solve_ols_rust( + X, y, + cluster_ids=cluster_ids, + return_vcov=return_vcov, + return_fitted=return_fitted, + ) - try: - coefficients, residuals, vcov = _rust_solve_ols( - X, y, cluster_ids_int, return_vcov + # Check for NaN vcov: Rust SVD may detect rank-deficiency that QR missed + # for ill-conditioned matrices (QR and SVD have different numerical properties). + # When this happens, fall back to Python's R-style handling. + vcov = result[-1] # vcov is always the last element + if return_vcov and vcov is not None and np.any(np.isnan(vcov)): + warnings.warn( + "Rust backend detected ill-conditioned matrix (NaN in variance-covariance). " + "Re-running with Python backend for proper rank detection.", + UserWarning, + stacklevel=2, + ) + # Force fresh rank detection - don't pass cached info since QR + # and SVD disagreed about rank. Python's QR will re-detect and + # apply R-style NaN handling for dropped columns. + return _solve_ols_numpy( + X, y, + cluster_ids=cluster_ids, + return_vcov=return_vcov, + return_fitted=return_fitted, + rank_deficient_action=rank_deficient_action, + column_names=column_names, + _precomputed_rank_info=None, # Force re-detection ) - except ValueError as e: - # Translate Rust LAPACK errors to consistent Python error messages - error_msg = str(e) - if "Matrix inversion failed" in error_msg or "Least squares failed" in error_msg: - raise ValueError( - "Design matrix is rank-deficient (singular X'X matrix). " - "This indicates perfect multicollinearity. Check your fixed effects " - "and covariates for linear dependencies." - ) from e - raise - - if return_fitted: - fitted = X @ coefficients - return coefficients, residuals, fitted, vcov else: - return coefficients, residuals, vcov + return result - # Fallback to NumPy/SciPy implementation + # Use NumPy implementation for rank-deficient cases (R-style NA handling) + # or when Rust backend is not available return _solve_ols_numpy( - X, y, cluster_ids=cluster_ids, return_vcov=return_vcov, return_fitted=return_fitted + X, y, + cluster_ids=cluster_ids, + return_vcov=return_vcov, + return_fitted=return_fitted, + rank_deficient_action=rank_deficient_action, + column_names=column_names, + # Pass pre-computed rank info to avoid redundant computation + _precomputed_rank_info=(rank, dropped_cols, pivot), ) @@ -177,18 +555,20 @@ def _solve_ols_numpy( cluster_ids: Optional[np.ndarray] = None, return_vcov: bool = True, return_fitted: bool = False, + rank_deficient_action: str = "warn", + column_names: Optional[List[str]] = None, + _precomputed_rank_info: Optional[Tuple[int, np.ndarray, np.ndarray]] = None, + _skip_rank_check: bool = False, ) -> Union[ Tuple[np.ndarray, np.ndarray, Optional[np.ndarray]], Tuple[np.ndarray, np.ndarray, np.ndarray, Optional[np.ndarray]], ]: """ - NumPy/SciPy fallback implementation of solve_ols. + NumPy/SciPy implementation of solve_ols with R-style rank deficiency handling. - Uses scipy.linalg.lstsq with 'gelsy' driver (QR with column pivoting) - for numerically stable least squares solving. QR decomposition is preferred - over normal equations because it doesn't square the condition number of X, - making it more robust for ill-conditioned matrices common in DiD designs - (e.g., many unit/time fixed effects). + Detects rank-deficient matrices using pivoted QR decomposition and handles + them following R's lm() approach: drop redundant columns, set NA (NaN) for + their coefficients, and compute valid SEs for identified coefficients only. Parameters ---------- @@ -202,32 +582,100 @@ def _solve_ols_numpy( Whether to compute variance-covariance matrix. return_fitted : bool Whether to return fitted values. + rank_deficient_action : str + How to handle rank deficiency: "warn", "error", or "silent". + column_names : list of str, optional + Names for the columns (used in warning/error messages). + _precomputed_rank_info : tuple, optional + Pre-computed (rank, dropped_cols, pivot) from _detect_rank_deficiency. + Used internally to avoid redundant computation when called from solve_ols. + _skip_rank_check : bool, default False + If True, skip rank detection entirely and assume full rank. + Used when caller has already determined matrix is full rank. Returns ------- coefficients : np.ndarray - OLS coefficients of shape (k,). + OLS coefficients of shape (k,). NaN for dropped columns. residuals : np.ndarray Residuals of shape (n,). fitted : np.ndarray, optional Fitted values if return_fitted=True. vcov : np.ndarray, optional - Variance-covariance matrix if return_vcov=True. + Variance-covariance matrix if return_vcov=True. NaN for dropped rows/cols. """ - # Solve OLS using QR decomposition via scipy's optimized LAPACK routines - # 'gelsy' uses QR with column pivoting, which is numerically stable even - # for ill-conditioned matrices (doesn't square the condition number like - # normal equations would) - coefficients = scipy_lstsq(X, y, lapack_driver="gelsy", check_finite=False)[0] + n, k = X.shape + + # Determine rank deficiency status + if _skip_rank_check: + # Caller guarantees full rank - skip expensive QR decomposition + is_rank_deficient = False + dropped_cols = np.array([], dtype=int) + elif _precomputed_rank_info is not None: + # Use pre-computed rank info + rank, dropped_cols, pivot = _precomputed_rank_info + is_rank_deficient = len(dropped_cols) > 0 + else: + # Compute rank via pivoted QR + rank, dropped_cols, pivot = _detect_rank_deficiency(X) + is_rank_deficient = len(dropped_cols) > 0 + + if is_rank_deficient: + # Format dropped column information for messages + dropped_str = _format_dropped_columns(dropped_cols, column_names) + + if rank_deficient_action == "error": + raise ValueError( + f"Design matrix is rank-deficient. {k - rank} of {k} columns are " + f"linearly dependent and cannot be uniquely estimated: {dropped_str}. " + "This indicates multicollinearity in your model specification." + ) + elif rank_deficient_action == "warn": + warnings.warn( + f"Rank-deficient design matrix: dropping {k - rank} of {k} columns " + f"({dropped_str}). Coefficients for these columns are set to NA. " + "This may indicate multicollinearity in your model specification.", + UserWarning, + stacklevel=3, # Point to user code that called solve_ols + ) + # else: "silent" - no warning + + # Extract kept columns for the reduced solve + kept_cols = np.array([i for i in range(k) if i not in dropped_cols]) + X_reduced = X[:, kept_cols] + + # Solve the reduced system (now full-rank) + # Use cond=1e-07 for consistency with Rust backend and QR rank tolerance + coefficients_reduced = scipy_lstsq( + X_reduced, y, lapack_driver="gelsd", check_finite=False, cond=1e-07 + )[0] + + # Expand coefficients to full size with NaN for dropped columns + coefficients = _expand_coefficients_with_nan(coefficients_reduced, k, kept_cols) + + # Compute residuals using only the identified coefficients + # Note: Dropped coefficients are NaN, so we use the reduced form + fitted = X_reduced @ coefficients_reduced + residuals = y - fitted + + # Compute variance-covariance matrix for reduced system, then expand + vcov = None + if return_vcov: + vcov_reduced = _compute_robust_vcov_numpy(X_reduced, residuals, cluster_ids) + vcov = _expand_vcov_with_nan(vcov_reduced, k, kept_cols) + else: + # Full-rank case: proceed normally + # Use cond=1e-07 for consistency with Rust backend and QR rank tolerance + coefficients = scipy_lstsq(X, y, lapack_driver="gelsd", check_finite=False, cond=1e-07)[0] - # Compute residuals and fitted values - fitted = X @ coefficients - residuals = y - fitted + # Compute residuals and fitted values + fitted = X @ coefficients + residuals = y - fitted - # Compute variance-covariance matrix if requested - vcov = None - if return_vcov: - vcov = _compute_robust_vcov_numpy(X, residuals, cluster_ids) + # Compute variance-covariance matrix if requested + vcov = None + if return_vcov: + vcov = _compute_robust_vcov_numpy(X, residuals, cluster_ids) if return_fitted: return coefficients, residuals, fitted, vcov @@ -475,12 +923,22 @@ class InferenceResult: alpha: float = 0.05 def is_significant(self, alpha: Optional[float] = None) -> bool: - """Check if the coefficient is statistically significant.""" + """Check if the coefficient is statistically significant. + + Returns False for NaN p-values (unidentified coefficients). + """ + if np.isnan(self.p_value): + return False threshold = alpha if alpha is not None else self.alpha return self.p_value < threshold def significance_stars(self) -> str: - """Return significance stars based on p-value.""" + """Return significance stars based on p-value. + + Returns empty string for NaN p-values (unidentified coefficients). + """ + if np.isnan(self.p_value): + return "" if self.p_value < 0.001: return "***" elif self.p_value < 0.01: @@ -526,6 +984,11 @@ class LinearRegression: Overrides the `robust` parameter if provided. alpha : float, default 0.05 Significance level for confidence intervals. + rank_deficient_action : str, default "warn" + Action when design matrix is rank-deficient (linearly dependent columns): + - "warn": Issue warning and drop linearly dependent columns (default) + - "error": Raise ValueError + - "silent": Drop columns silently without warning Attributes ---------- @@ -541,8 +1004,11 @@ class LinearRegression: Number of observations (available after fit). n_params_ : int Number of parameters including intercept (available after fit). + n_params_effective_ : int + Effective number of parameters after dropping linearly dependent columns. + Equals n_params_ for full-rank matrices (available after fit). df_ : int - Degrees of freedom (n - k) (available after fit). + Degrees of freedom (n - n_params_effective) (available after fit). Examples -------- @@ -577,11 +1043,13 @@ def __init__( robust: bool = True, cluster_ids: Optional[np.ndarray] = None, alpha: float = 0.05, + rank_deficient_action: str = "warn", ): self.include_intercept = include_intercept self.robust = robust self.cluster_ids = cluster_ids self.alpha = alpha + self.rank_deficient_action = rank_deficient_action # Fitted attributes (set by fit()) self.coefficients_: Optional[np.ndarray] = None @@ -592,6 +1060,7 @@ def __init__( self._X: Optional[np.ndarray] = None self.n_obs_: Optional[int] = None self.n_params_: Optional[int] = None + self.n_params_effective_: Optional[int] = None self.df_: Optional[int] = None def fit( @@ -643,6 +1112,7 @@ def fit( cluster_ids=effective_cluster_ids, return_fitted=True, return_vcov=compute_vcov, + rank_deficient_action=self.rank_deficient_action, ) else: # Classical OLS - compute vcov separately @@ -650,15 +1120,37 @@ def fit( X, y, return_fitted=True, return_vcov=False, + rank_deficient_action=self.rank_deficient_action, ) # Compute classical OLS variance-covariance matrix + # Handle rank-deficient case: use effective rank for df n, k = X.shape - mse = np.sum(residuals**2) / (n - k) - try: - vcov = np.linalg.solve(X.T @ X, mse * np.eye(k)) - except np.linalg.LinAlgError: - # Fall back to pseudo-inverse for singular matrices - vcov = np.linalg.pinv(X.T @ X) * mse + nan_mask = np.isnan(coefficients) + k_effective = k - np.sum(nan_mask) # Number of identified coefficients + + if k_effective == 0: + # All coefficients dropped - no valid inference + vcov = np.full((k, k), np.nan) + elif np.any(nan_mask): + # Rank-deficient: compute vcov for identified coefficients only + kept_cols = np.where(~nan_mask)[0] + X_reduced = X[:, kept_cols] + mse = np.sum(residuals**2) / (n - k_effective) + try: + vcov_reduced = np.linalg.solve( + X_reduced.T @ X_reduced, mse * np.eye(k_effective) + ) + except np.linalg.LinAlgError: + vcov_reduced = np.linalg.pinv(X_reduced.T @ X_reduced) * mse + # Expand to full size with NaN for dropped columns + vcov = _expand_vcov_with_nan(vcov_reduced, k, kept_cols) + else: + # Full rank: standard computation + mse = np.sum(residuals**2) / (n - k) + try: + vcov = np.linalg.solve(X.T @ X, mse * np.eye(k)) + except np.linalg.LinAlgError: + vcov = np.linalg.pinv(X.T @ X) * mse # Store fitted attributes self.coefficients_ = coefficients @@ -669,7 +1161,12 @@ def fit( self._X = X self.n_obs_ = X.shape[0] self.n_params_ = X.shape[1] - self.df_ = self.n_obs_ - self.n_params_ - df_adjustment + + # Compute effective number of parameters (excluding dropped columns) + # This is needed for correct degrees of freedom in inference + nan_mask = np.isnan(coefficients) + self.n_params_effective_ = int(self.n_params_ - np.sum(nan_mask)) + self.df_ = self.n_obs_ - self.n_params_effective_ - df_adjustment return self @@ -876,10 +1373,18 @@ def r_squared(self, adjusted: bool = False) -> float: ------- float R-squared value. + + Notes + ----- + For rank-deficient fits, adjusted R² uses the effective number of + parameters (excluding dropped columns) for consistency with the + corrected degrees of freedom. """ self._check_fitted() + # Use effective params for adjusted R² to match df correction + n_params = self.n_params_effective_ if adjusted else self.n_params_ return compute_r_squared( - self._y, self.residuals_, adjusted=adjusted, n_params=self.n_params_ + self._y, self.residuals_, adjusted=adjusted, n_params=n_params ) def predict(self, X: np.ndarray) -> np.ndarray: @@ -896,6 +1401,12 @@ def predict(self, X: np.ndarray) -> np.ndarray: ------- ndarray Predicted values. + + Notes + ----- + For rank-deficient fits where some coefficients are NaN, predictions + use only the identified (non-NaN) coefficients. This is equivalent to + treating dropped columns as having zero coefficients. """ self._check_fitted() X = np.asarray(X, dtype=np.float64) @@ -903,7 +1414,12 @@ def predict(self, X: np.ndarray) -> np.ndarray: if self.include_intercept: X = np.column_stack([np.ones(X.shape[0]), X]) - return X @ self.coefficients_ + # Handle rank-deficient case: use only identified coefficients + # Replace NaN with 0 so they don't contribute to prediction + coef = self.coefficients_.copy() + coef[np.isnan(coef)] = 0.0 + + return X @ coef # ============================================================================= diff --git a/diff_diff/results.py b/diff_diff/results.py index b99c5170..213d09fd 100644 --- a/diff_diff/results.py +++ b/diff_diff/results.py @@ -181,7 +181,14 @@ def significance_stars(self) -> str: def _get_significance_stars(p_value: float) -> str: - """Return significance stars based on p-value.""" + """Return significance stars based on p-value. + + Returns empty string for NaN p-values (unidentified coefficients from + rank-deficient matrices). + """ + import numpy as np + if np.isnan(p_value): + return "" if p_value < 0.001: return "***" elif p_value < 0.01: diff --git a/diff_diff/staggered.py b/diff_diff/staggered.py index aa1fce1e..3b1bcb41 100644 --- a/diff_diff/staggered.py +++ b/diff_diff/staggered.py @@ -109,6 +109,7 @@ def gradient(beta: np.ndarray) -> np.ndarray: def _linear_regression( X: np.ndarray, y: np.ndarray, + rank_deficient_action: str = "warn", ) -> Tuple[np.ndarray, np.ndarray]: """ Fit OLS regression. @@ -119,6 +120,11 @@ def _linear_regression( Feature matrix (n_samples, n_features). Intercept added automatically. y : np.ndarray Outcome variable. + rank_deficient_action : str, default "warn" + Action when design matrix is rank-deficient: + - "warn": Issue warning and drop linearly dependent columns (default) + - "error": Raise ValueError + - "silent": Drop columns silently without warning Returns ------- @@ -132,7 +138,10 @@ def _linear_regression( X_with_intercept = np.column_stack([np.ones(n), X]) # Use unified OLS backend (no vcov needed) - beta, residuals, _ = solve_ols(X_with_intercept, y, return_vcov=False) + beta, residuals, _ = solve_ols( + X_with_intercept, y, return_vcov=False, + rank_deficient_action=rank_deficient_action, + ) return beta, residuals @@ -195,6 +204,11 @@ class CallawaySantAnna( Use ``bootstrap_weights`` instead. Will be removed in v2.0. seed : int, optional Random seed for reproducibility. + rank_deficient_action : str, default="warn" + Action when design matrix is rank-deficient (linearly dependent columns): + - "warn": Issue warning and drop linearly dependent columns (default) + - "error": Raise ValueError + - "silent": Drop columns silently without warning Attributes ---------- @@ -277,6 +291,7 @@ def __init__( bootstrap_weights: Optional[str] = None, bootstrap_weight_type: Optional[str] = None, seed: Optional[int] = None, + rank_deficient_action: str = "warn", ): import warnings @@ -312,6 +327,12 @@ def __init__( f"got '{bootstrap_weights}'" ) + if rank_deficient_action not in ["warn", "error", "silent"]: + raise ValueError( + f"rank_deficient_action must be 'warn', 'error', or 'silent', " + f"got '{rank_deficient_action}'" + ) + self.control_group = control_group self.anticipation = anticipation self.estimation_method = estimation_method @@ -322,6 +343,7 @@ def __init__( # Keep bootstrap_weight_type for backward compatibility self.bootstrap_weight_type = bootstrap_weights self.seed = seed + self.rank_deficient_action = rank_deficient_action self.is_fitted_ = False self.results_: Optional[CallawaySantAnnaResults] = None @@ -778,7 +800,10 @@ def _outcome_regression( if X_treated is not None and X_control is not None and X_treated.shape[1] > 0: # Covariate-adjusted outcome regression # Fit regression on control units: E[Delta Y | X, D=0] - beta, residuals = _linear_regression(X_control, control_change) + beta, residuals = _linear_regression( + X_control, control_change, + rank_deficient_action=self.rank_deficient_action, + ) # Predict counterfactual for treated units X_treated_with_intercept = np.column_stack([np.ones(n_t), X_treated]) @@ -938,7 +963,10 @@ def _doubly_robust( if X_treated is not None and X_control is not None and X_treated.shape[1] > 0: # Doubly robust estimation with covariates # Step 1: Outcome regression - fit E[Delta Y | X] on control - beta, _ = _linear_regression(X_control, control_change) + beta, _ = _linear_regression( + X_control, control_change, + rank_deficient_action=self.rank_deficient_action, + ) # Predict counterfactual for both treated and control X_treated_with_intercept = np.column_stack([np.ones(n_t), X_treated]) @@ -1014,6 +1042,7 @@ def get_params(self) -> Dict[str, Any]: # Deprecated but kept for backward compatibility "bootstrap_weight_type": self.bootstrap_weight_type, "seed": self.seed, + "rank_deficient_action": self.rank_deficient_action, } def set_params(self, **params) -> "CallawaySantAnna": diff --git a/diff_diff/sun_abraham.py b/diff_diff/sun_abraham.py index 018b618d..cb7313f9 100644 --- a/diff_diff/sun_abraham.py +++ b/diff_diff/sun_abraham.py @@ -319,6 +319,11 @@ class SunAbraham: If 0, uses analytical cluster-robust standard errors. seed : int, optional Random seed for reproducibility. + rank_deficient_action : str, default="warn" + Action when design matrix is rank-deficient (linearly dependent columns): + - "warn": Issue warning and drop linearly dependent columns (default) + - "error": Raise ValueError + - "silent": Drop columns silently without warning Attributes ---------- @@ -395,6 +400,7 @@ def __init__( cluster: Optional[str] = None, n_bootstrap: int = 0, seed: Optional[int] = None, + rank_deficient_action: str = "warn", ): if control_group not in ["never_treated", "not_yet_treated"]: raise ValueError( @@ -402,12 +408,19 @@ def __init__( f"got '{control_group}'" ) + if rank_deficient_action not in ["warn", "error", "silent"]: + raise ValueError( + f"rank_deficient_action must be 'warn', 'error', or 'silent', " + f"got '{rank_deficient_action}'" + ) + self.control_group = control_group self.anticipation = anticipation self.alpha = alpha self.cluster = cluster self.n_bootstrap = n_bootstrap self.seed = seed + self.rank_deficient_action = rank_deficient_action self.is_fitted_ = False self.results_: Optional[SunAbrahamResults] = None @@ -756,6 +769,7 @@ def _fit_saturated_regression( include_intercept=False, # Already demeaned, no intercept needed robust=True, cluster_ids=cluster_ids, + rank_deficient_action=self.rank_deficient_action, ).fit(X, y) coefficients = reg.coefficients_ @@ -1153,6 +1167,7 @@ def get_params(self) -> Dict[str, Any]: "cluster": self.cluster, "n_bootstrap": self.n_bootstrap, "seed": self.seed, + "rank_deficient_action": self.rank_deficient_action, } def set_params(self, **params) -> "SunAbraham": diff --git a/diff_diff/triple_diff.py b/diff_diff/triple_diff.py index 1c82fc05..126d30b0 100644 --- a/diff_diff/triple_diff.py +++ b/diff_diff/triple_diff.py @@ -330,6 +330,7 @@ def gradient(beta: np.ndarray) -> np.ndarray: def _linear_regression( X: np.ndarray, y: np.ndarray, + rank_deficient_action: str = "warn", ) -> Tuple[np.ndarray, np.ndarray, float]: """ Fit OLS regression. @@ -340,6 +341,11 @@ def _linear_regression( Feature matrix (n_samples, n_features). Intercept added automatically. y : np.ndarray Outcome variable. + rank_deficient_action : str, default "warn" + Action when design matrix is rank-deficient: + - "warn": Issue warning and drop linearly dependent columns (default) + - "error": Raise ValueError + - "silent": Drop columns silently without warning Returns ------- @@ -355,7 +361,8 @@ def _linear_regression( # Use unified OLS backend beta, residuals, fitted, _ = solve_ols( - X_with_intercept, y, return_fitted=True, return_vcov=False + X_with_intercept, y, return_fitted=True, return_vcov=False, + rank_deficient_action=rank_deficient_action, ) # Compute R-squared @@ -400,6 +407,11 @@ class TripleDifference: pscore_trim : float, default=0.01 Trimming threshold for propensity scores. Scores below this value or above (1 - pscore_trim) are clipped to avoid extreme weights. + rank_deficient_action : str, default="warn" + Action when design matrix is rank-deficient (linearly dependent columns): + - "warn": Issue warning and drop linearly dependent columns (default) + - "error": Raise ValueError + - "silent": Drop columns silently without warning Attributes ---------- @@ -478,17 +490,24 @@ def __init__( cluster: Optional[str] = None, alpha: float = 0.05, pscore_trim: float = 0.01, + rank_deficient_action: str = "warn", ): if estimation_method not in ("dr", "reg", "ipw"): raise ValueError( f"estimation_method must be 'dr', 'reg', or 'ipw', " f"got '{estimation_method}'" ) + if rank_deficient_action not in ["warn", "error", "silent"]: + raise ValueError( + f"rank_deficient_action must be 'warn', 'error', or 'silent', " + f"got '{rank_deficient_action}'" + ) self.estimation_method = estimation_method self.robust = robust self.cluster = cluster self.alpha = alpha self.pscore_trim = pscore_trim + self.rank_deficient_action = rank_deficient_action self.is_fitted_ = False self.results_: Optional[TripleDifferenceResults] = None @@ -744,6 +763,7 @@ def _regression_adjustment( include_intercept=False, # Intercept already in design_matrix robust=self.robust, alpha=self.alpha, + rank_deficient_action=self.rank_deficient_action, ).fit(design_matrix, y) # ATT is the coefficient on G*P*T (index 7) @@ -937,7 +957,10 @@ def _doubly_robust( if np.sum(mask) > 1: X_cell = np.column_stack([X[mask], T[mask]]) try: - _, fitted, _ = _linear_regression(X_cell, y[mask]) + _, fitted, _ = _linear_regression( + X_cell, y[mask], + rank_deficient_action=self.rank_deficient_action, + ) mu_fitted[mask] = fitted except Exception: mu_fitted[mask] = np.mean(y[mask]) @@ -1166,6 +1189,7 @@ def get_params(self) -> Dict[str, Any]: "cluster": self.cluster, "alpha": self.alpha, "pscore_trim": self.pscore_trim, + "rank_deficient_action": self.rank_deficient_action, } def set_params(self, **params) -> "TripleDifference": @@ -1223,6 +1247,7 @@ def triple_difference( robust: bool = True, cluster: Optional[str] = None, alpha: float = 0.05, + rank_deficient_action: str = "warn", ) -> TripleDifferenceResults: """ Estimate Triple Difference (DDD) treatment effect. @@ -1256,6 +1281,11 @@ def triple_difference( Column name for cluster-robust standard errors. alpha : float, default=0.05 Significance level for confidence intervals. + rank_deficient_action : str, default="warn" + Action when design matrix is rank-deficient: + - "warn": Issue warning and drop linearly dependent columns (default) + - "error": Raise ValueError + - "silent": Drop columns silently without warning Returns ------- @@ -1280,6 +1310,7 @@ def triple_difference( robust=robust, cluster=cluster, alpha=alpha, + rank_deficient_action=rank_deficient_action, ) return estimator.fit( data=data, diff --git a/diff_diff/twfe.py b/diff_diff/twfe.py index f3167ec0..45b08d64 100644 --- a/diff_diff/twfe.py +++ b/diff_diff/twfe.py @@ -127,12 +127,30 @@ def fit( # type: ignore[override] # Always use LinearRegression for initial fit (unified code path) # For wild bootstrap, we don't need cluster SEs from the initial fit cluster_ids = data[cluster_var].values - reg = LinearRegression( - include_intercept=False, # Intercept already in X - robust=True, # TWFE always uses robust/cluster SEs - cluster_ids=cluster_ids if self.inference != "wild_bootstrap" else None, - alpha=self.alpha, - ).fit(X, y, df_adjustment=df_adjustment) + + # Pass rank_deficient_action to LinearRegression + # If "error", let LinearRegression raise immediately + # If "warn" or "silent", suppress generic warning and use TWFE's context-specific + # error/warning messages (more informative for panel data) + if self.rank_deficient_action == "error": + reg = LinearRegression( + include_intercept=False, + robust=True, + cluster_ids=cluster_ids if self.inference != "wild_bootstrap" else None, + alpha=self.alpha, + rank_deficient_action="error", + ).fit(X, y, df_adjustment=df_adjustment) + else: + # Suppress generic warning, TWFE provides context-specific messages below + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", message="Rank-deficient design matrix") + reg = LinearRegression( + include_intercept=False, + robust=True, + cluster_ids=cluster_ids if self.inference != "wild_bootstrap" else None, + alpha=self.alpha, + rank_deficient_action="silent", + ).fit(X, y, df_adjustment=df_adjustment) coefficients = reg.coefficients_ residuals = reg.residuals_ @@ -140,6 +158,41 @@ def fit( # type: ignore[override] r_squared = reg.r_squared() att = coefficients[att_idx] + # Check for unidentified coefficients (collinearity) + # Build column names for informative error messages + column_names = ["intercept", "treatment×post"] + if covariates: + column_names.extend(covariates) + + nan_mask = np.isnan(coefficients) + if np.any(nan_mask): + dropped_indices = np.where(nan_mask)[0] + dropped_names = [column_names[i] if i < len(column_names) + else f"column {i}" for i in dropped_indices] + + # Determine the source of collinearity for better error message + if att_idx in dropped_indices: + # Treatment coefficient is unidentified + raise ValueError( + f"Treatment effect cannot be identified due to collinearity. " + f"Dropped columns: {', '.join(dropped_names)}. " + "This can happen when: (1) treatment is perfectly collinear with " + "unit/time fixed effects, (2) all treated units are treated in all " + "periods, or (3) a covariate is collinear with the treatment indicator. " + "Check your data structure and model specification." + ) + else: + # Only covariates are dropped - this is a warning, not an error + # The ATT can still be estimated + # Respect rank_deficient_action setting for warning + if self.rank_deficient_action == "warn": + warnings.warn( + f"Some covariates are collinear and were dropped: " + f"{', '.join(dropped_names)}. The treatment effect is still identified.", + UserWarning, + stacklevel=2, + ) + # Get inference - either from bootstrap or analytical if self.inference == "wild_bootstrap": # Override with wild cluster bootstrap inference diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 36da4796..80e46129 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -58,7 +58,9 @@ where τ is the ATT. *Edge cases:* - Empty cells (e.g., no treated-pre observations) raise ValueError - Singleton groups in clustering are dropped with warning -- Perfect collinearity in covariates raises LinAlgError +- Rank-deficient design matrix (collinearity): warns and sets NA for dropped coefficients (R-style, matches `lm()`) + - Tolerance: `1e-07` (matches R's `qr()` default), relative to largest diagonal element of R in QR decomposition + - Controllable via `rank_deficient_action` parameter: "warn" (default), "error", or "silent" **Reference implementation(s):** - R: `fixest::feols()` with interaction term @@ -98,6 +100,8 @@ where E_i is treatment time for unit i, and δ_e are event-study coefficients. - Unbalanced panels: only uses observations where event-time is defined - Never-treated units: event-time indicators are all zero - Endpoint binning: distant event times can be binned +- Rank-deficient design matrix (collinearity): warns and sets NA for dropped coefficients (R-style, matches `lm()`) +- Average ATT (`avg_att`) is NA if any post-period effect is unidentified (R-style NA propagation) **Reference implementation(s):** - R: `fixest::feols()` with `i(event_time, ref=-1)` @@ -139,7 +143,9 @@ where tildes denote demeaned variables. *Edge cases:* - Singleton units/periods are automatically dropped -- Treatment perfectly collinear with FE raises error +- Treatment perfectly collinear with FE raises error with informative message listing dropped columns +- Covariate collinearity emits warning but estimation continues (ATT still identified) +- Rank-deficient design matrix: warns and sets NA for dropped coefficients (R-style, matches `lm()`) - Unbalanced panels handled via proper demeaning **Reference implementation(s):** @@ -194,6 +200,10 @@ Aggregations: - Groups with single observation: included but may have high variance - Missing group-time cells: ATT(g,t) set to NaN - Anticipation: `anticipation` parameter shifts reference period +- Rank-deficient design matrix (covariate collinearity): + - Detection: Pivoted QR decomposition with tolerance `1e-07` (R's `qr()` default) + - Handling: Warns and drops linearly dependent columns, sets NA for dropped coefficients (R-style, matches `lm()`) + - Parameter: `rank_deficient_action` controls behavior: "warn" (default), "error", or "silent" **Reference implementation(s):** - R: `did::att_gt()` (Callaway & Sant'Anna's official package) @@ -242,6 +252,10 @@ where weights ŵ_{g,e} = n_{g,e} / Σ_g n_{g,e} (sample share of cohort g at eve - Single cohort: reduces to standard event study - Cohorts with no observations at some event-times: weighted appropriately - Extrapolation beyond observed event-times: not estimated +- Rank-deficient design matrix (covariate collinearity): + - Detection: Pivoted QR decomposition with tolerance `1e-07` (R's `qr()` default) + - Handling: Warns and drops linearly dependent columns, sets NA for dropped coefficients (R-style, matches `lm()`) + - Parameter: `rank_deficient_action` controls behavior: "warn" (default), "error", or "silent" **Reference implementation(s):** - R: `fixest::sunab()` (Laurent Bergé's implementation) diff --git a/docs/tutorials/02_staggered_did.ipynb b/docs/tutorials/02_staggered_did.ipynb index ff75190d..f15f6c6a 100644 --- a/docs/tutorials/02_staggered_did.ipynb +++ b/docs/tutorials/02_staggered_did.ipynb @@ -29,10 +29,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:53.563913Z", - "iopub.status.busy": "2026-01-18T18:23:53.563787Z", - "iopub.status.idle": "2026-01-18T18:23:54.146380Z", - "shell.execute_reply": "2026-01-18T18:23:54.146080Z" + "iopub.execute_input": "2026-01-20T14:37:22.343991Z", + "iopub.status.busy": "2026-01-20T14:37:22.343878Z", + "iopub.status.idle": "2026-01-20T14:37:23.009082Z", + "shell.execute_reply": "2026-01-20T14:37:23.008730Z" } }, "outputs": [], @@ -66,24 +66,48 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.147667Z", - "iopub.status.busy": "2026-01-18T18:23:54.147584Z", - "iopub.status.idle": "2026-01-18T18:23:54.153830Z", - "shell.execute_reply": "2026-01-18T18:23:54.153627Z" + "iopub.execute_input": "2026-01-20T14:37:23.010593Z", + "iopub.status.busy": "2026-01-20T14:37:23.010493Z", + "iopub.status.idle": "2026-01-20T14:37:23.019860Z", + "shell.execute_reply": "2026-01-20T14:37:23.019620Z" } }, "outputs": [], - "source": "# Generate staggered adoption data using the library function\nfrom diff_diff import generate_staggered_data\n\n# Generate data with 100 units, 8 periods, two treatment cohorts (periods 3 and 5),\n# and 40% never-treated\ndf = generate_staggered_data(\n n_units=100,\n n_periods=8,\n cohort_periods=[3, 5], # Treatment cohorts at periods 3 and 5\n never_treated_frac=0.4,\n treatment_effect=2.0,\n dynamic_effects=True,\n effect_growth=0.5, # Effect grows 0.5 per period\n unit_fe_sd=2.0,\n noise_sd=0.5,\n seed=42\n)\n\n# Add a 'cohort' column that matches the old format (first_treat is already there)\ndf['cohort'] = df['first_treat']\n\nprint(f\"Dataset: {len(df)} observations, {df['unit'].nunique()} units, {df['period'].nunique()} periods\")\ndf.head(10)" + "source": [ + "# Generate staggered adoption data using the library function\n", + "from diff_diff import generate_staggered_data\n", + "\n", + "# Generate data with 100 units, 8 periods, two treatment cohorts (periods 3 and 5),\n", + "# and 40% never-treated\n", + "df = generate_staggered_data(\n", + " n_units=100,\n", + " n_periods=8,\n", + " cohort_periods=[3, 5], # Treatment cohorts at periods 3 and 5\n", + " never_treated_frac=0.4,\n", + " treatment_effect=2.0,\n", + " dynamic_effects=True,\n", + " effect_growth=0.5, # Effect grows 0.5 per period\n", + " unit_fe_sd=2.0,\n", + " noise_sd=0.5,\n", + " seed=42\n", + ")\n", + "\n", + "# Add a 'cohort' column that matches the old format (first_treat is already there)\n", + "df['cohort'] = df['first_treat']\n", + "\n", + "print(f\"Dataset: {len(df)} observations, {df['unit'].nunique()} units, {df['period'].nunique()} periods\")\n", + "df.head(10)" + ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.165680Z", - "iopub.status.busy": "2026-01-18T18:23:54.165600Z", - "iopub.status.idle": "2026-01-18T18:23:54.168804Z", - "shell.execute_reply": "2026-01-18T18:23:54.168561Z" + "iopub.execute_input": "2026-01-20T14:37:23.034321Z", + "iopub.status.busy": "2026-01-20T14:37:23.034210Z", + "iopub.status.idle": "2026-01-20T14:37:23.037701Z", + "shell.execute_reply": "2026-01-20T14:37:23.037433Z" } }, "outputs": [], @@ -115,10 +139,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.169846Z", - "iopub.status.busy": "2026-01-18T18:23:54.169789Z", - "iopub.status.idle": "2026-01-18T18:23:54.173444Z", - "shell.execute_reply": "2026-01-18T18:23:54.173244Z" + "iopub.execute_input": "2026-01-20T14:37:23.038882Z", + "iopub.status.busy": "2026-01-20T14:37:23.038806Z", + "iopub.status.idle": "2026-01-20T14:37:23.046891Z", + "shell.execute_reply": "2026-01-20T14:37:23.046635Z" } }, "outputs": [], @@ -160,10 +184,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.174452Z", - "iopub.status.busy": "2026-01-18T18:23:54.174400Z", - "iopub.status.idle": "2026-01-18T18:23:54.180139Z", - "shell.execute_reply": "2026-01-18T18:23:54.179950Z" + "iopub.execute_input": "2026-01-20T14:37:23.048342Z", + "iopub.status.busy": "2026-01-20T14:37:23.048265Z", + "iopub.status.idle": "2026-01-20T14:37:23.056285Z", + "shell.execute_reply": "2026-01-20T14:37:23.056047Z" } }, "outputs": [], @@ -188,10 +212,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.181019Z", - "iopub.status.busy": "2026-01-18T18:23:54.180969Z", - "iopub.status.idle": "2026-01-18T18:23:54.283154Z", - "shell.execute_reply": "2026-01-18T18:23:54.282934Z" + "iopub.execute_input": "2026-01-20T14:37:23.057419Z", + "iopub.status.busy": "2026-01-20T14:37:23.057355Z", + "iopub.status.idle": "2026-01-20T14:37:23.190283Z", + "shell.execute_reply": "2026-01-20T14:37:23.189957Z" } }, "outputs": [], @@ -233,10 +257,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.284227Z", - "iopub.status.busy": "2026-01-18T18:23:54.284141Z", - "iopub.status.idle": "2026-01-18T18:23:54.289412Z", - "shell.execute_reply": "2026-01-18T18:23:54.289217Z" + "iopub.execute_input": "2026-01-20T14:37:23.191535Z", + "iopub.status.busy": "2026-01-20T14:37:23.191440Z", + "iopub.status.idle": "2026-01-20T14:37:23.218970Z", + "shell.execute_reply": "2026-01-20T14:37:23.218711Z" } }, "outputs": [], @@ -275,10 +299,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.290406Z", - "iopub.status.busy": "2026-01-18T18:23:54.290354Z", - "iopub.status.idle": "2026-01-18T18:23:54.291918Z", - "shell.execute_reply": "2026-01-18T18:23:54.291715Z" + "iopub.execute_input": "2026-01-20T14:37:23.220213Z", + "iopub.status.busy": "2026-01-20T14:37:23.220143Z", + "iopub.status.idle": "2026-01-20T14:37:23.222201Z", + "shell.execute_reply": "2026-01-20T14:37:23.221951Z" } }, "outputs": [], @@ -298,10 +322,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.292699Z", - "iopub.status.busy": "2026-01-18T18:23:54.292649Z", - "iopub.status.idle": "2026-01-18T18:23:54.295835Z", - "shell.execute_reply": "2026-01-18T18:23:54.295677Z" + "iopub.execute_input": "2026-01-20T14:37:23.223255Z", + "iopub.status.busy": "2026-01-20T14:37:23.223189Z", + "iopub.status.idle": "2026-01-20T14:37:23.227098Z", + "shell.execute_reply": "2026-01-20T14:37:23.226868Z" } }, "outputs": [], @@ -326,10 +350,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.296754Z", - "iopub.status.busy": "2026-01-18T18:23:54.296696Z", - "iopub.status.idle": "2026-01-18T18:23:54.298177Z", - "shell.execute_reply": "2026-01-18T18:23:54.297978Z" + "iopub.execute_input": "2026-01-20T14:37:23.228183Z", + "iopub.status.busy": "2026-01-20T14:37:23.228109Z", + "iopub.status.idle": "2026-01-20T14:37:23.229832Z", + "shell.execute_reply": "2026-01-20T14:37:23.229620Z" } }, "outputs": [], @@ -347,10 +371,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.299040Z", - "iopub.status.busy": "2026-01-18T18:23:54.298981Z", - "iopub.status.idle": "2026-01-18T18:23:54.300379Z", - "shell.execute_reply": "2026-01-18T18:23:54.300207Z" + "iopub.execute_input": "2026-01-20T14:37:23.230821Z", + "iopub.status.busy": "2026-01-20T14:37:23.230751Z", + "iopub.status.idle": "2026-01-20T14:37:23.232362Z", + "shell.execute_reply": "2026-01-20T14:37:23.232135Z" } }, "outputs": [], @@ -367,10 +391,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.301184Z", - "iopub.status.busy": "2026-01-18T18:23:54.301120Z", - "iopub.status.idle": "2026-01-18T18:23:54.302604Z", - "shell.execute_reply": "2026-01-18T18:23:54.302425Z" + "iopub.execute_input": "2026-01-20T14:37:23.233366Z", + "iopub.status.busy": "2026-01-20T14:37:23.233302Z", + "iopub.status.idle": "2026-01-20T14:37:23.235186Z", + "shell.execute_reply": "2026-01-20T14:37:23.234965Z" } }, "outputs": [], @@ -412,10 +436,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.303485Z", - "iopub.status.busy": "2026-01-18T18:23:54.303436Z", - "iopub.status.idle": "2026-01-18T18:23:54.309925Z", - "shell.execute_reply": "2026-01-18T18:23:54.309739Z" + "iopub.execute_input": "2026-01-20T14:37:23.236227Z", + "iopub.status.busy": "2026-01-20T14:37:23.236160Z", + "iopub.status.idle": "2026-01-20T14:37:23.246416Z", + "shell.execute_reply": "2026-01-20T14:37:23.246159Z" } }, "outputs": [], @@ -452,10 +476,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.310776Z", - "iopub.status.busy": "2026-01-18T18:23:54.310721Z", - "iopub.status.idle": "2026-01-18T18:23:54.312619Z", - "shell.execute_reply": "2026-01-18T18:23:54.312433Z" + "iopub.execute_input": "2026-01-20T14:37:23.247555Z", + "iopub.status.busy": "2026-01-20T14:37:23.247475Z", + "iopub.status.idle": "2026-01-20T14:37:23.249720Z", + "shell.execute_reply": "2026-01-20T14:37:23.249504Z" } }, "outputs": [], @@ -492,10 +516,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.313659Z", - "iopub.status.busy": "2026-01-18T18:23:54.313599Z", - "iopub.status.idle": "2026-01-18T18:23:54.343103Z", - "shell.execute_reply": "2026-01-18T18:23:54.342886Z" + "iopub.execute_input": "2026-01-20T14:37:23.250809Z", + "iopub.status.busy": "2026-01-20T14:37:23.250729Z", + "iopub.status.idle": "2026-01-20T14:37:23.288942Z", + "shell.execute_reply": "2026-01-20T14:37:23.288685Z" } }, "outputs": [], @@ -521,10 +545,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.344073Z", - "iopub.status.busy": "2026-01-18T18:23:54.344016Z", - "iopub.status.idle": "2026-01-18T18:23:54.382161Z", - "shell.execute_reply": "2026-01-18T18:23:54.381956Z" + "iopub.execute_input": "2026-01-20T14:37:23.290133Z", + "iopub.status.busy": "2026-01-20T14:37:23.290051Z", + "iopub.status.idle": "2026-01-20T14:37:23.342093Z", + "shell.execute_reply": "2026-01-20T14:37:23.341852Z" } }, "outputs": [], @@ -557,10 +581,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.383226Z", - "iopub.status.busy": "2026-01-18T18:23:54.383149Z", - "iopub.status.idle": "2026-01-18T18:23:54.388355Z", - "shell.execute_reply": "2026-01-18T18:23:54.388156Z" + "iopub.execute_input": "2026-01-20T14:37:23.343322Z", + "iopub.status.busy": "2026-01-20T14:37:23.343236Z", + "iopub.status.idle": "2026-01-20T14:37:23.349197Z", + "shell.execute_reply": "2026-01-20T14:37:23.348958Z" } }, "outputs": [], @@ -600,10 +624,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.389338Z", - "iopub.status.busy": "2026-01-18T18:23:54.389272Z", - "iopub.status.idle": "2026-01-18T18:23:54.393588Z", - "shell.execute_reply": "2026-01-18T18:23:54.393388Z" + "iopub.execute_input": "2026-01-20T14:37:23.350293Z", + "iopub.status.busy": "2026-01-20T14:37:23.350229Z", + "iopub.status.idle": "2026-01-20T14:37:23.355209Z", + "shell.execute_reply": "2026-01-20T14:37:23.354990Z" } }, "outputs": [], @@ -639,10 +663,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.394514Z", - "iopub.status.busy": "2026-01-18T18:23:54.394463Z", - "iopub.status.idle": "2026-01-18T18:23:54.404705Z", - "shell.execute_reply": "2026-01-18T18:23:54.404487Z" + "iopub.execute_input": "2026-01-20T14:37:23.356328Z", + "iopub.status.busy": "2026-01-20T14:37:23.356250Z", + "iopub.status.idle": "2026-01-20T14:37:23.378022Z", + "shell.execute_reply": "2026-01-20T14:37:23.377751Z" } }, "outputs": [], @@ -674,7 +698,11 @@ "source": [ "## 11. Comparing with MultiPeriodDiD\n", "\n", - "For comparison, here's how you would use `MultiPeriodDiD` which estimates period-specific effects but doesn't handle staggered adoption as carefully." + "For comparison, here's how you would use `MultiPeriodDiD` which estimates period-specific effects. \n", + "\n", + "**Important**: `MultiPeriodDiD` assumes **simultaneous treatment timing** (all treated units get treated at the same time). For staggered adoption, always use `CallawaySantAnna` or `SunAbraham` instead.\n", + "\n", + "To demonstrate `MultiPeriodDiD` properly, we'll create a simple dataset where all treated units receive treatment at the same time." ] }, { @@ -682,24 +710,38 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.405752Z", - "iopub.status.busy": "2026-01-18T18:23:54.405696Z", - "iopub.status.idle": "2026-01-18T18:23:54.408664Z", - "shell.execute_reply": "2026-01-18T18:23:54.408454Z" + "iopub.execute_input": "2026-01-20T14:37:23.379194Z", + "iopub.status.busy": "2026-01-20T14:37:23.379121Z", + "iopub.status.idle": "2026-01-20T14:37:23.388215Z", + "shell.execute_reply": "2026-01-20T14:37:23.387986Z" } }, "outputs": [], "source": [ - "# For this comparison, let's use data from cohort 3 only (single treatment timing)\n", - "cohort3_df = df[df['cohort'].isin([0, 3])].copy()\n", + "# Create a simple dataset with simultaneous treatment timing\n", + "# This is the appropriate data structure for MultiPeriodDiD\n", + "from diff_diff import generate_did_data\n", + "\n", + "# Generate data with simultaneous treatment at period 4\n", + "mp_data = generate_did_data(\n", + " n_units=100,\n", + " n_periods=8,\n", + " treatment_period=4, # All treated units get treatment at period 4\n", + " treatment_fraction=0.5,\n", + " treatment_effect=2.5,\n", + " seed=42\n", + ")\n", + "\n", + "print(f\"MultiPeriodDiD dataset: {len(mp_data)} obs\")\n", + "print(f\"Treatment starts at period 4 for all treated units\")\n", "\n", "mp_did = MultiPeriodDiD()\n", "results_mp = mp_did.fit(\n", - " cohort3_df,\n", + " mp_data,\n", " outcome=\"outcome\",\n", " treatment=\"treated\",\n", " time=\"period\",\n", - " post_periods=[3, 4, 5, 6, 7]\n", + " post_periods=[4, 5, 6, 7]\n", ")\n", "\n", "print(results_mp.summary())" @@ -710,10 +752,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.409636Z", - "iopub.status.busy": "2026-01-18T18:23:54.409582Z", - "iopub.status.idle": "2026-01-18T18:23:54.411076Z", - "shell.execute_reply": "2026-01-18T18:23:54.410866Z" + "iopub.execute_input": "2026-01-20T14:37:23.389404Z", + "iopub.status.busy": "2026-01-20T14:37:23.389320Z", + "iopub.status.idle": "2026-01-20T14:37:23.391035Z", + "shell.execute_reply": "2026-01-20T14:37:23.390778Z" } }, "outputs": [], @@ -749,10 +791,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.412012Z", - "iopub.status.busy": "2026-01-18T18:23:54.411958Z", - "iopub.status.idle": "2026-01-18T18:23:54.422080Z", - "shell.execute_reply": "2026-01-18T18:23:54.421878Z" + "iopub.execute_input": "2026-01-20T14:37:23.392190Z", + "iopub.status.busy": "2026-01-20T14:37:23.392111Z", + "iopub.status.idle": "2026-01-20T14:37:23.410168Z", + "shell.execute_reply": "2026-01-20T14:37:23.409921Z" } }, "outputs": [], @@ -780,10 +822,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.423081Z", - "iopub.status.busy": "2026-01-18T18:23:54.423018Z", - "iopub.status.idle": "2026-01-18T18:23:54.425058Z", - "shell.execute_reply": "2026-01-18T18:23:54.424845Z" + "iopub.execute_input": "2026-01-20T14:37:23.411166Z", + "iopub.status.busy": "2026-01-20T14:37:23.411095Z", + "iopub.status.idle": "2026-01-20T14:37:23.413556Z", + "shell.execute_reply": "2026-01-20T14:37:23.413310Z" } }, "outputs": [], @@ -819,10 +861,10 @@ "execution_count": null, "metadata": { "execution": { - "iopub.execute_input": "2026-01-18T18:23:54.426005Z", - "iopub.status.busy": "2026-01-18T18:23:54.425952Z", - "iopub.status.idle": "2026-01-18T18:23:54.427888Z", - "shell.execute_reply": "2026-01-18T18:23:54.427679Z" + "iopub.execute_input": "2026-01-20T14:37:23.414763Z", + "iopub.status.busy": "2026-01-20T14:37:23.414691Z", + "iopub.status.idle": "2026-01-20T14:37:23.417345Z", + "shell.execute_reply": "2026-01-20T14:37:23.417101Z" } }, "outputs": [], @@ -907,7 +949,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.4" + "version": "3.9.6" } }, "nbformat": 4, diff --git a/rust/src/linalg.rs b/rust/src/linalg.rs index de7dace6..79c5f2e2 100644 --- a/rust/src/linalg.rs +++ b/rust/src/linalg.rs @@ -6,13 +6,27 @@ //! - Cluster-robust variance-covariance estimation use ndarray::{Array1, Array2, ArrayView1, ArrayView2, Axis}; -use ndarray_linalg::{FactorizeC, LeastSquaresSvd, Solve, SolveC, UPLO}; +use ndarray_linalg::{FactorizeC, Solve, SolveC, SVD, UPLO}; use numpy::{IntoPyArray, PyArray1, PyArray2, PyReadonlyArray1, PyReadonlyArray2}; use pyo3::prelude::*; use std::collections::HashMap; /// Solve OLS regression: β = (X'X)^{-1} X'y /// +/// Uses SVD with truncation for rank-deficient matrices: +/// - Computes SVD: X = U * S * V^T +/// - Truncates singular values below rcond * max(S) +/// - Computes solution: β = V * S^{-1}_truncated * U^T * y +/// +/// This matches scipy's 'gelsd' driver behavior for handling rank-deficient +/// design matrices that can occur in DiD estimation (e.g., MultiPeriodDiD +/// with redundant period dummies + treatment interactions). +/// +/// For rank-deficient matrices (rank < k), the vcov matrix is filled with NaN +/// since the sandwich estimator requires inverting the singular X'X matrix. +/// The Python wrapper should use the full R-style handling with QR pivoting +/// for proper rank-deficiency support. +/// /// # Arguments /// * `x` - Design matrix (n, k) /// * `y` - Response vector (n,) @@ -20,7 +34,8 @@ use std::collections::HashMap; /// * `return_vcov` - Whether to compute and return variance-covariance matrix /// /// # Returns -/// Tuple of (coefficients, residuals, vcov) where vcov is None if return_vcov=False +/// Tuple of (coefficients, residuals, vcov) where vcov is None if return_vcov=False, +/// or NaN-filled matrix if rank-deficient #[pyfunction] #[pyo3(signature = (x, y, cluster_ids=None, return_vcov=true))] pub fn solve_ols<'py>( @@ -37,25 +52,69 @@ pub fn solve_ols<'py>( let x_arr = x.as_array(); let y_arr = y.as_array(); - // Solve least squares using SVD (more stable than normal equations) + let n = x_arr.nrows(); + let k = x_arr.ncols(); + + // Solve using SVD with truncation for rank-deficient matrices + // This matches scipy's 'gelsd' behavior let x_owned = x_arr.to_owned(); let y_owned = y_arr.to_owned(); - let result = x_owned - .least_squares(&y_owned) - .map_err(|e| PyErr::new::(format!("Least squares failed: {}", e)))?; + // Compute SVD: X = U * S * V^T + let (u_opt, s, vt_opt) = x_owned + .svd(true, true) + .map_err(|e| PyErr::new::(format!("SVD failed: {}", e)))?; + + let u = u_opt.ok_or_else(|| { + PyErr::new::("SVD did not return U matrix") + })?; + let vt = vt_opt.ok_or_else(|| { + PyErr::new::("SVD did not return V^T matrix") + })?; + + // Compute rcond threshold to match R's lm() behavior + // R's qr() uses tol = 1e-07 by default, which is sqrt(eps) ≈ 1.49e-08 + // We use 1e-07 for consistency with Python backend and R + let rcond = 1e-07_f64; + let s_max = s.iter().cloned().fold(0.0_f64, f64::max); + let threshold = s_max * rcond; + + // Compute truncated pseudoinverse solution: β = V * S^{-1} * U^T * y + // Singular values below threshold are treated as zero (truncated) + let uty = u.t().dot(&y_owned); // (min(n,k),) + + // Build S^{-1} with truncation and count effective rank + let mut s_inv_uty = Array1::::zeros(k); + let mut rank = 0usize; + for i in 0..s.len().min(k) { + if s[i] > threshold { + s_inv_uty[i] = uty[i] / s[i]; + rank += 1; + } + // else: leave as 0 (truncate this singular value) + } - let coefficients = result.solution; + // Compute coefficients: β = V * (S^{-1} * U^T * y) + let coefficients = vt.t().dot(&s_inv_uty); // Compute fitted values and residuals let fitted = x_arr.dot(&coefficients); let residuals = &y_arr - &fitted; // Compute variance-covariance if requested + // For rank-deficient matrices, return NaN vcov since X'X is singular let vcov = if return_vcov { - let cluster_arr = cluster_ids.as_ref().map(|c| c.as_array().to_owned()); - let vcov_arr = compute_robust_vcov_internal(&x_arr, &residuals.view(), cluster_arr.as_ref())?; - Some(vcov_arr.into_pyarray(py)) + if rank < k { + // Rank-deficient: cannot compute valid vcov, return NaN matrix + let mut nan_vcov = Array2::::zeros((k, k)); + nan_vcov.fill(f64::NAN); + Some(nan_vcov.into_pyarray(py)) + } else { + // Full rank: compute robust vcov normally + let cluster_arr = cluster_ids.as_ref().map(|c| c.as_array().to_owned()); + let vcov_arr = compute_robust_vcov_internal(&x_arr, &residuals.view(), cluster_arr.as_ref())?; + Some(vcov_arr.into_pyarray(py)) + } } else { None }; @@ -224,7 +283,9 @@ fn invert_symmetric(a: &Array2) -> PyResult> { let col = a.solve(&e_i).map_err(|e| { PyErr::new::(format!( - "Matrix inversion failed: {}", + "Matrix inversion failed (likely rank-deficient X'X): {}. \ + If the design matrix is rank-deficient, use solve_ols without \ + skip_rank_check=True to enable R-style handling.", e )) })?; diff --git a/tests/test_estimators.py b/tests/test_estimators.py index a2e383f9..141bd4e7 100644 --- a/tests/test_estimators.py +++ b/tests/test_estimators.py @@ -211,6 +211,79 @@ def test_unfitted_model_error(self): with pytest.raises(RuntimeError, match="fitted"): did.summary() + def test_rank_deficient_action_error_raises(self, simple_2x2_data): + """Test that rank_deficient_action='error' raises ValueError on collinear data.""" + # Add a covariate that is perfectly collinear with treatment + data = simple_2x2_data.copy() + data["collinear_cov"] = data["treated"].copy() + + did = DifferenceInDifferences(rank_deficient_action="error") + with pytest.raises(ValueError, match="rank-deficient"): + did.fit( + data, + outcome="outcome", + treatment="treated", + time="post", + covariates=["collinear_cov"] + ) + + def test_rank_deficient_action_silent_no_warning(self, simple_2x2_data): + """Test that rank_deficient_action='silent' produces no warning.""" + import warnings + + # Add a covariate that is perfectly collinear with treatment + data = simple_2x2_data.copy() + data["collinear_cov"] = data["treated"].copy() + + did = DifferenceInDifferences(rank_deficient_action="silent") + + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + results = did.fit( + data, + outcome="outcome", + treatment="treated", + time="post", + covariates=["collinear_cov"] + ) + + # No warnings about rank deficiency should be emitted + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message) + or "rank-deficient" in str(x.message).lower()] + assert len(rank_warnings) == 0, f"Expected no rank warnings, got {rank_warnings}" + + # Should still have NaN for dropped coefficient + assert "collinear_cov" in results.coefficients + # Either collinear_cov or treated will be NaN + has_nan = (np.isnan(results.coefficients.get("collinear_cov", 0)) or + np.isnan(results.coefficients.get("treated", 0))) + assert has_nan, "Expected NaN for one of the collinear coefficients" + + def test_rank_deficient_action_warn_default(self, simple_2x2_data): + """Test that rank_deficient_action='warn' (default) emits warning.""" + import warnings + + # Add a covariate that is perfectly collinear with treatment + data = simple_2x2_data.copy() + data["collinear_cov"] = data["treated"].copy() + + did = DifferenceInDifferences() # Default is "warn" + + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + results = did.fit( + data, + outcome="outcome", + treatment="treated", + time="post", + covariates=["collinear_cov"] + ) + + # Should have a warning about rank deficiency + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message) + or "rank-deficient" in str(x.message).lower()] + assert len(rank_warnings) > 0, "Expected warning about rank deficiency" + class TestDiDResults: """Tests for DiDResults class.""" @@ -704,7 +777,9 @@ class TestEdgeCases: """Tests for edge cases and robustness.""" def test_multicollinearity_detection(self): - """Test that perfect multicollinearity is detected.""" + """Test that perfect multicollinearity is detected and warning is emitted.""" + import warnings + # Create data where a covariate is perfectly correlated with treatment data = pd.DataFrame({ "outcome": [10, 11, 15, 18, 9, 10, 12, 13], @@ -714,14 +789,23 @@ def test_multicollinearity_detection(self): }) did = DifferenceInDifferences() - with pytest.raises(ValueError, match="rank-deficient"): - did.fit( + + # With R-style rank deficiency handling, a warning is emitted + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + result = did.fit( data, outcome="outcome", treatment="treated", time="post", covariates=["duplicate_treated"] ) + # Should emit a warning about rank deficiency + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message)] + assert len(rank_warnings) > 0, "Expected warning about rank deficiency" + + # ATT should still be finite + assert np.isfinite(result.att) def test_wasserstein_custom_threshold(self): """Test that custom Wasserstein threshold is respected.""" @@ -930,6 +1014,102 @@ def test_twfe_clusters_at_unit_level(self, twfe_panel_data): # But the results should still reflect cluster-robust SEs were computed correctly assert results.se > 0 + def test_twfe_treatment_collinearity_raises_error(self): + """Test that TWFE raises informative error when treatment is collinear.""" + from diff_diff.estimators import TwoWayFixedEffects + + # Create data where treatment is perfectly collinear with fixed effects + # (all treated units are treated in all periods) + data = [] + for unit in range(10): + is_treated = unit < 5 + for period in range(4): + data.append({ + "unit": unit, + "period": period, + "treated": int(is_treated), # Same for all periods + "post": 1 if period >= 2 else 0, + "outcome": 10.0 + unit * 0.5 + period * 0.3 + np.random.normal(0, 0.1), + }) + df = pd.DataFrame(data) + + # Make treatment_post constant for treated units (collinear) + # by making treatment only occur in post periods + df_collinear = df.copy() + # This creates perfect collinearity: treatment is perfectly predicted by unit FE + # since treated units always have treated=1 and control units always have treated=0 + + twfe = TwoWayFixedEffects() + + # Should raise or warn about collinearity - depends on what columns get dropped + # The key is that it should NOT silently produce misleading results + try: + results = twfe.fit( + df_collinear, + outcome="outcome", + treatment="treated", + time="post", + unit="unit" + ) + # If we get here without error, the ATT should still be computed + # (this means only covariates were dropped, not the treatment) + assert results is not None + except ValueError as e: + # If treatment column is dropped, should get informative error + assert "collinear" in str(e).lower() or "Treatment effect cannot be identified" in str(e) + + def test_rank_deficient_action_error_raises(self, twfe_panel_data): + """Test that rank_deficient_action='error' raises ValueError on collinear data.""" + from diff_diff.estimators import TwoWayFixedEffects + + # Add a covariate that is perfectly collinear with post + twfe_panel_data = twfe_panel_data.copy() + twfe_panel_data["collinear_cov"] = twfe_panel_data["post"].copy() + + twfe = TwoWayFixedEffects(rank_deficient_action="error") + with pytest.raises(ValueError, match="rank-deficient"): + twfe.fit( + twfe_panel_data, + outcome="outcome", + treatment="treated", + time="post", + unit="unit", + covariates=["collinear_cov"] + ) + + def test_rank_deficient_action_silent_no_warning(self, twfe_panel_data): + """Test that rank_deficient_action='silent' produces no warning.""" + import warnings + from diff_diff.estimators import TwoWayFixedEffects + + # Add a covariate that is perfectly collinear with another + twfe_panel_data = twfe_panel_data.copy() + twfe_panel_data["size"] = np.random.normal(100, 10, len(twfe_panel_data)) + twfe_panel_data["size_dup"] = twfe_panel_data["size"].copy() # Perfect collinearity + + twfe = TwoWayFixedEffects(rank_deficient_action="silent") + + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + results = twfe.fit( + twfe_panel_data, + outcome="outcome", + treatment="treated", + time="post", + unit="unit", + covariates=["size", "size_dup"] + ) + + # No warnings about rank deficiency or collinearity should be emitted + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message) + or "rank-deficient" in str(x.message).lower() + or "collinear" in str(x.message).lower()] + assert len(rank_warnings) == 0, f"Expected no rank warnings, got {rank_warnings}" + + # Should still get valid results + assert results is not None + assert twfe.is_fitted_ + class TestClusterRobustSE: """Tests for cluster-robust standard errors.""" @@ -1569,6 +1749,137 @@ def test_coefficients_dict(self, multi_period_data): # Treatment interactions assert any("treated:period_" in k for k in results.coefficients) + def test_rank_deficient_design_warns_and_sets_nan(self, multi_period_data): + """Test that rank-deficient design matrix warns and sets NaN for dropped columns.""" + import warnings + + # Add a covariate that is perfectly collinear with an existing column + # Use exact duplicate to ensure perfect collinearity is detected + multi_period_data = multi_period_data.copy() + multi_period_data["collinear_cov"] = multi_period_data["treated"].copy() + + did = MultiPeriodDiD() + + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + results = did.fit( + multi_period_data, + outcome="outcome", + treatment="treated", + time="period", + post_periods=[3, 4, 5], + covariates=["collinear_cov"] + ) + + # Should have warning about rank deficiency + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message) + or "collinear" in str(x.message).lower()] + assert len(rank_warnings) > 0, "Expected warning about rank deficiency" + + # The collinear covariate should have NaN coefficient + assert "collinear_cov" in results.coefficients + assert np.isnan(results.coefficients["collinear_cov"]), \ + "Collinear covariate coefficient should be NaN" + + # Treatment effects should still be identified (not NaN) + for period in [3, 4, 5]: + pe = results.period_effects[period] + assert not np.isnan(pe.effect), f"Period {period} effect should be identified" + assert not np.isnan(pe.se), f"Period {period} SE should be valid" + assert pe.se > 0, f"Period {period} SE should be positive" + + # Vcov should have NaN for the collinear column + assert results.vcov is not None + assert np.any(np.isnan(results.vcov)), "Vcov should have NaN for dropped column" + + # avg_att should still be computed because all period effects are identified + assert not np.isnan(results.avg_att), "avg_att should be valid when all period effects are identified" + + def test_avg_att_nan_when_period_effect_nan(self, multi_period_data): + """Test that avg_att is NaN if any period effect is NaN (R-style NA propagation).""" + import warnings + + # Remove all treated observations in period 3 to make that interaction + # unidentified (column of zeros) + data_no_treated_period3 = multi_period_data[ + ~((multi_period_data["treated"] == 1) & (multi_period_data["period"] == 3)) + ].copy() + + did = MultiPeriodDiD() + + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + results = did.fit( + data_no_treated_period3, + outcome="outcome", + treatment="treated", + time="period", + post_periods=[3, 4, 5] + ) + + # Should have warning about rank deficiency (treated:period_3 is all zeros) + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message) + or "collinear" in str(x.message).lower()] + assert len(rank_warnings) > 0, "Expected warning about rank deficiency" + + # The treated×period_3 interaction should have NaN coefficient (unidentified) + pe_3 = results.period_effects[3] + assert np.isnan(pe_3.effect), "Period 3 effect should be NaN (unidentified)" + + # avg_att should be NaN because one period effect is NaN (R-style NA propagation) + assert np.isnan(results.avg_att), "avg_att should be NaN when any period effect is NaN" + assert np.isnan(results.avg_se), "avg_se should be NaN when avg_att is NaN" + assert np.isnan(results.avg_t_stat), "avg_t_stat should be NaN when avg_att is NaN" + assert np.isnan(results.avg_p_value), "avg_p_value should be NaN when avg_att is NaN" + + def test_rank_deficient_action_error_raises(self, multi_period_data): + """Test that rank_deficient_action='error' raises ValueError on collinear data.""" + # Add a covariate that is perfectly collinear with treatment + multi_period_data = multi_period_data.copy() + multi_period_data["collinear_cov"] = multi_period_data["treated"].copy() + + did = MultiPeriodDiD(rank_deficient_action="error") + with pytest.raises(ValueError, match="rank-deficient"): + did.fit( + multi_period_data, + outcome="outcome", + treatment="treated", + time="period", + post_periods=[3, 4, 5], + covariates=["collinear_cov"] + ) + + def test_rank_deficient_action_silent_no_warning(self, multi_period_data): + """Test that rank_deficient_action='silent' produces no warning.""" + import warnings + + # Add a covariate that is perfectly collinear with treatment + multi_period_data = multi_period_data.copy() + multi_period_data["collinear_cov"] = multi_period_data["treated"].copy() + + did = MultiPeriodDiD(rank_deficient_action="silent") + + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + results = did.fit( + multi_period_data, + outcome="outcome", + treatment="treated", + time="period", + post_periods=[3, 4, 5], + covariates=["collinear_cov"] + ) + + # No warnings about rank deficiency should be emitted + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message) + or "rank-deficient" in str(x.message).lower()] + assert len(rank_warnings) == 0, f"Expected no rank warnings, got {rank_warnings}" + + # Should still have NaN for dropped coefficient + assert "collinear_cov" in results.coefficients + assert np.isnan(results.coefficients["collinear_cov"]), \ + "Collinear covariate coefficient should be NaN" + class TestSyntheticDiD: """Tests for SyntheticDiD estimator.""" @@ -2595,14 +2906,15 @@ def test_sdid_single_treated_unit(self): class TestCollinearityDetection: """Tests for handling perfect or near collinearity.""" - def test_did_with_redundant_covariate_handles_gracefully(self): - """Test DiD handles perfectly collinear covariates gracefully. + def test_did_with_redundant_covariate_emits_warning(self): + """Test DiD emits warning for perfectly collinear covariates. - Note: scipy's gelsy solver handles rank-deficiency gracefully via - QR with column pivoting. Results may be numerically unstable but - the computation completes. This is acceptable for performance reasons - as the alternative (upfront rank check) adds O(k^3) overhead. + Following R's lm() approach, rank-deficient matrices emit a warning + and set NaN for coefficients of dropped columns. The ATT should still + be identified if the treatment interaction is not in the dropped set. """ + import warnings + np.random.seed(42) data = pd.DataFrame({ "outcome": np.random.normal(10, 1, 100), @@ -2615,20 +2927,32 @@ def test_did_with_redundant_covariate_handles_gracefully(self): did = DifferenceInDifferences() - # With scipy's gelsy solver, collinear covariates are handled gracefully - # (may produce numerically unstable but finite results) - result = did.fit( - data, - outcome="outcome", - treatment="treated", - time="post", - covariates=["x1", "x2"] - ) - # Result should be finite (even if numerically unstable) + # With R-style rank deficiency handling, a warning is emitted + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + result = did.fit( + data, + outcome="outcome", + treatment="treated", + time="post", + covariates=["x1", "x2"] + ) + # Should emit a warning about rank deficiency + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message)] + assert len(rank_warnings) > 0, "Expected warning about rank deficiency" + + # ATT should still be finite (the interaction term is identified) assert np.isfinite(result.att) - def test_did_with_constant_covariate_raises_error(self): - """Test DiD raises clear error for constant covariates.""" + def test_did_with_constant_covariate_emits_warning(self): + """Test DiD emits warning for constant covariates. + + Constant covariates are collinear with the intercept and are dropped. + Following R's lm() approach, a warning is emitted and the coefficient + for the constant covariate is set to NaN. + """ + import warnings + np.random.seed(42) data = pd.DataFrame({ "outcome": np.random.normal(10, 1, 100), @@ -2640,15 +2964,22 @@ def test_did_with_constant_covariate_raises_error(self): did = DifferenceInDifferences() # Constant covariate is collinear with intercept - # Should raise clear error - with pytest.raises(ValueError, match="rank-deficient"): - did.fit( + # Should emit warning about rank deficiency + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + result = did.fit( data, outcome="outcome", treatment="treated", time="post", covariates=["constant_x"] ) + # Should emit a warning about rank deficiency + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message)] + assert len(rank_warnings) > 0, "Expected warning about rank deficiency" + + # ATT should still be finite + assert np.isfinite(result.att) def test_did_with_near_collinear_covariates(self): """Test DiD handles near-collinear covariates (not perfectly collinear).""" diff --git a/tests/test_linalg.py b/tests/test_linalg.py index 7fab517b..33b3e5f4 100644 --- a/tests/test_linalg.py +++ b/tests/test_linalg.py @@ -186,34 +186,230 @@ def test_inf_in_x_raises_error(self): solve_ols(X, y) def test_check_finite_false_skips_validation(self): - """Test that check_finite=False skips NaN/Inf validation.""" + """Test that check_finite=False skips the upfront NaN/Inf validation. + + Note: With the 'gelsd' driver, LAPACK may still error on NaN values + during computation, which is actually safer than producing garbage. + """ X = np.random.randn(100, 2) X[50, 0] = np.nan y = np.random.randn(100) - # Should not raise, but will return garbage results - coef, resid, vcov = solve_ols(X, y, check_finite=False) - # Coefficients will contain NaN due to bad input - assert np.isnan(coef).any() or np.isinf(coef).any() + # The gelsd driver may raise an error when encountering NaN during + # computation, or produce garbage results. Either is acceptable + # (the key is that we don't raise the "X contains NaN" user-friendly error) + try: + coef, resid, vcov = solve_ols(X, y, check_finite=False) + # If it completed, coefficients should contain NaN/Inf due to bad input + assert np.isnan(coef).any() or np.isinf(coef).any() + except ValueError as e: + # LAPACK may raise an error on NaN values (gelsd behavior) + # This is acceptable - the key is we skipped our own validation + assert "X contains NaN" not in str(e) and "y contains NaN" not in str(e) + + def test_rank_deficient_produces_nan_for_dropped_columns(self): + """Test that rank-deficient matrix returns NaN for dropped columns. + + Following R's lm() approach, coefficients for linearly dependent columns + are set to NaN while identified coefficients are computed normally. + """ + import warnings - def test_rank_deficient_still_solves(self): - """Test that rank-deficient matrix still returns a solution. + np.random.seed(42) + X = np.random.randn(100, 3) + X[:, 2] = X[:, 0] + X[:, 1] # Perfect collinearity: col 2 = col 0 + col 1 + y = np.random.randn(100) - Note: The gelsy driver doesn't always detect rank deficiency, - but it still returns a valid least-squares solution. - """ + # Should emit warning about rank deficiency + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + coef, resid, vcov = solve_ols(X, y) + assert len(w) == 1 + assert "Rank-deficient" in str(w[0].message) + + assert coef.shape == (3,) + assert resid.shape == (100,) + + # Exactly one coefficient should be NaN (the dropped one) + nan_mask = np.isnan(coef) + assert np.sum(nan_mask) == 1, f"Expected 1 NaN coefficient, got {np.sum(nan_mask)}: {coef}" + + # Non-NaN coefficients should be finite and reasonable + finite_coef = coef[~nan_mask] + assert np.all(np.isfinite(finite_coef)), f"Finite coefficients contain non-finite values: {finite_coef}" + assert np.all(np.abs(finite_coef) < 1e6), f"Finite coefficients are unreasonably large: {finite_coef}" + + # VCoV should have NaN for dropped column's row and column + assert vcov is not None + dropped_idx = np.where(nan_mask)[0][0] + assert np.all(np.isnan(vcov[dropped_idx, :])), "VCoV row for dropped column should be NaN" + assert np.all(np.isnan(vcov[:, dropped_idx])), "VCoV column for dropped column should be NaN" + + # VCoV for identified coefficients should be finite + kept_idx = np.where(~nan_mask)[0] + vcov_kept = vcov[np.ix_(kept_idx, kept_idx)] + assert np.all(np.isfinite(vcov_kept)), "VCoV for kept coefficients should be finite" + + # Residuals should be finite (computed using only identified coefficients) + assert np.all(np.isfinite(resid)), f"Residuals contain non-finite values" + + def test_rank_deficient_error_mode(self): + """Test that rank_deficient_action='error' raises ValueError.""" np.random.seed(42) X = np.random.randn(100, 3) X[:, 2] = X[:, 0] + X[:, 1] # Perfect collinearity y = np.random.randn(100) - # Should still complete and return valid output - coef, resid, vcov = solve_ols(X, y) + with pytest.raises(ValueError, match="rank-deficient"): + solve_ols(X, y, rank_deficient_action="error") + + def test_rank_deficient_silent_mode(self): + """Test that rank_deficient_action='silent' produces no warning.""" + import warnings + np.random.seed(42) + X = np.random.randn(100, 3) + X[:, 2] = X[:, 0] + X[:, 1] # Perfect collinearity + y = np.random.randn(100) + + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + coef, resid, vcov = solve_ols(X, y, rank_deficient_action="silent") + # No warnings should be emitted + assert len(w) == 0, f"Expected no warnings, got {len(w)}: {[str(x.message) for x in w]}" + + # Should still produce NaN for dropped column + assert np.sum(np.isnan(coef)) == 1 + + def test_rank_deficient_column_names_in_warning(self): + """Test that column names appear in warning message.""" + import warnings + + np.random.seed(42) + X = np.random.randn(100, 3) + X[:, 2] = X[:, 0] + X[:, 1] # Perfect collinearity + y = np.random.randn(100) + + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + coef, resid, vcov = solve_ols( + X, y, + column_names=["intercept", "x1", "x2_collinear"] + ) + assert len(w) == 1 + # Column name should appear in warning (not just index) + assert "x2_collinear" in str(w[0].message) or "intercept" in str(w[0].message) or "x1" in str(w[0].message) + + def test_skip_rank_check_bypasses_qr_decomposition(self): + """Test that skip_rank_check=True skips QR rank detection. + + When skip_rank_check=True, the function should skip QR decomposition + and go directly to SVD solving, even in Python backend. + """ + import warnings + import os + + np.random.seed(42) + n = 100 + X = np.random.randn(n, 3) + y = np.random.randn(n) + + # Force Python backend for this test + old_backend = os.environ.get("DIFF_DIFF_BACKEND") + os.environ["DIFF_DIFF_BACKEND"] = "python" + + try: + # With skip_rank_check=True, should not emit any warnings + # (even if we make X rank-deficient, since we're skipping the check) + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + coef, resid, vcov = solve_ols(X, y, skip_rank_check=True) + # No rank-deficiency warning should be emitted + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message)] + assert len(rank_warnings) == 0 + finally: + if old_backend is not None: + os.environ["DIFF_DIFF_BACKEND"] = old_backend + elif "DIFF_DIFF_BACKEND" in os.environ: + del os.environ["DIFF_DIFF_BACKEND"] + + # Should produce valid coefficients assert coef.shape == (3,) - assert resid.shape == (100,) - # Residuals should still be valid (y - X @ coef) - np.testing.assert_allclose(resid, y - X @ coef, rtol=1e-10) + assert np.all(np.isfinite(coef)) + + def test_multiperiod_like_design_full_rank(self): + """Test that MultiPeriodDiD-like design matrices work when full-rank. + + This test creates a properly specified MultiPeriodDiD-like design that + is NOT rank-deficient to verify correct coefficient recovery. + """ + import warnings + + np.random.seed(42) + n = 200 + n_periods = 5 + + # Create a design matrix similar to MultiPeriodDiD: + # [intercept, period_1, period_2, ..., period_k, treated*post] + + # Intercept + intercept = np.ones(n) + + # Period dummies (one-hot encoding for periods 1 to n_periods-1) + # Period 0 is the reference + period_assignment = np.random.randint(0, n_periods, n) + period_dummies = np.zeros((n, n_periods - 1)) + for i in range(1, n_periods): + period_dummies[:, i - 1] = (period_assignment == i).astype(float) + + # Treatment indicator (varies within periods to ensure identification) + treated = np.random.binomial(1, 0.5, n) + + # Post indicator (periods >= 3 are post) + post = (period_assignment >= 3).astype(float) + + # Treatment × post interaction + treat_post = treated * post + + # Build design matrix + X = np.column_stack([intercept, period_dummies, treat_post]) + + # True effect + true_effect = 2.5 + y = ( + 1.0 # intercept effect + + 0.5 * period_dummies[:, 0] # period 1 effect + + 0.3 * period_dummies[:, 1] # period 2 effect + + 0.7 * period_dummies[:, 2] # period 3 effect + + 0.9 * period_dummies[:, 3] # period 4 effect + + true_effect * treat_post # treatment effect + + np.random.randn(n) * 0.5 # noise + ) + + # Fit with solve_ols - should NOT produce warning if full-rank + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + coef, resid, vcov = solve_ols(X, y) + # Check if any rank deficiency warnings (may or may not occur) + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message)] + + # If no rank deficiency, all coefficients should be finite + if len(rank_warnings) == 0: + assert np.all(np.isfinite(coef)), f"Full-rank matrix: coefficients should be finite" + assert np.all(np.abs(coef) < 1e6), f"Coefficients are unreasonably large: {coef}" + # The treatment effect coefficient (last one) should be close to true effect + assert abs(coef[-1] - true_effect) < 2.0, ( + f"Treatment effect {coef[-1]} is too far from true {true_effect}" + ) + else: + # If rank-deficient, check that identified coefficients are valid + finite_coef = coef[~np.isnan(coef)] + assert np.all(np.isfinite(finite_coef)), f"Identified coefficients should be finite" + # If treatment effect is identified, check it + if not np.isnan(coef[-1]): + assert abs(coef[-1] - true_effect) < 2.0, ( + f"Treatment effect {coef[-1]} is too far from true {true_effect}" + ) def test_single_cluster_error(self): """Test that single cluster raises error.""" @@ -824,6 +1020,239 @@ def test_matches_solve_ols(self, simple_data): np.testing.assert_allclose(reg.fitted_values_, fitted, rtol=1e-10) np.testing.assert_allclose(reg.vcov_, vcov, rtol=1e-10) + def test_rank_deficient_degrees_of_freedom(self): + """Test that degrees of freedom are computed correctly when columns are dropped. + + When a design matrix is rank-deficient, the effective number of parameters + is the rank, not the number of columns. The df should be n - rank. + """ + import warnings + + np.random.seed(42) + n = 100 + # Create rank-deficient matrix: 4 columns but rank 3 + X = np.random.randn(n, 3) + X = np.column_stack([X, X[:, 0] + X[:, 1]]) # Column 3 = Column 0 + Column 1 + + y = np.random.randn(n) + + with warnings.catch_warnings(record=True): + warnings.simplefilter("always") + reg = LinearRegression(include_intercept=False).fit(X, y) + + # n_params_ should be total columns (4) + assert reg.n_params_ == 4 + + # n_params_effective_ should be the rank (3) + assert reg.n_params_effective_ == 3 + + # df_ should be n - effective_params = 100 - 3 = 97 + assert reg.df_ == n - 3 + + # Verify one coefficient is NaN (the dropped one) + assert np.sum(np.isnan(reg.coefficients_)) == 1 + + def test_rank_deficient_inference_uses_correct_df(self): + """Test that p-values and CIs use the correct df for rank-deficient matrices.""" + import warnings + from scipy import stats + + np.random.seed(42) + n = 100 + # Create rank-deficient matrix + X = np.random.randn(n, 3) + X = np.column_stack([X, X[:, 0] + X[:, 1]]) # Perfect collinearity + + # True coefficients for the first 3 columns only + y = 2 * X[:, 0] + 3 * X[:, 1] + np.random.randn(n) * 0.5 + + with warnings.catch_warnings(record=True): + warnings.simplefilter("always") + reg = LinearRegression(include_intercept=False).fit(X, y) + + # Get inference for an identified coefficient + nan_mask = np.isnan(reg.coefficients_) + kept_idx = np.where(~nan_mask)[0][0] # First non-NaN coefficient + result = reg.get_inference(kept_idx) + + # Check that df is correct (should be n - rank = 97) + assert result.df == n - 3, f"Expected df={n-3}, got {result.df}" + + # Manually compute expected values using correct df + coef = result.coefficient + se = result.se + t_stat_expected = coef / se + p_value_expected = 2 * (1 - stats.t.cdf(abs(t_stat_expected), df=n - 3)) + + # Verify t-stat + np.testing.assert_allclose(result.t_stat, t_stat_expected, rtol=1e-10) + + # Verify p-value uses correct df (use atol for very small p-values) + np.testing.assert_allclose(result.p_value, p_value_expected, atol=1e-10) + + # Verify CI uses correct df + t_crit = stats.t.ppf(1 - 0.05 / 2, df=n - 3) + ci_expected = (coef - t_crit * se, coef + t_crit * se) + np.testing.assert_allclose(result.conf_int, ci_expected, rtol=1e-6) + + def test_rank_deficient_inference_nan_for_dropped_coef(self): + """Test that inference for dropped coefficients returns NaN values.""" + import warnings + + np.random.seed(42) + n = 100 + X = np.random.randn(n, 3) + X = np.column_stack([X, X[:, 0] + X[:, 1]]) # Column 3 is dropped + y = np.random.randn(n) + + with warnings.catch_warnings(record=True): + warnings.simplefilter("always") + reg = LinearRegression(include_intercept=False).fit(X, y) + + # Find the dropped coefficient index + nan_mask = np.isnan(reg.coefficients_) + dropped_idx = np.where(nan_mask)[0][0] + + # Get inference for dropped coefficient + result = reg.get_inference(dropped_idx) + + # All inference values should be NaN + assert np.isnan(result.coefficient) + assert np.isnan(result.se) + assert np.isnan(result.t_stat) + assert np.isnan(result.p_value) + assert np.isnan(result.conf_int[0]) + assert np.isnan(result.conf_int[1]) + + def test_rank_deficient_predict_uses_identified_coefficients(self): + """Test that predict() works correctly with rank-deficient fits. + + Predictions should use only identified coefficients (treating dropped + coefficients as zero), not produce all-NaN predictions. + """ + import warnings + + np.random.seed(42) + n = 100 + # Create rank-deficient matrix + X = np.random.randn(n, 3) + X = np.column_stack([X, X[:, 0] + X[:, 1]]) # Column 3 is collinear + + # True model uses only first 3 columns + y = 2 * X[:, 0] + 3 * X[:, 1] - 1 * X[:, 2] + np.random.randn(n) * 0.5 + + with warnings.catch_warnings(record=True): + warnings.simplefilter("always") + reg = LinearRegression(include_intercept=False).fit(X, y) + + # Verify one coefficient is NaN + assert np.sum(np.isnan(reg.coefficients_)) == 1 + + # Predictions should NOT be all NaN + X_new = np.random.randn(10, 4) + y_pred = reg.predict(X_new) + + assert y_pred.shape == (10,) + assert np.all(np.isfinite(y_pred)), "Predictions should be finite, not NaN" + + # Verify predictions match fitted values on training data + y_fitted = reg.predict(X) + np.testing.assert_allclose(y_fitted, reg.fitted_values_, rtol=1e-10) + + def test_rank_deficient_adjusted_r_squared_uses_effective_params(self): + """Test that adjusted R² uses effective params, not total params. + + For rank-deficient fits, adjusted R² should use n_params_effective_ + for consistency with the corrected degrees of freedom. + """ + import warnings + + np.random.seed(42) + n = 100 + # Create rank-deficient matrix: 4 columns but rank 3 + X = np.random.randn(n, 3) + X = np.column_stack([X, X[:, 0] + X[:, 1]]) # Column 3 = Column 0 + Column 1 + + y = 2 * X[:, 0] + np.random.randn(n) * 0.5 + + with warnings.catch_warnings(record=True): + warnings.simplefilter("always") + reg = LinearRegression(include_intercept=False).fit(X, y) + + # Verify setup + assert reg.n_params_ == 4 + assert reg.n_params_effective_ == 3 + + # Get R² values + r2 = reg.r_squared() + r2_adj = reg.r_squared(adjusted=True) + + # Both should be valid numbers + assert 0 <= r2 <= 1 + assert r2_adj < r2 # Adjusted should be smaller + + # Manually compute adjusted R² using effective params + # r²_adj = 1 - (1 - r²) * (n - 1) / (n - k_effective) + r2_adj_expected = 1 - (1 - r2) * (n - 1) / (n - 3) + np.testing.assert_allclose(r2_adj, r2_adj_expected, rtol=1e-10) + + # Verify it's NOT using total params (which would give different result) + r2_adj_wrong = 1 - (1 - r2) * (n - 1) / (n - 4) + assert r2_adj != r2_adj_wrong, "Should use effective params, not total params" + + def test_rank_deficient_action_error_raises(self): + """Test that LinearRegression with rank_deficient_action='error' raises on collinear data.""" + np.random.seed(42) + n = 100 + X = np.random.randn(n, 3) + X = np.column_stack([X, X[:, 0] + X[:, 1]]) # Perfect collinearity + y = np.random.randn(n) + + reg = LinearRegression(include_intercept=False, rank_deficient_action="error") + with pytest.raises(ValueError, match="rank-deficient"): + reg.fit(X, y) + + def test_rank_deficient_action_silent_no_warning(self): + """Test that LinearRegression with rank_deficient_action='silent' produces no warning.""" + import warnings + + np.random.seed(42) + n = 100 + X = np.random.randn(n, 3) + X = np.column_stack([X, X[:, 0] + X[:, 1]]) # Perfect collinearity + y = np.random.randn(n) + + reg = LinearRegression(include_intercept=False, rank_deficient_action="silent") + + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + reg.fit(X, y) + # No warnings should be emitted + assert len(w) == 0, f"Expected no warnings, got {len(w)}: {[str(x.message) for x in w]}" + + # Should still produce NaN for dropped column + assert np.sum(np.isnan(reg.coefficients_)) == 1 + + def test_rank_deficient_action_warn_default(self): + """Test that LinearRegression with rank_deficient_action='warn' (default) emits warning.""" + import warnings + + np.random.seed(42) + n = 100 + X = np.random.randn(n, 3) + X = np.column_stack([X, X[:, 0] + X[:, 1]]) # Perfect collinearity + y = np.random.randn(n) + + reg = LinearRegression(include_intercept=False) # Default is "warn" + + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + reg.fit(X, y) + # Should have a warning about rank deficiency + assert len(w) > 0, "Expected warning about rank deficiency" + assert any("Rank-deficient" in str(x.message) or "rank-deficient" in str(x.message).lower() + for x in w), f"Expected rank-deficient warning, got: {[str(x.message) for x in w]}" + class TestNumericalStability: """Tests for numerical stability with ill-conditioned matrices.""" @@ -833,15 +1262,18 @@ def test_near_singular_matrix_stability(self): np.random.seed(42) n = 100 - # Create near-collinear design (high condition number) + # Create near-collinear design (high condition number but above rank tolerance) + # The rank detection tolerance is 1e-07 (matching R's qr()), so we use noise + # of 1e-5 which is clearly above the tolerance and provides a distinguishable + # signal. With noise < 1e-07, the column would be considered linearly dependent. X = np.random.randn(n, 3) - X[:, 2] = X[:, 0] + X[:, 1] + np.random.randn(n) * 1e-8 # Near-perfect collinearity + X[:, 2] = X[:, 0] + X[:, 1] + np.random.randn(n) * 1e-5 # Near but not perfect collinearity y = X[:, 0] + np.random.randn(n) * 0.1 reg = LinearRegression(include_intercept=True).fit(X, y) - # Should still produce finite coefficients + # Should still produce finite coefficients (noise is above tolerance) assert np.all(np.isfinite(reg.coefficients_)) # Compare with numpy's lstsq (gold standard for stability) @@ -849,7 +1281,7 @@ def test_near_singular_matrix_stability(self): expected, _, _, _ = np.linalg.lstsq(X_full, y, rcond=None) # Should be close (within reasonable tolerance for ill-conditioned problem) - np.testing.assert_allclose(reg.coefficients_, expected, rtol=1e-6) + np.testing.assert_allclose(reg.coefficients_, expected, rtol=1e-4) def test_high_condition_number_matrix(self): """Test that high condition number matrices don't lose precision.""" diff --git a/tests/test_rust_backend.py b/tests/test_rust_backend.py index ad42fcff..5362cb7f 100644 --- a/tests/test_rust_backend.py +++ b/tests/test_rust_backend.py @@ -337,6 +337,108 @@ def test_near_singular_with_clusters(self): assert np.all(np.isfinite(residuals)), "Residuals should be finite" assert vcov.shape == (3, 3), "VCoV should have correct shape" + # ========================================================================= + # Rank-Deficient Matrix Tests (Critical for MultiPeriodDiD) + # ========================================================================= + + def test_rank_deficient_matrix_produces_valid_coefficients(self): + """Test that rank-deficient matrices produce finite, reasonable coefficients. + + This test verifies the fix for the MultiPeriodDiD bug where rank-deficient + design matrices (with redundant columns) produced astronomically wrong + estimates (trillions instead of single digits). + + The SVD-based solver should truncate small singular values and produce + a valid minimum-norm solution. + """ + from diff_diff._rust_backend import solve_ols + + np.random.seed(42) + n = 100 + + # Create perfectly collinear design matrix (rank-deficient) + # This mimics what can happen in MultiPeriodDiD with period dummies + X = np.random.randn(n, 3) + X[:, 2] = X[:, 0] + X[:, 1] # Column 3 = Column 1 + Column 2 + + y = X[:, 0] + np.random.randn(n) * 0.1 + + # Rust backend should handle this gracefully via SVD truncation + coeffs, residuals, vcov = solve_ols(X, y, None, True) + + # Coefficients must be finite (not NaN or Inf) + assert np.all(np.isfinite(coeffs)), f"Coefficients should be finite, got {coeffs}" + + # Coefficients should be reasonable (not astronomically large like 1e12) + assert np.all(np.abs(coeffs) < 1e6), f"Coefficients are unreasonably large: {coeffs}" + + # Residuals should be correct given coefficients + expected_residuals = y - X @ coeffs + np.testing.assert_array_almost_equal( + residuals, expected_residuals, decimal=8, + err_msg="Residuals should match y - X @ coeffs" + ) + + def test_multiperiod_did_like_design_matrix(self): + """Test design matrix structure similar to MultiPeriodDiD. + + MultiPeriodDiD creates design matrices with: + - Intercept + - Period dummies (one-hot encoded) + - Treatment × post interaction terms + + These can create rank-deficient matrices when period dummies and + interaction terms are not all linearly independent. + """ + from diff_diff._rust_backend import solve_ols + + np.random.seed(42) + n = 200 + n_periods = 5 + + # Create MultiPeriodDiD-like design matrix + intercept = np.ones(n) + + # Period dummies (periods 1-4, period 0 is reference) + period_assignment = np.random.randint(0, n_periods, n) + period_dummies = np.zeros((n, n_periods - 1)) + for i in range(1, n_periods): + period_dummies[:, i - 1] = (period_assignment == i).astype(float) + + # Treatment indicator and post indicator + treated = np.random.binomial(1, 0.5, n) + post = (period_assignment >= 3).astype(float) + treat_post = treated * post + + # Build design matrix (potentially rank-deficient) + X = np.column_stack([intercept, period_dummies, treat_post]) + + # True effect + true_effect = 2.5 + y = ( + 1.0 + + 0.5 * period_dummies[:, 0] + + 0.3 * period_dummies[:, 1] + + 0.7 * period_dummies[:, 2] + + 0.9 * period_dummies[:, 3] + + true_effect * treat_post + + np.random.randn(n) * 0.5 + ) + + # Fit with Rust backend + coeffs, residuals, vcov = solve_ols(X, y, None, True) + + # Coefficients must be finite + assert np.all(np.isfinite(coeffs)), f"Coefficients should be finite, got {coeffs}" + + # Coefficients should be reasonable (not trillions) + assert np.all(np.abs(coeffs) < 1e6), f"Coefficients are unreasonably large: {coeffs}" + + # Treatment effect (last coefficient) should be close to true effect + assert abs(coeffs[-1] - true_effect) < 2.0, ( + f"Treatment effect {coeffs[-1]} is too far from true effect {true_effect}" + ) + @pytest.mark.skipif(not HAS_RUST_BACKEND, reason="Rust backend not available") class TestRustVsNumpy: @@ -393,6 +495,124 @@ def test_solve_ols_with_clusters_match(self): err_msg="Clustered OLS VCoV should match" ) + def test_rank_deficient_ols_residuals_match(self): + """Test Rust and NumPy produce matching residuals for rank-deficient matrices. + + The Rust backend uses SVD truncation while NumPy uses R-style NaN handling. + Despite different approaches, both should produce equivalent residuals. + + Note: The coefficient representations differ: + - Rust: All finite (SVD minimum-norm solution) + - NumPy: NaN for dropped columns (R-style) + But both produce the same fitted values and residuals. + """ + import warnings + from diff_diff._rust_backend import solve_ols as rust_fn + from diff_diff.linalg import _solve_ols_numpy as numpy_fn + + np.random.seed(42) + n = 100 + + # Create rank-deficient design matrix (perfect collinearity) + X = np.random.randn(n, 3) + X[:, 2] = X[:, 0] + X[:, 1] # Column 3 = Column 1 + Column 2 + + y = X[:, 0] + 2 * X[:, 1] + np.random.randn(n) * 0.1 + + # Rust backend produces finite coefficients via SVD truncation + rust_coeffs, rust_resid, _ = rust_fn(X, y, None, True) + + # NumPy backend produces NaN for dropped columns (R-style) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") # Suppress rank-deficient warning + numpy_coeffs, numpy_resid, _ = numpy_fn(X, y, cluster_ids=None) + + # Rust should produce finite coefficients + assert np.all(np.isfinite(rust_coeffs)), "Rust coefficients should be finite" + assert np.all(np.abs(rust_coeffs) < 1e6), "Rust coefficients should be reasonable" + + # NumPy should produce exactly one NaN coefficient (the dropped one) + assert np.sum(np.isnan(numpy_coeffs)) == 1, "NumPy should have one NaN coefficient" + + # Non-NaN NumPy coefficients should be reasonable + finite_numpy = numpy_coeffs[~np.isnan(numpy_coeffs)] + assert np.all(np.abs(finite_numpy) < 1e6), "NumPy finite coefficients should be reasonable" + + # Residuals should be very close (this is the key equivalence check) + # Both approaches should produce the same fitted values and residuals + np.testing.assert_array_almost_equal( + rust_resid, numpy_resid, decimal=5, + err_msg="Residuals should match despite different coefficient representations" + ) + + def test_multiperiod_did_design_residuals_equivalence(self): + """Test both backends produce equivalent residuals for MultiPeriodDiD-like matrices. + + For full-rank designs, both backends should produce identical results. + The design matrix in this test is typically full-rank. + """ + import warnings + from diff_diff._rust_backend import solve_ols as rust_fn + from diff_diff.linalg import _solve_ols_numpy as numpy_fn + + np.random.seed(42) + n = 200 + n_periods = 5 + + # Create MultiPeriodDiD-like design matrix + intercept = np.ones(n) + period_assignment = np.random.randint(0, n_periods, n) + period_dummies = np.zeros((n, n_periods - 1)) + for i in range(1, n_periods): + period_dummies[:, i - 1] = (period_assignment == i).astype(float) + + treated = np.random.binomial(1, 0.5, n) + post = (period_assignment >= 3).astype(float) + treat_post = treated * post + + X = np.column_stack([intercept, period_dummies, treat_post]) + + true_effect = 2.5 + y = ( + 1.0 + + 0.5 * period_dummies[:, 0] + + 0.3 * period_dummies[:, 1] + + 0.7 * period_dummies[:, 2] + + 0.9 * period_dummies[:, 3] + + true_effect * treat_post + + np.random.randn(n) * 0.5 + ) + + rust_coeffs, rust_resid, _ = rust_fn(X, y, None, True) + + with warnings.catch_warnings(): + warnings.simplefilter("ignore") # May or may not warn depending on rank + numpy_coeffs, numpy_resid, _ = numpy_fn(X, y, cluster_ids=None) + + # Rust should produce finite treatment effect + rust_effect = rust_coeffs[-1] + assert np.isfinite(rust_effect), "Rust treatment effect should be finite" + assert abs(rust_effect - true_effect) < 2.0, ( + f"Rust treatment effect {rust_effect} too far from true {true_effect}" + ) + + # NumPy treatment effect should be close (may be finite or NaN depending on rank) + numpy_effect = numpy_coeffs[-1] + if np.isfinite(numpy_effect): + assert abs(numpy_effect - true_effect) < 2.0, ( + f"NumPy treatment effect {numpy_effect} too far from true {true_effect}" + ) + # Effects should be close to each other + assert abs(rust_effect - numpy_effect) < 0.5, ( + f"Rust ({rust_effect}) and NumPy ({numpy_effect}) effects should match" + ) + + # Residuals should be very close (key equivalence check) + np.testing.assert_array_almost_equal( + rust_resid, numpy_resid, decimal=5, + err_msg="Residuals should match for MultiPeriodDiD-like design" + ) + # ========================================================================= # Robust VCoV Equivalence # ========================================================================= @@ -563,6 +783,89 @@ def test_simplex_projection_match(self): err_msg=f"Simplex projection mismatch for input {v}" ) + def test_nan_vcov_fallback_to_python(self): + """Test that NaN vcov from Rust backend triggers fallback to Python. + + When Rust SVD detects rank-deficiency that Python QR missed (due to + different numerical properties), the vcov matrix may contain NaN values. + The high-level solve_ols should detect this and fall back to Python's + R-style handling, ensuring the user never receives silent NaN SEs. + + The key behavior being tested: + 1. When Rust returns NaN vcov, we emit a warning and re-run Python + 2. The Python re-run does fresh rank detection (not using cached info) + 3. R-style handling is applied: NaN coefficients for dropped columns + """ + import warnings + from diff_diff.linalg import solve_ols + + # Create an ill-conditioned matrix that might cause QR/SVD disagreement. + # The condition number is extremely high, which may cause the Rust SVD + # to detect numerical issues that QR doesn't catch. + np.random.seed(42) + n = 100 + + # Create a matrix with near-perfect but not exact collinearity. + # This is on the boundary where QR/SVD might disagree. + X = np.random.randn(n, 4) + # Make column 3 almost (but not exactly) a linear combination of 0-2 + X[:, 3] = X[:, 0] + X[:, 1] + X[:, 2] + np.random.randn(n) * 1e-12 + + y = np.random.randn(n) + + # Capture any warnings that might be emitted + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + coeffs, residuals, vcov = solve_ols(X, y) + + # Check if fallback warning was emitted + fallback_warning_emitted = any( + "Re-running with Python backend" in str(warning.message) + for warning in w + ) + + # Key invariants that must hold regardless of which backend is used: + # 1. Coefficients must be finite (either via Rust SVD or Python R-style) + finite_coeffs = coeffs[np.isfinite(coeffs)] + assert len(finite_coeffs) >= 3, \ + "At least 3 coefficients should be finite (identifiable)" + assert np.all(np.abs(finite_coeffs) < 1e10), \ + f"Finite coefficients should be reasonable, got {finite_coeffs}" + + # 2. If vcov has any finite values, they should correspond to finite coefficients + if vcov is not None: + finite_coef_mask = np.isfinite(coeffs) + for i in range(len(coeffs)): + if finite_coef_mask[i]: + # This coefficient's variance should be finite + var_i = vcov[i, i] + assert np.isfinite(var_i) or np.isnan(var_i), \ + f"Variance for finite coef {i} should be finite or NaN (dropped)" + + # 3. Residuals must always be finite + assert np.all(np.isfinite(residuals)), "Residuals should be finite" + + # 4. R-style consistency: NaN coefficients must have NaN vcov diagonal + if vcov is not None: + nan_coef_indices = set(np.where(np.isnan(coeffs))[0]) + nan_vcov_diag_indices = set(np.where(np.isnan(np.diag(vcov)))[0]) + + # NaN in vcov diagonal should correspond exactly to NaN coefficients + assert nan_vcov_diag_indices == nan_coef_indices, \ + f"NaN vcov diagonal {nan_vcov_diag_indices} should match " \ + f"NaN coefficients {nan_coef_indices}" + + # 5. If fallback warning was emitted, R-style handling MUST have occurred + # This verifies that the fallback actually applies R-style NaN handling + # (not minimum-norm solution which would have all finite coefficients) + if fallback_warning_emitted: + assert np.any(np.isnan(coeffs)), \ + "Fallback warning emitted but no NaN coefficients - " \ + "R-style handling was not applied" + assert vcov is not None and np.any(np.isnan(vcov)), \ + "Fallback warning emitted but vcov has no NaN - " \ + "R-style handling was not applied" + @pytest.mark.skipif(not HAS_RUST_BACKEND, reason="Rust backend not available") class TestTROPRustBackend: diff --git a/tests/test_staggered.py b/tests/test_staggered.py index f78f756a..39fda7fa 100644 --- a/tests/test_staggered.py +++ b/tests/test_staggered.py @@ -685,8 +685,12 @@ def test_near_collinear_covariates(self): """Test that near-collinear covariates are handled gracefully.""" data = generate_staggered_data_with_covariates(seed=42) - # Add a near-collinear covariate (x1 + small noise) - data['x1_copy'] = data['x1'] + np.random.randn(len(data)) * 1e-8 + # Add a near-collinear covariate (x1 + noise above rank detection tolerance) + # The rank detection tolerance is 1e-07 (matching R's qr()), so we use noise + # of 1e-5 which is above the tolerance but still creates high collinearity. + # With noise < 1e-07, the column would be considered linearly dependent. + np.random.seed(42) + data['x1_copy'] = data['x1'] + np.random.randn(len(data)) * 1e-5 cs = CallawaySantAnna(estimation_method='reg') results = cs.fit( @@ -698,7 +702,7 @@ def test_near_collinear_covariates(self): covariates=['x1', 'x1_copy'] # Nearly collinear ) - # Should still produce valid results (lstsq handles this) + # Should still produce valid results (noise is above tolerance) assert results.overall_att is not None assert np.isfinite(results.overall_att) @@ -726,6 +730,61 @@ def test_missing_values_in_covariates_warning(self): assert results.overall_att is not None assert results.overall_se > 0 + def test_rank_deficient_action_error_raises(self): + """Test that rank_deficient_action='error' raises ValueError on collinear data.""" + data = generate_staggered_data_with_covariates(seed=42) + + # Add a covariate that is perfectly collinear with x1 + data["x1_dup"] = data["x1"].copy() + + cs = CallawaySantAnna( + estimation_method="reg", # Use regression method to test OLS path + rank_deficient_action="error" + ) + with pytest.raises(ValueError, match="rank-deficient"): + cs.fit( + data, + outcome='outcome', + unit='unit', + time='time', + first_treat='first_treat', + covariates=['x1', 'x1_dup'] + ) + + def test_rank_deficient_action_silent_no_warning(self): + """Test that rank_deficient_action='silent' produces no warning.""" + import warnings + + data = generate_staggered_data_with_covariates(seed=42) + + # Add a covariate that is perfectly collinear with x1 + data["x1_dup"] = data["x1"].copy() + + cs = CallawaySantAnna( + estimation_method="reg", # Use regression method to test OLS path + rank_deficient_action="silent" + ) + + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + results = cs.fit( + data, + outcome='outcome', + unit='unit', + time='time', + first_treat='first_treat', + covariates=['x1', 'x1_dup'] + ) + + # No warnings about rank deficiency should be emitted + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message) + or "rank-deficient" in str(x.message).lower()] + assert len(rank_warnings) == 0, f"Expected no rank warnings, got {rank_warnings}" + + # Should still get valid results + assert results is not None + assert results.overall_att is not None + class TestCallawaySantAnnaBootstrap: """Tests for Callaway-Sant'Anna multiplier bootstrap inference.""" diff --git a/tests/test_sun_abraham.py b/tests/test_sun_abraham.py index 0096bf8a..c34ae26a 100644 --- a/tests/test_sun_abraham.py +++ b/tests/test_sun_abraham.py @@ -730,3 +730,56 @@ def test_cohort_level_dataframe(self): assert "relative_period" in df_cohort.columns assert "effect" in df_cohort.columns assert "weight" in df_cohort.columns + + def test_rank_deficient_action_error_raises(self): + """Test that rank_deficient_action='error' raises ValueError on collinear data.""" + data = generate_staggered_data(seed=42) + + # Add covariates that are perfectly collinear + np.random.seed(42) + data["cov1"] = np.random.randn(len(data)) + data["cov1_dup"] = data["cov1"].copy() + + sa = SunAbraham(rank_deficient_action="error") + with pytest.raises(ValueError, match="rank-deficient"): + sa.fit( + data, + outcome="outcome", + unit="unit", + time="time", + first_treat="first_treat", + covariates=["cov1", "cov1_dup"] + ) + + def test_rank_deficient_action_silent_no_warning(self): + """Test that rank_deficient_action='silent' produces no warning.""" + import warnings + + data = generate_staggered_data(seed=42) + + # Add covariates that are perfectly collinear + np.random.seed(42) + data["cov1"] = np.random.randn(len(data)) + data["cov1_dup"] = data["cov1"].copy() + + sa = SunAbraham(rank_deficient_action="silent") + + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + results = sa.fit( + data, + outcome="outcome", + unit="unit", + time="time", + first_treat="first_treat", + covariates=["cov1", "cov1_dup"] + ) + + # No warnings about rank deficiency should be emitted + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message) + or "rank-deficient" in str(x.message).lower()] + assert len(rank_warnings) == 0, f"Expected no rank warnings, got {rank_warnings}" + + # Should still get valid results + assert results is not None + assert results.overall_att is not None diff --git a/tests/test_triple_diff.py b/tests/test_triple_diff.py index 4bbde03a..02d77b32 100644 --- a/tests/test_triple_diff.py +++ b/tests/test_triple_diff.py @@ -853,3 +853,108 @@ def test_repr_does_not_raise(self, simple_ddd_data): # Should not raise repr_str = repr(results) assert "TripleDifferenceResults" in repr_str + + +# ============================================================================= +# Rank Deficiency Tests +# ============================================================================= + + +class TestRankDeficientAction: + """Tests for rank_deficient_action parameter handling.""" + + @pytest.fixture + def ddd_data_with_covariates(self): + """Create DDD data with covariates for testing.""" + np.random.seed(42) + n = 400 + data = pd.DataFrame({ + "group": np.repeat([0, 1], n // 2), + "partition": np.tile(np.repeat([0, 1], n // 4), 2), + "time": np.tile([0, 1], n // 2), + "x1": np.random.randn(n), + }) + + # Generate outcome with effect + data["outcome"] = ( + 1.0 + + 0.5 * data["x1"] + + 0.5 * data["group"] + + 0.3 * data["partition"] + + 0.2 * data["time"] + + 2.0 * data["group"] * data["partition"] * data["time"] + + np.random.randn(n) * 0.5 + ) + + return data + + def test_rank_deficient_action_error_raises(self, ddd_data_with_covariates): + """Test that rank_deficient_action='error' raises ValueError on collinear data.""" + # Add a covariate that is perfectly collinear with x1 + ddd_data_with_covariates["x1_dup"] = ddd_data_with_covariates["x1"].copy() + + ddd = TripleDifference( + estimation_method="reg", # Use regression method to test OLS path + rank_deficient_action="error" + ) + with pytest.raises(ValueError, match="rank-deficient"): + ddd.fit( + ddd_data_with_covariates, + outcome="outcome", + group="group", + partition="partition", + time="time", + covariates=["x1", "x1_dup"] + ) + + def test_rank_deficient_action_silent_no_warning(self, ddd_data_with_covariates): + """Test that rank_deficient_action='silent' produces no warning.""" + import warnings + + # Add a covariate that is perfectly collinear with x1 + ddd_data_with_covariates["x1_dup"] = ddd_data_with_covariates["x1"].copy() + + ddd = TripleDifference( + estimation_method="reg", # Use regression method to test OLS path + rank_deficient_action="silent" + ) + + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + results = ddd.fit( + ddd_data_with_covariates, + outcome="outcome", + group="group", + partition="partition", + time="time", + covariates=["x1", "x1_dup"] + ) + + # No warnings about rank deficiency should be emitted + rank_warnings = [x for x in w if "Rank-deficient" in str(x.message) + or "rank-deficient" in str(x.message).lower()] + assert len(rank_warnings) == 0, f"Expected no rank warnings, got {rank_warnings}" + + # Should still get valid results + assert results is not None + assert results.att is not None + + def test_convenience_function_passes_rank_deficient_action(self, ddd_data_with_covariates): + """Test that triple_difference() convenience function passes rank_deficient_action.""" + from diff_diff import triple_difference + + # Add a covariate that is perfectly collinear with x1 + ddd_data_with_covariates["x1_dup"] = ddd_data_with_covariates["x1"].copy() + + # Should raise with "error" action + with pytest.raises(ValueError, match="rank-deficient"): + triple_difference( + ddd_data_with_covariates, + outcome="outcome", + group="group", + partition="partition", + time="time", + estimation_method="reg", + covariates=["x1", "x1_dup"], + rank_deficient_action="error" + )