Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 23 additions & 5 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,22 +103,35 @@ cross-platform compilation - no OpenBLAS or Intel MKL installation required.

- **`diff_diff/imputation.py`** - Borusyak-Jaravel-Spiess imputation DiD estimator:
- `ImputationDiD` - Borusyak et al. (2024) efficient imputation estimator for staggered DiD
- `ImputationDiDResults` - Results with overall ATT, event study, group effects, pre-trend test
- `ImputationBootstrapResults` - Multiplier bootstrap inference results
- `imputation_did()` - Convenience function
- Steps: (1) OLS on untreated obs for unit+time FE, (2) impute counterfactual Y(0), (3) aggregate
- Conservative variance (Theorem 3) with `aux_partition` parameter for SE tightness
- Pre-trend test (Equation 9) via `results.pretrend_test()`
- Proposition 5: NaN for unidentified long-run horizons without never-treated units
- Re-exports result and bootstrap classes for backward compatibility

- **`diff_diff/imputation_results.py`** - Result container classes:
- `ImputationBootstrapResults` - Multiplier bootstrap inference results
- `ImputationDiDResults` - Results with overall ATT, event study, group effects, pre-trend test

- **`diff_diff/imputation_bootstrap.py`** - Bootstrap inference:
- `ImputationDiDBootstrapMixin` - Mixin with multiplier bootstrap methods
- Cluster-level influence function sums (Theorem 3) with Rademacher weights

- **`diff_diff/two_stage.py`** - Gardner (2022) Two-Stage DiD estimator:
- `TwoStageDiD` - Two-stage estimator: (1) estimate unit+time FE on untreated obs, (2) regress residualized outcomes on treatment indicators
- `TwoStageDiDResults` - Results with overall ATT, event study, group effects, per-observation treatment effects
- `TwoStageBootstrapResults` - Multiplier bootstrap inference on GMM influence function
- `two_stage_did()` - Convenience function
- Point estimates identical to ImputationDiD; different variance estimator (GMM sandwich vs. conservative)
- Custom `_compute_gmm_variance()` — cannot reuse `compute_robust_vcov()` because correction term uses GLOBAL cross-moment
- No finite-sample adjustments (raw asymptotic sandwich, matching R `did2s`)
- Re-exports result and bootstrap classes for backward compatibility

- **`diff_diff/two_stage_results.py`** - Result container classes:
- `TwoStageBootstrapResults` - Multiplier bootstrap inference on GMM influence function
- `TwoStageDiDResults` - Results with overall ATT, event study, group effects, per-observation treatment effects

- **`diff_diff/two_stage_bootstrap.py`** - Bootstrap inference:
- `TwoStageDiDBootstrapMixin` - Mixin with GMM influence function bootstrap methods

- **`diff_diff/triple_diff.py`** - Triple Difference (DDD) estimator:
- `TripleDifference` - Ortiz-Villavicencio & Sant'Anna (2025) estimator for DDD designs
Expand All @@ -129,14 +142,19 @@ cross-platform compilation - no OpenBLAS or Intel MKL installation required.

- **`diff_diff/trop.py`** - Triply Robust Panel (TROP) estimator (v2.1.0):
- `TROP` - Athey, Imbens, Qu & Viviano (2025) estimator with factor model adjustment
- `TROPResults` - Results with ATT, factors, loadings, unit/time weights
- `trop()` - Convenience function for quick estimation
- Three robustness components: factor adjustment, unit weights, time weights
- Two estimation methods via `method` parameter:
- `"twostep"` (default): Per-observation model fitting (Algorithm 2 of paper)
- `"joint"`: Weighted least squares with homogeneous treatment effect (faster)
- Automatic rank selection via cross-validation, information criterion, or elbow detection
- Bootstrap and placebo-based variance estimation
- Re-exports result classes for backward compatibility

- **`diff_diff/trop_results.py`** - Result container classes:
- `_LAMBDA_INF` - Sentinel value for disabled factor model (λ_nn=∞)
- `_PrecomputedStructures` - TypedDict for cached matrices
- `TROPResults` - Results with ATT, factors, loadings, unit/time weights

- **`diff_diff/bacon.py`** - Goodman-Bacon decomposition for TWFE diagnostics:
- `BaconDecomposition` - Decompose TWFE into weighted 2x2 comparisons (Goodman-Bacon 2021)
Expand Down
80 changes: 16 additions & 64 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ Target: < 1000 lines per module for maintainability.
|------|-------|--------|
| ~~`staggered.py`~~ | ~~2301~~ 1120 | ✅ Split into staggered.py, staggered_bootstrap.py, staggered_aggregation.py, staggered_results.py |
| ~~`prep.py`~~ | ~~1993~~ 1241 | ✅ Split: DGP functions moved to `prep_dgp.py` (777 lines) |
| `trop.py` | 2904 | **Needs split** -- nearly 3x the 1000-line target |
| `imputation.py` | 2480 | **Needs split** -- results, bootstrap, aggregation like staggered |
| `two_stage.py` | 2209 | **Needs split** -- same pattern as imputation |
| `utils.py` | 1879 | **Needs split** -- legacy placebo functions could move to diagnostics |
| ~~`trop.py`~~ | ~~2904~~ ~2560 | ✅ Partially split: results extracted to `trop_results.py` (~340 lines) |
| ~~`imputation.py`~~ | ~~2480~~ ~1740 | ✅ Split into imputation.py, imputation_results.py, imputation_bootstrap.py |
| ~~`two_stage.py`~~ | ~~2209~~ ~1490 | ✅ Split into two_stage.py, two_stage_results.py, two_stage_bootstrap.py |
| `utils.py` | 1879 | Monitor -- legacy placebo functions stay to avoid circular imports |
| `visualization.py` | 1678 | Monitor -- growing but cohesive |
| `linalg.py` | 1537 | Monitor -- unified backend, splitting would hurt cohesion |
| `honest_did.py` | 1511 | Acceptable |
Expand All @@ -42,49 +42,11 @@ Target: < 1000 lines per module for maintainability.

All 7 t_stat locations fixed (diagnostics.py, sun_abraham.py, triple_diff.py) -- all now use `np.nan` or `np.isfinite()` guards. Fixed in PR #118 and follow-up PRs.

**Remaining nuance**: `diagnostics.py:785` still has `se = ... else 0.0` for the SE variable itself (not t_stat). The downstream t_stat line correctly returns `np.nan`, so inference is safe, but the SE value of 0.0 is technically incorrect for an undefined SE.
~~**Remaining nuance**: `diagnostics.py:785` SE = 0.0~~ — ✅ Fixed: SE now returns `np.nan` when undefined, and all downstream inference uses `safe_inference()`.

### Migrate Existing Inference Call Sites to `safe_inference()`
### ~~Migrate Existing Inference Call Sites to `safe_inference()`~~ -- DONE

`safe_inference()` was added to `diff_diff/utils.py` to compute t_stat, p_value, and CI together with a NaN gate at the top. It is now the prescribed pattern for all new code (see CLAUDE.md design pattern #7). However, ~26 existing inline inference computations across 12 files have **not** been migrated yet.

**Files with inline inference to migrate:**

| File | Approx. Locations |
|------|-------------------|
| `estimators.py` | 2 (lines 1038, 1089) |
| `sun_abraham.py` | 4 (lines 621, 644, 661, 905) |
| `staggered.py` | 6 (lines 696, 725, 763, 777, 792, 806) |
| `staggered_aggregation.py` | 2 (lines 407, 479) |
| `triple_diff.py` | 1 (line 601) |
| `imputation.py` | 2 (lines 1806, 1927) |
| `two_stage.py` | 2 (lines 1325, 1431) |
| `diagnostics.py` | 2 (lines 665, 786) |
| `synthetic_did.py` | 1 (line 426) |
| `trop.py` | 2 (lines 1474, 2054) |
| `utils.py` | 1 (line 641) |
| `linalg.py` | 1 (line 1310) |

**How to find them:**
```bash
grep -En "(t_stat|overall_t)\s*=\s*[^#]*/|\[.t_stat.\]\s*=\s*[^#]*/" diff_diff/*.py | grep -v "def safe_inference"
```

**Note**: This command has one false positive (`utils.py:178`, inside the `safe_inference()` body) and misses multi-line expressions (e.g., `sun_abraham.py:660-661`). The table above is the authoritative list.

**Migration pattern:**
```python
# Before (inline, error-prone)
t_stat = effect / se if se > 0 else 0.0
p_value = compute_p_value(t_stat)
ci = compute_confidence_interval(effect, se, alpha)

# After (NaN-safe, consistent)
from diff_diff.utils import safe_inference
t_stat, p_value, ci = safe_inference(effect, se, alpha=alpha, df=df)
```

**Priority**: Medium — the NaN-handling table above covers the worst cases (those using `0.0`). The remaining sites may use partial guards but should still be migrated for consistency and to prevent regressions.
✅ All ~32 inline inference call sites migrated to `safe_inference()` across 11 source files: `estimators.py`, `sun_abraham.py`, `staggered.py`, `staggered_aggregation.py`, `triple_diff.py`, `imputation.py`, `two_stage.py`, `diagnostics.py`, `synthetic_did.py`, `trop.py`, `utils.py`. Two sites left as-is with comments: `diagnostics.py:665` (permutation-based p_value) and `linalg.py:1310` (deliberately uses ±inf for zero-SE).

---

Expand All @@ -96,17 +58,17 @@ Deferred items from PR reviews that were not addressed before merge.

| Issue | Location | PR | Priority |
|-------|----------|----|----------|
| TwoStageDiD & ImputationDiD bootstrap hardcodes Rademacher only; no `bootstrap_weights` parameter unlike CallawaySantAnna | `two_stage.py:1860`, `imputation.py:2363` | #156, #141 | Medium |
| TwoStageDiD GMM score logic duplicated between analytic/bootstrap with inconsistent NaN/overflow handling | `two_stage.py:1454-1784` | #156 | Medium |
| ImputationDiD weight construction duplicated between aggregation and bootstrap (drift risk) -- has explicit code comment acknowledging duplication | `imputation.py:1777-1786`, `imputation.py:2216-2221` | #141 | Medium |
| ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py:1564` | #141 | Medium |
| TwoStageDiD & ImputationDiD bootstrap hardcodes Rademacher only; no `bootstrap_weights` parameter unlike CallawaySantAnna | `two_stage_bootstrap.py`, `imputation_bootstrap.py` | #156, #141 | Medium |
| TwoStageDiD GMM score logic duplicated between analytic/bootstrap with inconsistent NaN/overflow handling | `two_stage.py`, `two_stage_bootstrap.py` | #156 | Medium |
| ImputationDiD weight construction duplicated between aggregation and bootstrap (drift risk) -- has explicit code comment acknowledging duplication | `imputation.py`, `imputation_bootstrap.py` | #141 | Medium |
| ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py` | #141 | Medium |

#### Performance

| Issue | Location | PR | Priority |
|-------|----------|----|----------|
| TwoStageDiD per-column `.toarray()` in loop for cluster scores | `two_stage.py:1766-1767` | #156 | Medium |
| ImputationDiD event-study SEs recompute full conservative variance per horizon (should cache A0/A1 factorization) | `imputation.py:1772-1804` | #141 | Low |
| TwoStageDiD per-column `.toarray()` in loop for cluster scores | `two_stage_bootstrap.py` | #156 | Medium |
| ImputationDiD event-study SEs recompute full conservative variance per horizon (should cache A0/A1 factorization) | `imputation.py` | #141 | Low |
| Legacy `compute_placebo_effects` uses deprecated projected-gradient weights (marked deprecated, users directed to `SyntheticDiD`) | `utils.py:1689-1691` | #145 | Low |
| Rust faer SVD ndarray-to-faer conversion overhead (minimal vs SVD cost) | `rust/src/linalg.rs:67` | #115 | Low |

Expand All @@ -115,7 +77,7 @@ Deferred items from PR reviews that were not addressed before merge.
| Issue | Location | PR | Priority |
|-------|----------|----|----------|
| Tutorial notebooks not executed in CI | `docs/tutorials/*.ipynb` | #159 | Low |
| TwoStageDiD `test_nan_propagation` is a no-op (only `pass`) | `tests/test_two_stage.py:643-652` | #156 | Low |
| ~~TwoStageDiD `test_nan_propagation` is a no-op~~ | ~~`tests/test_two_stage.py:643-652`~~ | ~~#156~~ | ✅ Fixed |
| ImputationDiD bootstrap + covariate path untested | `tests/test_imputation.py` | #141 | Low |
| TROP `n_bootstrap >= 2` validation missing (can yield 0/NaN SE silently) | `trop.py:462` | #124 | Low |
| SunAbraham deprecated `min_pre_periods`/`min_post_periods` still in `fit()` docstring | `sun_abraham.py:458-487` | #153 | Low |
Expand Down Expand Up @@ -209,19 +171,9 @@ Spurious RuntimeWarnings ("divide by zero", "overflow", "invalid value") are emi
- Occurs in IPW and DR estimation methods with covariates
- Related to logistic regression overflow in edge cases (separate from BLAS bug)

### Fix Plan (follow-up PR)

Replace `@` operator with `np.dot()` at affected call sites. `np.dot()` bypasses the ufunc FPE dispatch layer and produces identical results with zero spurious warnings on M4.

**Affected files and lines:**
- `linalg.py`: 332, 690, 704, 829, 1463
- `staggered.py`: 78, 85, 102, 860, 1024-1025
- `triple_diff.py`: 301, 307, 323
- `utils.py`: 515
- `imputation.py`: 1253, 1301, 1602, 1662
- `trop.py`: 1098
### ~~Fix Plan (follow-up PR)~~ -- DONE

**Regression test:** Assert no RuntimeWarnings from `solve_ols()` with n ≥ 500 on all platforms.
✅ Replaced `@` operator with `np.dot()` at all 19 affected call sites across 6 files: `linalg.py` (5), `staggered.py` (5), `triple_diff.py` (3), `utils.py` (1), `imputation.py` (4), `trop.py` (1). Regression test added in `test_linalg.py::TestNoDotRuntimeWarnings`.

**Long-term:** Revert to `@` operator when numpy ≥ 2.3 becomes the minimum supported version.

Expand Down
14 changes: 4 additions & 10 deletions diff_diff/diagnostics.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

from diff_diff.estimators import DifferenceInDifferences
from diff_diff.results import _get_significance_stars
from diff_diff.utils import compute_confidence_interval, compute_p_value
from diff_diff.utils import safe_inference


@dataclass
Expand Down Expand Up @@ -661,7 +661,7 @@ def permutation_test(
ci_lower = np.percentile(valid_effects, alpha / 2 * 100)
ci_upper = np.percentile(valid_effects, (1 - alpha / 2) * 100)

# T-stat from original estimate
# NOTE: Not using safe_inference — p_value is permutation-based, CI is percentile-based.
t_stat = original_att / se if np.isfinite(se) and se > 0 else np.nan

return PlaceboTestResults(
Expand Down Expand Up @@ -782,15 +782,9 @@ def leave_one_out_test(

# Statistics of LOO distribution
mean_effect = np.mean(valid_effects)
se = np.std(valid_effects, ddof=1) if len(valid_effects) > 1 else 0.0
t_stat = mean_effect / se if np.isfinite(se) and se > 0 else np.nan

# Use t-distribution for p-value
se = np.std(valid_effects, ddof=1) if len(valid_effects) > 1 else np.nan
df = len(valid_effects) - 1 if len(valid_effects) > 1 else 1
p_value = compute_p_value(t_stat, df=df)

# CI
conf_int = compute_confidence_interval(mean_effect, se, alpha, df=df) if np.isfinite(se) and se > 0 else (np.nan, np.nan)
t_stat, p_value, conf_int = safe_inference(mean_effect, se, alpha=alpha, df=df)

return PlaceboTestResults(
test_type="leave_one_out",
Expand Down
24 changes: 5 additions & 19 deletions diff_diff/estimators.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,8 @@
from diff_diff.results import DiDResults, MultiPeriodDiDResults, PeriodEffect
from diff_diff.utils import (
WildBootstrapResults,
compute_confidence_interval,
compute_p_value,
demean_by_group,
safe_inference,
validate_binary,
wild_bootstrap_se,
)
Expand Down Expand Up @@ -1034,14 +1033,7 @@ def fit( # type: ignore[override]
idx = interaction_indices[period]
effect = coefficients[idx]
se = np.sqrt(vcov[idx, idx])
if np.isfinite(se) and se > 0:
t_stat = effect / se
p_value = compute_p_value(t_stat, df=df)
conf_int = compute_confidence_interval(effect, se, self.alpha, df=df)
else:
t_stat = np.nan
p_value = np.nan
conf_int = (np.nan, np.nan)
t_stat, p_value, conf_int = safe_inference(effect, se, alpha=self.alpha, df=df)

period_effects[period] = PeriodEffect(
period=period,
Expand Down Expand Up @@ -1085,15 +1077,9 @@ def fit( # type: ignore[override]
avg_conf_int = (np.nan, np.nan)
else:
avg_se = float(np.sqrt(avg_var))
if np.isfinite(avg_se) and avg_se > 0:
avg_t_stat = avg_att / avg_se
avg_p_value = compute_p_value(avg_t_stat, df=df)
avg_conf_int = compute_confidence_interval(avg_att, avg_se, self.alpha, df=df)
else:
# Zero SE (degenerate case)
avg_t_stat = np.nan
avg_p_value = np.nan
avg_conf_int = (np.nan, np.nan)
avg_t_stat, avg_p_value, avg_conf_int = safe_inference(
avg_att, avg_se, alpha=self.alpha, df=df
)

# Count observations
n_treated = int(np.sum(d))
Expand Down
Loading