From 519a3eb00580a2f8e6335f1d2531dca6fd2c6867 Mon Sep 17 00:00:00 2001 From: igerber Date: Mon, 16 Feb 2026 17:32:23 -0500 Subject: [PATCH 1/2] Update TODO.md and ROADMAP.md for accuracy post-v2.4.0 Fix stale line references, correct module line counts, mark completed NaN handling as done, expand safe_inference migration table to 25 sites across 11 files, add tech debt section from PR reviews, and bump ROADMAP.md version from v2.3.0 to v2.4.0. Co-Authored-By: Claude Opus 4.6 --- ROADMAP.md | 2 +- TODO.md | 107 +++++++++++++++++++++++++++++++++-------------------- 2 files changed, 68 insertions(+), 41 deletions(-) diff --git a/ROADMAP.md b/ROADMAP.md index 47e1d2f5..e6290039 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -8,7 +8,7 @@ For past changes and release history, see [CHANGELOG.md](CHANGELOG.md). ## Current Status -diff-diff v2.3.0 is a **production-ready** DiD library with feature parity with R's `did` + `HonestDiD` + `synthdid` ecosystem for core DiD analysis: +diff-diff v2.4.0 is a **production-ready** DiD library with feature parity with R's `did` + `HonestDiD` + `synthdid` ecosystem for core DiD analysis: - **Core estimators**: Basic DiD, TWFE, MultiPeriod, Callaway-Sant'Anna, Sun-Abraham, Borusyak-Jaravel-Spiess Imputation, Synthetic DiD, Triple Difference (DDD), TROP, Two-Stage DiD (Gardner 2022) - **Valid inference**: Robust SEs, cluster SEs, wild bootstrap, multiplier bootstrap, placebo-based variance diff --git a/TODO.md b/TODO.md index 408716ed..fb4da081 100644 --- a/TODO.md +++ b/TODO.md @@ -12,8 +12,8 @@ Current limitations that may affect users: | Issue | Location | Priority | Notes | |-------|----------|----------|-------| -| MultiPeriodDiD wild bootstrap not supported | `estimators.py:1068-1074` | Low | Edge case | -| `predict()` raises NotImplementedError | `estimators.py:532-554` | Low | Rarely needed | +| MultiPeriodDiD wild bootstrap not supported | `estimators.py:779-785` | Low | Edge case | +| `predict()` raises NotImplementedError | `estimators.py:568-587` | Low | Rarely needed | ## Code Quality @@ -23,59 +23,50 @@ Target: < 1000 lines per module for maintainability. | File | Lines | Action | |------|-------|--------| -| ~~`staggered.py`~~ | ~~2301~~ 1066 | ✅ Split into staggered.py, staggered_bootstrap.py, staggered_aggregation.py, staggered_results.py | +| ~~`staggered.py`~~ | ~~2301~~ 1120 | ✅ Split into staggered.py, staggered_bootstrap.py, staggered_aggregation.py, staggered_results.py | | ~~`prep.py`~~ | ~~1993~~ 1241 | ✅ Split: DGP functions moved to `prep_dgp.py` (777 lines) | -| `trop.py` | 1703 | Monitor size | -| `visualization.py` | 1627 | Acceptable but growing | -| `honest_did.py` | 1493 | Acceptable | -| `utils.py` | 1481 | Acceptable | +| `trop.py` | 2904 | **Needs split** -- nearly 3x the 1000-line target | +| `imputation.py` | 2480 | **Needs split** -- results, bootstrap, aggregation like staggered | +| `two_stage.py` | 2209 | **Needs split** -- same pattern as imputation | +| `utils.py` | 1879 | **Needs split** -- legacy placebo functions could move to diagnostics | +| `visualization.py` | 1678 | Monitor -- growing but cohesive | +| `linalg.py` | 1537 | Monitor -- unified backend, splitting would hurt cohesion | +| `honest_did.py` | 1511 | Acceptable | | `power.py` | 1350 | Acceptable | -| `triple_diff.py` | 1291 | Acceptable | -| `sun_abraham.py` | 1176 | Acceptable | -| `pretrends.py` | 1160 | Acceptable | -| `bacon.py` | 1027 | OK | +| `triple_diff.py` | 1322 | Acceptable | +| `sun_abraham.py` | 1227 | Acceptable | +| `estimators.py` | 1161 | Acceptable | +| `pretrends.py` | 1104 | Acceptable | -### NaN Handling for Undefined t-statistics +### ~~NaN Handling for Undefined t-statistics~~ -- DONE -Several estimators return `0.0` for t-statistic when SE is 0 or undefined. This is incorrect—a t-stat of 0 implies a null effect, whereas `np.nan` correctly indicates undefined inference. +All 7 t_stat locations fixed (diagnostics.py, sun_abraham.py, triple_diff.py) -- all now use `np.nan` or `np.isfinite()` guards. Fixed in PR #118 and follow-up PRs. -**Pattern to fix**: `t_stat = effect / se if se > 0 else 0.0` → `t_stat = effect / se if se > 0 else np.nan` - -| Location | Line | Current Code | -|----------|------|--------------| -| `diagnostics.py` | 665 | `t_stat = original_att / se if se > 0 else 0.0` | -| `diagnostics.py` | 786 | `t_stat = mean_effect / se if se > 0 else 0.0` | -| `sun_abraham.py` | 603 | `overall_t = overall_att / overall_se if overall_se > 0 else 0.0` | -| `sun_abraham.py` | 626 | `overall_t = overall_att / overall_se if overall_se > 0 else 0.0` | -| `sun_abraham.py` | 643 | `eff_val / se_val if se_val > 0 else 0.0` | -| `sun_abraham.py` | 881 | `t_stat = agg_effect / agg_se if agg_se > 0 else 0.0` | -| `triple_diff.py` | 601 | `t_stat = att / se if se > 0 else 0.0` | - -**Priority**: Medium - affects inference reporting in edge cases. - -**Note**: CallawaySantAnna was fixed in PR #97 to use `np.nan`. These other estimators should follow the same pattern. +**Remaining nuance**: `diagnostics.py:785` still has `se = ... else 0.0` for the SE variable itself (not t_stat). The downstream t_stat line correctly returns `np.nan`, so inference is safe, but the SE value of 0.0 is technically incorrect for an undefined SE. ### Migrate Existing Inference Call Sites to `safe_inference()` -`safe_inference()` was added to `diff_diff/utils.py` to compute t_stat, p_value, and CI together with a NaN gate at the top. It is now the prescribed pattern for all new code (see CLAUDE.md design pattern #7). However, ~20 existing inline inference computations across 12 files have **not** been migrated yet. +`safe_inference()` was added to `diff_diff/utils.py` to compute t_stat, p_value, and CI together with a NaN gate at the top. It is now the prescribed pattern for all new code (see CLAUDE.md design pattern #7). However, ~25 existing inline inference computations across 11 files have **not** been migrated yet. **Files with inline inference to migrate:** | File | Approx. Locations | |------|-------------------| -| `estimators.py` | 1 | -| `sun_abraham.py` | 4 | -| `staggered.py` | 4 | -| `staggered_aggregation.py` | 2 | -| `triple_diff.py` | 1 | -| `imputation.py` | 2 | -| `diagnostics.py` | 2 | -| `synthetic_did.py` | 1 | -| `trop.py` | 2 | +| `estimators.py` | 2 (lines 1038, 1089) | +| `sun_abraham.py` | 4 (lines 621, 644, 661, 905) | +| `staggered.py` | 6 (lines 696, 725, 763, 777, 792, 806) | +| `staggered_aggregation.py` | 2 (lines 407, 479) | +| `triple_diff.py` | 1 (line 601) | +| `imputation.py` | 2 (lines 1806, 1927) | +| `two_stage.py` | 2 (lines 1325, 1431) | +| `diagnostics.py` | 2 (lines 665, 786) | +| `synthetic_did.py` | 1 (line 426) | +| `trop.py` | 2 (lines 1474, 2054) | +| `linalg.py` | 1 (line 1310) | **How to find them:** ```bash -grep -n "t_stat[[:space:]]*=[[:space:]]*[^#]*/ *se" diff_diff/*.py | grep -v "safe_inference" +grep -n "t_stat\s*=\s*[^#]*/\|overall_t\s*=\s*[^#]*/" diff_diff/*.py | grep -v safe_inference | grep -v "^.*:#" ``` **Migration pattern:** @@ -94,6 +85,42 @@ t_stat, p_value, ci = safe_inference(effect, se, alpha=alpha, df=df) --- +### Tech Debt from Code Reviews + +Deferred items from PR reviews that were not addressed before merge. + +#### Methodology/Correctness + +| Issue | Location | PR | Priority | +|-------|----------|----|----------| +| TwoStageDiD & ImputationDiD bootstrap hardcodes Rademacher only; docs/Registry claim Mammen/Webb | `two_stage.py:1860`, `imputation.py:2363` | #156, #141 | Medium | +| TwoStageDiD GMM score logic duplicated between analytic/bootstrap with inconsistent NaN/overflow handling | `two_stage.py:1454-1784` | #156 | Medium | +| ImputationDiD weight construction duplicated between aggregation and bootstrap (drift risk) -- has explicit code comment acknowledging duplication | `imputation.py:1777-1786`, `imputation.py:2216-2221` | #141 | Medium | +| ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py:1564` | #141 | Medium | + +#### Performance + +| Issue | Location | PR | Priority | +|-------|----------|----|----------| +| TwoStageDiD per-column `.toarray()` in loop for cluster scores | `two_stage.py:1766-1767` | #156 | Medium | +| ImputationDiD event-study SEs recompute full conservative variance per horizon (should cache A0/A1 factorization) | `imputation.py:1772-1804` | #141 | Low | +| Legacy `compute_placebo_effects` uses deprecated projected-gradient weights (marked deprecated, users directed to `SyntheticDiD`) | `utils.py:1689-1691` | #145 | Low | +| Rust faer SVD ndarray-to-faer conversion overhead (minimal vs SVD cost) | `rust/src/linalg.rs:67` | #115 | Low | + +#### Testing/Docs + +| Issue | Location | PR | Priority | +|-------|----------|----|----------| +| Tutorial notebooks not executed in CI | `docs/tutorials/*.ipynb` | #159 | Low | +| TwoStageDiD `test_nan_propagation` is a no-op (only `pass`) | `tests/test_two_stage.py:643-652` | #156 | Low | +| ImputationDiD bootstrap + covariate path untested | `tests/test_imputation.py` | #141 | Low | +| TROP `n_bootstrap >= 2` validation missing (can yield 0/NaN SE silently) | `trop.py:462` | #124 | Low | +| SunAbraham deprecated `min_pre_periods`/`min_post_periods` still in `fit()` docstring | `sun_abraham.py:458-487` | #153 | Low | +| R comparison tests spawn separate `Rscript` per test (slow CI) | `tests/test_methodology_twfe.py:294` | #139 | Low | +| Rust TROP bootstrap SE returns 0.0 instead of NaN for <2 samples | `rust/src/trop.rs:1038-1054` | #115 | Low | + +--- + ### Standard Error Consistency Different estimators compute SEs differently. Consider unified interface. @@ -187,7 +214,7 @@ Replace `@` operator with `np.dot()` at affected call sites. `np.dot()` bypasses - `linalg.py`: 332, 690, 704, 829, 1463 - `staggered.py`: 78, 85, 102, 860, 1024-1025 - `triple_diff.py`: 301, 307, 323 -- `utils.py`: 481 +- `utils.py`: 515 - `imputation.py`: 1253, 1301, 1602, 1662 - `trop.py`: 1098 From 00c4616f45f6c588a1b18e9666f288b195efe26e Mon Sep 17 00:00:00 2001 From: igerber Date: Mon, 16 Feb 2026 17:57:54 -0500 Subject: [PATCH 2/2] Fix TODO.md documentation accuracy per PR #160 review feedback Address two P2 findings: correct Rademacher/Mammen/Webb claim in tech debt table (Registry only documents Rademacher), and improve safe_inference grep command to catch dict-based patterns with documented limitations. Add missing utils.py:641 call site to migration table. Co-Authored-By: Claude Opus 4.6 --- TODO.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/TODO.md b/TODO.md index fb4da081..1d436bef 100644 --- a/TODO.md +++ b/TODO.md @@ -46,7 +46,7 @@ All 7 t_stat locations fixed (diagnostics.py, sun_abraham.py, triple_diff.py) -- ### Migrate Existing Inference Call Sites to `safe_inference()` -`safe_inference()` was added to `diff_diff/utils.py` to compute t_stat, p_value, and CI together with a NaN gate at the top. It is now the prescribed pattern for all new code (see CLAUDE.md design pattern #7). However, ~25 existing inline inference computations across 11 files have **not** been migrated yet. +`safe_inference()` was added to `diff_diff/utils.py` to compute t_stat, p_value, and CI together with a NaN gate at the top. It is now the prescribed pattern for all new code (see CLAUDE.md design pattern #7). However, ~26 existing inline inference computations across 12 files have **not** been migrated yet. **Files with inline inference to migrate:** @@ -62,13 +62,16 @@ All 7 t_stat locations fixed (diagnostics.py, sun_abraham.py, triple_diff.py) -- | `diagnostics.py` | 2 (lines 665, 786) | | `synthetic_did.py` | 1 (line 426) | | `trop.py` | 2 (lines 1474, 2054) | +| `utils.py` | 1 (line 641) | | `linalg.py` | 1 (line 1310) | **How to find them:** ```bash -grep -n "t_stat\s*=\s*[^#]*/\|overall_t\s*=\s*[^#]*/" diff_diff/*.py | grep -v safe_inference | grep -v "^.*:#" +grep -En "(t_stat|overall_t)\s*=\s*[^#]*/|\[.t_stat.\]\s*=\s*[^#]*/" diff_diff/*.py | grep -v "def safe_inference" ``` +**Note**: This command has one false positive (`utils.py:178`, inside the `safe_inference()` body) and misses multi-line expressions (e.g., `sun_abraham.py:660-661`). The table above is the authoritative list. + **Migration pattern:** ```python # Before (inline, error-prone) @@ -93,7 +96,7 @@ Deferred items from PR reviews that were not addressed before merge. | Issue | Location | PR | Priority | |-------|----------|----|----------| -| TwoStageDiD & ImputationDiD bootstrap hardcodes Rademacher only; docs/Registry claim Mammen/Webb | `two_stage.py:1860`, `imputation.py:2363` | #156, #141 | Medium | +| TwoStageDiD & ImputationDiD bootstrap hardcodes Rademacher only; no `bootstrap_weights` parameter unlike CallawaySantAnna | `two_stage.py:1860`, `imputation.py:2363` | #156, #141 | Medium | | TwoStageDiD GMM score logic duplicated between analytic/bootstrap with inconsistent NaN/overflow handling | `two_stage.py:1454-1784` | #156 | Medium | | ImputationDiD weight construction duplicated between aggregation and bootstrap (drift risk) -- has explicit code comment acknowledging duplication | `imputation.py:1777-1786`, `imputation.py:2216-2221` | #141 | Medium | | ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py:1564` | #141 | Medium |