Add Two-Stage DiD estimator (Gardner 2022) by igerber · Pull Request #156 · igerber/diff-diff

igerber · 2026-02-16T17:29:16Z

Summary

Add TwoStageDiD estimator implementing Gardner (2022) two-stage differences-in-differences
Stage 1: estimate unit+time FE on untreated observations only; Stage 2: regress residualized outcomes on treatment indicators
Point estimates identical to ImputationDiD (verified to machine precision ~1e-16)
Custom GMM sandwich variance (Newey & McFadden 1994 Theorem 6.1) accounting for first-stage estimation error — cannot reuse compute_robust_vcov because correction term uses GLOBAL cross-moment
No finite-sample adjustments (raw asymptotic sandwich, matching R did2s)
Multiplier bootstrap on GMM influence function (consistent with library; R uses block bootstrap)
Static, event study, and group aggregation modes
51 tests across 7 test classes including equivalence tests with ImputationDiD

Methodology references (required if estimator / math changes)

Method name(s): Two-Stage DiD (Gardner 2022), GMM sandwich variance (Butts & Gardner 2022)
Paper / source link(s):
- Gardner, J. (2022). "Two-stage differences in differences." arXiv:2207.05943. https://arxiv.org/abs/2207.05943
- Butts, K., & Gardner, J. (2022). "did2s: Two-Stage Difference-in-Differences." The R Journal, 14(1), 162-173. https://doi.org/10.32614/RJ-2022-048
- Newey, W.K., & McFadden, D. (1994). "Large Sample Estimation and Hypothesis Testing." Handbook of Econometrics, Vol. 4.
Reference implementation: R did2s::did2s() (Kyle Butts & John Gardner)
Any intentional deviations from the source (and why):
- Uses multiplier bootstrap (library convention) instead of R's default block bootstrap — both asymptotically valid
- Uses GLOBAL (X'_{10} X_{10})^{-1} in GMM variance (matching R source code), not per-cluster inverse from paper Eq. 6

Validation

Tests added/updated:
- tests/test_two_stage.py — 51 tests in 7 classes (basic, equivalence, variance, edge cases, parameters, bootstrap, convenience)
- Key validation: point estimates match ImputationDiD within ~1e-16; GMM SEs differ from conservative SEs (ratio ~1.27)
Full test suite: 1258 passed, 0 failures (31 pre-existing skips)
Backtest / simulation / notebook evidence: smoke test with generate_staggered_data(n_units=200, n_periods=10) produces expected results

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Implement the Two-Stage DiD estimator from Gardner (2022), matching the R `did2s` package. Stage 1 estimates unit+time fixed effects on untreated observations; Stage 2 regresses residualized outcomes on treatment indicators. Point estimates are identical to ImputationDiD; the key contribution is a GMM sandwich variance estimator (Newey & McFadden 1994) that accounts for first-stage estimation error. - TwoStageDiD class with static, event study, and group aggregation - Custom GMM sandwich variance (cannot reuse compute_robust_vcov) - Multiplier bootstrap on GMM influence function - 51 tests including equivalence tests with ImputationDiD - Full documentation: README, REGISTRY, CLAUDE.md, ROADMAP Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-02-16T17:36:41Z

Overall assessment: ⚠️ Needs changes

Executive summary:

TwoStageDiD broadly matches the two‑stage procedure and global GMM variance described in the Methodology Registry, but a few registry‑required behaviors are missing.
Always‑treated units are excluded without listing unit IDs, which conflicts with the registry requirement.
Bootstrap inference fails in rank‑deficient/unidentified FE cases because NaN y_tilde isn’t cleaned before OLS, so inference is silently skipped.
balance_e is ignored when no cohort satisfies the balance requirement, producing unbalanced event‑study results despite the parameter.

Methodology

P1 — Always‑treated exclusion warning does not list affected unit IDs as required by the Methodology Registry. Impact: registry mismatch and users cannot identify excluded units. Fix: include a (possibly truncated) list of unit IDs in the warning or update the registry text to match behavior. diff_diff/two_stage.py:L593-L608 docs/methodology/REGISTRY.md:L616-L617
P1 — Bootstrap path does not apply the documented NaN y_tilde exclusion; solve_ols will error when rank‑deficient FE create NaNs, causing bootstrap inference to be skipped. Impact: missing inference exactly in edge cases the registry claims are handled. Fix: apply the same nan_mask handling as stage‑2 (set NaN y_tilde to 0 and zero corresponding rows in X_2) before solve_ols and score computation. diff_diff/two_stage.py:L1741-L1749 docs/methodology/REGISTRY.md:L618-L619

Code Quality

P1 — balance_e is effectively ignored when no cohorts satisfy the required range (falls back to np.ones(...)). Impact: event‑study estimates are computed on unbalanced cohorts even when the user requests balancing, violating the parameter’s contract. Fix: if balanced_cohorts is empty, return empty/NaN results or raise/warn; apply the same logic in bootstrap. diff_diff/two_stage.py:L1176-L1190 diff_diff/two_stage.py:L1805-L1818

Performance

P3 — Per‑column .toarray() loops for c_by_cluster densify large sparse matrices and scale poorly with many FEs. Impact: high memory/time on large panels. Fix: aggregate with sparse operations (e.g., use weighted_X10.tocoo() and np.add.at on .data with row indices) to avoid dense per‑column materialization. diff_diff/two_stage.py:L1477-L1487

Maintainability

P3 — GMM score construction is duplicated between _compute_gmm_variance and _compute_cluster_S_scores. Impact: higher drift risk if one path changes. Fix: factor shared gamma/score computation into a helper used by both. diff_diff/two_stage.py:L1373-L1589 diff_diff/two_stage.py:L1601-L1702

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

P2 — Tests do not cover balance_e empty‑cohort handling or bootstrap with rank‑deficient/missing FE (NaN y_tilde). Impact: regressions in these edge cases likely go undetected. Fix: add targeted tests for both scenarios. tests/test_two_stage.py:L1-L955

Tests not run (review only).

- Always-treated warning now lists affected unit IDs (truncated at 10) - Bootstrap handles NaN y_tilde: masks NaN obs in static, event study, and group bootstrap paths; returns None when all treated obs are NaN - balance_e warns when no cohorts qualify instead of silently falling back - Add 3 edge case tests and REGISTRY.md update Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-16T18:20:56Z

/ai-review

github-actions · 2026-02-16T18:26:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bc2eb77fb41838772feda30b8352ff2092779e42

Overall Assessment: ⚠️ Needs changes

Executive Summary

P1: Event-study and group aggregation can return horizons/cohorts with zero effective observations after balance_e or NaN y_tilde, yielding NaN effects but se=0.0, which conflicts with the Methodology Registry’s “unidentified horizons → NaN” requirement and can mislead inference. Ref: diff_diff/two_stage.py:L1169-L1296, diff_diff/two_stage.py:L1325-L1379
P2: Methodology Registry claims multiplier bootstrap supports Rademacher/Mammen/Webb weights, but implementation hardcodes Rademacher only. This is a method spec mismatch. Ref: docs/methodology/REGISTRY.md:L611-L613, diff_diff/two_stage.py:L1720-L1795
P2: Tests cover balance_e with no qualifying cohorts but not partial balance_e/NaN-horizon cases that produce empty columns; add targeted tests to prevent regressions. Ref: tests/test_two_stage.py:L756-L795

Methodology

Severity: P1 | Impact: When balance_e filtering (or NaN y_tilde from rank issues) leaves some horizons/cohorts with zero observations, those columns remain in X_2, leading to rank-deficient estimation where coefficients become NaN but se is computed as 0.0. This violates the Registry’s requirement that unidentified horizons produce NaN inference and can falsely signal “zero variance.” | Fix: After constructing X_2, compute n_obs per column; drop zero‑count columns before solve_ols and _compute_gmm_variance, then map results back with effect/se/t_stat/p_value/conf_int = NaN and n_obs=0. Alternatively recompute est_horizons after applying balance_mask and nan_mask. | Ref: diff_diff/two_stage.py:L1169-L1296, diff_diff/two_stage.py:L1325-L1379
Severity: P2 | Impact: The Methodology Registry states the multiplier bootstrap uses Rademacher/Mammen/Webb weights, but TwoStageDiD hardcodes Rademacher. This is a documented method deviation. | Fix: Add a weight_type parameter (plumbed into _generate_bootstrap_weights_batch) or update the Registry (and README, if needed) to specify Rademacher-only support. | Ref: docs/methodology/REGISTRY.md:L611-L613, diff_diff/two_stage.py:L1720-L1795

Code Quality

No issues found.

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

Severity: P2 | Impact: Missing tests for partial balance_e filtering or rank-deficiency scenarios where some horizons/cohorts have zero effective observations, which is the regime where NaN inference should be enforced. | Fix: Add tests that construct data where balance_e removes some (but not all) horizons and verify those horizons return effect/se/t_stat/p_value/CI = NaN and n_obs=0; add a similar test for group effects when all treated observations in a cohort have NaN y_tilde. | Ref: tests/test_two_stage.py:L756-L795

Address PR review round 2: - [P1] Add n_obs==0 early check in _stage2_event_study and _stage2_group to produce NaN inference for zero-observation horizons/cohorts - [P2] Fix REGISTRY.md bootstrap weight type (Rademacher only, not Rademacher/Mammen/Webb) - [P2] Add tests for zero-observation event study horizons and group effects from NaN y_tilde filtering Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-16T18:53:46Z

/ai-review

github-actions · 2026-02-16T19:01:54Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 81878683398675a39ec174d6d951e14f06474da6

Overall assessment: ⚠️ Needs changes

Executive Summary

P1 methodology gap: Event‑study handling for no never‑treated units doesn’t emit the required Proposition 5 warning and collapses identified‑but‑undefined horizons into “zero‑obs,” diverging from the Methodology Registry.
Tests are extensive but don’t cover the Proposition 5 warning/behavior mismatch.
Performance risk: cluster‑score aggregation densifies each column; could be a bottleneck on large FE designs.

Methodology

Severity: P1 | Impact: The Methodology Registry requires a Proposition‑5/no‑never‑treated warning and distinguishes “unidentified with observations” from “zero‑obs after filtering,” but _stage2_event_study currently drops NaN y_tilde rows and produces n_obs=0 with no warning, so users can’t tell identification failure vs empty filtering. | Concrete fix: Add a has_never_treated/h_bar check (as in ImputationDiD) to mark horizons h >= h_bar as NaN while preserving the treated‑obs count, and emit a warning listing affected horizons; mirror this in bootstrap handling. (diff_diff/two_stage.py:L1169-L1311, docs/methodology/REGISTRY.md:L615-L623)

Code Quality

No issues found.

Performance

Severity: P2 | Impact: Per‑column densification of X_10 and X_2 for cluster aggregation is O(p·n) with heavy memory churn; this can dominate runtime for large unit/time FE matrices. | Concrete fix: Replace per‑column getcol().toarray() loops with sparse matrix aggregation using a cluster indicator matrix or csr_matrix row grouping to compute X_10' * eps_10 and X_2' * eps_2 by cluster in bulk. (diff_diff/two_stage.py:L1514-L1532, diff_diff/two_stage.py:L1716-L1730)

Maintainability

Severity: P3 | Impact: _compute_gmm_variance and _compute_cluster_S_scores duplicate substantial GMM/cluster‑score logic; fixes will need to be applied in two places. | Concrete fix: Factor out shared steps (gamma_hat, c_g, S_g, bread) into a helper that both functions call. (diff_diff/two_stage.py:L1412-L1741)

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

Severity: P2 | Impact: No test asserts the Proposition‑5 warning or the “identified‑but‑undefined” horizon behavior promised by the registry, so regressions in identification reporting can slip in. | Concrete fix: Add a test for no‑never‑treated cohorts that expects a warning and n_obs>0 with NaN inference for long‑run horizons. (tests/test_two_stage.py, docs/methodology/REGISTRY.md:L615-L623)
Tests not run (review only).

Suggested next steps

Add the Proposition‑5 warning + horizon handling and corresponding tests.

Distinguish Prop 5 horizons (treated obs exist, counterfactual unidentified) from zero-observation horizons. Prop 5 horizons get n_obs > 0 with NaN inference and a warning, matching ImputationDiD behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-16T19:27:04Z

/ai-review

github-actions · 2026-02-16T19:38:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e0286d2149b7280594ba3b085f04689dcfbfd592

Overall assessment: ✅ Looks good

Executive summary:

Methodology for TwoStageDiD matches docs/methodology/REGISTRY.md: untreated FE stage 1, residual regression stage 2, global GMM sandwich variance, and no finite-sample adjustment.
No P0/P1 methodology mismatches found against the registry and cited sources.
Minor test coverage gaps for rank-deficiency error handling and NaN propagation.
Small maintainability risk: duplicated GMM score logic with inconsistent NaN/overflow handling between analytic variance and bootstrap.

Methodology

No issues found. Methods affected: TwoStageDiD estimator, GMM sandwich variance, multiplier bootstrap (diff_diff/two_stage.py).

Code Quality

No issues found.

Performance

No issues found.

Maintainability

P3 | Impact: _compute_gmm_variance and _compute_cluster_S_scores duplicate gamma/cluster-score logic, but only the analytic path sanitizes NaN/Inf. This can create analytic-vs-bootstrap inconsistencies in rare rank-deficient/overflow cases. | Fix: factor out shared helper for gamma_hat, c_by_cluster, and s2_by_cluster, and apply the same nan_to_num handling in both paths (diff_diff/two_stage.py:1500-1784).

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

P3 | Impact: test_rank_deficiency_error doesn’t exercise the rank-deficient error path; it only triggers “No treated observations,” leaving rank_deficient_action="error" untested. | Fix: build a dataset where some treated units have no untreated periods due to anticipation>0, while other units are never-treated, then expect a rank-condition ValueError (tests/test_two_stage.py:620-641).
P3 | Impact: test_nan_propagation is effectively a no-op; it doesn’t assert NaN propagation for SE/t/p/CI. | Fix: assert NaNs using a Prop. 5 setup (no never-treated, multiple cohorts) or a zero-observation horizon; alternatively remove the test if redundant with Prop. 5 checks (tests/test_two_stage.py:643-656).

Tests not run (review only).

igerber merged commit c7bd27c into main Feb 16, 2026
10 checks passed

igerber deleted the two-stage-did branch February 16, 2026 20:23

igerber mentioned this pull request Feb 16, 2026

Update TODO.md and ROADMAP.md for accuracy post-v2.4.0 #160

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Two-Stage DiD estimator (Gardner 2022)#156

Add Two-Stage DiD estimator (Gardner 2022)#156
igerber merged 4 commits intomainfrom
two-stage-did

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Feb 16, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant