Skip to content

Add Two-Stage DiD estimator (Gardner 2022)#156

Merged
igerber merged 4 commits intomainfrom
two-stage-did
Feb 16, 2026
Merged

Add Two-Stage DiD estimator (Gardner 2022)#156
igerber merged 4 commits intomainfrom
two-stage-did

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Feb 16, 2026

Summary

  • Add TwoStageDiD estimator implementing Gardner (2022) two-stage differences-in-differences
  • Stage 1: estimate unit+time FE on untreated observations only; Stage 2: regress residualized outcomes on treatment indicators
  • Point estimates identical to ImputationDiD (verified to machine precision ~1e-16)
  • Custom GMM sandwich variance (Newey & McFadden 1994 Theorem 6.1) accounting for first-stage estimation error — cannot reuse compute_robust_vcov because correction term uses GLOBAL cross-moment
  • No finite-sample adjustments (raw asymptotic sandwich, matching R did2s)
  • Multiplier bootstrap on GMM influence function (consistent with library; R uses block bootstrap)
  • Static, event study, and group aggregation modes
  • 51 tests across 7 test classes including equivalence tests with ImputationDiD

Methodology references (required if estimator / math changes)

  • Method name(s): Two-Stage DiD (Gardner 2022), GMM sandwich variance (Butts & Gardner 2022)
  • Paper / source link(s):
    • Gardner, J. (2022). "Two-stage differences in differences." arXiv:2207.05943. https://arxiv.org/abs/2207.05943
    • Butts, K., & Gardner, J. (2022). "did2s: Two-Stage Difference-in-Differences." The R Journal, 14(1), 162-173. https://doi.org/10.32614/RJ-2022-048
    • Newey, W.K., & McFadden, D. (1994). "Large Sample Estimation and Hypothesis Testing." Handbook of Econometrics, Vol. 4.
  • Reference implementation: R did2s::did2s() (Kyle Butts & John Gardner)
  • Any intentional deviations from the source (and why):
    • Uses multiplier bootstrap (library convention) instead of R's default block bootstrap — both asymptotically valid
    • Uses GLOBAL (X'_{10} X_{10})^{-1} in GMM variance (matching R source code), not per-cluster inverse from paper Eq. 6

Validation

  • Tests added/updated:
    • tests/test_two_stage.py — 51 tests in 7 classes (basic, equivalence, variance, edge cases, parameters, bootstrap, convenience)
    • Key validation: point estimates match ImputationDiD within ~1e-16; GMM SEs differ from conservative SEs (ratio ~1.27)
  • Full test suite: 1258 passed, 0 failures (31 pre-existing skips)
  • Backtest / simulation / notebook evidence: smoke test with generate_staggered_data(n_units=200, n_periods=10) produces expected results

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Implement the Two-Stage DiD estimator from Gardner (2022), matching the
R `did2s` package. Stage 1 estimates unit+time fixed effects on untreated
observations; Stage 2 regresses residualized outcomes on treatment
indicators. Point estimates are identical to ImputationDiD; the key
contribution is a GMM sandwich variance estimator (Newey & McFadden 1994)
that accounts for first-stage estimation error.

- TwoStageDiD class with static, event study, and group aggregation
- Custom GMM sandwich variance (cannot reuse compute_robust_vcov)
- Multiplier bootstrap on GMM influence function
- 51 tests including equivalence tests with ImputationDiD
- Full documentation: README, REGISTRY, CLAUDE.md, ROADMAP

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall assessment: ⚠️ Needs changes

Executive summary:

  • TwoStageDiD broadly matches the two‑stage procedure and global GMM variance described in the Methodology Registry, but a few registry‑required behaviors are missing.
  • Always‑treated units are excluded without listing unit IDs, which conflicts with the registry requirement.
  • Bootstrap inference fails in rank‑deficient/unidentified FE cases because NaN y_tilde isn’t cleaned before OLS, so inference is silently skipped.
  • balance_e is ignored when no cohort satisfies the balance requirement, producing unbalanced event‑study results despite the parameter.

Methodology

  • P1 — Always‑treated exclusion warning does not list affected unit IDs as required by the Methodology Registry. Impact: registry mismatch and users cannot identify excluded units. Fix: include a (possibly truncated) list of unit IDs in the warning or update the registry text to match behavior. diff_diff/two_stage.py:L593-L608 docs/methodology/REGISTRY.md:L616-L617
  • P1 — Bootstrap path does not apply the documented NaN y_tilde exclusion; solve_ols will error when rank‑deficient FE create NaNs, causing bootstrap inference to be skipped. Impact: missing inference exactly in edge cases the registry claims are handled. Fix: apply the same nan_mask handling as stage‑2 (set NaN y_tilde to 0 and zero corresponding rows in X_2) before solve_ols and score computation. diff_diff/two_stage.py:L1741-L1749 docs/methodology/REGISTRY.md:L618-L619

Code Quality

  • P1 — balance_e is effectively ignored when no cohorts satisfy the required range (falls back to np.ones(...)). Impact: event‑study estimates are computed on unbalanced cohorts even when the user requests balancing, violating the parameter’s contract. Fix: if balanced_cohorts is empty, return empty/NaN results or raise/warn; apply the same logic in bootstrap. diff_diff/two_stage.py:L1176-L1190 diff_diff/two_stage.py:L1805-L1818

Performance

  • P3 — Per‑column .toarray() loops for c_by_cluster densify large sparse matrices and scale poorly with many FEs. Impact: high memory/time on large panels. Fix: aggregate with sparse operations (e.g., use weighted_X10.tocoo() and np.add.at on .data with row indices) to avoid dense per‑column materialization. diff_diff/two_stage.py:L1477-L1487

Maintainability

  • P3 — GMM score construction is duplicated between _compute_gmm_variance and _compute_cluster_S_scores. Impact: higher drift risk if one path changes. Fix: factor shared gamma/score computation into a helper used by both. diff_diff/two_stage.py:L1373-L1589 diff_diff/two_stage.py:L1601-L1702

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

  • P2 — Tests do not cover balance_e empty‑cohort handling or bootstrap with rank‑deficient/missing FE (NaN y_tilde). Impact: regressions in these edge cases likely go undetected. Fix: add targeted tests for both scenarios. tests/test_two_stage.py:L1-L955

Tests not run (review only).

- Always-treated warning now lists affected unit IDs (truncated at 10)
- Bootstrap handles NaN y_tilde: masks NaN obs in static, event study,
  and group bootstrap paths; returns None when all treated obs are NaN
- balance_e warns when no cohorts qualify instead of silently falling back
- Add 3 edge case tests and REGISTRY.md update

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Feb 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bc2eb77fb41838772feda30b8352ff2092779e42


Overall Assessment: ⚠️ Needs changes

Executive Summary

  • P1: Event-study and group aggregation can return horizons/cohorts with zero effective observations after balance_e or NaN y_tilde, yielding NaN effects but se=0.0, which conflicts with the Methodology Registry’s “unidentified horizons → NaN” requirement and can mislead inference. Ref: diff_diff/two_stage.py:L1169-L1296, diff_diff/two_stage.py:L1325-L1379
  • P2: Methodology Registry claims multiplier bootstrap supports Rademacher/Mammen/Webb weights, but implementation hardcodes Rademacher only. This is a method spec mismatch. Ref: docs/methodology/REGISTRY.md:L611-L613, diff_diff/two_stage.py:L1720-L1795
  • P2: Tests cover balance_e with no qualifying cohorts but not partial balance_e/NaN-horizon cases that produce empty columns; add targeted tests to prevent regressions. Ref: tests/test_two_stage.py:L756-L795

Methodology

  • Severity: P1 | Impact: When balance_e filtering (or NaN y_tilde from rank issues) leaves some horizons/cohorts with zero observations, those columns remain in X_2, leading to rank-deficient estimation where coefficients become NaN but se is computed as 0.0. This violates the Registry’s requirement that unidentified horizons produce NaN inference and can falsely signal “zero variance.” | Fix: After constructing X_2, compute n_obs per column; drop zero‑count columns before solve_ols and _compute_gmm_variance, then map results back with effect/se/t_stat/p_value/conf_int = NaN and n_obs=0. Alternatively recompute est_horizons after applying balance_mask and nan_mask. | Ref: diff_diff/two_stage.py:L1169-L1296, diff_diff/two_stage.py:L1325-L1379
  • Severity: P2 | Impact: The Methodology Registry states the multiplier bootstrap uses Rademacher/Mammen/Webb weights, but TwoStageDiD hardcodes Rademacher. This is a documented method deviation. | Fix: Add a weight_type parameter (plumbed into _generate_bootstrap_weights_batch) or update the Registry (and README, if needed) to specify Rademacher-only support. | Ref: docs/methodology/REGISTRY.md:L611-L613, diff_diff/two_stage.py:L1720-L1795

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • Severity: P2 | Impact: Missing tests for partial balance_e filtering or rank-deficiency scenarios where some horizons/cohorts have zero effective observations, which is the regime where NaN inference should be enforced. | Fix: Add tests that construct data where balance_e removes some (but not all) horizons and verify those horizons return effect/se/t_stat/p_value/CI = NaN and n_obs=0; add a similar test for group effects when all treated observations in a cohort have NaN y_tilde. | Ref: tests/test_two_stage.py:L756-L795

Address PR review round 2:
- [P1] Add n_obs==0 early check in _stage2_event_study and _stage2_group
  to produce NaN inference for zero-observation horizons/cohorts
- [P2] Fix REGISTRY.md bootstrap weight type (Rademacher only, not
  Rademacher/Mammen/Webb)
- [P2] Add tests for zero-observation event study horizons and group
  effects from NaN y_tilde filtering

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Feb 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 81878683398675a39ec174d6d951e14f06474da6


Overall assessment: ⚠️ Needs changes

Executive Summary

  • P1 methodology gap: Event‑study handling for no never‑treated units doesn’t emit the required Proposition 5 warning and collapses identified‑but‑undefined horizons into “zero‑obs,” diverging from the Methodology Registry.
  • Tests are extensive but don’t cover the Proposition 5 warning/behavior mismatch.
  • Performance risk: cluster‑score aggregation densifies each column; could be a bottleneck on large FE designs.

Methodology

  • Severity: P1 | Impact: The Methodology Registry requires a Proposition‑5/no‑never‑treated warning and distinguishes “unidentified with observations” from “zero‑obs after filtering,” but _stage2_event_study currently drops NaN y_tilde rows and produces n_obs=0 with no warning, so users can’t tell identification failure vs empty filtering. | Concrete fix: Add a has_never_treated/h_bar check (as in ImputationDiD) to mark horizons h >= h_bar as NaN while preserving the treated‑obs count, and emit a warning listing affected horizons; mirror this in bootstrap handling. (diff_diff/two_stage.py:L1169-L1311, docs/methodology/REGISTRY.md:L615-L623)

Code Quality

  • No issues found.

Performance

  • Severity: P2 | Impact: Per‑column densification of X_10 and X_2 for cluster aggregation is O(p·n) with heavy memory churn; this can dominate runtime for large unit/time FE matrices. | Concrete fix: Replace per‑column getcol().toarray() loops with sparse matrix aggregation using a cluster indicator matrix or csr_matrix row grouping to compute X_10' * eps_10 and X_2' * eps_2 by cluster in bulk. (diff_diff/two_stage.py:L1514-L1532, diff_diff/two_stage.py:L1716-L1730)

Maintainability

  • Severity: P3 | Impact: _compute_gmm_variance and _compute_cluster_S_scores duplicate substantial GMM/cluster‑score logic; fixes will need to be applied in two places. | Concrete fix: Factor out shared steps (gamma_hat, c_g, S_g, bread) into a helper that both functions call. (diff_diff/two_stage.py:L1412-L1741)

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • Severity: P2 | Impact: No test asserts the Proposition‑5 warning or the “identified‑but‑undefined” horizon behavior promised by the registry, so regressions in identification reporting can slip in. | Concrete fix: Add a test for no‑never‑treated cohorts that expects a warning and n_obs>0 with NaN inference for long‑run horizons. (tests/test_two_stage.py, docs/methodology/REGISTRY.md:L615-L623)
  • Tests not run (review only).

Suggested next steps

  1. Add the Proposition‑5 warning + horizon handling and corresponding tests.

Distinguish Prop 5 horizons (treated obs exist, counterfactual unidentified)
from zero-observation horizons. Prop 5 horizons get n_obs > 0 with NaN
inference and a warning, matching ImputationDiD behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Feb 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e0286d2149b7280594ba3b085f04689dcfbfd592


Overall assessment: ✅ Looks good

Executive summary:

  • Methodology for TwoStageDiD matches docs/methodology/REGISTRY.md: untreated FE stage 1, residual regression stage 2, global GMM sandwich variance, and no finite-sample adjustment.
  • No P0/P1 methodology mismatches found against the registry and cited sources.
  • Minor test coverage gaps for rank-deficiency error handling and NaN propagation.
  • Small maintainability risk: duplicated GMM score logic with inconsistent NaN/overflow handling between analytic variance and bootstrap.

Methodology

  • No issues found. Methods affected: TwoStageDiD estimator, GMM sandwich variance, multiplier bootstrap (diff_diff/two_stage.py).

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • P3 | Impact: _compute_gmm_variance and _compute_cluster_S_scores duplicate gamma/cluster-score logic, but only the analytic path sanitizes NaN/Inf. This can create analytic-vs-bootstrap inconsistencies in rare rank-deficient/overflow cases. | Fix: factor out shared helper for gamma_hat, c_by_cluster, and s2_by_cluster, and apply the same nan_to_num handling in both paths (diff_diff/two_stage.py:1500-1784).

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • P3 | Impact: test_rank_deficiency_error doesn’t exercise the rank-deficient error path; it only triggers “No treated observations,” leaving rank_deficient_action="error" untested. | Fix: build a dataset where some treated units have no untreated periods due to anticipation>0, while other units are never-treated, then expect a rank-condition ValueError (tests/test_two_stage.py:620-641).
  • P3 | Impact: test_nan_propagation is effectively a no-op; it doesn’t assert NaN propagation for SE/t/p/CI. | Fix: assert NaNs using a Prop. 5 setup (no never-treated, multiple cohorts) or a zero-observation horizon; alternatively remove the test if redundant with Prop. 5 checks (tests/test_two_stage.py:643-656).

Tests not run (review only).

@igerber igerber merged commit c7bd27c into main Feb 16, 2026
10 checks passed
@igerber igerber deleted the two-stage-did branch February 16, 2026 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant