Skip to content

Optimize CallawaySantAnna covariate path (5-19x speedup)#194

Merged
igerber merged 9 commits intomainfrom
speed-review
Mar 15, 2026
Merged

Optimize CallawaySantAnna covariate path (5-19x speedup)#194
igerber merged 9 commits intomainfrom
speed-review

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Mar 8, 2026

Summary

  • Cache Cholesky factorizations of X^T X across (g,t) pairs sharing the same control design matrix, eliminating redundant OLS solves in the covariate regression path
  • Cache propensity score logistic regression coefficients for IPW/DR methods, avoiding redundant BFGS optimization
  • Add _compute_all_att_gt_covariate_reg() batched method for the reg+covariates path, mirroring the existing _compute_all_att_gt_vectorized() pattern
  • Extend benchmark and validation suites with 10-covariate, DR, and IPW scenarios

Speedups at 5K units:

Scenario Before After Speedup
2 covariates, reg 0.158s 0.028s 5.6x
10 covariates, reg 0.535s 0.042s 12.7x
2 covariates, DR 0.373s 0.059s 6.3x
10 cov, reg, 50K units 4.430s 0.230s 19.3x

Methodology references (required if estimator / math changes)

  • Method name(s): Callaway & Sant'Anna (2021) ATT(g,t) estimation
  • Paper / source link(s): Callaway, B., & Sant'Anna, P. H. (2021). Difference-in-Differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.
  • Any intentional deviations from the source (and why): None — Cholesky factorization caching is a numerical optimization that produces identical results to lstsq (validated to 1e-10 tolerance). Falls back to lstsq on singular/ill-conditioned matrices.

Validation

  • Tests added/updated: benchmarks/speed_review/bench_callaway.py, benchmarks/speed_review/validate_results.py
  • 96/96 staggered tests pass, 1489/1489 full suite tests pass
  • 5-scenario numerical validation (reg/DR/IPW × 0/2/10 covariates) matches baselines saved before optimization

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 8, 2026

Overall assessment: ⛔ Blocker

Executive summary

  • This PR affects CallawaySantAnna’s covariate-adjusted ATT(g,t) paths, especially estimation_method="dr" and estimation_method="reg".
  • P0: the new DR Cholesky cache key is too coarse for control_group="not_yet_treated", so cached outcome regressions can be reused across different cohort-specific control sets.
  • P1: rank-deficient covariate handling no longer matches the Methodology Registry / public rank_deficient_action contract; the fast path skips QR rank checks and bypasses solve_ols.
  • The new benchmark validator would not catch either issue: it never exercises control_group="not_yet_treated" or rank-deficient covariates, and it barely checks inference outputs.
  • No direct security issues stand out.

Methodology

  • Severity: P0 | Impact: The DR cache breaks the documented not_yet_treated control definition. cho_key is set to (base_period_val, t) for all non-never_treated cases, but under control_group="not_yet_treated" the control design also depends on cohort g because cohort g must be excluded from controls for each ATT(g,t). That means the cached Cholesky factor can be reused across different cohorts that share the same (base, t) under the default base_period="varying", giving the wrong outcome-regression fit for at least the lead / pre-treatment comparisons. This violates the registry’s control-mask rule and the paper’s per-ATT(g,t) group-time setup. docs/methodology/REGISTRY.md:395, diff_diff/staggered.py:315, diff_diff/staggered.py:567, diff_diff/staggered.py:1137, diff_diff/staggered.py:1529. The source paper explicitly frames this estimator around group-time ATT(g,t) objects with OR/IPW/DR estimation, so cache reuse is only valid when the comparison design is unchanged. citeturn2view0
    Concrete fix: include g in cho_key whenever control membership depends on cohort (at minimum all control_group="not_yet_treated" cases), or disable DR Cholesky caching for that mode; add a regression test comparing cached vs uncached DR outputs for control_group="not_yet_treated".

  • Severity: P1 | Impact: Rank-deficient handling no longer matches the documented methodology/default behavior. The registry and public docstring say rank_deficient_action="warn" should detect collinearity via pivoted QR and return R-style dropped-column / NA behavior, but _linear_regression() now sets skip_rank_check=True for every non-"error" case, and the new covariate-reg fast path bypasses solve_ols entirely via Cholesky / raw lstsq. That silently changes estimator behavior for collinear or near-collinear covariates instead of honoring the documented contract. docs/methodology/REGISTRY.md:355, diff_diff/staggered.py:210, diff_diff/staggered.py:138, diff_diff/staggered.py:933, diff_diff/linalg.py:384.
    Concrete fix: preserve QR/SVD rank detection for warn and silent, and only take the Cholesky fast path after a verified full-rank check at the same 1e-07 tolerance; otherwise fall back to solve_ols with the requested rank_deficient_action.

Code Quality

  • Severity: P2 | Impact: _compute_all_att_gt_covariate_reg() silently drops problematic (g,t) cells with continue when beta, predictions, or residuals become non-finite. That mutates the set of estimated cells and therefore downstream aggregation weights instead of surfacing an explicit warning / NaN result. diff_diff/staggered.py:951, diff_diff/staggered.py:964, diff_diff/staggered.py:1180.
    Concrete fix: preserve the cell with effect=np.nan, se=np.nan, and NaN inference fields (plus a warning), or fall back to the scalar path rather than silently removing it.

Performance

  • Severity: P2 | Impact: The speedup currently relies on unchecked invariants rather than proving them. The bad DR cache key and the skipped rank checks mean part of the benchmark win comes from dropping correctness safeguards, so the performance numbers are not decision-useful yet for all supported modes. diff_diff/staggered.py:138, diff_diff/staggered.py:567, diff_diff/staggered.py:933.
    Concrete fix: keep the optimization, but gate cache reuse on explicit design-equivalence checks and preserve the old solver semantics before advertising the speed path as methodology-preserving.

Maintainability

  • Severity: P2 | Impact: The new _compute_all_att_gt_covariate_reg() duplicates large parts of _outcome_regression() / _linear_regression() instead of sharing a single cached-solver abstraction, and it has already drifted on rank handling and failure behavior. diff_diff/staggered.py:785, diff_diff/staggered.py:1324.
    Concrete fix: factor a shared OLS helper that accepts an optional cached factorization but still centralizes rank checks, fallbacks, warnings, and influence-function construction.

Tech Debt

  • Severity: P2 | Impact: The migration from unit IDs to index arrays is incomplete. Producers now emit treated_idx / control_idx, but fallback branches still reference treated_units / control_units, leaving two incompatible internal schemas. diff_diff/staggered.py:598, diff_diff/staggered_aggregation.py:210, diff_diff/staggered_bootstrap.py:180.
    Concrete fix: finish the migration in the fallback branches, or explicitly require precomputed in those paths so dead compatibility code cannot silently rot.

Security

  • No material findings.

Documentation/Tests

  • Severity: P2 | Impact: The added validation is benchmark-oriented, not methodology-oriented. It only covers five scenarios, all with the default control-group path; it never exercises control_group="not_yet_treated" or rank-deficient covariates, and it does not compare overall_p_value, overall_ci, event-study outputs, or group aggregations. The 0.01 SE tolerance is also loose enough to miss meaningful inference drift. benchmarks/speed_review/validate_results.py:99, benchmarks/speed_review/validate_results.py:144, benchmarks/speed_review/validate_results.py:169.
    Concrete fix: add real regression tests for not_yet_treated + DR + covariates, rank_deficient_action parity (warn / silent / error), and NaN/Inf bootstrap edge cases; extend snapshot checks to all inference fields and all aggregation objects.

igerber added a commit that referenced this pull request Mar 8, 2026
Address PR #194 review feedback:
- Include cohort g in cho_key and control_key for not_yet_treated
  control group, preventing stale Cholesky/matrix reuse across groups
- Add UserWarning when group-time cells are dropped due to non-finite
  regression results in covariate reg path
- Add not_yet_treated validation scenarios (dr_2cov_nyt, reg_2cov_nyt)
- Add regression test for DR/reg + covariates + not_yet_treated

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 8, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 01688aeb711828128409c4bb7f92177e90ea6142


Overall assessment: ⚠️ Needs changes

Executive summary

  • Affected methods: CallawaySantAnna ATT(g,t) estimation in the covariate-adjusted reg path, the DR first-stage outcome regression, and the influence-function bootstrap/aggregation plumbing.
  • I do not see a direct estimand-level conflict with Callaway-Sant’Anna or the official did::att_gt() options themselves; the problem is that the optimization changes documented edge-case behavior around those methods, especially rank-deficient covariate designs. citeturn1search1turn1search3
  • P1: the new default covariate path bypasses the library’s required pivoted-QR rank-deficiency handling, so collinear covariates can now silently change ATT(g,t) behavior instead of warning and dropping redundant columns.
  • P1: the migration from unit-id IF payloads to index arrays is incomplete; fallback consumers still read removed keys and can now fail if they are reached.
  • P2: the new validation coverage is too weak for the changed code paths; it does not check most of the new aggregation/bootstrap/inference outputs and misses the default rank_deficient_action="warn" mode.

Methodology
Affected methods are the covariate-adjusted Callaway-Sant’Anna paths (reg, dr, ipw) and the control_group / base_period variants exposed by did::att_gt(). The paper/docs support those options; the issue here is not the caching idea, but the implementation departing from the registry’s required handling for rank-deficient covariate designs. citeturn1search1turn1search3

  • Severity: P1. Impact: the optimized covariate reg branch selected in diff_diff/staggered.py#L1136-L1143 bypasses solve_ols() and uses raw Cholesky / scipy.linalg.lstsq in diff_diff/staggered.py#L934-L955, while the shared helper _linear_regression() now sets skip_rank_check=True for every non-"error" mode in diff_diff/staggered.py#L138-L147. That contradicts the documented contract in docs/methodology/REGISTRY.md#L355-L358 and diff_diff/linalg.py#L384-L389: default rank_deficient_action="warn" is supposed to detect collinearity via pivoted QR and apply R-style NA/warning handling, not silently return minimum-norm or numerically unstable coefficients. The fast path can also drop whole ATT(g,t) cells on non-finite regression output in diff_diff/staggered.py#L957-L1013, which changes downstream aggregation weights. Concrete fix: do not skip rank checks in warn/silent modes; only cache/reuse factorizations after a one-time full-rank confirmation for that control design, and otherwise fall back to the existing solve_ols() / _outcome_regression() path.

Code Quality

  • No additional findings beyond the methodology issue above.

Performance

  • No blocking performance findings. The vectorization/caching direction is reasonable once the rank-deficiency safeguards are restored.

Maintainability

Tech Debt

  • No additional findings.

Security

  • No findings in the diff reviewed. I did not see secrets or obvious security-sensitive changes.

Documentation/Tests

  • Severity: P2. Impact: the new validation harness records overall_p_value, overall_ci, and event-study effects in benchmarks/speed_review/validate_results.py#L70-L95, but check_results() ignores them and only compares overall ATT/SE plus per-cell effect/SE, with a very loose 0.01 absolute tolerance for group-time SEs in benchmarks/speed_review/validate_results.py#L149-L175. Existing collinearity tests in tests/test_staggered.py#L928-L980 cover only "error" and "silent", not the default "warn" mode that this PR changes. Concrete fix: add explicit warn-mode regression tests for reg and dr with duplicated/near-collinear covariates, and extend validation to compare event-study/group aggregates plus bootstrap p-values/CIs/cband outputs with tighter tolerances.

Verification note: I could not execute the test suite in this sandbox because the available Python environment does not have numpy or pytest installed.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 8, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 27bcc9f51ee0bc9485fd993f149f508249e25704


Overall assessment: ⚠️ Needs changes

Executive Summary

  • The PR changes the Callaway-Sant’Anna ATT(g,t) covariate-adjusted reg and dr paths, so the main review bar is parity with the registry’s documented estimator/inference contract.
  • The new batched reg path detects QR rank deficiency but then solves the full matrix with bare scipy.linalg.lstsq, so it no longer follows the documented R-style dropped-column behavior for near-collinear designs.
  • The new cached dr path tries Cholesky before any QR/tolerance check, so rank_deficient_action can be bypassed entirely when the design is numerically near-singular but still Cholesky-factorable.
  • The new batch bootstrap helper preserves NaN outputs, but the added validation/tests do not cover the new IPW not_yet_treated cache path, DR near-rank-deficiency, or degenerate per-effect bootstrap warning behavior.
  • Review is diff-based; I did not rerun the suite in this environment.

Methodology
Affected method: CallawaySantAnna ATT(g,t) with covariate-adjusted reg and dr, cross-checked against docs/methodology/REGISTRY.md:291 and the rank-deficiency contract at docs/methodology/REGISTRY.md:355.

  • Severity: P1. Impact: the optimized reg path warns on QR-detected rank deficiency, but then falls back to full-matrix scipy_linalg.lstsq rather than the shared QR-reduced solve. That is a methodological mismatch with the registry / shared linear algebra contract, which says rank-deficient designs are handled by dropping dependent columns at tolerance 1e-07 in R-style fashion. Near-collinear covariates can therefore change ATT(g,t) and SEs relative to the documented implementation, even though the warning implies the columns were dropped. Concrete fix: when _detect_rank_deficiency() flags dropped columns, solve the reduced kept_cols system (or delegate to solve_ols / _linear_regression with precomputed rank info) instead of calling bare lstsq on the full design. References: diff_diff/staggered.py:869, diff_diff/staggered.py:972, diff_diff/linalg.py:655.

  • Severity: P1. Impact: the new cached dr outcome-regression path attempts Cholesky before any QR-based rank check, and only falls back to _linear_regression() if Cholesky fails numerically. For designs that are rank-deficient under the library’s documented 1e-07 tolerance but still positive-definite enough for Cholesky, rank_deficient_action is never honored, warnings are skipped, and the optimized DR estimator silently deviates from the registry contract. Concrete fix: gate the cached Cholesky path behind the same rank-deficiency detection used by the shared OLS path, and only cache/use Cholesky when that check passes. References: diff_diff/staggered.py:1566, diff_diff/staggered.py:1586, docs/methodology/REGISTRY.md:355.

Code Quality
No separate findings beyond the solver-path drift already called out under Methodology.

Performance
No findings. The caching/vectorization direction looks reasonable once correctness is brought back in line with the shared solver behavior.

Maintainability
No separate findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

  • Severity: P2. Impact: the new high-risk paths are still under-tested. The added validation matrix covers dr_2cov_nyt and reg_2cov_nyt, but not the new IPW not_yet_treated cache path, and the new rank-deficiency tests only exercise exact duplicate columns on the reg fast path. There is also no test around the batched bootstrap helper’s degenerate / <50% valid per-effect warning behavior. Concrete fix: add ipw + covariates + not_yet_treated, DR rank_deficient_action in {"warn","silent"} with near-collinear (not exact-duplicate) covariates, and per-effect bootstrap tests asserting both NaN outputs and warning emission for invalid distributions. References: benchmarks/speed_review/validate_results.py:100, tests/test_staggered.py:888, tests/test_staggered.py:983, diff_diff/bootstrap_utils.py:327.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 8, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 2cf0973f48a2814c2d6808fac0e529f62daecc49


Overall assessment: ⚠️ Needs changes

Executive Summary

  • Affected method: CallawaySantAnna ATT(g,t) estimation on the covariate regression path, plus the multiplier-bootstrap inference helpers. The cache keys themselves are consistent with the registry’s base-period, anticipation, and control-group rules.
  • P1: the optimized reg + covariates path no longer preserves the documented per-cell rank-deficiency / invalid-cell behavior; on changing control samples it can silently solve under a different rule or omit ATT(g,t) cells entirely, which changes downstream aggregation weights.
  • P1: fit(..., covariates=[]) now routes into the covariate fast path and will dereference covariate_by_period=None instead of behaving like the no-covariate estimator.
  • P2: the new batched bootstrap stats helper drops the registry-mandated warning path when per-effect bootstrap validity falls below 50%.
  • P2: tests and the new validator do not cover the new estimation_method="reg" + covariates + n_bootstrap>0 path or several changed inference outputs.
  • Static review only: the sandbox Python here does not have numpy / pandas / scipy, so I could not execute the suite.

Methodology
Cross-check target: Callaway & Sant’Anna (2021) ATT(g,t) plus the registry’s edge-case requirements in docs/methodology/REGISTRY.md:L293-L364.

  • Severity P1. Impact: the new batched reg + covariates path checks rank only on the cached base control design, then later (g,t)-specific control designs can go through raw Cholesky / scipy_linalg.lstsq, and non-finite solves are dropped with continue. On unbalanced panels or control_group="not_yet_treated", that means the default rank_deficient_action="warn" / R-style NA behavior is no longer guaranteed per ATT(g,t), and failed cells can disappear from simple/event-study/group aggregation instead of surfacing as NaN. That is a methodological deviation from the documented ATT(g,t) handling and from the scalar regression path. Concrete fix: only reuse Cholesky for designs proven identical and full-rank; for every pair-specific pair_X_ctrl, rerun _detect_rank_deficiency() or call _linear_regression / solve_ols, and if the solve is invalid keep the (g,t) cell with NaN inference instead of omitting it. diff_diff/staggered.py:L869-L1057, diff_diff/staggered.py:L1372-L1437, docs/methodology/REGISTRY.md:L347-L358
  • Severity P2. Impact: compute_effect_bootstrap_stats_batch() returns all-NaN outputs when an effect fails the <50% valid bootstrap draws rule, but unlike the scalar helper it emits no warning/context. Since staggered_bootstrap.py now routes group-time, event-study, and group effects through this helper, invalid bootstrap inference on the changed paths can fail silently, which diverges from the registry’s documented “warn + NaN” behavior. Concrete fix: preserve the scalar helper’s warning semantics in batch mode, at least by warning once with the affected indices/labels before returning NaNs. diff_diff/bootstrap_utils.py:L317-L330, diff_diff/staggered_bootstrap.py:L339-L396, docs/methodology/REGISTRY.md:L359-L364

Code Quality

  • Severity P1. Impact: fit() now dispatches to _compute_all_att_gt_covariate_reg() whenever covariates is not None, but _precompute_structures() only builds covariate_by_period when covariates is truthy. A caller passing covariates=[] will therefore hit the covariate fast path with covariate_by_period=None and crash, whereas the pre-PR scalar path treated an empty list like “no covariates”. Concrete fix: normalize empty iterables to None before dispatch, or change the fast-path guard to if covariates:. diff_diff/staggered.py:L427-L435, diff_diff/staggered.py:L780-L833, diff_diff/staggered.py:L1173-L1187

Performance

  • No new blocking performance finding. The caching/factorization approach is reasonable once the behavior regressions above are fixed.

Maintainability

  • No separate maintainability finding beyond the scalar/batch semantic drift already noted above.

Tech Debt

  • No new tech-debt finding beyond the validation gap below.

Security

  • No security findings in the changed library code or the added benchmark artifacts.

Documentation/Tests

  • Severity P2. Impact: the added tests cover reg / dr not-yet-treated and rank-deficiency only with n_bootstrap=0, while the only bootstrap-with-covariates test still exercises the default DR path. The new validator also serializes only effect/SE outputs, not the changed t_stat / p_value / CI fields or group aggregation results. That leaves the optimized estimation_method="reg" path and the new batched inference helpers largely unguarded. Concrete fix: add targeted tests for estimation_method="reg", covariates, n_bootstrap>0 under both control groups and an unbalanced panel, plus direct scalar-vs-batch equivalence tests for safe_inference_batch() / compute_effect_bootstrap_stats_batch(), and extend the validator to compare group aggregation plus per-effect inference fields. tests/test_staggered.py:L888-L1090, tests/test_staggered.py:L1365-L1458, benchmarks/speed_review/validate_results.py:L52-L97, benchmarks/speed_review/validate_results.py:L165-L196

igerber added a commit that referenced this pull request Mar 14, 2026
Address five failure modes observed after gpt-5.4 upgrade (PRs #192, #194,
#195): documented deviations flagged as blockers, deferred work not accepted,
moving goalposts on re-review, undefined approval criteria, and valid
implementation choices treated as methodology errors.

Changes to pr_review.md:
- Exempt REGISTRY.md-documented deviations from P0/P1 (classify as P3)
- Add implementation choice exception for valid numerical approaches
- Add Deferred Work Acceptance section honoring TODO.md tracking
- Add Assessment Criteria with objective verdicts and mitigation rules
- Add Re-review Scope rules to prevent oscillation between rounds

Changes to ai_pr_review.yml:
- Add step to fetch previous AI review comment for re-review context
- Inject prior review findings into compiled prompt on /ai-review reruns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 14, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2cf0973f48a2814c2d6808fac0e529f62daecc49


Overall assessment

⚠️ Needs changes

Executive summary

  • Affected methods: Callaway & Sant’Anna ATT(g,t) estimation on the optimized estimation_method="reg" covariate path, plus the multiplier-bootstrap inference helpers. I cross-checked against docs/methodology/REGISTRY.md:L293-L364 and the changed estimator docstrings.
  • The Cholesky/logit caching itself is a valid numerical optimization, and the earlier not_yet_treated cache-key concern looks addressed by the new keying and regression tests.
  • P1: fit(..., covariates=[]) still routes into the covariate fast path and dereferences covariate_by_period=None, so the PR still introduces a reproducible crash on an empty-list input.
  • P1: the optimized covariate-regression path still drops numerically invalid ATT(g,t) cells entirely instead of preserving them as invalid cells, which changes downstream aggregation weights and conflicts with the registry’s missing-cell rule.
  • P2: the new batch bootstrap stats helper still suppresses the scalar warning path for invalid per-effect bootstrap inference, so ATT(g,t)/event-study/group bootstrap outputs can now return all-NaN silently.
  • Static review only: this environment does not have numpy, pandas, scipy, or pytest, so I could not execute the suite.

Methodology

  • Severity P1. Impact: the previous control-design/cache-key issue appears fixed, but _compute_all_att_gt_covariate_reg() still increments n_dropped_cells and continues when the per-pair regression solve, prediction, or residuals become non-finite. That removes the (g,t) cell from group_time_effects, so simple/event-study/group aggregation reweights over a smaller set instead of preserving the invalid cell. That conflicts with the registry rule that missing group-time cells should be represented as invalid (NaN), not silently omitted. Concrete fix: on those branches, fall back to the scalar pairwise path for that (g,t); if it is still invalid, materialize the cell with NaN inference instead of omitting it. References: diff_diff/staggered.py:L980-L1057, docs/methodology/REGISTRY.md:L347-L350
  • Severity P2. Impact: compute_effect_bootstrap_stats_batch() returns all-NaN outputs for <50% valid draws or zero/non-finite bootstrap SE, but it does not emit the warning path required by the registry. Since ATT(g,t), event-study, and group bootstrap stats now use that helper, these changed paths can fail silently while overall ATT still warns. Concrete fix: add batch warning support (for example via an optional contexts list) or route invalid subsets through compute_effect_bootstrap_stats(). References: diff_diff/bootstrap_utils.py:L283-L377, diff_diff/staggered_bootstrap.py:L339-L396, docs/methodology/REGISTRY.md:L359-L364

Code Quality

  • Severity P1. Impact: fit() dispatches to the optimized covariate regression branch when covariates is not None, but _precompute_structures() only builds covariate_by_period when covariates is truthy. Passing covariates=[] therefore reaches _compute_all_att_gt_covariate_reg() with covariate_by_period=None and crashes. Concrete fix: normalize empty iterables to None before preprocessing/dispatch, or change the optimized branch guard to if covariates and self.estimation_method == "reg". References: diff_diff/staggered.py:L427-L435, diff_diff/staggered.py:L754-L833, diff_diff/staggered.py:L1173-L1187

Performance

  • No findings. The Cholesky/logistic caching is a reasonable implementation choice once the correctness issues above are fixed.

Maintainability

  • No separate findings beyond the scalar/batch semantic drift already noted in Methodology.

Tech Debt

  • No separate findings. The remaining lower-severity coverage gap below is not tracked in TODO.md.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new speed-review validator materially improves coverage and does exercise reg/dr/ipw with bootstrap, including reg + covariates, but it still serializes only effect/SE fields for group-time and event-study outputs and omits group_effects plus per-effect t_stat / p_value / CI, which are also changed by this PR’s batched inference helpers. Concrete fix: extend the validator to compare those outputs and add one direct scalar-vs-batch regression test for compute_effect_bootstrap_stats_batch(), especially the warning/NaN edge cases. References: benchmarks/speed_review/validate_results.py:L52-L97, benchmarks/speed_review/validate_results.py:L165-L196

Path to Approval

  1. Normalize empty covariates inputs before fast-path dispatch so covariates=[] behaves like the no-covariate estimator instead of crashing.
  2. In _compute_all_att_gt_covariate_reg(), stop dropping numerically invalid (g,t) cells via continue; fall back to the scalar pairwise path or preserve the cell as invalid so aggregations do not silently reweight around it.

igerber added a commit that referenced this pull request Mar 14, 2026
…es, batch warnings, validator fields

- Normalize covariates=[] to None to prevent crash in covariate dispatch
- Preserve invalid regression cells as NaN instead of dropping (REGISTRY.md compliance)
- Filter NaN effects from aggregation in simple, event study, and group methods
- Add summary warnings to batch bootstrap stats for insufficient samples and zero SE
- Extend validator field coverage to include t_stat, p_value, conf_int, group_effects
- Re-save baselines with extended schema

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 14, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 4d54cd78760c803b9a3b714a02acfd72d96d642d


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Affected methods: Callaway-Sant’Anna ATT(g,t) estimation on the optimized estimation_method="reg" covariate path, plus multiplier-bootstrap inference for overall ATT, event-study, and group aggregations.
  • The prior re-review blockers appear resolved: covariates=[] is normalized to None, numerically invalid ATT(g,t) cells are now preserved instead of dropped, and the new batch bootstrap helper emits warnings for insufficient valid draws / zero SE cases.
  • Severity P1 [Newly identified]: once the optimized regression path preserves an invalid ATT(g,t) cell as NaN, the bootstrap aggregation code still builds overall/event-study/group bootstrap targets from the unfiltered cell set. That contaminates aggregate bootstrap SE/CI/p-values even though the analytical aggregators now correctly exclude those NaN cells.
  • I did not find a methodology defect in the Cholesky/logit caching itself. The not_yet_treated cache-keying concern from the previous review looks addressed.
  • Static review only: I could not execute pytest or a targeted repro in this environment because pytest is not installed and numpy is unavailable.

Methodology

Affected methods: Callaway-Sant’Anna ATT(g,t) with optimized covariate regression, and bootstrap aggregation of ATT(g,t) into overall/event-study/group effects. Cross-checking against the registry, the caching is a valid numerical optimization; the remaining issue is the bootstrap treatment of preserved NaN cells.

Code Quality

No separate findings. The earlier covariates=[] crash appears fixed at diff_diff/staggered.py:L1129.

Performance

No findings. The Cholesky/logistic-coefficient caching is a reasonable implementation choice and does not look like a methodology defect on its own.

Maintainability

No separate findings beyond the bootstrap/analytical aggregation drift above.

Tech Debt

No separate findings. TODO.md does not contain an entry that mitigates the P1 above, so it remains approval-blocking. See TODO.md:L38.

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the new coverage does not exercise the remaining failing interaction. The added preserved-NaN regression test stops at n_bootstrap=0, and the speed validator still omits group-level t_stat / p_value / CI fields, so this PR’s changed aggregate-bootstrap paths can still regress without detection. Concrete fix: add one bootstrap regression test that reuses the poisoned-lstsq setup with n_bootstrap>0 and aggregate="all" and asserts finite aggregates keep finite bootstrap inference after excluding invalid cells; extend the validator to serialize and compare group-effect inference fields as well. References: tests/test_staggered.py:L1049, benchmarks/speed_review/validate_results.py:L104, benchmarks/speed_review/validate_results.py:L229

Path to Approval

  1. Filter non-finite ATT(g,t) cells out of the bootstrap overall-ATT aggregation before computing overall_weights_post, original_overall, and the combined IF, mirroring [diff_diff/staggered_aggregation.py:L90](/home/runner/work/diff-diff/diff-diff/diff_diff/staggered_aggregation.py#L90).
  2. Apply the same finite-cell filtering and weight renormalization inside _prepare_event_study_aggregation() and _prepare_group_aggregation() so bootstrap event-study/group inference is based on the same cell set as the reported analytical aggregates.
  3. Add a regression test that forces one optimized covariate-regression ATT(g,t) cell to NaN under n_bootstrap>0, aggregate="all" and asserts overall/event-study/group bootstrap SE/CI/p-values are computed from the remaining finite cells rather than all becoming NaN.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 14, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 909cf05b2a187fa67ffadd5fc09244b80194915d


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Affected method: Callaway-Sant'Anna ATT(g,t) multiplier-bootstrap aggregation for overall_att in the optimized covariate-regression path.
  • The prior re-review concerns around control_group="not_yet_treated" cache keying and NaN filtering for event-study/group bootstrap reaggregation look addressed.
  • Severity P1: the overall bootstrap path still mixes the filtered post-treatment index set with the stale unfiltered post_treatment_mask, so a NaN post-treatment ATT(g,t) can either crash bootstrap reaggregation or poison overall bootstrap inference instead of excluding that cell.
  • The Cholesky/logit caching itself looks like a valid numerical optimization, not a methodology deviation from Callaway-Sant'Anna or the Methodology Registry.
  • Static review only: the environment lacks numpy, so I could not run pytest or a targeted repro.

Methodology

Affected method: Callaway-Sant'Anna ATT(g,t) aggregation into overall_att with multiplier bootstrap.

  • Severity P1. Impact: _run_multiplier_bootstrap() now filters non-finite post-treatment cells into post_treatment_indices, but original_overall is still computed from original_atts[post_treatment_mask]. If any post-treatment ATT(g,t) is NaN, the analytical path excludes it per the registry, but the bootstrap overall path no longer uses the same estimand. With multiple surviving post-treatment cells this will raise a shape-mismatch error; with exactly one surviving cell, NumPy broadcasts the scalar weight and contaminates original_overall to NaN, which then collapses overall bootstrap SE/CI/p-value even though a valid aggregate exists. That is an unmitigated inference bug in the changed path, and it is not a documented deviation in the registry. Concrete fix: after finite_post filtering, compute post_effects = original_atts[post_treatment_indices] once and use that same filtered array for original_overall, overall_weights_post, and _compute_combined_influence_function(). References: diff_diff/staggered_bootstrap.py:L207, diff_diff/staggered_bootstrap.py:L236, diff_diff/staggered_bootstrap.py:L307, docs/methodology/REGISTRY.md:L349, docs/methodology/REGISTRY.md:L359

No other methodology findings. The new Cholesky/logit caches are implementation choices between valid numerical approaches.

Code Quality

No separate findings.

Performance

No findings. The factorization/logit caches are reasonable performance optimizations on their own.

Maintainability

No separate findings beyond the filtered-set drift in _run_multiplier_bootstrap() described above.

Tech Debt

No separate findings. The blocking bootstrap issue above is not currently tracked under the code-review tech-debt list, so it remains unmitigated. Reference: TODO.md:L38

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the new regression test for NaN-cell bootstrap aggregation only poisons the first lstsq solve; it does not ensure that the poisoned ATT(g,t) is post-treatment, so it can miss the still-broken overall_att bootstrap branch. Concrete fix: force a specific post-treatment (g,t) cell to become NaN, or at minimum assert that one poisoned cell satisfies t >= g - anticipation before checking overall bootstrap inference. References: tests/test_staggered.py:L1110, diff_diff/staggered.py:L792
  • Severity P3. Impact: validate_results.py will silently pass a NaN vs finite mismatch because abs(np.nan - x) stays NaN, which never exceeds the tolerance. That weakens the added validation harness for exactly the NaN-handling paths this PR is changing. Concrete fix: add an explicit one-sided-NaN check before computing diff. References: benchmarks/speed_review/validate_results.py:L177

Path to Approval

  1. In _run_multiplier_bootstrap(), use the filtered post-treatment set consistently when computing original_overall; do not mix post_treatment_indices with the stale post_treatment_mask.
  2. Add a targeted regression test that forces a post-treatment ATT(g,t) cell to NaN under n_bootstrap > 0, aggregate="all" and asserts finite overall bootstrap inference from the remaining finite post-treatment cells.

igerber added a commit that referenced this pull request Mar 14, 2026
Address five failure modes observed after gpt-5.4 upgrade (PRs #192, #194,
#195): documented deviations flagged as blockers, deferred work not accepted,
moving goalposts on re-review, undefined approval criteria, and valid
implementation choices treated as methodology errors.

Changes to pr_review.md:
- Exempt REGISTRY.md-documented deviations from P0/P1 (classify as P3)
- Add implementation choice exception for valid numerical approaches
- Add Deferred Work Acceptance section honoring TODO.md tracking
- Add Assessment Criteria with objective verdicts and mitigation rules
- Add Re-review Scope rules to prevent oscillation between rounds

Changes to ai_pr_review.yml:
- Add step to fetch previous AI review comment for re-review context
- Inject prior review findings into compiled prompt on /ai-review reruns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 14, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6356a9d3fcf37c264c44f25234391569fcc776d9


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review scope is the Callaway-Sant’Anna optimization/validation diff. The prior P1 around original_overall bootstrap aggregation is fixed in diff_diff/staggered_bootstrap.py:L207-L236.
  • The Cholesky / cached-logit changes are valid numerical optimizations, not methodology deviations from Callaway-Sant’Anna (2021).
  • Severity P1 [Newly identified]: balance_e event-study aggregation still keeps cohorts whose anchor-horizon ATT(g,t) is NaN, so the “balanced” event study can silently use different cohort sets across horizons.
  • The prior validator gap for one-sided NaN mismatches appears fixed in benchmarks/speed_review/validate_results.py:L177-L183.
  • Static review only: I could not run targeted tests in this sandbox because numpy is unavailable (ModuleNotFoundError).

Methodology

  • Severity P1 [Newly identified]. Impact: affected method is Callaway-Sant’Anna event-study aggregation with balance_e. The public docstring says balance_e “ensures all groups contribute to each relative period,” and the registry defines event-study aggregation as a weighted average over ATT(g,g+e) cells, with missing group-time cells represented as NaN. After this PR, the optimized covariate-regression path now preserves non-finite ATT(g,t) cells as NaN, but both the analytical and bootstrap balance_e helpers still build groups_at_e from every anchor-horizon cell without checking whether that anchor effect is finite. They only drop NaN later within each horizon. That means a cohort with NaN at e=balance_e can still contribute at other horizons, so the event study is no longer actually balanced and the estimand changes silently. Concrete fix: require a finite anchor effect when building groups_at_e in both helpers, then add a regression test that forces a NaN anchor cell and verifies the cohort is excluded from all event-study horizons. References: diff_diff/staggered.py:L1057-L1067, diff_diff/staggered.py:L1115-L1117, diff_diff/staggered_aggregation.py:L433-L474, diff_diff/staggered_bootstrap.py:L488-L522, docs/methodology/REGISTRY.md:L306-L309, docs/methodology/REGISTRY.md:L347-L349.

  • No other methodology findings. The caching itself is an implementation choice, and the previous overall-bootstrap mismatch is resolved.

Code Quality

  • No findings.

Performance

  • No findings. The Cholesky and propensity-score caches are reasonable performance optimizations for the reviewed path.

Maintainability

  • No separate findings beyond the balance_e bug above.

Tech Debt

  • No separate findings. The remaining balance_e issue is a correctness problem and is not mitigated by TODO.md under the project rules.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new regression tests cover post-treatment NaN filtering for overall/bootstrap aggregation, and there is an existing balanced-event-study bootstrap smoke test, but there is still no Callaway-Sant’Anna regression test for the specific balance_e + NaN-anchor interaction. The current suite would not catch the bug above. Concrete fix: add analytical and bootstrap tests with aggregate="event_study", balance_e=0 that poison a specific anchor-horizon cell to NaN and assert that cohort is absent from every returned event-study horizon. References: tests/test_staggered.py:L1110-L1167, tests/test_staggered.py:L1645-L1668.

Path to Approval

  1. In both event-study aggregation helpers, change groups_at_e construction to include only cohorts with a finite anchor-horizon effect at e=balance_e.
  2. Add a targeted regression test for Callaway-Sant’Anna that forces a NaN anchor cell under balance_e=0 and verifies that the affected cohort is excluded from all analytical and bootstrap event-study horizons.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 15, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 12e67edd1d090a01c130c0ca294676d2499e4da9


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: the small-cohort overlap warning gap and the missing public docs/tutorial for EfficientDiD are already tracked in TODO.md, so they should not block this PR under the stated review rules. Concrete fix: none required for approval; address via the existing TODO entries. References: TODO.md:L42-L49

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: tests/test_efficient_did.py covers unbalanced panels, duplicates, and absorbing-treatment validation, but it does not cover the row-complete-with-NaN-outcome case that currently slips through and causes the methodology issue above. Concrete fix: add a validation regression test that keeps the panel structurally balanced, injects one np.nan into the outcome column, and asserts that EfficientDiD.fit() raises a clear ValueError before weight / EIF computation. References: tests/test_efficient_did.py:L290-L331

Path to Approval

  1. Add explicit non-finite outcome validation in EfficientDiD.fit() before the wide pivot, and fail fast with a clear ValueError for any NaN/Inf outcomes.
  2. Add a regression test in tests/test_efficient_did.py that constructs a row-complete balanced panel with one missing outcome and asserts the estimator raises, rather than propagating non-finite values into Omega*/aggregation.

igerber and others added 8 commits March 15, 2026 10:14
…hing

Cache Cholesky factorizations of X^T X across (g,t) pairs that share the
same control design matrix, eliminating redundant OLS solves. Cache
propensity score coefficients for IPW/DR methods. Achieves 5-19x speedup
on covariate regression path depending on k.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address PR #194 review feedback:
- Include cohort g in cho_key and control_key for not_yet_treated
  control group, preventing stale Cholesky/matrix reuse across groups
- Add UserWarning when group-time cells are dropped due to non-finite
  regression results in covariate reg path
- Add not_yet_treated validation scenarios (dr_2cov_nyt, reg_2cov_nyt)
- Add regression test for DR/reg + covariates + not_yet_treated

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gthen validation

- Revert _linear_regression skip_rank_check regression (always check rank)
- Add one-time rank check to batched covariate path with lstsq fallback
- Add dual-key IF payloads (treated_units/control_units + treated_idx/control_idx)
- Extend validate_results.py with p-value, CI, event-study, and tighter SE checks
- Add test_rank_deficient_action_warn_emits_warning test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e-07

Fix batched reg path: compute kept_cols on rank deficiency, skip per-pair
Cholesky when rank-deficient, solve reduced system with zero-fill for
dropped columns. Fix DR path: check rank before caching Cholesky factor,
use False sentinel for rank-deficient keys, zero NaN coefficients for
prediction. Add cond=1e-07 to all lstsq calls for solve_ols consistency.

Add ipw_2cov_nyt validation scenario, DR rank test, reg nyt rank test,
and bootstrap single-unit cohort edge case test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…es, batch warnings, validator fields

- Normalize covariates=[] to None to prevent crash in covariate dispatch
- Preserve invalid regression cells as NaN instead of dropping (REGISTRY.md compliance)
- Filter NaN effects from aggregation in simple, event study, and group methods
- Add summary warnings to batch bootstrap stats for insufficient samples and zero SE
- Extend validator field coverage to include t_stat, p_value, conf_int, group_effects
- Re-save baselines with extended schema

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…st, extend validator group fields

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… poison post-treatment cell, add one-sided NaN detection in validator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d regression test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reject NaN/Inf values in the outcome column before they can propagate
into Omega*/EIF calculations. Adds a regression test for the validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 15, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3d30bc5f6abca4e06cc4c19dee0ecf5db1bfd5de


Overall Assessment

✅ Looks good

Executive Summary

Methodology

Code Quality

  • Severity P3. Impact: _aggregate_event_study() and _aggregate_by_group() correctly drop NaN ATT(g,t) cells before aggregation, but n_groups and n_periods still record the pre-filter counts. In poisoned-cell scenarios, the effect/SE are computed from the filtered set while the metadata overstates how many finite cells contributed. Concrete fix: derive n_groups/n_periods after finite_mask filtering and add a regression test that asserts the reported counts match the filtered contributors. diff_diff/staggered_aggregation.py:L465 diff_diff/staggered_aggregation.py:L569

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No new untracked tech-debt findings. The remaining EfficientDiD follow-ups in TODO.md are unchanged and already tracked, so they do not block approval. TODO.md:L42

Security

  • No findings.

Documentation/Tests

  • No blocking findings.
  • The added tests cover the resolved EfficientDiD validation gap and the new Callaway-Sant’Anna not_yet_treated, rank-deficiency, NaN-cell, and bootstrap edge paths.
  • I could not run the suite in this environment because the required Python scientific stack is not installed.

@igerber igerber merged commit 154639a into main Mar 15, 2026
10 checks passed
@igerber igerber deleted the speed-review branch March 15, 2026 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant