Add R validation for 4 survey-enabled estimators by igerber · Pull Request #276 · igerber/diff-diff

igerber · 2026-04-06T01:24:41Z

Summary

Add R benchmark script (benchmark_survey_estimators.R) generating golden JSON for 4 survey-enabled estimators: ImputationDiD, StackedDiD, SunAbraham, TripleDifference
Add Python tests consuming golden values (5 tests across 4 scenarios)
Document validation results in docs/benchmarks.rst with observed gaps and reproduction instructions

Methodology references (required if estimator / math changes)

N/A — no methodology changes. Tests validate existing survey weight propagation against R's survey::svyglm().

Validation

Tests added: tests/test_survey_estimator_validation.py (5 tests)
- S1: ImputationDiD control-only WLS regression (coef gap <1e-10, SE 0.00%)
- S1: ImputationDiD end-to-end smoke test (finite ATT/SE)
- S2: StackedDiD with Q-weight x survey weight composition (coef gap <1e-10, SE 0.77%)
- S3: SunAbraham IW-aggregated ATT (coef gap <1e-11, SE 0.00%)
- S4: TripleDifference three-way interaction DDD (coef gap <1e-10, SE 0.36%)
R golden values generated by Rscript benchmarks/R/benchmark_survey_estimators.R

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…kedDiD, SunAbraham, TripleDifference) Expand survey R-validation coverage from 4 to 8 estimators. Each target estimator reduces to a WLS regression under survey weights, validated against R's survey::svyglm(). Point estimates match to <1e-10; SEs within 0.77% (tightest observed gap). Deliverables: - R benchmark script generating golden JSON (4 scenarios, 2 DGPs) - Python tests consuming golden values (5 tests, 1.5% SE tolerance) - Documentation in docs/benchmarks.rst Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-06T01:31:24Z

Overall Assessment

⚠️ Needs changes

Executive Summary

Cross-checked the PR against docs/methodology/REGISTRY.md and the affected estimator implementations. I did not find an undocumented estimator-method mismatch in ImputationDiD, StackedDiD, SunAbraham, or TripleDifference.
The S2 StackedDiD SE gap is a disclosed benchmark-harness asymmetry: the R side drops FPC on stacked data, while the Python estimator keeps the re-resolved stacked survey design.
P1: the new R benchmark helper computes inference inline without the repo’s NaN-all-or-nothing guard, so a future zero or non-finite SE would serialize misleading benchmark inference values.
P3: several golden values generated by the benchmark are never asserted in Python, so the exercised coverage is narrower than the fixture/docs suggest.
I could not execute the new test module here because the review environment does not have pytest or pandas.

Methodology

Cross-check against the registry and estimator docstrings did not surface an undocumented deviation in the library estimators touched by this PR.

Severity: P1. Impact: benchmarks/R/benchmark_survey_estimators.R:L195-L211 introduces a new inline inference helper (t_stat <- att / se, manual CI construction). If a benchmark case ever yields se <= 0 or non-finite SE, this will emit misleading Inf or point-estimate inference instead of the project’s required all-or-nothing NaN behavior. Because tests/test_survey_estimator_validation.py:L304-L325 only checks S4 coefficient/SE, those bad inference fields would land silently in the committed golden file. Concrete fix: replace the helper with guarded inference generation that sets all derived inference fields to NaN whenever se is non-finite or <= 0, and then assert those fields in Python.
Severity: P3. Impact: docs/benchmarks.rst:L802-L806 correctly discloses that S2 omits FPC on the R side while the Python estimator keeps it on the stacked design. I do not count this as a defect because it is documented and confined to the benchmark harness. Concrete fix: none required.

Code Quality

No additional findings beyond the P1 inference helper above.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No blocking tech-debt issue introduced in the library code. The coverage gap below is currently untracked in TODO.md.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the benchmark writes more golden inference data than the test module actually exercises: S2 event-study SEs at benchmarks/R/benchmark_survey_estimators.R:L385-L388 are not asserted by tests/test_survey_estimator_validation.py:L242-L254; S3 cohort SEs at benchmarks/R/benchmark_survey_estimators.R:L480-L483 are not asserted by tests/test_survey_estimator_validation.py:L286-L298; and S4 t_stat/CI fields at benchmarks/R/benchmark_survey_estimators.R:L591-L600 are not asserted by tests/test_survey_estimator_validation.py:L304-L325. That means the fixture/docs overstate what CI actually protects. Concrete fix: add assertions for the generated SE/inference fields, or remove the unused fields from the fixture/docs if they are intentionally out of scope.

Path to Approval

Replace the inline S4 inference math in benchmarks/R/benchmark_survey_estimators.R:L195-L211 with guarded NaN propagation for zero or non-finite SEs.
Extend tests/test_survey_estimator_validation.py:L304-L325 to assert the S4 inference fields generated by that helper (t_stat, CI, and df), so the benchmark cannot regress silently again.

- Guard extract_svyglm_results for non-finite/zero SE (P1) - Assert S4 t_stat, CI lower/upper in Python test (P1) - Assert S2 event-study SEs and S3 cohort-effect SEs (P3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-06T12:20:43Z

/ai-review

github-actions · 2026-04-06T12:26:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9d0e27e2bceb6a066a3c4452d2a867a3dd6b8012

Overall Assessment

✅ Looks good

All previously raised P1+ issues are resolved. I found one new P3 test-harness gap, but no unmitigated P0/P1 issues.

Executive Summary

The prior blocker is fixed: the new R helper now applies NaN-all-or-nothing gating for invalid SEs in benchmark_survey_estimators.R:195, and the S4 test now asserts t_stat and CI outputs in test_survey_estimator_validation.py:318.
Cross-checking the new benchmark math against the Methodology Registry and the current estimator implementations did not surface an undocumented methodology mismatch for ImputationDiD, StackedDiD, SunAbraham, or TripleDifference.
The S2 0.77% SE gap is an explained benchmark-harness asymmetry, not a library defect; benchmarks.rst:802 documents that the R stacked comparison omits FPC while Python re-resolves the stacked design.
Severity: P3. The generic _assert_close() helper in test_survey_estimator_validation.py:41 skips assertions whenever the golden value is NaN, so a future undefined-inference case from benchmark_survey_estimators.R:195 would not verify Python also returns NaN.
I could not execute pytest tests/test_survey_estimator_validation.py here because pytest is not installed in the review environment.

Methodology

No findings. The new R-side constructions match the documented estimator contracts: S1 uses Omega_0 untreated observations, S2 uses sample-share Q-weights with never-treated clean controls, S3 uses survey-weighted cohort masses in the IW aggregation, and S4 compares the regression DDD path to the three-way interaction coefficient under survey design.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining gap is a minor test-helper issue rather than untracked estimator debt.

Security

No findings. The committed JSON fixture appears synthetic; I did not see secrets or PII.

Documentation/Tests

Severity: P3. Impact: test_survey_estimator_validation.py:41 returns immediately when r_val is NaN. That means the new NaN guard in benchmark_survey_estimators.R:195 is not actually enforced by the Python assertions for future undefined-inference scenarios. Concrete fix: change _assert_close() to assert np.isnan(py_val) when r_val is NaN, and add one focused degenerate-inference case so the NaN path is exercised explicitly.

igerber added the ready-for-ci Triggers CI test workflows label Apr 6, 2026

igerber merged commit 8ad8c4e into main Apr 6, 2026
17 of 18 checks passed

igerber deleted the survey-r-validation branch April 6, 2026 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add R validation for 4 survey-enabled estimators#276

Add R validation for 4 survey-enabled estimators#276
igerber merged 2 commits intomainfrom
survey-r-validation

igerber commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

igerber commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 6, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

igerber commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant