Skip to content

Add R validation for 4 survey-enabled estimators#276

Merged
igerber merged 2 commits intomainfrom
survey-r-validation
Apr 6, 2026
Merged

Add R validation for 4 survey-enabled estimators#276
igerber merged 2 commits intomainfrom
survey-r-validation

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 6, 2026

Summary

  • Add R benchmark script (benchmark_survey_estimators.R) generating golden JSON for 4 survey-enabled estimators: ImputationDiD, StackedDiD, SunAbraham, TripleDifference
  • Add Python tests consuming golden values (5 tests across 4 scenarios)
  • Document validation results in docs/benchmarks.rst with observed gaps and reproduction instructions

Methodology references (required if estimator / math changes)

  • N/A — no methodology changes. Tests validate existing survey weight propagation against R's survey::svyglm().

Validation

  • Tests added: tests/test_survey_estimator_validation.py (5 tests)
    • S1: ImputationDiD control-only WLS regression (coef gap <1e-10, SE 0.00%)
    • S1: ImputationDiD end-to-end smoke test (finite ATT/SE)
    • S2: StackedDiD with Q-weight x survey weight composition (coef gap <1e-10, SE 0.77%)
    • S3: SunAbraham IW-aggregated ATT (coef gap <1e-11, SE 0.00%)
    • S4: TripleDifference three-way interaction DDD (coef gap <1e-10, SE 0.36%)
  • R golden values generated by Rscript benchmarks/R/benchmark_survey_estimators.R

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…kedDiD, SunAbraham, TripleDifference)

Expand survey R-validation coverage from 4 to 8 estimators. Each target
estimator reduces to a WLS regression under survey weights, validated
against R's survey::svyglm(). Point estimates match to <1e-10; SEs
within 0.77% (tightest observed gap).

Deliverables:
- R benchmark script generating golden JSON (4 scenarios, 2 DGPs)
- Python tests consuming golden values (5 tests, 1.5% SE tolerance)
- Documentation in docs/benchmarks.rst

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

Overall Assessment

⚠️ Needs changes

Executive Summary

  • Cross-checked the PR against docs/methodology/REGISTRY.md and the affected estimator implementations. I did not find an undocumented estimator-method mismatch in ImputationDiD, StackedDiD, SunAbraham, or TripleDifference.
  • The S2 StackedDiD SE gap is a disclosed benchmark-harness asymmetry: the R side drops FPC on stacked data, while the Python estimator keeps the re-resolved stacked survey design.
  • P1: the new R benchmark helper computes inference inline without the repo’s NaN-all-or-nothing guard, so a future zero or non-finite SE would serialize misleading benchmark inference values.
  • P3: several golden values generated by the benchmark are never asserted in Python, so the exercised coverage is narrower than the fixture/docs suggest.
  • I could not execute the new test module here because the review environment does not have pytest or pandas.

Methodology

Cross-check against the registry and estimator docstrings did not surface an undocumented deviation in the library estimators touched by this PR.

  • Severity: P1. Impact: benchmarks/R/benchmark_survey_estimators.R:L195-L211 introduces a new inline inference helper (t_stat <- att / se, manual CI construction). If a benchmark case ever yields se <= 0 or non-finite SE, this will emit misleading Inf or point-estimate inference instead of the project’s required all-or-nothing NaN behavior. Because tests/test_survey_estimator_validation.py:L304-L325 only checks S4 coefficient/SE, those bad inference fields would land silently in the committed golden file. Concrete fix: replace the helper with guarded inference generation that sets all derived inference fields to NaN whenever se is non-finite or <= 0, and then assert those fields in Python.
  • Severity: P3. Impact: docs/benchmarks.rst:L802-L806 correctly discloses that S2 omits FPC on the R side while the Python estimator keeps it on the stacked design. I do not count this as a defect because it is documented and confined to the benchmark harness. Concrete fix: none required.

Code Quality

  • No additional findings beyond the P1 inference helper above.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No blocking tech-debt issue introduced in the library code. The coverage gap below is currently untracked in TODO.md.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Replace the inline S4 inference math in benchmarks/R/benchmark_survey_estimators.R:L195-L211 with guarded NaN propagation for zero or non-finite SEs.
  2. Extend tests/test_survey_estimator_validation.py:L304-L325 to assert the S4 inference fields generated by that helper (t_stat, CI, and df), so the benchmark cannot regress silently again.

- Guard extract_svyglm_results for non-finite/zero SE (P1)
- Assert S4 t_stat, CI lower/upper in Python test (P1)
- Assert S2 event-study SEs and S3 cohort-effect SEs (P3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 6, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 9d0e27e2bceb6a066a3c4452d2a867a3dd6b8012


Overall Assessment

✅ Looks good

All previously raised P1+ issues are resolved. I found one new P3 test-harness gap, but no unmitigated P0/P1 issues.

Executive Summary

  • The prior blocker is fixed: the new R helper now applies NaN-all-or-nothing gating for invalid SEs in benchmark_survey_estimators.R:195, and the S4 test now asserts t_stat and CI outputs in test_survey_estimator_validation.py:318.
  • Cross-checking the new benchmark math against the Methodology Registry and the current estimator implementations did not surface an undocumented methodology mismatch for ImputationDiD, StackedDiD, SunAbraham, or TripleDifference.
  • The S2 0.77% SE gap is an explained benchmark-harness asymmetry, not a library defect; benchmarks.rst:802 documents that the R stacked comparison omits FPC while Python re-resolves the stacked design.
  • Severity: P3. The generic _assert_close() helper in test_survey_estimator_validation.py:41 skips assertions whenever the golden value is NaN, so a future undefined-inference case from benchmark_survey_estimators.R:195 would not verify Python also returns NaN.
  • I could not execute pytest tests/test_survey_estimator_validation.py here because pytest is not installed in the review environment.

Methodology

No findings. The new R-side constructions match the documented estimator contracts: S1 uses Omega_0 untreated observations, S2 uses sample-share Q-weights with never-treated clean controls, S3 uses survey-weighted cohort masses in the IW aggregation, and S4 compares the regression DDD path to the three-way interaction coefficient under survey design.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining gap is a minor test-helper issue rather than untracked estimator debt.

Security

No findings. The committed JSON fixture appears synthetic; I did not see secrets or PII.

Documentation/Tests

  • Severity: P3. Impact: test_survey_estimator_validation.py:41 returns immediately when r_val is NaN. That means the new NaN guard in benchmark_survey_estimators.R:195 is not actually enforced by the Python assertions for future undefined-inference scenarios. Concrete fix: change _assert_close() to assert np.isnan(py_val) when r_val is NaN, and add one focused degenerate-inference case so the NaN path is exercised explicitly.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 6, 2026
@igerber igerber merged commit 8ad8c4e into main Apr 6, 2026
17 of 18 checks passed
@igerber igerber deleted the survey-r-validation branch April 6, 2026 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant