Add comprehensive methodology verification tests for CallawaySantAnna by igerber · Pull Request #105 · igerber/diff-diff

igerber · 2026-01-24T15:20:37Z

Summary

Add 46 comprehensive methodology verification tests for CallawaySantAnna estimator
Update METHODOLOGY_REVIEW.md to mark CallawaySantAnna review as Complete

Methodology references (required if estimator / math changes)

Method name(s): CallawaySantAnna (Callaway-Sant'Anna estimator for staggered DiD)
Paper / source link(s): Callaway, B., & Sant'Anna, P.H.C. (2021). Difference-in-Differences with multiple time periods. Journal of Econometrics, 225(2), 200-230. https://doi.org/10.1016/j.jeconom.2020.12.001
Any intentional deviations from the source (and why):
- NaN for invalid inference: When SE is non-finite or zero, Python returns NaN for t_stat/p_value rather than potentially erroring (defensive enhancement)
- Webb weights variance: Webb's 6-point distribution has Var(w) ≈ 0.72, not 1.0 (correct theoretical variance)

Validation

Tests added/updated:
- tests/test_methodology_callaway.py (new - 46 tests)
  - TestATTgtFormula: Hand-calculated equation verification
  - TestBasePeriodSelection: Varying vs universal base period
  - TestDoublyRobustEstimator: DR, IPW, reg method consistency
  - TestRBenchmarkCallaway: R did::att_gt() comparison (skipped if R unavailable)
  - TestCallawaySantAnnaEdgeCases: All REGISTRY.md edge cases
  - TestRankDeficiencyHandling: Rank deficiency parameter tests
  - TestSEFormulas: Analytical vs bootstrap SE verification
  - TestAggregationMethods: Simple, event study, group aggregations
  - TestGetSetParams: sklearn-compatible interface
  - TestResultsObject: Summary and DataFrame output
  - TestDeprecationWarnings: Deprecated parameter handling
Backtest / simulation / notebook evidence (if applicable): N/A

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Create tests/test_methodology_callaway.py with 46 tests covering: - Phase 1: Equation verification (hand-calculated ATT formula match) - Phase 2: R benchmark comparison (did::att_gt() alignment) - Phase 3: Edge case tests (all REGISTRY.md documented cases) - Phase 4: SE formula verification (analytical vs bootstrap) Update METHODOLOGY_REVIEW.md to mark CallawaySantAnna as Complete with detailed verification results and documented deviations from R package. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-01-24T15:29:34Z

Overall assessment: ⚠️ Needs changes

Executive Summary

CallawaySantAnna methodology tests are comprehensive, but the Webb bootstrap weight variance claim conflicts with current implementations and backends.
Documentation now asserts Var(w)≈0.72 as a deliberate deviation, yet NumPy and Rust backends use different Webb distributions, risking inconsistent SEs.
New tests introduce import‑time R checks and large bootstrap reps that can noticeably slow CI.

Methodology

P1 | Impact: Webb bootstrap weights are documented/tested as Var(w)≈0.72, but the NumPy generator uses a different 6‑point distribution (var ≈0.83) while the Rust backend uses yet another distribution; bootstrap SEs can differ by backend, and the “deviation” note is inconsistent with actual implementations. | Fix: Choose a single Webb distribution (per Webb 2014), align NumPy + Rust generators to it, and update the test expectations + methodology note accordingly (refs: METHODOLOGY_REVIEW.md:L148-L152, tests/test_methodology_callaway.py:L879-L897, diff_diff/staggered_bootstrap.py:L121-L160, rust/src/bootstrap.rs:L116-L131).

Code Quality

P3 | Impact: check_r_available() runs Rscript at import time, causing side effects during test collection and adding up to 30s latency even when R tests are skipped. | Fix: Move the availability check into a fixture or use a lazy pytest.mark.skipif guard so the subprocess runs only when R tests are selected (ref: tests/test_methodology_callaway.py:L75-L90).

Performance

P3 | Impact: Bootstrap tests with 199/499 reps and 300 units can substantially slow CI and may be flaky on constrained runners. | Fix: Mark these as slow (@pytest.mark.slow) or reduce n_bootstrap/sample size while keeping statistical intent (refs: tests/test_methodology_callaway.py:L829, tests/test_methodology_callaway.py:L909).

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No findings.

- Fix Rust Webb bootstrap weights to match NumPy implementation: - Correct values: ±√(3/2), ±1, ±√(1/2) (was using wrong values) - Correct probabilities: [1,2,3,3,2,1]/12 (was uniform) - Add 3 Rust unit tests for Webb weight verification - Both backends now produce variance ≈ 0.833 - Add lazy R availability fixture to avoid import-time latency: - New tests/conftest.py with session-scoped r_available fixture - Support DIFF_DIFF_R=skip environment variable - Test collection now completes in <1s (was ~2s with subprocess) - Improve test performance: - Add @pytest.mark.slow marker for thorough bootstrap tests - Reduce bootstrap iterations from 199 to 99 where sufficient - Add slow marker definition to pyproject.toml - Documentation updates: - METHODOLOGY_REVIEW.md: Correct Webb variance to 0.833 - TODO.md: Log 7 pre-existing NaN handling issues - CLAUDE.md: Document Rust test troubleshooting (PyO3 linking) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

igerber · 2026-01-24T17:01:23Z

/ai-review

github-actions · 2026-01-24T17:13:07Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4c2e53058cfa0269ce766aa5578db5fc9a65f5cf

Methodology

Severity: P1 | Impact: Webb weights for the multiplier bootstrap (affects CallawaySantAnna when the Rust backend is used) now use Var≈10/12, but the in-code moment claims still assert E[w²]=1/E[w⁴]=1; if any bootstrap logic assumes unit-variance multipliers, SEs and p‑values are scaled down. | Fix: either rescale Webb weights to unit variance or update the moment claims and explicitly document the 10/12 variance in the registry. Ref: rust/src/bootstrap.rs:116, tests/test_methodology_callaway.py:880, docs/methodology/REGISTRY.md:196

Code Quality

No findings.

Performance

Severity: P3 | Impact: The slow SE convergence test runs under default pytest options, so CI/runtime may jump due to 499 bootstrap iterations. | Fix: add -m "not slow" to default CI or reduce iterations in the default suite. Ref: tests/test_methodology_callaway.py:798, pyproject.toml:70

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P2 | Impact: R benchmark JSON parsing can break if package startup messages hit stdout, causing flaky failures when R is installed. | Fix: wrap R imports with suppressPackageStartupMessages/suppressMessages or run Rscript --vanilla --quiet and emit JSON only. Ref: tests/test_methodology_callaway.py:364
Severity: P3 | Impact: R comparison tests allow 20% tolerance and the review is marked “Complete” despite acknowledged discrepancies, which can mask regressions vs the reference implementation. | Fix: tighten tolerances with deterministic data or document expected divergence in the registry. Ref: tests/test_methodology_callaway.py:457, METHODOLOGY_REVIEW.md:131, docs/methodology/REGISTRY.md:165
Severity: P3 | Impact: “Last Review” is set to a future date, which weakens auditability. | Fix: correct the date or leave it blank. Ref: METHODOLOGY_REVIEW.md:26

Open Questions

Is the non‑unit Webb variance (10/12) intentional for inference, or should weights be rescaled to preserve unit‑variance assumptions?
Should CI exclude slow tests by default, or keep them in the default suite?

Overall Assessment

⚠️ Needs changes

Executive Summary

Webb multiplier bootstrap now enforces Var≈10/12 but the documented moment claims still assume E[w²]=1; clarify or rescale to avoid SE bias risk.
R benchmark tests can fail if R emits startup messages to stdout; suppress messages or redirect output.
Slow SE convergence test runs unless CI filters slow, which can inflate default runtimes.
R comparisons are allowed 20% tolerance while the review is marked complete; tighten or document expected divergence.

Tests not run (review only).

…ements - Fix Webb weights to use equal probabilities (1/6 each) matching R's did package - Python: removed weighted probs from staggered_bootstrap.py - Rust: changed to uniform selection in bootstrap.rs - Now Var(w) = 1.0 instead of 0.833, fixing ~8.5% SE underestimation - Add suppressMessages() to R benchmark tests to prevent JSON parse errors - Exclude slow tests by default via pytest addopts (-m 'not slow') - Update METHODOLOGY_REVIEW.md to reflect Webb alignment with R - Update test expectations for Webb variance (1.0 instead of 0.833) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

igerber · 2026-01-24T18:11:28Z

/ai-review

github-actions · 2026-01-24T18:20:09Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4e41a2476d5826ebdd6e4a34fb4e262f8b5979db

Overall assessment: ⚠️ Needs changes

Executive Summary

CallawaySantAnna’s multiplier bootstrap now uses an equal‑probability Webb distribution (Var=1) but the codebase/registry don’t define Webb consistently, so the weighting change is undocumented and potentially inconsistent.
R benchmark tests interpolate raw file paths into Rscript, which can break on Windows or paths with backslashes/spaces.
Default pytest opts now skip slow tests, so the new analytical-vs-bootstrap SE check won’t run unless CI opts in.

Methodology

P1 | Impact: The Webb weights for CallawaySantAnna are now equal‑probability (Var=1), while the codebase’s other Webb definition uses weighted probabilities and the Methodology Registry doesn’t specify which distribution is authoritative; this is a weighting/variance change without a documented source and creates inconsistent “webb” semantics across estimators. | Fix: Document the exact Webb distribution in docs/methodology/REGISTRY.md, cite the source (Callaway & Sant’Anna vs R did), and either align diff_diff/utils.py or rename the alternative distribution to avoid ambiguity. | Refs: diff_diff/staggered_bootstrap.py:L61-L69, rust/src/bootstrap.rs:L116-L146, diff_diff/utils.py:L236-L266, docs/methodology/REGISTRY.md:L196-L199

Code Quality

No issues found.

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

P2 | Impact: R benchmark tests embed data_path directly into the R script; Windows backslashes or spaces can break read.csv, making tests flaky on non‑POSIX paths. | Fix: Use normalizePath/shQuote in R or pass the path via env/args and build it inside the R script. | Refs: tests/test_methodology_callaway.py:L336-L404
P2 | Impact: Default pytest options now skip all slow tests, so the new analytical‑vs‑bootstrap SE convergence check will not run unless CI explicitly opts in, reducing methodology regression coverage. | Fix: Add a CI job or documented command to run -m slow, or avoid the global -m 'not slow' default. | Refs: pyproject.toml:L70-L76, tests/test_methodology_callaway.py:L798-L835
P3 | Impact: METHODOLOGY_REVIEW.md records a future “Last Review” date, which can mislead audit trails. | Fix: Update to the actual review date or leave blank until verified. | Refs: METHODOLOGY_REVIEW.md:L21-L27

…coverage - Add detailed Webb distribution documentation to REGISTRY.md with citation - Table of all bootstrap weight distributions (Rademacher, Mammen, Webb) - Explicit values, probabilities, and properties - Reference to Webb (2023) working paper - Fix R test path handling for cross-platform compatibility - Convert backslashes to forward slashes before embedding in R script - Use normalizePath() in R for proper path resolution - Document how to run slow tests in pyproject.toml - Add comment explaining pytest -m slow and pytest -m '' options - Update marker description for clarity Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

igerber · 2026-01-24T18:33:38Z

/ai-review

github-actions · 2026-01-24T18:41:43Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7397d6d102d1ccb86869024735c25a7632b0aa5f

Overall assessment: ⚠️ Needs changes

Executive Summary

Webb weight definition now diverges between CallawaySantAnna’s multiplier bootstrap and the general wild-bootstrap helper, creating inconsistent inference under the same webb label.
Methodology docs now assert equal-probability Webb weights (Var=1) for Callaway, which conflicts with the existing wild-bootstrap implementation.
Pytest now skips slow tests by default; without a dedicated slow-test job, the new analytical-vs-bootstrap SE check won’t run in standard CI.

Methodology

P1: Webb weights are inconsistent across implementations: CallawaySantAnna’s bootstrap now uses equal-probability weights per registry, but the wild-bootstrap helper still uses the 1-2-3-3-2-1/12 probabilities (Var≠1). Impact: inference differs across estimators for the same webb option and violates the registry’s E[w]=0, Var(w)=1 requirement for multiplier bootstrap weights. Fix: align diff_diff/utils.py Webb weights with the registry distribution or explicitly document/rename the alternate distribution and ensure call sites choose the intended variant. diff_diff/staggered_bootstrap.py:61 diff_diff/utils.py:267 docs/methodology/REGISTRY.md:203

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

P3: Default pytest config now excludes slow tests, so the new analytical-vs-bootstrap SE convergence test won’t execute in standard CI runs. Impact: reduced coverage for SE validation unless a dedicated slow-test job exists. Fix: add a CI job that runs pytest -m slow (or remove the default -m 'not slow' if full coverage is desired). pyproject.toml:70

…mparison tests - Fix Webb weights in utils.py to use equal probabilities (1/6 each), matching staggered_bootstrap.py, Rust backend, and R's did package. This gives Var(w)=1.0 for consistency across all weight distributions. - Add MPDTA-based strict R comparison tests (TestMPDTARComparison): - Export R's mpdta data to ensure identical input for Python and R - Test overall ATT, SE, and group-time effects with <1% tolerance - Validates methodology alignment with R's did::att_gt() Note: pyproject.toml, test R suppressMessages(), and METHODOLOGY_REVIEW.md were already correct from previous commits. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

igerber · 2026-01-24T18:57:22Z

/ai-review

github-actions · 2026-01-24T19:03:46Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 176fc93df1b8c542cf401e597a84159d78ac4e9f

Overall assessment: ⚠️ Needs changes

Executive Summary

Webb bootstrap weights were changed to uniform {±√(3/2), ±1, ±√(1/2)} across Python/Rust/docs, affecting bootstrap SEs for CallawaySantAnna and any estimator using Webb weights.
This definition may not match the Webb 2014/2023 six‑point distribution cited in docstrings/registry, so the methodology source alignment needs confirmation or explicit deviation documentation.
Tests/docs otherwise look consistent with the new implementation, and no security concerns are evident.

Methodology

P1 Methodology: The “Webb” weights now use equal‑probability values {±√(3/2), ±1, ±√(1/2)}, which differs from the standard Webb six‑point distribution in the cited Webb 2014/2023 references; Impact: bootstrap SEs/inference may not match the cited source across CallawaySantAnna and other estimators using Webb weights; Fix: verify against Webb 2014/2023 and/or the R did source, then either revert to the paper’s distribution (and adjust tests/docs) or explicitly document the deviation and cite the R implementation (diff_diff/staggered_bootstrap.py:L61, diff_diff/utils.py:L236, rust/src/bootstrap.rs:L116, docs/methodology/REGISTRY.md:L201).

Code Quality

None found.

Performance

None found.

Maintainability

None found.

Tech Debt

None found.

Security

None found.

Documentation/Tests

P3 Documentation: Review status date is set in the future (2026‑01‑24); Impact: audit trail ambiguity for methodology review cadence; Fix: set the actual review date or leave blank until the review occurs (METHODOLOGY_REVIEW.md:L26).

Address PR #105 round 4 feedback: Add verification note referencing fwildclusterboot's C++ source code as evidence that our Webb weights implementation (±√1.5, ±1, ±√0.5 with equal probabilities) matches established R packages. Also clarify why some documentation shows simplified values (±1.5, ±1, ±0.5) vs the actual sqrt values used in implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

igerber · 2026-01-24T19:15:35Z

/ai-review

github-actions · 2026-01-24T19:25:23Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 19bbb68bf3ef329c3f31503f2c9fef9da40d48f3

Overall Assessment: ⚠️ Needs changes
Executive Summary

Webb bootstrap weights were changed to a uniform 6‑point distribution, which affects bootstrap SEs for CallawaySantAnna and any estimator using wild_bootstrap_se; needs source verification against Webb (2014/2023) or R did.
R availability check doesn’t validate jsonlite, so R‑based tests can error instead of skipping when only did is installed.
R comparison tests invoke R multiple times per class; this can slow/flake CI when R is available.
Default pytest now skips slow tests, so the analytical‑vs‑bootstrap SE check won’t run unless CI is updated.

Methodology

Severity: P1 | Impact: Webb weights changed to equal‑probability ±√(3/2), ±1, ±√(1/2); if Webb (2014/2023) or R did uses a different 6‑point distribution, bootstrap SEs/inference for CallawaySantAnna and wild_bootstrap_se will be biased. | Concrete fix: verify the exact Webb distribution against the cited source or R did implementation, cite it in the registry, and add a regression test that asserts the exact values/probabilities (or revert/parameterize if the source differs). diff_diff/staggered_bootstrap.py:61 diff_diff/utils.py:236 rust/src/bootstrap.rs:116 docs/methodology/REGISTRY.md:201

Code Quality

Severity: P3 | Impact: No issues found in this diff. | Concrete fix: N/A.

Performance

Severity: P3 | Impact: R comparison tests spawn R multiple times (each test method re‑runs R), which can dominate CI runtime and risk timeouts in R‑enabled environments. | Concrete fix: cache R results via a session‑scoped fixture or mark these tests @pytest.mark.slow and run them in a dedicated job. tests/test_methodology_callaway.py:333 tests/test_methodology_callaway.py:1219

Maintainability

Severity: P3 | Impact: No issues found in this diff. | Concrete fix: N/A.

Tech Debt

Severity: P3 | Impact: No issues found in this diff. | Concrete fix: N/A.

Security

Severity: P3 | Impact: No issues found in this diff. | Concrete fix: N/A.

Documentation/Tests

Severity: P2 | Impact: r_available only checks did, but the R tests also require jsonlite, so tests can fail instead of skipping when jsonlite is missing. | Concrete fix: update the check to library(did); library(jsonlite) (or skip with a clear message when jsonlite isn’t available). tests/conftest.py:21
Severity: P3 | Impact: Default pytest options now skip slow tests, so the new SE convergence test won’t run in default CI unless a separate job is added. | Concrete fix: add a CI job running pytest -m slow (or remove the default filter if not intended) and note it in docs. pyproject.toml:70

Remove `-m 'not slow'` filter from pytest addopts so slow tests (like the analytical-vs-bootstrap SE convergence test) run by default. Developers can still use `pytest -m 'not slow'` for faster local runs. Addresses PR #105 feedback about CI test coverage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add fwildclusterboot verification to Webb weights section in REGISTRY.md (authoritative methodology documentation, not just METHODOLOGY_REVIEW.md) - Update R dependency check in conftest.py to also verify jsonlite package (used by methodology tests for JSON output from R) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

igerber · 2026-01-24T19:34:51Z

/ai-review

github-actions · 2026-01-24T19:46:13Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7b4932182545666e194ddcd6e2d9d15f1ddf2bf1

Overall assessment: ⚠️ Needs changes

Executive summary

P1 Webb bootstrap weights were changed to an equal‑probability ±√(3/2), ±1, ±√(1/2) distribution; confirm this matches Webb (2014/2023) or explicitly document it as an R‑compatibility deviation.
P3 Default pytest now runs slow bootstrap and R integration tests, which can significantly slow CI/local loops.
P3 CallawaySantAnna review is marked “Complete” with a future date while listing unresolved R divergences.

Methodology

P1 Impact: CallawaySantAnna bootstrap SEs and p‑values can be mis‑scaled if Webb (2014/2023) defines a different 6‑point distribution than the equal‑probability ±√(3/2), ±1, ±√(1/2) now used; this changes inference in few‑cluster settings. Fix: confirm the exact Webb distribution in the cited paper; if it differs, revert to the paper’s values/probabilities or explicitly document the R/did‑compatible deviation (with package/version) and consider a configurable option. Ref: diff_diff/staggered_bootstrap.py:L61-L69, diff_diff/utils.py:L236-L271, rust/src/bootstrap.rs:L116-L147, docs/methodology/REGISTRY.md:L201-L220

Code Quality

No issues found.

Performance

P3 Impact: Default pytest runs now include slow bootstrap (499 reps) and R benchmarks, likely increasing CI and local run time. Fix: mark the R comparison tests as @pytest.mark.slow and/or update addopts/CI to skip slow tests by default and run them in a separate job. Ref: pyproject.toml:L70-L76, tests/test_methodology_callaway.py:L444-L533, tests/test_methodology_callaway.py:L803-L833

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

P3 Impact: METHODOLOGY_REVIEW.md marks CallawaySantAnna review “Complete” with a future date while listing unresolved R differences, which can mislead users about verification status. Fix: use a current date and set status to “In Progress” or “Complete (with deviations)” and link to follow‑up investigation. Ref: METHODOLOGY_REVIEW.md:L21-L33, METHODOLOGY_REVIEW.md:L103-L146

igerber and others added 2 commits January 24, 2026 14:28

igerber merged commit 026a940 into main Jan 24, 2026
4 checks passed

igerber deleted the feature/methodology-verification-callaway-santanna branch January 24, 2026 19:48

Conversation

igerber commented Jan 24, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

igerber commented Jan 24, 2026

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

igerber commented Jan 24, 2026

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

igerber commented Jan 24, 2026

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

igerber commented Jan 24, 2026

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

igerber commented Jan 24, 2026

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

igerber commented Jan 24, 2026

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant