Sun-Abraham methodology review: fix 5 issues, add R benchmarks by igerber · Pull Request #153 · igerber/diff-diff

igerber · 2026-02-16T11:53:31Z

Summary

Complete methodology review of the Sun-Abraham interaction-weighted estimator against R's fixest::sunab()
Fix 5 implementation issues: df_adjustment for absorbed FE, NaN return for no post-treatment effects, deprecation warnings for unused params, remove hardcoded event-time truncation, add variance fallback warning
Add R and Python benchmark scripts with orchestrator integration; ATT matches R within machine precision, SE within 0.3%
Add 7 new methodology verification tests (43/43 SA tests pass, full suite clean)

Methodology references (required if estimator / math changes)

Method name(s): Sun & Abraham (2021) interaction-weighted estimator
Paper / source link(s): Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2), 175-199.
Any intentional deviations from the source (and why): Aggregated event study and overall ATT p-values use normal distribution (via compute_p_value()) rather than t-distribution. This is asymptotically equivalent and standard for delta-method-aggregated quantities. R's fixest uses t-distribution at all levels, so aggregated p-values may differ slightly for small samples.

Validation

Tests added/updated: tests/test_sun_abraham.py — 7 new methodology tests in TestSunAbrahamMethodology class
Backtest / simulation / notebook evidence: R benchmark comparison at small (200 units) and 1k (1000 units) scales:
- ATT relative difference: <1e-11 (machine precision)
- SE relative difference: 0.1-0.3%
- Event study correlation: 1.000000

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Complete methodology review of the Sun-Abraham interaction-weighted estimator against R's fixest::sunab(). ATT matches within machine precision, SE within 0.3%. Fixes: - Add df_adjustment for absorbed unit/time fixed effects in saturated regression - Return (NaN, NaN) instead of (0, 0) when no post-treatment effects exist - Add FutureWarning for unimplemented min_pre/post_periods parameters - Remove hardcoded [-20, 20] event-time truncation to match R behavior - Add UserWarning for variance fallback path using simplified variance Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-02-16T12:00:10Z

Overall assessment: ⚠️ Needs changes

Executive summary

SunAbraham changes largely align with docs/methodology/REGISTRY.md (no event-time cap, NaN overall ATT when no post effects, warnings, variance fallback).
P1: df adjustment for absorbed FE appears off by one given the no-intercept saturated regression, making cohort-level p-values/CIs slightly too liberal.
Tests add coverage for new behaviors, but the df-adjustment test does not actually assert the df change.

Methodology
Method affected: SunAbraham (Sun & Abraham 2021). Registry cross-check: event-time range, NaN overall ATT with no post effects, deprecation warnings, and variance fallback match the documented edge cases.

Severity: P1. Impact: Cohort-level inference uses LinearRegression.df_; with no intercept in the saturated regression, df_adjustment = n_units + n_times - 2 overstates df by 1, leading to slightly smaller p-values and narrower CIs. Location: diff_diff/sun_abraham.py:L785-L795. Concrete fix: use df_adj = n_units_fe + n_times_fe - 1, or add an explicit intercept column (then keep -2) and document the choice.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: test_df_adjustment_sets_regression_df doesn’t validate the df change, so regressions in df adjustment could slip. Location: tests/test_sun_abraham.py:L1192-L1223. Concrete fix: assert df directly (e.g., expose regression df in results or monkeypatch LinearRegression.fit to capture df_), or compare inference against a known df-adjusted baseline.

…rtion - Fix df_adjustment from n_units + n_times - 2 to n_units + n_times - 1 (SunAbraham has no intercept column unlike TWFE) - Rewrite test to monkeypatch LinearRegression.fit and assert exact df values - Update METHODOLOGY_REVIEW.md with corrected formula and rationale Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-16T13:28:17Z

/ai-review

github-actions · 2026-02-16T13:36:11Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 62564e420142d30107cd2fa7b2efa517f0fa4f5e

Overall assessment: ⚠️ Needs changes

Executive summary:

SunAbraham estimator changes (df adjustment for absorbed FE, no event-time cap, NaN overall ATT/SE when no post effects, variance fallback warning) generally align with the Methodology Registry.
P1: IW weights are computed/tested as cohort-size shares, but the Registry specifies event-time sample shares; this is a methodology mismatch for unbalanced panels.
Docstring still describes min_pre_periods/min_post_periods as active filters even though they’re now deprecated and ignored.
Tests not run (not requested).

Methodology

P1 — Impact: Interaction weights are calculated from cohort sizes (n_g) rather than event-time sample shares (n_{g,e}) defined in the Methodology Registry; this mis-weights event-study and overall ATT when panels are unbalanced, and the new test now codifies the cohort-size behavior. Fix: compute weights using counts at each event time (n_{g,e}) from df[(first_treat==g) & (_rel_time==e)], or explicitly require balanced panels and update the Registry/test to match that assumption. References: diff_diff/sun_abraham.py:858, diff_diff/sun_abraham.py:872, tests/test_sun_abraham.py:1284, docs/methodology/REGISTRY.md:412.

Code Quality

No issues found.

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

P3 — Impact: fit() docstring still presents min_pre_periods/min_post_periods as functional filters, but the implementation now ignores them and emits FutureWarning, which can mislead users and downstream docs. Fix: update the docstring (and any public API docs) to mark both parameters deprecated/ignored or remove them at the next breaking release. References: diff_diff/sun_abraham.py:458, diff_diff/sun_abraham.py:473.
Tests not run (not requested).

…tring - Fix IW weights to use n_{g,e} (per-event-time observation counts) instead of n_g (total cohort sizes), matching REGISTRY formula for unbalanced panels - Add unbalanced panel test verifying n_{g,e} != n_g distinction - Update existing IW weights test to verify against event-time counts - Mark min_pre_periods/min_post_periods as deprecated in fit() docstring - Update METHODOLOGY_REVIEW.md verified-component line and corrections Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-16T13:58:26Z

/ai-review

github-actions · 2026-02-16T14:03:07Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 89e8e72672bbc258738856173fc2c82a475dc610

Overall assessment: ⛔ Blocker

Executive Summary

P0: Sun-Abraham now treats first_treat=np.inf never-treated units as a treated cohort when the event-time cap is removed, producing -inf event times and invalid weights.
Methodology cross-check: Sun-Abraham IW weights and no-cap event-time range align with docs/methodology/REGISTRY.md, but the np.inf path violates the “never-treated as controls” requirement.
Tests added for new behaviors, but no coverage for the explicitly supported np.inf encoding of never-treated units.
Tests not run (not requested).

Methodology

P0: first_treat=np.inf (explicitly allowed) is included in treatment_groups and _rel_time (both use > 0), which becomes -inf and is now included after removing the event-time cap. This violates the control-group definition and can bias estimates or create nonsensical event-study coefficients. Impact: incorrect Sun-Abraham estimates or crashes for valid inputs. Fix: treat np.inf as never-treated everywhere you define treated cohorts and relative times: use np.isfinite(first_treat) & (first_treat > 0) for treatment_groups, _rel_time, event_time_counts, and post_weights, or normalize np.inf to 0 during data prep. Add a regression test. diff_diff/sun_abraham.py:L505-L564, diff_diff/sun_abraham.py:L859-L877, diff_diff/sun_abraham.py:L946-L949

Code Quality

No issues found.

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

P2: Missing test for never-treated encoded as np.inf, which is documented as valid input and now interacts with uncapped event-time range. Impact: regression can slip in undetected. Fix: add a test in TestSunAbrahamMethodology that sets never-treated to np.inf, asserts no ±inf event times, and ensures no inf cohort appears in weights. tests/test_sun_abraham.py:L1070-L1185

first_treat=np.inf (documented as valid for never-treated units) passed all `> 0` checks in sun_abraham.py and staggered.py, causing np.inf to be treated as a cohort with -inf relative times. Fixed by normalizing np.inf to 0 immediately after computing _never_treated in both estimators. Added regression tests verifying equivalence between encodings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-16T14:26:52Z

/ai-review

github-actions · 2026-02-16T14:32:49Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 93e9cdbd9ab5aeac0dd4a1c085d18dc4daa3be3a

Overall assessment: ⚠️ Needs changes

Executive summary

P1: np.inf normalization happens after treatment_groups is derived, so never-treated units can still be treated as cohorts, which can bypass the “no treated units” guard and/or leak controls into results.groups.
Methodology changes otherwise align with REGISTRY (event-time weights n_{g,e}, no truncation, NaN overall ATT when no post effects).
Tests cover most new behavior, but miss the np.inf-only/no-treated case and results.groups contamination.

Methodology

Severity: P1. Impact: If never-treated units are encoded as np.inf, they can still be included in treatment_groups because normalization happens afterward. In SunAbraham this can skip the “no treated units” error and later hit min(all_rel_times_sorted) on an empty set; in both estimators it can leave np.inf in results.groups, violating the control/treated cohort definition. Fix: normalize first_treat (set np.inf → 0) before computing treatment_groups, or recompute and explicitly filter finite > 0 values after normalization. diff_diff/sun_abraham.py:L505-L556, diff_diff/staggered.py:L643-L650.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

Severity: P2. Impact: Current np.inf encoding tests don’t assert that results.groups excludes np.inf, nor do they cover the all-never-treated np.inf case that should trigger “No treated units found,” so the regression above wouldn’t be caught. Fix: extend the existing tests to assert np.inf is absent from results.groups and add a SunAbraham test for all-never-treated np.inf input raising ValueError. tests/test_sun_abraham.py:L1370-L1404, tests/test_staggered.py:L105-L130.

…_groups The np.inf → 0 normalization was placed after treatment_groups was computed, so np.inf passed the `g > 0` filter and leaked into treatment cohorts. Reorder in both sun_abraham.py and staggered.py so normalization precedes treatment_groups derivation. Add results.groups assertion and all-never-treated ValueError test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-16T14:40:36Z

/ai-review

github-actions · 2026-02-16T14:46:28Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 049a6ef95aa647a8ac4b4a28f7a7a77bf09d7dac

Overall Assessment
⚠️ Needs changes

Executive Summary

P0: Bootstrap inference can emit non-NaN overall p-values/CI even when overall ATT is NaN (no post-treatment effects), violating the Methodology Registry and the NaN propagation edge-case rule.
Methodology alignment improvements look correct: IW weights now use event-time sample shares and event-time range is no longer truncated, matching the registry and source expectations.
Tests cover the no-post-effects analytic path but miss the bootstrap path, leaving the regression above unguarded.

Methodology

Severity: P0. Impact: When there are no post-treatment effects, overall_att is NaN, but if n_bootstrap > 0 the bootstrap path overwrites overall_p_value/overall_conf_int with non-NaN values (e.g., 1/(B+1)), which is misleading and contradicts the registry’s “all inference fields NaN” rule for this edge case. Concrete fix: Short-circuit overall bootstrap inference when overall_att is not finite (or when post_effects is empty), or make _compute_bootstrap_pvalue/percentile CI return NaN when the original effect or bootstrap distribution is not finite; keep event-study bootstrap if desired. diff_diff/sun_abraham.py:L610-L646, diff_diff/sun_abraham.py:L920-L1159, diff_diff/sun_abraham.py:L1176-L1194

Code Quality

No issues found.

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

Severity: P2. Impact: The new no-post-effects test only exercises n_bootstrap=0, so the bootstrap NaN propagation bug can slip through. Concrete fix: Add a test variant with n_bootstrap > 0 asserting overall_att, overall_se, overall_p_value, and CI are NaN when there are no post-treatment effects. tests/test_sun_abraham.py:L1070-L1124

…as compat - Guard overall bootstrap stats in _run_bootstrap() when ATT is NaN, preventing _compute_bootstrap_pvalue from returning 1/(B+1) instead of NaN - Add test_no_post_effects_bootstrap_returns_nan for the bootstrap NaN path - Cast first_treat to float before assigning np.inf in tests for pandas compat Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-16T15:41:34Z

/ai-review

github-actions · 2026-02-16T15:48:28Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 90264644e97e9139d55899ae670ea16626fa62be

Overall assessment: ✅ Looks good

Executive Summary

Affected methods: SunAbraham (weights, event-time range, overall ATT/SE, df adjustment, variance fallback) and CallawaySantAnna (never-treated encoding); cross-checked against docs/methodology/REGISTRY.md and in-code docstrings with no mismatches found.
Edge-case handling improved (no-post-treatment NaNs, np.inf normalization) and tests cover these paths.
Benchmarks for R/Python parity are integrated cleanly into the runner.
Potential performance regression from unbounded event-time range; consider a warning/optional cap.

Methodology
No issues found. Affected methods: SunAbraham and CallawaySantAnna; registry/docstring alignment verified.

Code Quality
No issues found.

Performance

Severity: P3 | Impact: Removing the event-time truncation can materially increase the number of cohort×relative-time dummies for long panels, raising memory/time costs versus prior behavior. | Concrete fix: Consider an optional max_rel_time/event_time_cap parameter or a warning when (max_rel - min_rel) is large, while keeping the default uncapped for methodology parity. Location: diff_diff/sun_abraham.py:L541-L566.

Maintainability
No issues found.

Tech Debt
No issues found.

Security
No issues found.

Documentation/Tests
No issues found. Tests not run (not requested).

igerber merged commit fc7888b into main Feb 16, 2026
7 checks passed

igerber deleted the sa-method-review branch February 16, 2026 16:48

igerber mentioned this pull request Feb 16, 2026

Update TODO.md and ROADMAP.md for accuracy post-v2.4.0 #160

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sun-Abraham methodology review: fix 5 issues, add R benchmarks#153

Sun-Abraham methodology review: fix 5 issues, add R benchmarks#153
igerber merged 6 commits intomainfrom
sa-method-review

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Feb 16, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

igerber commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant