Skip to content

Sun-Abraham methodology review: fix 5 issues, add R benchmarks#153

Merged
igerber merged 6 commits intomainfrom
sa-method-review
Feb 16, 2026
Merged

Sun-Abraham methodology review: fix 5 issues, add R benchmarks#153
igerber merged 6 commits intomainfrom
sa-method-review

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Feb 16, 2026

Summary

  • Complete methodology review of the Sun-Abraham interaction-weighted estimator against R's fixest::sunab()
  • Fix 5 implementation issues: df_adjustment for absorbed FE, NaN return for no post-treatment effects, deprecation warnings for unused params, remove hardcoded event-time truncation, add variance fallback warning
  • Add R and Python benchmark scripts with orchestrator integration; ATT matches R within machine precision, SE within 0.3%
  • Add 7 new methodology verification tests (43/43 SA tests pass, full suite clean)

Methodology references (required if estimator / math changes)

  • Method name(s): Sun & Abraham (2021) interaction-weighted estimator
  • Paper / source link(s): Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2), 175-199.
  • Any intentional deviations from the source (and why): Aggregated event study and overall ATT p-values use normal distribution (via compute_p_value()) rather than t-distribution. This is asymptotically equivalent and standard for delta-method-aggregated quantities. R's fixest uses t-distribution at all levels, so aggregated p-values may differ slightly for small samples.

Validation

  • Tests added/updated: tests/test_sun_abraham.py — 7 new methodology tests in TestSunAbrahamMethodology class
  • Backtest / simulation / notebook evidence: R benchmark comparison at small (200 units) and 1k (1000 units) scales:
    • ATT relative difference: <1e-11 (machine precision)
    • SE relative difference: 0.1-0.3%
    • Event study correlation: 1.000000

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Complete methodology review of the Sun-Abraham interaction-weighted estimator
against R's fixest::sunab(). ATT matches within machine precision, SE within 0.3%.

Fixes:
- Add df_adjustment for absorbed unit/time fixed effects in saturated regression
- Return (NaN, NaN) instead of (0, 0) when no post-treatment effects exist
- Add FutureWarning for unimplemented min_pre/post_periods parameters
- Remove hardcoded [-20, 20] event-time truncation to match R behavior
- Add UserWarning for variance fallback path using simplified variance

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall assessment: ⚠️ Needs changes

Executive summary

  • SunAbraham changes largely align with docs/methodology/REGISTRY.md (no event-time cap, NaN overall ATT when no post effects, warnings, variance fallback).
  • P1: df adjustment for absorbed FE appears off by one given the no-intercept saturated regression, making cohort-level p-values/CIs slightly too liberal.
  • Tests add coverage for new behaviors, but the df-adjustment test does not actually assert the df change.

Methodology
Method affected: SunAbraham (Sun & Abraham 2021). Registry cross-check: event-time range, NaN overall ATT with no post effects, deprecation warnings, and variance fallback match the documented edge cases.

  • Severity: P1. Impact: Cohort-level inference uses LinearRegression.df_; with no intercept in the saturated regression, df_adjustment = n_units + n_times - 2 overstates df by 1, leading to slightly smaller p-values and narrower CIs. Location: diff_diff/sun_abraham.py:L785-L795. Concrete fix: use df_adj = n_units_fe + n_times_fe - 1, or add an explicit intercept column (then keep -2) and document the choice.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: test_df_adjustment_sets_regression_df doesn’t validate the df change, so regressions in df adjustment could slip. Location: tests/test_sun_abraham.py:L1192-L1223. Concrete fix: assert df directly (e.g., expose regression df in results or monkeypatch LinearRegression.fit to capture df_), or compare inference against a known df-adjusted baseline.

…rtion

- Fix df_adjustment from n_units + n_times - 2 to n_units + n_times - 1
  (SunAbraham has no intercept column unlike TWFE)
- Rewrite test to monkeypatch LinearRegression.fit and assert exact df values
- Update METHODOLOGY_REVIEW.md with corrected formula and rationale

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Feb 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 62564e420142d30107cd2fa7b2efa517f0fa4f5e


Overall assessment: ⚠️ Needs changes

Executive summary:

  • SunAbraham estimator changes (df adjustment for absorbed FE, no event-time cap, NaN overall ATT/SE when no post effects, variance fallback warning) generally align with the Methodology Registry.
  • P1: IW weights are computed/tested as cohort-size shares, but the Registry specifies event-time sample shares; this is a methodology mismatch for unbalanced panels.
  • Docstring still describes min_pre_periods/min_post_periods as active filters even though they’re now deprecated and ignored.
  • Tests not run (not requested).

Methodology

  • P1 — Impact: Interaction weights are calculated from cohort sizes (n_g) rather than event-time sample shares (n_{g,e}) defined in the Methodology Registry; this mis-weights event-study and overall ATT when panels are unbalanced, and the new test now codifies the cohort-size behavior. Fix: compute weights using counts at each event time (n_{g,e}) from df[(first_treat==g) & (_rel_time==e)], or explicitly require balanced panels and update the Registry/test to match that assumption. References: diff_diff/sun_abraham.py:858, diff_diff/sun_abraham.py:872, tests/test_sun_abraham.py:1284, docs/methodology/REGISTRY.md:412.

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • P3 — Impact: fit() docstring still presents min_pre_periods/min_post_periods as functional filters, but the implementation now ignores them and emits FutureWarning, which can mislead users and downstream docs. Fix: update the docstring (and any public API docs) to mark both parameters deprecated/ignored or remove them at the next breaking release. References: diff_diff/sun_abraham.py:458, diff_diff/sun_abraham.py:473.
  • Tests not run (not requested).

…tring

- Fix IW weights to use n_{g,e} (per-event-time observation counts) instead
  of n_g (total cohort sizes), matching REGISTRY formula for unbalanced panels
- Add unbalanced panel test verifying n_{g,e} != n_g distinction
- Update existing IW weights test to verify against event-time counts
- Mark min_pre_periods/min_post_periods as deprecated in fit() docstring
- Update METHODOLOGY_REVIEW.md verified-component line and corrections

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Feb 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 89e8e72672bbc258738856173fc2c82a475dc610


Overall assessment: ⛔ Blocker

Executive Summary

  • P0: Sun-Abraham now treats first_treat=np.inf never-treated units as a treated cohort when the event-time cap is removed, producing -inf event times and invalid weights.
  • Methodology cross-check: Sun-Abraham IW weights and no-cap event-time range align with docs/methodology/REGISTRY.md, but the np.inf path violates the “never-treated as controls” requirement.
  • Tests added for new behaviors, but no coverage for the explicitly supported np.inf encoding of never-treated units.
  • Tests not run (not requested).

Methodology

  • P0: first_treat=np.inf (explicitly allowed) is included in treatment_groups and _rel_time (both use > 0), which becomes -inf and is now included after removing the event-time cap. This violates the control-group definition and can bias estimates or create nonsensical event-study coefficients. Impact: incorrect Sun-Abraham estimates or crashes for valid inputs. Fix: treat np.inf as never-treated everywhere you define treated cohorts and relative times: use np.isfinite(first_treat) & (first_treat > 0) for treatment_groups, _rel_time, event_time_counts, and post_weights, or normalize np.inf to 0 during data prep. Add a regression test. diff_diff/sun_abraham.py:L505-L564, diff_diff/sun_abraham.py:L859-L877, diff_diff/sun_abraham.py:L946-L949

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • P2: Missing test for never-treated encoded as np.inf, which is documented as valid input and now interacts with uncapped event-time range. Impact: regression can slip in undetected. Fix: add a test in TestSunAbrahamMethodology that sets never-treated to np.inf, asserts no ±inf event times, and ensures no inf cohort appears in weights. tests/test_sun_abraham.py:L1070-L1185

first_treat=np.inf (documented as valid for never-treated units) passed
all `> 0` checks in sun_abraham.py and staggered.py, causing np.inf to
be treated as a cohort with -inf relative times. Fixed by normalizing
np.inf to 0 immediately after computing _never_treated in both estimators.
Added regression tests verifying equivalence between encodings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Feb 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 93e9cdbd9ab5aeac0dd4a1c085d18dc4daa3be3a


Overall assessment: ⚠️ Needs changes

Executive summary

  • P1: np.inf normalization happens after treatment_groups is derived, so never-treated units can still be treated as cohorts, which can bypass the “no treated units” guard and/or leak controls into results.groups.
  • Methodology changes otherwise align with REGISTRY (event-time weights n_{g,e}, no truncation, NaN overall ATT when no post effects).
  • Tests cover most new behavior, but miss the np.inf-only/no-treated case and results.groups contamination.

Methodology

  • Severity: P1. Impact: If never-treated units are encoded as np.inf, they can still be included in treatment_groups because normalization happens afterward. In SunAbraham this can skip the “no treated units” error and later hit min(all_rel_times_sorted) on an empty set; in both estimators it can leave np.inf in results.groups, violating the control/treated cohort definition. Fix: normalize first_treat (set np.inf → 0) before computing treatment_groups, or recompute and explicitly filter finite > 0 values after normalization. diff_diff/sun_abraham.py:L505-L556, diff_diff/staggered.py:L643-L650.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

  • Severity: P2. Impact: Current np.inf encoding tests don’t assert that results.groups excludes np.inf, nor do they cover the all-never-treated np.inf case that should trigger “No treated units found,” so the regression above wouldn’t be caught. Fix: extend the existing tests to assert np.inf is absent from results.groups and add a SunAbraham test for all-never-treated np.inf input raising ValueError. tests/test_sun_abraham.py:L1370-L1404, tests/test_staggered.py:L105-L130.

…_groups

The np.inf → 0 normalization was placed after treatment_groups was
computed, so np.inf passed the `g > 0` filter and leaked into treatment
cohorts. Reorder in both sun_abraham.py and staggered.py so
normalization precedes treatment_groups derivation. Add results.groups
assertion and all-never-treated ValueError test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Feb 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 049a6ef95aa647a8ac4b4a28f7a7a77bf09d7dac


Overall Assessment
⚠️ Needs changes

Executive Summary

  • P0: Bootstrap inference can emit non-NaN overall p-values/CI even when overall ATT is NaN (no post-treatment effects), violating the Methodology Registry and the NaN propagation edge-case rule.
  • Methodology alignment improvements look correct: IW weights now use event-time sample shares and event-time range is no longer truncated, matching the registry and source expectations.
  • Tests cover the no-post-effects analytic path but miss the bootstrap path, leaving the regression above unguarded.

Methodology

  • Severity: P0. Impact: When there are no post-treatment effects, overall_att is NaN, but if n_bootstrap > 0 the bootstrap path overwrites overall_p_value/overall_conf_int with non-NaN values (e.g., 1/(B+1)), which is misleading and contradicts the registry’s “all inference fields NaN” rule for this edge case. Concrete fix: Short-circuit overall bootstrap inference when overall_att is not finite (or when post_effects is empty), or make _compute_bootstrap_pvalue/percentile CI return NaN when the original effect or bootstrap distribution is not finite; keep event-study bootstrap if desired. diff_diff/sun_abraham.py:L610-L646, diff_diff/sun_abraham.py:L920-L1159, diff_diff/sun_abraham.py:L1176-L1194

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • Severity: P2. Impact: The new no-post-effects test only exercises n_bootstrap=0, so the bootstrap NaN propagation bug can slip through. Concrete fix: Add a test variant with n_bootstrap > 0 asserting overall_att, overall_se, overall_p_value, and CI are NaN when there are no post-treatment effects. tests/test_sun_abraham.py:L1070-L1124

…as compat

- Guard overall bootstrap stats in _run_bootstrap() when ATT is NaN,
  preventing _compute_bootstrap_pvalue from returning 1/(B+1) instead of NaN
- Add test_no_post_effects_bootstrap_returns_nan for the bootstrap NaN path
- Cast first_treat to float before assigning np.inf in tests for pandas compat

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Feb 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 90264644e97e9139d55899ae670ea16626fa62be


Overall assessment: ✅ Looks good

Executive Summary

  • Affected methods: SunAbraham (weights, event-time range, overall ATT/SE, df adjustment, variance fallback) and CallawaySantAnna (never-treated encoding); cross-checked against docs/methodology/REGISTRY.md and in-code docstrings with no mismatches found.
  • Edge-case handling improved (no-post-treatment NaNs, np.inf normalization) and tests cover these paths.
  • Benchmarks for R/Python parity are integrated cleanly into the runner.
  • Potential performance regression from unbounded event-time range; consider a warning/optional cap.

Methodology
No issues found. Affected methods: SunAbraham and CallawaySantAnna; registry/docstring alignment verified.

Code Quality
No issues found.

Performance

  • Severity: P3 | Impact: Removing the event-time truncation can materially increase the number of cohort×relative-time dummies for long panels, raising memory/time costs versus prior behavior. | Concrete fix: Consider an optional max_rel_time/event_time_cap parameter or a warning when (max_rel - min_rel) is large, while keeping the default uncapped for methodology parity. Location: diff_diff/sun_abraham.py:L541-L566.

Maintainability
No issues found.

Tech Debt
No issues found.

Security
No issues found.

Documentation/Tests
No issues found. Tests not run (not requested).

@igerber igerber merged commit fc7888b into main Feb 16, 2026
7 checks passed
@igerber igerber deleted the sa-method-review branch February 16, 2026 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant