Revise survey tutorial to show the cost of ignoring survey design by igerber · Pull Request #277 · igerber/diff-diff

igerber · 2026-04-06T15:31:42Z

Summary

Enhance Section 1 of 16_survey_did.ipynb with evidence-based motivation using the research-grade DGP (informative_sampling, heterogeneous_te_by_strata, return_true_population_att)
Add ground truth comparison showing naive ATT is ~22% biased vs known population ATT, while survey-weighted ATT is ~10% biased
Add DEFF diagnostic cell connecting SE underestimation (~1.8x) to effective sample size (~474 vs 1,600 nominal)
Add 200-iteration MC simulation proving naive 95% CIs cover truth only ~66% (vs ~89% for survey) and falsely detect pre-trends in ~67% of draws (vs ~17% for survey)
Rewrite Cell 7 markdown and add event study truth line for visual grounding

Methodology references (required if estimator / math changes)

N/A - no methodology changes, tutorial content only

Validation

Tests added/updated: No test changes (tutorial notebook only)
Notebook executes end-to-end via jupyter nbconvert --execute without errors
Quantitative checks verified: bias > 15%, SE ratio > 1.5, naive coverage < 75%, survey coverage > 82%, naive false pre-trend > 50%, survey false pre-trend < 30%

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Enhance Section 1 of 16_survey_did.ipynb with evidence-based motivation using the research-grade DGP (informative_sampling, heterogeneous_te_by_strata, return_true_population_att). Adds 5 new cells showing: - Ground truth comparison: naive ATT is ~22% biased vs known population ATT - DEFF diagnostics connecting SE underestimation to effective sample size - 200-iteration MC simulation proving naive 95% CIs cover truth only ~66% and falsely detect pre-trends in ~67% of draws Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-06T15:38:39Z

Overall Assessment

⚠️ Needs changes

Executive Summary

P1 The new Section 1 event-study figure plots df.attrs["dgp_truth"]["population_att"] as a horizontal “truth” line, but that value is the overall population ATT, not the event-study estimand ATT(e). That is methodologically wrong for an event-study plot and is visibly wrong on pre-treatment horizons.
The new DGP switches themselves (informative_sampling, heterogeneous_te_by_strata, return_true_population_att) are documented in the Methodology Registry and match the implementation.
No estimator code, variance code, defaults, or security-sensitive code changed in this PR.
I reviewed this source-only. I could not re-run the notebook in this sandbox because the available Python lacks the notebook dependencies (numpy/pandas).

Methodology

P1 Impact: The new truth overlay in Section 1 conflates the library’s simple overall ATT with its event-study ATT by horizon. The registry explicitly separates simple aggregation from event-study aggregation, so this is not a documented deviation. population_att is stored as the overall treated-observation-weighted ATT, while event-study aggregation targets ATT(e) by relative time. The notebook then draws that overall value as a flat line across all event times, including pre-treatment periods where true_effect is zero by construction. That can make correct horizon-specific variation look like estimator bias and teaches the wrong estimand. Concrete fix: replace the horizontal line with a horizon-specific truth series consistent with ATT(e) (0 for e < 0; for e >= 0, the survey-weighted mean of true_effect among observations with period - first_treat == e), or remove the truth line and keep the overall-truth comparison only in the scalar ATT block. Location: docs/tutorials/16_survey_did.ipynb:L215-L239, docs/methodology/REGISTRY.md:L308-L310, diff_diff/prep_dgp.py:L1287-L1293, diff_diff/prep_dgp.py:L1614-L1625, diff_diff/prep_dgp.py:L1670-L1715, diff_diff/staggered_aggregation.py:L543-L653.

Code Quality

No findings in the changed code.

Performance

No findings. The added 200-draw Monte Carlo block is notebook-only and the claimed runtime is reasonable for docs.

Maintainability

No findings in the changed code path.

Tech Debt

No findings. I did not see a new untracked debt item that should block this tutorial-only PR.

Security

No findings. PR scope is limited to a notebook edit.

Documentation/Tests

No blocking findings.
Note: I could not independently execute the notebook in this sandbox because the available Python environment is missing notebook dependencies.

Path to Approval

Fix or remove the Section 1 event-study truth overlay so the plotted “truth” matches the event-study estimand ATT(e), not the overall population_att.

Replace the flat horizontal population_att line with a per-horizon truth series: ATT(e)=0 for pre-treatment, and the survey-weighted mean of true_effect among treated observations at each horizon for post-treatment. The truth now correctly varies by horizon due to cohort composition changes (e.g., 2.62 when both cohorts contribute vs 1.95 when only cohort 3 remains at longer horizons). Addresses P1 from AI review on PR #277. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-06T15:48:13Z

/ai-review

github-actions · 2026-04-06T15:56:34Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ee96e3664e592043a718904de48f73b2e0af907d

Overall Assessment

⚠️ Needs changes

Executive Summary

The previous re-review blocker is resolved: the event-study figure now overlays a horizon-specific truth series instead of reusing scalar population_att, which is consistent with the documented ATT(e) aggregation. docs/tutorials/16_survey_did.ipynb:L237-L254, docs/methodology/REGISTRY.md:L308-L310
P1 The new Section 1 DEFF subsection defines DEFF = (SE_survey / SE_naive)^2 using the fully unweighted fit as the denominator. The registry and compute_deff() define DEFF against an SRS baseline for the same estimator, so this is an undocumented methodology mismatch. docs/tutorials/16_survey_did.ipynb:L281-L320, docs/tutorials/16_survey_did.ipynb:L925-L936, docs/methodology/REGISTRY.md:L2437-L2447, diff_diff/survey.py:L775-L855
Because of that mismatch, the new “≈ 474 SRS obs” effective-sample-size statement is not methodologically supported as written. docs/tutorials/16_survey_did.ipynb:L315-L320
No estimator implementation, default behavior, or security-sensitive code changed in this PR.
I reviewed this source-only; I could not rerun the notebook here because the sandbox Python lacks numpy/pandas.

Methodology

Affected methods: tutorial exposition for CallawaySantAnna overall/event-study output and survey DEFF diagnostics.

P1 Impact: The new subsection labels (results_survey.overall_se / results_naive.overall_se)^2 as “DEFF” and derives n / DEFF from it, but results_naive is the unweighted Callaway-Sant’Anna fit, not the SRS variance of the same weighted estimand. That conflicts with the library’s own DEFF definition in the registry and with the later compute_deff() section, so the tutorial now teaches an undocumented methodology deviation and overstates the meaning of the reported effective sample size. Concrete fix: either relabel this subsection as an informal “SE inflation from ignoring survey design” comparison and remove the DEFF/effective-n interpretation, or compute a genuine SRS baseline for the same weighted ATT estimator before calling it DEFF. Location: docs/tutorials/16_survey_did.ipynb:L281-L320, docs/tutorials/16_survey_did.ipynb:L925-L936, docs/methodology/REGISTRY.md:L2437-L2447, diff_diff/survey.py:L775-L855

Code Quality

No findings in the changed cells.

Performance

No findings in the changed cells.

Maintainability

No findings in the changed cells.

Tech Debt

No findings. I did not find a TODO.md entry that mitigates the DEFF terminology issue. TODO.md:L51-L94

Security

No findings.

Documentation/Tests

No blocking findings beyond the methodology issue above. I could not independently execute the notebook in this sandbox because the available Python environment is missing notebook dependencies.

Path to Approval

In Section 1, stop calling the se_ratio comparison “DEFF” unless the denominator is changed to a true SRS baseline for the same weighted ATT estimator. The simplest fix is to rename the subsection/result to “SE inflation from ignoring survey design” and remove the n / DEFF effective-sample-size interpretation.

The SE ratio (survey / naive) compares two different estimators (weighted vs unweighted), which is not a formal design effect. Rename subsection to "SE inflation", remove the DEFF formula and effective-sample-size interpretation, and add a note distinguishing this from the library's proper compute_deff() API in Section 7. Addresses P1 from AI review rerun on PR #277. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-06T15:59:02Z

/ai-review

github-actions · 2026-04-06T16:04:29Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 10ad885866258fbfd1a978d743698e72ce45ceb1

Overall Assessment

✅ Looks good

The prior P1 on DEFF terminology is resolved, and I did not find any new unmitigated P0/P1 issues in the notebook-only diff. I have one non-blocking methodology/documentation finding (P2) on how the new pre-trend simulation benchmark is labeled/interpreted.

Executive Summary

The previous DEFF finding is addressed: Section 1 now explicitly says the survey-vs-naive SE ratio is not a formal DEFF and points readers to compute_deff(), which matches the registry’s SRS-baseline definition. docs/tutorials/16_survey_did.ipynb#L292 docs/methodology/REGISTRY.md#L2437
The event-study plot fix is methodologically aligned: the truth overlay is now horizon-specific ATT(e), not a scalar reused across all event times. docs/tutorials/16_survey_did.ipynb#L237 docs/methodology/REGISTRY.md#L308
P2 The new Monte Carlo “False pre-trend” row uses “any individual lead has p < 0.05” and benchmarks it with 1 - 0.95^n, which is not the documented joint pre-trends test and assumes independent lead estimates. The tutorial should relabel or qualify that benchmark. docs/tutorials/16_survey_did.ipynb#L383 diff_diff/pretrends.py#L424
The new DGP-backed “ground truth” comparison is consistent with the generator’s documented options for informative sampling, heterogeneous treatment effects by stratum, and returned truth diagnostics. docs/tutorials/16_survey_did.ipynb#L99 diff_diff/prep_dgp.py#L1269
No estimator implementation, variance engine, defaults, or security-sensitive code changed in this PR.
I could not execute the notebook in this sandbox because the available Python environment is missing numpy/pandas.

Methodology

P2 Impact: The Monte Carlo subsection defines “False pre-trend” as any pre-period coefficient having p < 0.05, then prints target_any = 1 - 0.95**n_pre as the target. That is not the library’s documented pre-trends test, which is a joint Wald/F-style test, and the displayed target assumes independent lead estimates, which event-study coefficients generally are not. This can overstate how far the survey-aware path remains from nominal and blurs “any significant lead” with a formal pre-trends test. Concrete fix: either relabel this row and the follow-on prose as “Any individual pre-period lead significant,” dropping the target or calling it an “independence benchmark,” or replace it with a documented joint pre-trends test / simultaneous-band metric. Location: docs/tutorials/16_survey_did.ipynb#L383, docs/tutorials/16_survey_did.ipynb#L416, diff_diff/pretrends.py#L424, diff_diff/pretrends.py#L679

Code Quality

No findings in the changed cells.

Performance

No findings in the changed cells. The new 200-draw Monte Carlo cell is heavier, but nothing in the source suggests a correctness or scaling problem for a tutorial notebook.

Maintainability

No findings in the changed cells.

Tech Debt

No findings. I did not see a relevant TODO.md entry that changes the assessment of the P2 benchmark-labeling issue.

Security

No findings.

Documentation/Tests

No blocking findings beyond the P2 methodology-labeling issue above.
Source-only review: I could not rerun the notebook here because the sandbox Python is missing numpy/pandas.

igerber added the ready-for-ci Triggers CI test workflows label Apr 6, 2026

igerber merged commit f11e169 into main Apr 6, 2026
3 of 4 checks passed

igerber deleted the survey-tutorial-pain branch April 6, 2026 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise survey tutorial to show the cost of ignoring survey design#277

Revise survey tutorial to show the cost of ignoring survey design#277
igerber merged 3 commits intomainfrom
survey-tutorial-pain

igerber commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

igerber commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

igerber commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 6, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

igerber commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

igerber commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant