Skip to content

Revise survey tutorial to show the cost of ignoring survey design#277

Merged
igerber merged 3 commits intomainfrom
survey-tutorial-pain
Apr 6, 2026
Merged

Revise survey tutorial to show the cost of ignoring survey design#277
igerber merged 3 commits intomainfrom
survey-tutorial-pain

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 6, 2026

Summary

  • Enhance Section 1 of 16_survey_did.ipynb with evidence-based motivation using the research-grade DGP (informative_sampling, heterogeneous_te_by_strata, return_true_population_att)
  • Add ground truth comparison showing naive ATT is ~22% biased vs known population ATT, while survey-weighted ATT is ~10% biased
  • Add DEFF diagnostic cell connecting SE underestimation (~1.8x) to effective sample size (~474 vs 1,600 nominal)
  • Add 200-iteration MC simulation proving naive 95% CIs cover truth only ~66% (vs ~89% for survey) and falsely detect pre-trends in ~67% of draws (vs ~17% for survey)
  • Rewrite Cell 7 markdown and add event study truth line for visual grounding

Methodology references (required if estimator / math changes)

  • N/A - no methodology changes, tutorial content only

Validation

  • Tests added/updated: No test changes (tutorial notebook only)
  • Notebook executes end-to-end via jupyter nbconvert --execute without errors
  • Quantitative checks verified: bias > 15%, SE ratio > 1.5, naive coverage < 75%, survey coverage > 82%, naive false pre-trend > 50%, survey false pre-trend < 30%

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Enhance Section 1 of 16_survey_did.ipynb with evidence-based motivation
using the research-grade DGP (informative_sampling, heterogeneous_te_by_strata,
return_true_population_att). Adds 5 new cells showing:

- Ground truth comparison: naive ATT is ~22% biased vs known population ATT
- DEFF diagnostics connecting SE underestimation to effective sample size
- 200-iteration MC simulation proving naive 95% CIs cover truth only ~66%
  and falsely detect pre-trends in ~67% of draws

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

Overall Assessment

⚠️ Needs changes

Executive Summary

  • P1 The new Section 1 event-study figure plots df.attrs["dgp_truth"]["population_att"] as a horizontal “truth” line, but that value is the overall population ATT, not the event-study estimand ATT(e). That is methodologically wrong for an event-study plot and is visibly wrong on pre-treatment horizons.
  • The new DGP switches themselves (informative_sampling, heterogeneous_te_by_strata, return_true_population_att) are documented in the Methodology Registry and match the implementation.
  • No estimator code, variance code, defaults, or security-sensitive code changed in this PR.
  • I reviewed this source-only. I could not re-run the notebook in this sandbox because the available Python lacks the notebook dependencies (numpy/pandas).

Methodology

  • P1 Impact: The new truth overlay in Section 1 conflates the library’s simple overall ATT with its event-study ATT by horizon. The registry explicitly separates simple aggregation from event-study aggregation, so this is not a documented deviation. population_att is stored as the overall treated-observation-weighted ATT, while event-study aggregation targets ATT(e) by relative time. The notebook then draws that overall value as a flat line across all event times, including pre-treatment periods where true_effect is zero by construction. That can make correct horizon-specific variation look like estimator bias and teaches the wrong estimand. Concrete fix: replace the horizontal line with a horizon-specific truth series consistent with ATT(e) (0 for e < 0; for e >= 0, the survey-weighted mean of true_effect among observations with period - first_treat == e), or remove the truth line and keep the overall-truth comparison only in the scalar ATT block. Location: docs/tutorials/16_survey_did.ipynb:L215-L239, docs/methodology/REGISTRY.md:L308-L310, diff_diff/prep_dgp.py:L1287-L1293, diff_diff/prep_dgp.py:L1614-L1625, diff_diff/prep_dgp.py:L1670-L1715, diff_diff/staggered_aggregation.py:L543-L653.

Code Quality

  • No findings in the changed code.

Performance

  • No findings. The added 200-draw Monte Carlo block is notebook-only and the claimed runtime is reasonable for docs.

Maintainability

  • No findings in the changed code path.

Tech Debt

  • No findings. I did not see a new untracked debt item that should block this tutorial-only PR.

Security

  • No findings. PR scope is limited to a notebook edit.

Documentation/Tests

  • No blocking findings.
  • Note: I could not independently execute the notebook in this sandbox because the available Python environment is missing notebook dependencies.

Path to Approval

  1. Fix or remove the Section 1 event-study truth overlay so the plotted “truth” matches the event-study estimand ATT(e), not the overall population_att.

Replace the flat horizontal population_att line with a per-horizon
truth series: ATT(e)=0 for pre-treatment, and the survey-weighted
mean of true_effect among treated observations at each horizon for
post-treatment. The truth now correctly varies by horizon due to
cohort composition changes (e.g., 2.62 when both cohorts contribute
vs 1.95 when only cohort 3 remains at longer horizons).

Addresses P1 from AI review on PR #277.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 6, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: ee96e3664e592043a718904de48f73b2e0af907d


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

Affected methods: tutorial exposition for CallawaySantAnna overall/event-study output and survey DEFF diagnostics.

  • P1 Impact: The new subsection labels (results_survey.overall_se / results_naive.overall_se)^2 as “DEFF” and derives n / DEFF from it, but results_naive is the unweighted Callaway-Sant’Anna fit, not the SRS variance of the same weighted estimand. That conflicts with the library’s own DEFF definition in the registry and with the later compute_deff() section, so the tutorial now teaches an undocumented methodology deviation and overstates the meaning of the reported effective sample size. Concrete fix: either relabel this subsection as an informal “SE inflation from ignoring survey design” comparison and remove the DEFF/effective-n interpretation, or compute a genuine SRS baseline for the same weighted ATT estimator before calling it DEFF. Location: docs/tutorials/16_survey_did.ipynb:L281-L320, docs/tutorials/16_survey_did.ipynb:L925-L936, docs/methodology/REGISTRY.md:L2437-L2447, diff_diff/survey.py:L775-L855

Code Quality

No findings in the changed cells.

Performance

No findings in the changed cells.

Maintainability

No findings in the changed cells.

Tech Debt

No findings. I did not find a TODO.md entry that mitigates the DEFF terminology issue. TODO.md:L51-L94

Security

No findings.

Documentation/Tests

No blocking findings beyond the methodology issue above. I could not independently execute the notebook in this sandbox because the available Python environment is missing notebook dependencies.

Path to Approval

  1. In Section 1, stop calling the se_ratio comparison “DEFF” unless the denominator is changed to a true SRS baseline for the same weighted ATT estimator. The simplest fix is to rename the subsection/result to “SE inflation from ignoring survey design” and remove the n / DEFF effective-sample-size interpretation.

The SE ratio (survey / naive) compares two different estimators
(weighted vs unweighted), which is not a formal design effect.
Rename subsection to "SE inflation", remove the DEFF formula and
effective-sample-size interpretation, and add a note distinguishing
this from the library's proper compute_deff() API in Section 7.

Addresses P1 from AI review rerun on PR #277.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 6, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 10ad885866258fbfd1a978d743698e72ce45ceb1


Overall Assessment

✅ Looks good

The prior P1 on DEFF terminology is resolved, and I did not find any new unmitigated P0/P1 issues in the notebook-only diff. I have one non-blocking methodology/documentation finding (P2) on how the new pre-trend simulation benchmark is labeled/interpreted.

Executive Summary

Methodology

  • P2 Impact: The Monte Carlo subsection defines “False pre-trend” as any pre-period coefficient having p < 0.05, then prints target_any = 1 - 0.95**n_pre as the target. That is not the library’s documented pre-trends test, which is a joint Wald/F-style test, and the displayed target assumes independent lead estimates, which event-study coefficients generally are not. This can overstate how far the survey-aware path remains from nominal and blurs “any significant lead” with a formal pre-trends test. Concrete fix: either relabel this row and the follow-on prose as “Any individual pre-period lead significant,” dropping the target or calling it an “independence benchmark,” or replace it with a documented joint pre-trends test / simultaneous-band metric. Location: docs/tutorials/16_survey_did.ipynb#L383, docs/tutorials/16_survey_did.ipynb#L416, diff_diff/pretrends.py#L424, diff_diff/pretrends.py#L679

Code Quality

  • No findings in the changed cells.

Performance

  • No findings in the changed cells. The new 200-draw Monte Carlo cell is heavier, but nothing in the source suggests a correctness or scaling problem for a tutorial notebook.

Maintainability

  • No findings in the changed cells.

Tech Debt

  • No findings. I did not see a relevant TODO.md entry that changes the assessment of the P2 benchmark-labeling issue.

Security

  • No findings.

Documentation/Tests

  • No blocking findings beyond the P2 methodology-labeling issue above.
  • Source-only review: I could not rerun the notebook here because the sandbox Python is missing numpy/pandas.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 6, 2026
@igerber igerber merged commit f11e169 into main Apr 6, 2026
3 of 4 checks passed
@igerber igerber deleted the survey-tutorial-pain branch April 6, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant