Skip to content

Add research-grade DGP parameters for survey tutorials#274

Merged
igerber merged 10 commits intomainfrom
survey-dgp
Apr 5, 2026
Merged

Add research-grade DGP parameters for survey tutorials#274
igerber merged 10 commits intomainfrom
survey-dgp

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 5, 2026

Summary

  • Add 6 new optional parameters to generate_survey_did_data(): icc, weight_cv, informative_sampling, heterogeneous_te_by_strata, strata_sizes, return_true_population_att
  • Enable DGPs where survey-design methods provably outperform naive estimators (bias reduction via informative sampling + heterogeneous TE; correct coverage via ICC-driven DEFF)
  • Add _rank_pair_weights() helper for informative sampling with panel and cross-section support
  • All parameters default to current behavior — backward compatible

Methodology references (required if estimator / math changes)

  • Method name(s): ICC variance decomposition, Kish DEFF, informative sampling via rank-pairing
  • Paper / source link(s): N/A — DGP utility, not estimator. ICC formula follows standard one-way ANOVA decomposition including PSU-period interaction variance.
  • Any intentional deviations from the source (and why): None

Validation

  • Tests added/updated: tests/test_prep.py — 13 new tests in TestSurveyDGPResearchGrade (ICC targeting, weight CV, informative sampling panel+cross-section, heterogeneous TE, strata sizes, backward compatibility)
  • Backtest / simulation / notebook evidence (if applicable): N/A — tutorial notebooks in subsequent PRs

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Add 6 new optional parameters (icc, weight_cv, informative_sampling,
heterogeneous_te_by_strata, strata_sizes, return_true_population_att)
that enable coverage studies and tutorials demonstrating when survey-
design methods outperform naive estimators. All default to current
behavior for backward compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

Overall Assessment

⚠️ Needs changes. The highest unmitigated severity is P1: the new informative-sampling path reverses the library’s documented pweight semantics, and the new icc_realized “truth” diagnostic is not actually the ICC definition used elsewhere.

Executive Summary

  • Scope is DGP-only: no estimator or survey-SE implementation changed; the review focused on the new survey-DGP weighting and diagnostic math.
  • P1: informative_sampling assigns smaller weights to “under-covered” high-outcome units, which contradicts the project’s documented inverse-selection-probability weight semantics.
  • P1: df.attrs["dgp_truth"]["icc_realized"] is mislabeled; it is computed as Var(PSU means)/Var(outcome), not the ANOVA/random-effects ICC used to calibrate and test icc.
  • P2: weight_cv lacks finite-value validation, so NaN/Inf can propagate into generated weights instead of being rejected.
  • P2: the new tests reinforce the reversed informative-sampling sign and only sanity-check icc_realized, so the methodology issues above would slip through CI.
  • No TODO.md item or REGISTRY.md note mitigates the P1 findings.

Methodology

  • Severity: P1. Impact: The new informative-sampling path in diff_diff/prep_dgp.py:L1138-L1159 and diff_diff/prep_dgp.py:L1265-L1271 gives the lightest weights to the highest Y(0) units, and the tests in tests/test_prep.py:L1571-L1601 expect that negative association. But the same function says survey weights are inverse selection probabilities in diff_diff/prep_dgp.py:L1199-L1201, matching the project’s survey methodology note w_i = 1 / pi_i in docs/methodology/survey-theory.md:L147-L152. Under that contract, under-covered high-outcome units should carry larger weights, not smaller ones. As written, the DGP implements oversampling of high-outcome units while documenting under-coverage, so weighted-vs-unweighted bias demonstrations can point in the wrong direction. Concrete fix: reverse the rank-pairing so larger Y(0) gets larger weight, or rename/document the feature as oversampling and update tests/tutorial claims to match.
  • Severity: P1. Impact: The new public icc_realized diagnostic in diff_diff/prep_dgp.py:L1594-L1628 is computed as var(psu_means) / var(outcome), but the icc parameter and its regression test use the standard ANOVA/random-effects ICC in diff_diff/prep_dgp.py:L1255-L1260 and tests/test_prep.py:L1512-L1532. Those are not the same statistic; in the canonical equal-cluster case the current ratio exceeds ICC because PSU means still contain within-PSU variance, and here it is further contaminated by pooling all periods, including treated periods. Users asking the DGP for “truth” will get a mislabeled value. Concrete fix: compute icc_realized with the same ANOVA ICC formula used in the test on untreated data (period 1 is the cleanest choice), or rename the field to the statistic actually returned.

Code Quality

  • Severity: P2. Impact: weight_cv validation in diff_diff/prep_dgp.py:L1367-L1374 rejects only <= 0; unlike the adjacent psu_period_factor guard, it does not reject NaN/Inf. Those values flow straight into the lognormal parameterization in diff_diff/prep_dgp.py:L1422-L1426, so the function can emit non-finite weights instead of failing fast with a clean ValueError. Concrete fix: require np.isfinite(weight_cv) before using it and add explicit tests for NaN/Inf rejection.

Performance

No material findings in the changed code.

Maintainability

No material findings beyond the documentation/test gaps below.

Tech Debt

No existing TODO.md entry under “Tech Debt from Code Reviews” covers these issues, so none of the findings above are mitigated by tracking.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests do not protect the intended methodology. tests/test_prep.py:L1571-L1601 codifies the negative weight-outcome correlation that only passes if the reversed pweight semantics remain in place, and tests/test_prep.py:L1636-L1651 checks icc_realized only for existence/non-negativity rather than against the ANOVA ICC already implemented in tests/test_prep.py:L1512-L1532. That means both P1 issues can ship with green tests. Concrete fix: update the informative-sampling tests to assert the pweight-consistent direction of correction, and compare dgp_truth["icc_realized"] to the same period-1 ANOVA ICC within tolerance.
  • Verification note: I could not execute the test suite locally because the sandbox Python environment does not have numpy installed.

Path to Approval

  1. Reverse the informative-sampling weight assignment so higher Y(0) units get larger inverse-probability weights, or explicitly rename the feature to oversampling and align the docstrings/tests/tutorial claims with that behavior.
  2. Replace dgp_truth["icc_realized"] with the same ANOVA ICC definition used for icc calibration, computed on untreated data, and add a regression test that it tracks the target icc.
  3. Add finite-input validation for weight_cv and tests for np.nan/np.inf rejection.

…alidation

- Reverse _rank_pair_weights() so higher Y(0) → heavier weight (w=1/pi)
- Replace icc_realized with ANOVA ICC on period-1 data (matches icc solver)
- Add np.isfinite() guard on weight_cv to reject NaN/Inf
- Fix test assertions: positive correlation, ICC regression check, NaN/Inf test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 5, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 808d7aa2d7aa21bfae7972eab68fceef2bc84fc5


Overall Assessment

⚠️ Needs changes. This re-review fixes the prior code-level P1s on informative-sampling sign, icc_realized computation, and weight_cv finite-value validation, but one new unmitigated P1 remains in the informative_sampling × weight_variation interaction.

Executive Summary

  • Affected methodology: informative-sampling pweight construction, ICC-targeted cluster variance calibration, and DGP truth diagnostics.
  • The previous P1s are largely addressed in code: high-Y(0) units now get heavier pweights, icc_realized now uses a period-1 ANOVA ICC, and weight_cv now rejects NaN/Inf.
  • P1: when informative_sampling=True and weight_cv is omitted, the helper silently wipes out the configured stratum-level weight_variation because it renormalizes each stratum to mean 1 whenever within-stratum weights are constant.
  • P2: the public generate_survey_did_data() docstring still describes the old opposite informative-sampling sign and the old icc_realized definition.
  • P2: the new tests only cover informative sampling with weight_cv=0.5, so the broken default interaction is untested; the panel test also does not assert the sign directly.
  • P2: strata_sizes is documented as list[int], but non-integer values can pass validation and then fail later with an opaque slice-index error.

Methodology

  • Severity: P1. Impact: informative_sampling silently violates the semantic contract of composing within-stratum informative sampling with the existing stratum-level pweights. Base weights from weight_variation are constant within each stratum in diff_diff/prep_dgp.py:L1425-L1435. _rank_pair_weights() treats that as “no weight variation” and replaces those weights with mean-1 ranks in diff_diff/prep_dgp.py:L1150-L1155. Both informative-sampling call sites use that helper in diff_diff/prep_dgp.py:L1472-L1481 and diff_diff/prep_dgp.py:L1513-L1522. So informative_sampling=True plus the default weight_variation="moderate" no longer preserves the stratum-level inverse-selection-probability weights, even though pweights are defined as inverse inclusion probabilities in docs/methodology/survey-theory.md:L147-L152 and diff_diff/survey.py:L45-L47. Concrete fix: preserve each stratum’s baseline weight level when generating rank weights, e.g. scale the ranks by w_vals.mean(), or restrict the mean-1 fallback to the explicit weight_variation="none"/no-weight_cv case only; then add regression tests for both panel and repeated cross-section on that branch.

Code Quality

  • Severity: P2. Impact: strata_sizes is documented as list[int], but validation only checks length, minimum, and sum in diff_diff/prep_dgp.py:L1379-L1392. Non-integer inputs such as floats can pass validation and then fail later when used as slice bounds in diff_diff/prep_dgp.py:L1417-L1423. Concrete fix: validate each element with the same integer/non-bool guard used for cohort_periods, and add a regression test for non-integer strata_sizes.

Performance

No material findings.

Maintainability

No material findings beyond the test/documentation gaps below.

Tech Debt

  • Severity: P3. Impact: there is no matching mitigation entry under TODO.md:L51-L76, and I found no REGISTRY.md Note/Deviation documenting the P1 above, so it remains unmitigated. Concrete fix: fix the methodology issue directly; this class of P1 is not deferrable via TODO tracking under the review policy.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: the public generate_survey_did_data() docstring still describes the pre-fix methodology. It says high-outcome units “receive lower weights” and that the no-variation fallback creates “inverse-rank weights” in diff_diff/prep_dgp.py:L1266-L1272, while the helper now assigns heavier weights to higher Y(0) in diff_diff/prep_dgp.py:L1138-L1143. It also still describes icc_realized as a “between/total variance ratio” in diff_diff/prep_dgp.py:L1281-L1286, even though the implementation now computes a period-1 ANOVA ICC in diff_diff/prep_dgp.py:L1621-L1633. Concrete fix: update the public docstring to match the corrected weighting direction and ANOVA-based ICC diagnostic.
  • Severity: P2. Impact: the new informative-sampling tests only exercise weight_cv=0.5 in tests/test_prep.py:L1580-L1614, which preserves the weight distribution by construction and never touches the broken default weight_variation interaction. The panel test in tests/test_prep.py:L1580-L1594 also checks only that weighted and unweighted means differ, not that the correlation has the correct positive sign. Concrete fix: add panel and cross-section tests with informative_sampling=True and no weight_cv that assert preserved stratum mean weights and positive within-stratum weight-outcome association.
  • Verification note: I could not run pytest here because the sandbox does not have pytest, and the available Python environment also lacks numpy.

Path to Approval

  1. Fix the informative_sampling × weight_variation composition bug so default stratum-level pweights are preserved when weight_cv is omitted, and add panel/repeated-cross-section regression tests for that branch.
  2. Update the public generate_survey_did_data() docstring so informative_sampling and icc_realized describe the corrected methodology actually implemented.

…lidation

- Scale rank weights by stratum mean to preserve weight_variation structure
- Update informative_sampling and icc_realized docstrings to match code
- Add integer type validation for strata_sizes
- Add tests: default weight_variation path, positive correlation, float rejection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 5, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 5c3798a69d0e924a49f9bac0b701828cdc2ba3ce


Overall Assessment

⚠️ Needs changes. The prior re-review blocker on informative-sampling/default weights appears fixed, but one unmitigated P1 remains in the new research-grade DGP parameters’ interaction with add_covariates=True.

Executive Summary

Methodology

  • Severity: P1 [Newly identified]. Impact: the new research-grade knobs are not fully propagated through the existing add_covariates=True path. The docstring says icc is a target intra-class correlation coefficient at diff_diff/prep_dgp.py:L1256-L1260, but the calibration only uses unit_fe_sd^2 + noise_sd^2 at diff_diff/prep_dgp.py:L1400-L1405 and then later adds 0.5*x1 + 0.3*x2 into the untreated outcome at diff_diff/prep_dgp.py:L1530-L1548, so the realized raw-outcome ICC is lower by construction when covariates are enabled. The same propagation gap affects informative_sampling: the ranking score is built before x1/x2 exist at diff_diff/prep_dgp.py:L1478-L1528, so under add_covariates=True it is not rank-pairing on the full untreated outcome described at diff_diff/prep_dgp.py:L1266-L1273. The only add_covariates coverage is the old column smoke test at tests/test_prep.py:L1251-L1257; none of the new methodology tests cover these combinations at tests/test_prep.py:L1512-L1693. I found no matching Note/Deviation in docs/methodology/REGISTRY.md. Concrete fix: either propagate the covariate contribution into both the ICC calibration and the informative-sampling y0 score, or explicitly reject/document add_covariates=True with these new parameters until supported; add panel and repeated-cross-section regression tests for both combinations.

Code Quality

  • No material findings beyond the parameter-propagation issue above.

Performance

  • No material findings.

Maintainability

  • No material findings.

Tech Debt

  • No material findings. There is no matching deferral entry for the P1 above in TODO.md:L51-L75, and this class of methodology/correctness issue is not deferrable under the review policy.

Security

  • No material findings.

Documentation/Tests

  • Severity: P2. Impact: the earlier informative-sampling/default-weight regression is now directly covered only for the panel path at tests/test_prep.py:L1599-L1620. The repeated-cross-section test still exercises only weight_cv=0.5 at tests/test_prep.py:L1622-L1640, so the panel=False + default weight_variation branch that now depends on _base_weight.copy() and _rank_pair_weights() at diff_diff/prep_dgp.py:L1490-L1491 and diff_diff/prep_dgp.py:L1520-L1528 remains unguarded. Concrete fix: add a panel=False, informative_sampling=True regression test with default weights that asserts preserved stratum mean weights and nonzero within-stratum variation.
  • Verification note: I could not execute the test suite here because the sandbox Python environment does not have numpy.

Path to Approval

  1. Make the new research-grade parameters consistent with add_covariates=True: either include the generator’s covariate contribution in the icc calibration and informative-sampling y0 ranking, or reject/document those parameter combinations as unsupported.
  2. Add regression tests for icc + add_covariates and informative_sampling + add_covariates in both panel=True and panel=False modes, asserting the documented behavior on the emitted outcomes.

- Include covariate variance (0.2725) in ICC calibration when add_covariates=True
- Document informative_sampling structural Y(0) ranking in docstring and REGISTRY.md
- Add cross-section default-weights test and ICC + covariates regression test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 5, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 7c6b16002f7fe34623548f2fca81f411384a161f


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review status: the prior icc + add_covariates P1 appears fixed. psu_re_sd now includes covariate variance when add_covariates=True in diff_diff/prep_dgp.py:L1402, with direct coverage in tests/test_prep.py:L1659.
  • Re-review status: the earlier informative_sampling vs add_covariates concern is now explicitly documented as an intentional methodology note in docs/methodology/REGISTRY.md:L2524, so under the review rubric it is mitigated rather than a defect.
  • The earlier default-weight informative-sampling coverage gap is also resolved; both panel and repeated-cross-section branches now have dedicated tests in tests/test_prep.py:L1599 and tests/test_prep.py:L1642.
  • I did not find any unmitigated P0/P1 issues in the changed lines. The only remaining items are P3 docs/tests follow-ups.
  • Static review only: I could not run pytest here because the sandbox lacks numpy and pytest.

Methodology

  • Severity: P3-informational. Impact: The affected methodology is the survey DGP’s ICC calibration and informative-sampling rank-pair weighting in diff_diff/prep_dgp.py:L1163. The icc calibration now matches the new registry note for add_covariates=True in diff_diff/prep_dgp.py:L1402, and the informative_sampling exclusion of add_covariates is explicitly documented in docs/methodology/REGISTRY.md:L2524. Under the stated rubric, that documented deviation is not a defect. Concrete fix: none required.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. I did not see a new blocker-class limitation that needs a TODO.md entry.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: The in-code docstring still lags the final methodology description. The icc docstring shows the older formula without the covariate-variance term, and the informative_sampling text is less precise than the new registry note, so a reader relying only on the function docstring can get an incomplete description. See diff_diff/prep_dgp.py:L1256, diff_diff/prep_dgp.py:L1266, diff_diff/prep_dgp.py:L1402, and docs/methodology/REGISTRY.md:L2520. Concrete fix: update generate_survey_did_data()’s docstring to mirror the registry wording, including the optional covariate-variance term and the + time trend description of structural Y(0).
  • Severity: P3. Impact: Current tests cover icc + add_covariates and both default-weight informative-sampling branches, but there is still no direct regression test for the documented informative_sampling=True, add_covariates=True contract in either panel=True or panel=False. See tests/test_prep.py:L1580 and tests/test_prep.py:L1659. Concrete fix: add one panel and one repeated-cross-section test asserting that rank-pairing ignores the covariate contribution while preserving the intended within-stratum weight behavior.

- Pre-draw covariates before ranking step for both panel and cross-section
  paths when add_covariates=True, include in Y(0) for rank-pairing
- Update docstrings: icc formula includes covariate variance, informative
  sampling includes covariates in ranking
- Update REGISTRY.md: remove structural-only deviation, document full
  covariate inclusion
- Add panel + cross-section tests for informative_sampling + add_covariates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 5, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 11d85015e332e2905241285bd987a578c51de56b


Overall Assessment

Needs changes. One unmitigated P1 remains in the new survey DGP methodology path.

Executive Summary

  • Re-review status: the prior icc + add_covariates documentation mismatch appears resolved. The implementation, docstring, and registry note now align in diff_diff/prep_dgp.py:L1256-L1276, diff_diff/prep_dgp.py:L1403-L1410, and docs/methodology/REGISTRY.md:L2518-L2527.
  • P1: the new icc calibration has no feasibility check for the zero non-PSU variance case. With unit_fe_sd=0, noise_sd=0, and add_covariates=False, the code silently computes psu_re_sd=0, so the generated data cannot satisfy any requested icc in (0, 1).
  • The new informative_sampling + add_covariates tests are a useful addition, but they are still indirect and do not actually pin the documented “covariates enter the Y(0) ranking” contract.
  • I did not find additional material issues in the changed code for code quality, performance, maintainability, tech debt, or security.
  • Static review only: I could not run pytest here because the sandbox is missing numpy and pytest.

Methodology

Cross-checking the changed survey DGP code against the registry and docstring shows the documented icc-with-covariates and informative-sampling-with-covariates behavior is now consistent in diff_diff/prep_dgp.py:L1256-L1289 and docs/methodology/REGISTRY.md:L2518-L2527.

  • Severity: P1. Impact: The new ICC calibration path in diff_diff/prep_dgp.py:L1363-L1410 is missing an assumption check for the case sigma2_unit + sigma2_noise + sigma2_cov == 0. In that case, the solver at diff_diff/prep_dgp.py:L1403-L1410 collapses to psu_re_sd = 0, so a call like generate_survey_did_data(icc=0.3, unit_fe_sd=0, noise_sd=0, add_covariates=False, ...) silently returns degenerate data with the wrong clustering structure instead of rejecting an unattainable target ICC. Concrete fix: before solving for psu_re_sd, compute the non-PSU variance term and raise ValueError when it is numerically zero for any icc in (0, 1); add a regression test for that parameter combination.

Code Quality

  • No findings in the changed lines.

Performance

  • No findings in the changed lines.

Maintainability

  • No findings in the changed lines.

Tech Debt

  • No findings. I did not see a new deferrable item that needs tracking in TODO.md.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: The new panel and repeated-cross-section tests for informative_sampling=True, add_covariates=True at tests/test_prep.py:L1681-L1713 only assert positive weight/outcome correlation and column presence. They would still pass if the ranking step dropped the covariate term entirely, because the unit/PSU components already induce positive correlation. That leaves the new registry note in docs/methodology/REGISTRY.md:L2524-L2527 only indirectly covered. Concrete fix: add a seed-controlled regression test where covariates are the only source of Y(0) variation, then assert the assigned weights change when the covariate contribution is included in the ranking.

Static review only: I could not execute tests/test_prep.py in this sandbox because numpy and pytest are unavailable.

Path to Approval

  1. Add an icc feasibility guard in generate_survey_did_data() so unattainable targets with zero non-PSU variance raise ValueError instead of silently producing psu_re_sd = 0.
  2. Add a regression test for icc=0.3, unit_fe_sd=0, noise_sd=0, add_covariates=False asserting the new guard fires.

Raise ValueError when icc is set but unit_fe_sd=0, noise_sd=0, and
add_covariates=False, since the target ICC is unattainable with zero
non-PSU variance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 5, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: a2c235e5dccaaa2fa70dc71f00302de84e86f363


Overall Assessment

Needs changes (highest unmitigated severity: P1).

Executive Summary

  • Re-review status: the prior ICC feasibility bug is fixed. The new guard in diff_diff/prep_dgp.py:L1404 and regression test in tests/test_prep.py:L1534 now reject unattainable icc targets with zero non-PSU variance.
  • Methodology cross-check: the updated icc calibration and the documented covariate-inclusive informative-sampling ranking now align between code and registry in diff_diff/prep_dgp.py:L1256, diff_diff/prep_dgp.py:L1491, and docs/methodology/REGISTRY.md:L2518.
  • P1: the new return_true_population_att path returns finite values for undefined diagnostics. population_att becomes 0.0 when there are no treated units, and icc_realized can become 1.0 or 0.0 when period 1 has no within-PSU replication.
  • P3: the new informative_sampling + add_covariates tests are still indirect and do not actually pin the documented “covariates enter the Y(0) ranking” contract.
  • Static review only: I could not run pytest here because the sandbox is missing numpy/pytest.

Methodology

Affected methods: survey DGP ICC calibration, informative sampling via rank-pairing, stratum-specific treatment-effect DGP, and the new dgp_truth diagnostics.

Cross-check result: I did not find an undocumented mismatch in the icc formula or the informative-sampling-with-covariates behavior; those now match the docstring and registry in diff_diff/prep_dgp.py:L1256, diff_diff/prep_dgp.py:L1491, and docs/methodology/REGISTRY.md:L2518.

  • Severity: P1. Impact: In diff_diff/prep_dgp.py:L1633, population_att is forced to 0.0 when treated_mask is empty, even though the function documents it as a weighted average over treated true effects in diff_diff/prep_dgp.py:L1285. In diff_diff/prep_dgp.py:L1656, icc_realized uses max(..., 1) denominator guards, so period-1 designs with no within-PSU replication emit a finite ANOVA ICC instead of surfacing that the statistic is undefined. That is inconsistent with the library’s documented NaN-for-undefined convention in docs/methodology/REGISTRY.md:L185 and with the survey methodology note that singleton-PSU variance is undefined without an explicit fallback in docs/methodology/survey-theory.md:L479. Concrete fix: when return_true_population_att=True, return np.nan or raise ValueError for population_att on empty treated sets and for icc_realized when period 1 has fewer than 2 populated PSUs or no within-PSU replication; add regression tests for both cases.

Code Quality

No findings in the changed lines.

Performance

No findings in the changed lines.

Maintainability

No findings in the changed lines.

Tech Debt

No separate tech-debt finding. TODO.md does not currently track the P1 above, so it remains unmitigated.

Security

No findings in the changed lines.

Documentation/Tests

  • Severity: P3. Impact: The new tests in tests/test_prep.py:L1690 and tests/test_prep.py:L1708 only check positive weight/outcome correlation and column presence. They would still pass if the covariate term were accidentally dropped from the ranking step, because the unit and PSU components already induce positive correlation. That leaves the documented note in docs/methodology/REGISTRY.md:L2524 only indirectly covered. Concrete fix: add a seed-controlled regression where covariates are the only source of Y(0) variation, then assert the assigned weights or ranking change when the covariate contribution is included.

Path to Approval

  1. Change return_true_population_att so undefined truth diagnostics do not return finite sentinels: use np.nan or raise on empty treated sets and on period-1 designs without estimable ANOVA ICC.
  2. Add regression tests for never_treated_frac=1.0, return_true_population_att=True and for a singleton-PSU period-1 design such as n_units=5, n_strata=5, psu_per_stratum=8, return_true_population_att=True.

- population_att returns NaN when no treated units (was 0.0)
- icc_realized returns NaN when period-1 has < 2 groups or no
  within-PSU replication (was finite sentinel)
- Matches library NaN-for-undefined convention

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 5, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 8be6fad354c480a6620a7663eda0b314eed6764a


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Methodology

No findings. Affected methods were survey-DGP ICC calibration, informative sampling via within-stratum rank-pairing, stratum-specific treatment effects, and the new dgp_truth diagnostics. The changed implementation is consistent with the added registry notes, and the previous undefined-diagnostic methodology issue is fixed in the changed code.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find a relevant pre-existing TODO.md entry that changes the disposition of this PR.

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: the registry note that covariate contributions enter the informative-sampling Y(0) ranking is not directly pinned by the new tests. The current assertions at tests/test_prep.py:L1690 and tests/test_prep.py:L1708 only check positive weight/outcome correlation and column presence, so they would likely still pass if 0.5*x1 + 0.3*x2 were accidentally dropped from the ranking step while the unit/PSU components remained. Concrete fix: add a seed-controlled regression where covariates are the only or dominant source of Y(0) dispersion used for ranking, then assert that the assigned weights change when the covariate term is included.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 5, 2026
- covariate_effects=(beta1, beta2) replaces hardcoded (0.5, 0.3)
  coefficients for x1/x2 in outcome, ICC calibration, and informative
  sampling ranking
- te_covariate_interaction adds TE_i = base_TE + gamma * x1_i for
  unit-level treatment effect heterogeneity (requires add_covariates)
- Both default to current behavior for backward compatibility

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 5, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 3f065976e0e90410e6570e1c7b01469e74c649d8


Overall Assessment

⛔ Blocker — one unmitigated P0 and one unmitigated P1 remain in the changed code.

Executive Summary

Path to Approval

  1. Add upfront validation that covariate_effects is a length-2 iterable of finite real numbers and that te_covariate_interaction is finite; add explicit NaN/Inf rejection tests for both.
  2. Make dgp_truth["stratum_effects"] truthful under dynamic_effects and te_covariate_interaction, either by returning a richer structure or by renaming/omitting the field when a single stratum-level TE is not well-defined; add regression tests for both combinations with return_true_population_att=True.

Methodology

  • Severity: P1. Impact: return_true_population_att=True documents dgp_truth["stratum_effects"] as “dict mapping stratum index to TE,” but the actual effect generation later applies te_covariate_interaction * x1_i and optional dynamic growth on top of the base stratum value. That makes the returned “truth” payload silently incorrect when dynamic_effects=True or te_covariate_interaction != 0, even though both are supported settings in the same public API. Concrete fix: either rename the field to something explicitly base-only such as base_stratum_effects, plus separate metadata for dynamic/interaction modifiers, or compute a summary that remains correct under those settings. Relevant locations: diff_diff/prep_dgp.py:L1287-L1302, diff_diff/prep_dgp.py:L1601-L1609, diff_diff/prep_dgp.py:L1664-L1700, tests/test_prep.py:L1757-L1775.

No other unmitigated methodology mismatch stood out in the changed math itself; the new ICC calibration and the documented inclusion of covariates in the informative-sampling ranking look aligned between diff_diff/prep_dgp.py:L1258-L1278 and docs/methodology/REGISTRY.md:L2520-L2527.

Code Quality

  • Severity: P0. Impact: the PR validates weight_cv and psu_period_factor, but it never validates covariate_effects or te_covariate_interaction before using them. As written, malformed or non-finite values can propagate directly into cov_var, psu_re_sd, outcome, and true_effect, producing silently corrupted DGP output instead of a clean ValueError. That is a data-corruption bug in a new public API surface. Concrete fix: validate covariate_effects length/type/finiteness and validate te_covariate_interaction finiteness before any downstream math, then add negative tests alongside the current happy-path coverage. Relevant locations: diff_diff/prep_dgp.py:L1415-L1437, diff_diff/prep_dgp.py:L1596-L1607, tests/test_prep.py:L1841-L1889.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find a matching mitigation entry in TODO.md:L51-L76 for the P0/P1 issues above, so they remain unmitigated.

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: the new covariate-ranking tests only assert positive weight/outcome correlation and that covariate columns exist, so they would likely still pass if _beta1 * x1 + _beta2 * x2 were accidentally dropped from the ranking step while unit/PSU components remained. That leaves the newly documented registry contract only indirectly tested. Concrete fix: add a seed-controlled case where covariates are the dominant source of Y(0) dispersion, then assert that the assigned weights or within-stratum rank ordering change when the covariate term is included versus zeroed out. Relevant locations: docs/methodology/REGISTRY.md:L2524-L2527, diff_diff/prep_dgp.py:L1510-L1523, diff_diff/prep_dgp.py:L1555-L1570, tests/test_prep.py:L1690-L1722.

- Add type/length/finiteness validation for covariate_effects and
  te_covariate_interaction
- Rename stratum_effects to base_stratum_effects in dgp_truth to
  clarify these are base TEs before dynamic/covariate modifiers
- Add validation rejection tests for malformed inputs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 5, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 21c25f10eb1c251eeaa89568184a71ce8de620e6


Overall Assessment

Looks good: no unmitigated P0 or P1 findings remain in the changed files. The previous P0 around non-finite input propagation and the previous P1 around misleading stratum-level truth metadata both appear resolved in substance in diff_diff/prep_dgp.py:L1417-L1436 and diff_diff/prep_dgp.py:L1679-L1715, with regression coverage in tests/test_prep.py:L1757-L1775 and tests/test_prep.py:L1892-L1915.

Executive Summary

Methodology

No unmitigated P0 or P1 methodology defects remain. The two methodology sensitive changes in scope, ICC calibration and informative sampling with covariates, match the code path described in the new registry entries.

  • P3 Impact: the new registry note for icc documents only the default covariate contribution Var(0.5*x1) + Var(0.3*x2) = 0.2725, but this PR also adds covariate_effects, and the implementation correctly calibrates with Var(beta1*x1) + Var(beta2*x2) when custom coefficients are supplied. That makes the methodology note narrower than the actual supported DGP. Concrete fix: rewrite the note to the general formula and mention 0.2725 only as the default (0.5, 0.3) case. References: docs/methodology/REGISTRY.md:L2520-L2523, diff_diff/prep_dgp.py:L1294-L1298, diff_diff/prep_dgp.py:L1440-L1443.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find a matching mitigation entry in TODO.md, but the remaining items are P3 documentation and test gaps and do not affect approval.

Security

No findings.

Documentation/Tests

  • P3 Impact: the return_true_population_att docstring still advertises dgp_truth["stratum_effects"], while the implementation and tests expose dgp_truth["base_stratum_effects"]. Users following the docstring will look up the wrong key on this new API surface. Concrete fix: update the docstring key list to base_stratum_effects and keep the existing "base stratum TEs before dynamic/covariate modifiers" explanation. References: diff_diff/prep_dgp.py:L1288-L1292, diff_diff/prep_dgp.py:L1711-L1715, tests/test_prep.py:L1767-L1771.
  • P3 Impact: the informative-sampling-with-covariates tests still only check positive weight/outcome correlation and covariate column presence, so they would not catch a regression that accidentally drops the covariate term from the Y(0) ranker while leaving PSU and unit components intact. This was the previous review's remaining P3 and it still applies. Concrete fix: add a seed controlled case where covariates dominate Y(0) dispersion, then assert that weight assignment or within-stratum rank order changes when covariate_effects are zeroed out versus nonzero. References: diff_diff/prep_dgp.py:L1534-L1538, diff_diff/prep_dgp.py:L1572-L1585, tests/test_prep.py:L1690-L1722.

…ing test

- REGISTRY.md: generalize ICC covariate variance to beta1²/beta2² formula
- Docstring: fix stratum_effects → base_stratum_effects key name
- Add direct covariate-ranking test: covariates dominate Y(0), verify
  weight assignment changes with nonzero vs zero covariate effects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 5, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: bdfb172f581d3e7105ac3c2101b9ad3ec2ccbd4b


Overall Assessment

✅ Looks good — no unmitigated P0 or P1 findings remain in the changed files.

Executive Summary

  • The prior review’s substantive P0/P1 issues appear resolved: covariate inputs are now validated before they feed ICC calibration or treatment-effect construction in diff_diff/prep_dgp.py:1417, and dgp_truth now consistently exposes base_stratum_effects in both docs and implementation at diff_diff/prep_dgp.py:1287 and diff_diff/prep_dgp.py:1711.
  • The methodology-sensitive additions that are documented in the registry now align with the code: generalized covariate-aware ICC calibration and covariate reuse during informative-sampling ranking match between docs/methodology/REGISTRY.md:2518, diff_diff/prep_dgp.py:1438, and diff_diff/prep_dgp.py:1525.
  • The previous P3 about verifying covariate-sensitive informative-sampling ranking is now covered by the new direct test at tests/test_prep.py:1724.
  • I did not find a new silent correctness bug or undocumented methodology deviation in the changed DGP paths.
  • P3: test_backward_compatibility is mislabeled; it only re-checks determinism and does not verify historical compatibility, while similar determinism coverage already exists at tests/test_prep.py:1267 and tests/test_prep.py:1868.
  • Static review only: I could not execute the tests here because pytest is unavailable and importing the package fails because numpy is not installed in this sandbox.

Methodology

Affected methods reviewed: generate_survey_did_data ICC calibration, Kish DEFF diagnostic, informative sampling via rank-pairing, heterogeneous treatment effects by stratum, and covariate-driven outcome / treatment-effect modifiers in diff_diff/prep_dgp.py:1163 and docs/methodology/REGISTRY.md:2518.

No findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: test_backward_compatibility at tests/test_prep.py:1868 calls the current implementation twice with the same seed, so it proves determinism, not backward compatibility. The suite already has the same determinism assertion at tests/test_prep.py:1267. A future silent default-path regression would still pass both tests. Concrete fix: replace it with a true regression check against a captured pre-change output/fixture/hash for a fixed seed and parameter set, or rename the test so it no longer claims backward compatibility.

Execution note: test execution was not possible in this environment because pytest and numpy are unavailable.

@igerber igerber merged commit e123338 into main Apr 5, 2026
16 checks passed
@igerber igerber deleted the survey-dgp branch April 5, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant