Extend power analysis to all estimators + simulation-based MDE/sample size by igerber · Pull Request #208 · igerber/diff-diff

igerber · 2026-03-18T12:59:21Z

Summary

Add estimator registry mapping all 12 built-in estimators (excluding BaconDecomposition) to their appropriate DGP, fit kwargs, and result extraction profiles
Refactor simulate_power() from hardcoded 4-estimator if/elif chain to registry-based lookup supporting all estimators automatically
Add simulate_mde() — bisection search over effect sizes to find the minimum detectable effect for any estimator
Add simulate_sample_size() — bisection search over n_units to find the required sample size for any estimator
Add SimulationMDEResults and SimulationSampleSizeResults dataclasses with summary(), to_dict(), to_dataframe() methods
34 new tests: registry unit tests, per-estimator coverage tests, MDE search tests, sample size search tests

Methodology references (required if estimator / math changes)

Method name(s): N/A — no methodology changes, this extends power analysis infrastructure only
Paper / source link(s): N/A
Any intentional deviations from the source (and why): None

Validation

Tests added/updated: tests/test_power.py — 34 new tests across 4 test classes (TestEstimatorRegistry, TestEstimatorCoverage, TestSimulateMDE, TestSimulateSampleSize)
All 76 tests pass (4 skipped: matplotlib, 2 deselected: slow-marked TROP/SDiD)
black, ruff check, and mypy all pass clean
Backtest / simulation / notebook evidence: N/A

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

github-actions · 2026-03-18T13:06:58Z

Overall Assessment

⛔ Blocker

Highest unmitigated severity is P0.

Executive Summary

The new ContinuousDiD registry profile is methodologically wrong for power analysis: it varies the DGP slope, but simulate_power() treats that scalar as the estimator’s scalar truth, even though the estimator returns overall_att. With defaults, treatment_effect=0 is still a non-null design.
The new simulate_mde() / simulate_sample_size() explicit-range paths can return unchecked upper bounds and power_at_* = 0.0 without ever evaluating the endpoint, because the supplied bracket is neither validated nor fully evaluated.
The added tests are mostly smoke tests. They do not cover ContinuousDiD null calibration or the explicit effect_range / n_range branches, so both defects can merge undetected.
I did not find a mitigating Methodology Registry note or TODO.md entry for either issue.

Path to Approval

Remove ContinuousDiD from the automatic power registry for this PR, or redesign that profile so the simulated DGP, extracted estimate, and reported true_effect all refer to the same documented scalar estimand.
In simulate_mde() and simulate_sample_size(), always evaluate user-supplied bracket endpoints, validate that the target power is actually bracketed, and never default power_at_* to 0.0 for unevaluated points.
Add regression tests for ContinuousDiD under a null effect and for explicit effect_range / n_range edge cases, including narrow and unbracketed ranges.

Methodology

Severity: P0. Impact: The new ContinuousDiD power profile does not simulate the same estimand it evaluates. _continuous_dgp_kwargs() maps treatment_effect into the DGP’s att_slope only, while the generator keeps att_intercept=1.0 by default and defines linear treatment effects as att_intercept + att_slope * d. But _extract_continuous() pulls result.overall_att, and simulate_power() stores the input scalar as true_effect for bias/coverage. The Methodology Registry defines ContinuousDiD targets as ATT(d), ATT^{glob}, and ACRT(d), not the slope parameter. So this profile is wrong in two ways: treatment_effect=0 is still non-null by default, and even with att_intercept=0, the reported truth is still not the returned estimand. Concrete fix: either drop ContinuousDiD from _get_registry() for now, or add a ContinuousDiD-specific power API that targets a documented scalar quantity and computes the truth from the generated dose-response, not from the input slope. Evidence: diff_diff/power.py:127, diff_diff/power.py:273, diff_diff/power.py:1538, diff_diff/prep_dgp.py:780, diff_diff/prep_dgp.py:892, docs/methodology/REGISTRY.md:459

Code Quality

Severity: P1. Impact: The new bisection helpers do not correctly handle explicit search ranges. When effect_range / n_range is provided, the code just assigns lo, hi and skips endpoint evaluation. best_power is then initialized from an empty search_path to 0.0, and simulate_sample_size() can even exit before simulating anything when hi - lo <= 2. That means these functions can return an unchecked upper bound labeled as mde / required_n, together with fabricated power_at_mde / power_at_n = 0.0. Concrete fix: evaluate both endpoints whenever a bracket is supplied, require or explicitly warn/error if the target is not bracketed, and only report power_at_* for values that were actually simulated. Evidence: diff_diff/power.py:1842, diff_diff/power.py:1879, diff_diff/power.py:2016, diff_diff/power.py:2033, diff_diff/power.py:2036

Performance

No material findings.

Maintainability

No material findings beyond the correctness issues above.

Tech Debt

No separate tech-debt-only findings. The issues above are not tracked in TODO.md:38, so they remain unmitigated for assessment purposes.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The added tests do not exercise the new failure modes. There is an existing DID null-effect calibration test, but the new ContinuousDiD coverage only checks that a result object is finite, and all new MDE/sample-size tests are DID-only. There is also no test for explicit effect_range / n_range inputs. Concrete fix: add tests that assert ContinuousDiD null behavior is calibrated or explicitly unsupported, plus tests that explicit brackets are evaluated/validated and that unbracketed ranges fail loudly. Evidence: tests/test_power.py:390, tests/test_power.py:980, tests/test_power.py:1001, tests/test_power.py:1106

Review note: I could not execute the test suite in this environment because the local Python environment is missing numpy, so this is a static diff review.

…bisection brackets P0: Remove ContinuousDiD from power analysis registry — the DGP simulated att_slope as treatment_effect but extracted overall_att (a dose-weighted average), so the simulated truth and evaluated estimand didn't match. P1: Add endpoint evaluation for user-supplied effect_range/n_range in simulate_mde() and simulate_sample_size(). Previously these paths skipped evaluation entirely, leaving search_path empty and best_power at 0.0. Now warns when brackets don't contain the target power. P2: Add regression tests for all three fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T13:18:35Z

/ai-review

github-actions · 2026-03-18T13:26:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 11539e7a01a24a990ff517c75c04099557578681

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

The prior ContinuousDiD blocker is resolved: it is no longer auto-registered for simulation power, and there is now a regression test for that path.
The new automatic TwoWayFixedEffects registry entry is still methodologically wrong for the default multi-period simulation path. It passes time="period" into an estimator that explicitly warns this does not estimate the standard DiD ATT.
The explicit-endpoint fix is only partial. simulate_sample_size() still never uses the lower-endpoint power, so it can return an oversized required_n when the lower bound already achieves target power.
The new tests are mostly smoke tests. They do not lock down either the TWFE methodology issue or the lower-bound sample-size bug.

Methodology

Severity: P1. Impact: the new TWFE registry profile is not aligned with the estimator’s documented estimand. _twfe_fit_kwargs() wires automatic power to time="period", but TwoWayFixedEffects.fit() explicitly warns that multi-valued time inputs estimate treated * period_number, not the standard treated * post ATT. That means the new “all built-in estimators are supported automatically” claim is false for TWFE under the default n_periods=4, and the resulting simulate_power(), simulate_mde(), and simulate_sample_size() outputs for TWFE are not methodologically valid. Concrete fix: do not auto-register TWFE for multi-period power simulation in its current form. Either restrict automatic TWFE support to validated two-period / binary-post designs, or change the estimator/power interface so time fixed effects and the post indicator are passed separately before claiming automatic support. Evidence: diff_diff/power.py:L139, diff_diff/power.py:L267, diff_diff/power.py:L1224, diff_diff/twfe.py:L96, docs/methodology/REGISTRY.md:L238, docs/methodology/REGISTRY.md:L253

Code Quality

Severity: P1. Impact: simulate_sample_size() still violates the basic bisection invariant because it evaluates the lower endpoint only for logging and never checks whether that lower bound already meets target power. If lo already satisfies the target, the function still bisects above it and can silently return a too-large required_n; the same bug exists in the auto-bracketing path when min_n is already sufficient. Concrete fix: evaluate and store power_lo, return lo immediately when power_lo >= power, and only enter bisection once the invariant power(lo) < target <= power(hi) holds. Evidence: diff_diff/power.py:L1988, diff_diff/power.py:L1999, diff_diff/power.py:L2016

Performance

No material findings in the changed code.

Maintainability

No material findings beyond the correctness issues above.

Tech Debt

No separate deferred-work findings. These issues are not tracked in the code-review debt section of TODO.md:L38, so they remain unmitigated for assessment.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: the new tests would not catch either P1. test_twfe() only asserts that the call returns a finite-looking result, not that the TWFE registry path is methodologically valid, and the sample-size tests never cover the case where the lower search bound already achieves target power. The simulate_power() docstring also overstates support by saying all built-in estimators are automatically supported. Concrete fix: add regression tests that either reject unsupported multi-period TWFE power simulation or validate a corrected two-period TWFE path, plus tests where simulate_sample_size() must return the lower endpoint for both explicit n_range and auto-bracketed strong-effect cases; narrow the docstring claim until that coverage exists. Evidence: tests/test_power.py:L844, tests/test_power.py:L1177, tests/test_power.py:L1201, diff_diff/power.py:L1224

Path to Approval

Restrict or redesign automatic TWFE power support so it does not run the current multi-period time="period" path and report it as standard ATT power.
Fix simulate_sample_size() to use both bracket endpoints and short-circuit when the lower bound already meets target power.
Add regression tests for the two cases above, including an explicit lower-bound-sufficient sample-size search.

Static diff review only. I could not execute the test suite in this environment because the local Python environment is missing numpy.

…short-circuit - Remove TwoWayFixedEffects from power analysis registry (time="period" produces treated*period_number, not standard ATT) - Add early return in simulate_sample_size() when lower bound already achieves target power (both explicit n_range and auto-bracket paths) - Narrow docstring from "All" to "Most" built-in estimators - Add regression tests for TWFE exclusion and lo-sufficient scenarios Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T13:38:05Z

/ai-review

github-actions · 2026-03-18T13:46:11Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d79b042bd80e8d8a1afa453ae8397f4fb2e95b37

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

The two prior blockers from the last review appear resolved: TwoWayFixedEffects is no longer auto-registered for simulation power, and simulate_sample_size() now short-circuits when the lower bound already meets target power, with regression coverage for both cases diff_diff/power.py:L249-L331, tests/test_power.py:L803-L813, diff_diff/power.py:L1972-L2003, tests/test_power.py:L1229-L1256. One new P1 remains.

Executive Summary

The prior TWFE methodology issue is fixed: TwoWayFixedEffects is now excluded from the automatic estimator registry, with a regression test asserting that it raises unless the caller supplies custom simulation plumbing diff_diff/power.py:L249-L331, tests/test_power.py:L803-L813.
The prior lower-bound bug in simulate_sample_size() is fixed in both explicit and auto-bracketed search paths, and the new tests cover both variants diff_diff/power.py:L1972-L2003, tests/test_power.py:L1229-L1256.
Severity P1: [Newly identified] automatic SyntheticDiD support is not valid at the public defaults. The registry uses the factor-model DGP with the global default treatment_fraction=0.5, but SyntheticDiD’s default placebo variance explicitly requires more controls than treated units; otherwise it returns SE=0.0, which flows to NaN inference, and simulate_power() then records those NaN p-values as non-rejections diff_diff/power.py:L92-L109, diff_diff/power.py:L316-L321, diff_diff/power.py:L1188-L1194, diff_diff/power.py:L1407-L1423, diff_diff/synthetic_did.py:L399-L431, diff_diff/synthetic_did.py:L720-L730, docs/methodology/REGISTRY.md:L1112-L1119.
The new tests sidestep that problem by only exercising SyntheticDiD at treatment_fraction=0.3, so the broken default contract is still untested tests/test_power.py:L959-L970.
No additional material findings in performance, maintainability, security, or deferred-work tracking.

Methodology

Severity: P1. Impact: [Newly identified] the new registry-backed “automatic support” path for SyntheticDiD is methodologically invalid at the default simulation settings. _factor_dgp_kwargs() turns the shared default treatment_fraction=0.5 into equal treated/control counts, the registry wires SyntheticDiD to that DGP, and the estimator’s documented placebo variance then returns SE=0.0 when n_control <= n_treated. Because simulate_power() maps NaN p-values to False rejections, the default simulate_power(SyntheticDiD()) path can report zero/near-zero power and distort the derived simulate_mde() / simulate_sample_size() searches for the wrong reason. Concrete fix: either fail fast for SyntheticDiD(variance_method="placebo") when the generated design has n_control <= n_treated, or make the auto-supported configuration valid before advertising automatic support (for example, require a control-heavy split or bootstrap variance). References: diff_diff/power.py:L92-L109, diff_diff/power.py:L316-L321, diff_diff/power.py:L1188-L1194, diff_diff/power.py:L1407-L1423, diff_diff/power.py:L1763-L1769, diff_diff/power.py:L1962-L1969, diff_diff/synthetic_did.py:L399-L431, diff_diff/synthetic_did.py:L720-L730, docs/methodology/REGISTRY.md:L1016-L1026, docs/methodology/REGISTRY.md:L1112-L1119.

Code Quality

No material findings beyond the methodology issue above.

Performance

No material findings in the changed code.

Maintainability

No material findings.

Tech Debt

No separate deferred-work findings. The remaining P1 is not tracked under the code-review debt section of TODO.md:L38-L64, so it remains unmitigated for assessment.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: the new test coverage does not exercise the public default SyntheticDiD auto-support path. test_synthetic_did() only uses treatment_fraction=0.3, and the new MDE/sample-size tests only cover DifferenceInDifferences, so the current default-path regression would not be caught by the suite. Concrete fix: add regression tests for simulate_power(SyntheticDiD(), progress=False) at defaults, plus corresponding simulate_mde() / simulate_sample_size() coverage, and assert either valid finite inference or an explicit fail-fast contract. References: tests/test_power.py:L959-L970, tests/test_power.py:L978-L1103, tests/test_power.py:L1111-L1256.

Path to Approval

Make the automatic SyntheticDiD simulation path valid or explicitly unsupported. Concretely: detect the default placebo-variance case where n_control <= n_treated and either raise with a clear message, or switch the supported automatic configuration to one that yields valid inference.
Add regression tests covering the public default SyntheticDiD path in simulate_power(), and the same configuration through simulate_mde() and simulate_sample_size().

Static review only. I could not run the test suite in this environment because pytest is unavailable, and local Python imports fail because numpy is not installed.

…ro-power bug Add _sdid_dgp_kwargs that caps treatment_fraction at 0.4 so the placebo variance method has enough pseudo-controls (n_control > n_treated). Add regression test for the default SyntheticDiD path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T14:12:57Z

/ai-review

github-actions · 2026-03-18T14:20:31Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 32f2b2b52bb09b1d35bcbe6a9c5b5f8dab5e3ee8

Overall Assessment

⛔ Blocker

Highest unmitigated severity: P0.

Executive Summary

The prior re-review blockers around TwoWayFixedEffects auto-registration and the simulate_sample_size() lower-bound short-circuit appear resolved in the changed code and tests: diff_diff/power.py:L270-L352, tests/test_power.py:L805-L815, diff_diff/power.py:L1993-L2024, tests/test_power.py:L1246-L1269.
The previous SyntheticDiD default-path zero-power failure is no longer present in that exact form, but the replacement fix is not acceptable: the new registry silently rewrites any requested treatment_fraction > 0.4 to 0.4 for SyntheticDiD: diff_diff/power.py:L112-L130.
That silent rewrite conflicts with the public API contract for simulate_power() and propagates into simulate_mde() and simulate_sample_size(), so the reported power/MDE/required-n can describe a different design than the caller asked for: diff_diff/power.py:L1242-L1244, diff_diff/power.py:L1769-L1786, diff_diff/power.py:L1968-L1986.
The Methodology Registry documents a different behavior for unsupported placebo designs: warn and return undefined inference when n_control <= n_treated. It does not document silently changing the simulated treatment share: docs/methodology/REGISTRY.md:L1085-L1119.
The new tests do not catch this contract break. They only check that the private builders return dicts and that default SyntheticDiD power is nonzero; the new MDE/sample-size suites still only cover DifferenceInDifferences: tests/test_power.py:L697-L729, tests/test_power.py:L975-L987, tests/test_power.py:L995-L1269.

Methodology

Severity: P0. Impact: Methods affected are the automatic SyntheticDiD power path plus the new simulation-based MDE/sample-size wrappers. _sdid_dgp_kwargs() unconditionally caps the treated share at 40%, even though simulate_power() documents treatment_fraction as the fraction of treated units. SyntheticDiD itself defaults to variance_method="placebo", and the registry already documents that placebo variance is undefined when n_control <= n_treated; that documented behavior is “warn + undefined inference,” not “silently simulate a different design.” Because simulate_mde() and simulate_sample_size() call simulate_power(), they inherit the same silent design rewrite. This is wrong statistical output with no warning. Concrete fix: preserve the caller’s requested treatment_fraction; if the estimator is SyntheticDiD with placebo variance and the requested design has n_control <= n_treated, fail fast with a clear error/warning, or require an explicit bootstrap configuration chosen by the caller. Do not silently mutate the DGP. References: diff_diff/power.py:L112-L130, diff_diff/power.py:L1242-L1244, diff_diff/power.py:L1769-L1786, diff_diff/power.py:L1968-L1986, diff_diff/synthetic_did.py:L131-L179, diff_diff/synthetic_did.py:L720-L730, docs/methodology/REGISTRY.md:L1113-L1119, docs/methodology/REGISTRY.md:L1706-L1723.

Code Quality

No material findings beyond the methodology blocker above.

Performance

No material findings in the changed code.

Maintainability

No material findings beyond the test gap below.

Tech Debt

No relevant entry under TODO.md:L38-L64 mitigates the blocker above, so it remains unmitigated for assessment.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: the new tests validate that the default SyntheticDiD path no longer returns zero power, but they do not verify that the requested design was preserved, and they do not exercise SyntheticDiD through simulate_mde() or simulate_sample_size(). That is why the silent treatment_fraction rewrite passes the suite. Concrete fix: add public-API regression tests that assert either (a) the realized treated share matches the requested treatment_fraction, or (b) unsupported placebo designs raise/warn explicitly; run the same contract through simulate_power(), simulate_mde(), and simulate_sample_size() for SyntheticDiD. References: tests/test_power.py:L697-L729, tests/test_power.py:L975-L987, tests/test_power.py:L995-L1269.

Path to Approval

Remove the unconditional 40% cap from _sdid_dgp_kwargs() so the simulation respects the caller’s treatment_fraction.
Add an explicit contract for unsupported SyntheticDiD placebo designs: if n_control <= n_treated, raise with a clear message or require an explicit bootstrap configuration, rather than silently changing the design.
Add regression coverage for SyntheticDiD across simulate_power(), simulate_mde(), and simulate_sample_size() that checks the contract above.

Static review only. I could not run pytest here because pytest is unavailable, and local imports fail because numpy is not installed.

…n cap with fail-fast validation Remove _sdid_dgp_kwargs that silently capped treatment_fraction at 0.4. Add upfront ValueError in simulate_power() when SyntheticDiD placebo variance has n_control <= n_treated, with actionable error message suggesting lowering treatment_fraction or using bootstrap variance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T14:37:35Z

/ai-review

github-actions · 2026-03-18T14:43:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4fe4a7ab2714d25bd64c6742359b04631c77e2e1

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

The prior SyntheticDiD blocker is resolved: the code now fails fast for placebo designs with n_control <= n_treated, the behavior is documented in the Methodology Registry, and tests cover simulate_power(), simulate_mde(), and simulate_sample_size().
[Newly identified] The registry refactor leaves TwoWayFixedEffects outside the automatic power-analysis path, so the default simulation APIs now error for a first-class public estimator. The new tests explicitly lock that unsupported state in.
The new "provide a custom data_generator and estimator_kwargs" fallback is not actually generic. Unregistered estimators with non-basic fit() signatures, such as ContinuousDiD, still receive hard-coded basic-DiD kwargs and cannot use the advertised escape hatch.
Test coverage is broad for registered estimators and the SyntheticDiD fail-fast path, but it adds no positive regression coverage for TwoWayFixedEffects and no test of the advertised custom fallback for non-basic estimators.

Methodology

Severity: P3. Impact: The previous SyntheticDiD placebo-variance issue appears addressed. The simulation APIs now fail fast when placebo variance would violate n_control > n_treated, and REGISTRY.md documents that deviation explicitly, so this is not a defect. Concrete fix: none required. References: diff_diff/power.py:L1312-L1325, docs/methodology/REGISTRY.md:L1113-L1114, tests/test_power.py:L972-L1067

Code Quality

Severity: P1. Impact: TwoWayFixedEffects is still exported as a core public estimator, but the new registry omits it and simulate_power() now raises "not in registry"; simulate_mde() and simulate_sample_size() inherit the same failure. The suite then codifies that unsupported state. This is an undocumented contraction of the default power-analysis API introduced by the refactor. Concrete fix: add a dedicated TwoWayFixedEffects profile to _get_registry() and cover all three public simulation entry points. When you add it, use a binary post indicator in the fit kwargs (time="post", unit="unit"), not the old time="period" pattern, because the registry explicitly says multi-valued time does not estimate the standard TWFE ATT. References: diff_diff/init.py:L37-L42, diff_diff/init.py:L188-L200, diff_diff/power.py:L249-L331, diff_diff/power.py:L1298-L1307, docs/methodology/REGISTRY.md:L253-L257, tests/test_power.py:L803-L814

Performance

No material findings in the changed code.

Maintainability

Severity: P2. Impact: The new fallback contract for unregistered estimators is incomplete. The error path tells users to "Provide a custom data_generator and estimator_kwargs," but when profile is None the code still injects treatment="treated" and time="post" unconditionally. Estimators with non-basic fit() signatures, such as ContinuousDiD.fit(data, outcome, unit, time, first_treat, dose, ...), cannot actually use the advertised escape hatch. Concrete fix: either add a truly generic fit_kwargs/fit_kwargs_builder override that lets callers fully control the fit signature, or raise a targeted error for non-basic unregistered estimators instead of suggesting a path that still fails. References: diff_diff/power.py:L1302-L1307, diff_diff/power.py:L1408-L1418, diff_diff/continuous_did.py:L153-L161

Tech Debt

No relevant TODO.md entry mitigates the P1 or P2 findings above. Reference: TODO.md:L38-L64

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The new tests institutionalize TwoWayFixedEffects as unsupported instead of guarding against the regression, and they do not exercise the advertised custom-generator fallback for an unregistered non-basic estimator. That leaves both API-contract issues above unguarded. Concrete fix: replace the negative TwoWayFixedEffects test with positive coverage for simulate_power(), simulate_mde(), and simulate_sample_size(), and add one regression test that either proves the custom fallback works for a non-basic estimator or asserts a targeted "unsupported fit signature" error. References: tests/test_power.py:L803-L814, tests/test_power.py:L834-L1067, tests/test_power.py:L1075-L1352

Path to Approval

Add a TwoWayFixedEffects registry entry so the default simulation APIs cover that public estimator.
Make the unregistered-estimator fallback honest: either support fully caller-specified fit kwargs or fail with a targeted error instead of hard-coding basic-DiD kwargs.
Add regression coverage for TwoWayFixedEffects across simulate_power(), simulate_mde(), and simulate_sample_size(), plus one test covering the intended behavior for an unregistered non-basic estimator.

Static review only. I did not execute the test suite in this environment.

…tor fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T18:21:14Z

/ai-review

github-actions · 2026-03-18T18:29:01Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5d2fdaf892bbe20fd99ce97b96ba339bc9121ff3

Overall Assessment

✅ Looks good

Highest unmitigated severity: P2.

Static review only. I could not execute the test suite in this environment because pytest and the scientific Python deps are not installed.

Executive Summary

The prior SyntheticDiD blocker is resolved: the simulation APIs now fail fast when placebo variance would violate n_control > n_treated, that deviation is documented in the Methodology Registry, and tests cover simulate_power(), simulate_mde(), and simulate_sample_size().
The prior TwoWayFixedEffects blocker is resolved: TWFE is back in the estimator registry, and the registry now uses a binary post indicator, which matches the Registry guidance for standard TWFE ATT.
The prior “custom fallback” concern is only partially resolved: caller-controlled estimator_kwargs fixed the old fit-signature problem, but unregistered estimators with nonstandard result objects still cannot use the advertised escape hatch.
The power docs remain stale and unguarded: docs/api/power.rst still documents nonexistent PowerAnalysis.compute_* methods and old simulate_power() arguments, and the new snippet smoke test never exercises that file.
The new TODO.md entry correctly tracks the remaining blanket-NameError limitation in doc-snippet smoke tests, so that item is informational only.

Methodology

Severity: P3. Impact: No unmitigated methodology findings remain in the changed power-analysis path. The prior SyntheticDiD placebo-variance deviation is now explicitly documented, and the TWFE power path now uses a binary post indicator consistent with the Registry note that multi-valued time does not estimate the standard TWFE ATT. Concrete fix: none. References: diff_diff/power.py:L1330-L1343, diff_diff/power.py:L139-L145, diff_diff/power.py:L267-L272, diff_diff/twfe.py:L96-L105, docs/methodology/REGISTRY.md:L253-L257, docs/methodology/REGISTRY.md:L1113-L1114, tests/test_power.py:L966-L1095

Code Quality

No material findings.

Performance

No material findings.

Maintainability

Severity: P2. Impact: The advertised unregistered-estimator fallback is still not truly generic. When profile is None, simulate_power() now lets callers fully control estimator.fit() kwargs, but result extraction still assumes either .att/.se/.p_value/.conf_int or .avg_*. Public unregistered estimators with different result schemas, such as ContinuousDiD (overall_att, overall_att_se, overall_att_p_value, overall_att_conf_int), will still fail even if the caller supplies a correct custom DGP wrapper and fit kwargs. That means the previous “generic fallback” concern is only partially resolved. Concrete fix: add a caller-supplied result_extractor hook, or extend the fallback extractor to handle overall_* result objects; then add a regression test using ContinuousDiD with explicit data_generator and estimator_kwargs. References: diff_diff/power.py:L1318-L1446, diff_diff/continuous_did.py:L153-L161, diff_diff/continuous_did_results.py:L112-L118, tests/test_power.py:L1097-L1138

Tech Debt

Severity: P3. Impact: tests/test_doc_snippets.py still treats any NameError as a pass, which can mask standalone snippet breakage, but this PR explicitly tracks that limitation in TODO.md, so it is mitigated and non-blocking under the review policy. Concrete fix: none required for approval; future work is to replace the blanket NameError catch with a per-snippet allowlist or explicit fixture setup. References: tests/test_doc_snippets.py:L357-L363, TODO.md:L62-L65

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The power documentation is still out of sync with the implementation, and the new smoke-test coverage does not protect it. docs/api/power.rst still advertises nonexistent PowerAnalysis.compute_* methods and stale simulate_power() parameters, while the newly exported simulate_mde, simulate_sample_size, SimulationMDEResults, and SimulationSampleSizeResults are absent from the API index. Because tests/test_doc_snippets.py does not include docs/api/power.rst, this broken documentation path remains untested. Concrete fix: update docs/api/power.rst to the current API (power, mde, sample_size, simulate_mde, simulate_sample_size, and both new result classes), add the new symbols to docs/api/index.rst and autosummary, and include api/power.rst in RST_FILES. References: docs/api/power.rst:L33-L65, docs/api/power.rst:L145-L160, docs/api/index.rst:L148-L155, tests/test_doc_snippets.py:L25-L40, diff_diff/init.py:L53-L64, diff_diff/init.py:L295-L306

igerber · 2026-03-18T18:54:30Z

/ai-review

…sample size - Add estimator registry mapping all 12 estimators to appropriate DGP, fit kwargs, and result extraction profiles - Refactor simulate_power() to use registry lookup instead of hardcoded if/elif chain (was 4 estimators, now 12 + custom fallback) - Add simulate_mde() for bisection search over effect sizes - Add simulate_sample_size() for bisection search over n_units - Add SimulationMDEResults and SimulationSampleSizeResults dataclasses - 34 new tests: registry, estimator coverage, MDE search, sample size search Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…bisection brackets P0: Remove ContinuousDiD from power analysis registry — the DGP simulated att_slope as treatment_effect but extracted overall_att (a dose-weighted average), so the simulated truth and evaluated estimand didn't match. P1: Add endpoint evaluation for user-supplied effect_range/n_range in simulate_mde() and simulate_sample_size(). Previously these paths skipped evaluation entirely, leaving search_path empty and best_power at 0.0. Now warns when brackets don't contain the target power. P2: Add regression tests for all three fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…short-circuit - Remove TwoWayFixedEffects from power analysis registry (time="period" produces treated*period_number, not standard ATT) - Add early return in simulate_sample_size() when lower bound already achieves target power (both explicit n_range and auto-bracket paths) - Narrow docstring from "All" to "Most" built-in estimators - Add regression tests for TWFE exclusion and lo-sufficient scenarios Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ro-power bug Add _sdid_dgp_kwargs that caps treatment_fraction at 0.4 so the placebo variance method has enough pseudo-controls (n_control > n_treated). Add regression test for the default SyntheticDiD path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…n cap with fail-fast validation Remove _sdid_dgp_kwargs that silently capped treatment_fraction at 0.4. Add upfront ValueError in simulate_power() when SyntheticDiD placebo variance has n_control <= n_treated, with actionable error message suggesting lowering treatment_fraction or using bootstrap variance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tor fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… API symbols - Add `result_extractor` parameter to simulate_power, simulate_mde, and simulate_sample_size for unregistered estimators with non-standard schemas - Fix power.rst: correct PowerAnalysis method names, example code, and add SimulationMDEResults/SimulationSampleSizeResults/simulate_mde/simulate_sample_size - Add 4 missing symbols to docs/api/index.rst autosummary - Add api/power.rst to doc snippet smoke tests - Add tests for custom result_extractor and MDE forwarding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-18T19:08:40Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a15ce5a4d7a4a318b9edb024152b61cf8307004c

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. Static review against the provided diff and surrounding source context; I did not execute the test suite in this environment.

Executive Summary

Prior blockers from the last review appear addressed: TwoWayFixedEffects is back in the power registry with a binary post path, and the SyntheticDiD placebo fail-fast behavior is now documented in the Methodology Registry.
One unmitigated P1 remains in the new automatic power-analysis path: the staggered-estimator registry hardcodes a single-cohort + never-treated DGP and does not adapt to estimator-level settings like control_group, clean_control, or anticipation.
That means simulate_power(), and by extension simulate_mde() / simulate_sample_size(), can silently report power for a different design than the configured estimator object the caller passed.
The new power tests mostly cover default estimator instances, so those parameter-dependent staggered paths are not exercised.
The doc-snippet NameError masking limitation is now tracked in TODO.md, so it is informational only.

Methodology

Severity: P1. Impact: the new staggered power registry is not parameter-aware. _staggered_dgp_kwargs() always forces cohort_periods=[treatment_period], never_treated_frac=1 - treatment_fraction, and dynamic_effects=False, and that shared profile is reused for all staggered estimators in the registry. generate_staggered_data() itself supports multiple cohorts, but the adapter collapses it to a single treated cohort, which breaks the documented comparison-group and anticipation semantics for configurations such as CallawaySantAnna(control_group="not_yet_treated"), StackedDiD(clean_control="not_yet_treated"|"strict"), and staggered estimators with anticipation>0. Callers can work around this manually with data_generator_kwargs, but the new “automatic” path does not infer or validate those settings, so power can be computed for the wrong estimand with no warning. Concrete fix: either make the registry DGP builder consume estimator-instance settings that change the simulated design, or explicitly reject/warn on unsupported staggered configurations and require matching data_generator_kwargs / data_generator. References: diff_diff/power.py:L73-L89, diff_diff/power.py:L281-L323, diff_diff/prep_dgp.py:L125-L153, diff_diff/prep_dgp.py:L207-L225, diff_diff/staggered.py:L420-L459, diff_diff/staggered.py:L1136-L1148, diff_diff/stacked_did.py:L58-L74, diff_diff/stacked_did.py:L514-L523, diff_diff/stacked_did.py:L565-L572, docs/methodology/REGISTRY.md:L351-L355, docs/methodology/REGISTRY.md:L409-L425, docs/methodology/REGISTRY.md:L928-L995

Code Quality

No material findings in the changed code beyond the methodology issue above.

Performance

No material findings.

Maintainability

No material findings.

Tech Debt

Severity: P3. Impact: the new doc-snippet smoke test still treats any NameError as a pass, which can mask standalone example breakage, but this PR also adds that limitation to TODO.md, so it is properly tracked and non-blocking under the review policy. Concrete fix: none required for approval; later, replace blanket NameError suppression with per-snippet setup or an allowlist. References: tests/test_doc_snippets.py:L357-L363, TODO.md:L60-L65

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: the new estimator-coverage tests only instantiate default estimator objects, so the regression suite never exercises the new auto-DGP path for control_group="not_yet_treated", clean_control="strict"/"not_yet_treated", or anticipation>0. That leaves the methodology mismatch above unguarded. Concrete fix: add explicit simulation tests for those configurations across simulate_power(), simulate_mde(), and simulate_sample_size(). References: tests/test_power.py:L828-L1095

Path to Approval

Update the staggered power registry so the default DGP matches estimator settings that change identification/comparison design (control_group, clean_control, cohort structure, anticipation), or refuse those configurations unless callers supply matching data_generator_kwargs / data_generator.
Add regression tests for at least CallawaySantAnna(control_group="not_yet_treated") with data_generator_kwargs={"never_treated_frac": 0.0, "cohort_periods": [2, 4]}, StackedDiD(clean_control="strict", kappa_pre=1, kappa_post=1) with later-treated clean controls, and one anticipation>0 staggered estimator case where the DGP matches the shifted effect onset/reference period.
If any automatic-simulation limitation remains intentional, document it in docs/methodology/REGISTRY.md with a Note or Deviation label and emit a runtime warning when the registry default DGP is not faithful to the configured estimator.

Add _check_staggered_dgp_compat() that detects control_group="not_yet_treated", clean_control="strict", and anticipation>0 on staggered estimators using the automatic single-cohort DGP, and emits UserWarning with specific guidance on supplying data_generator_kwargs or a custom data_generator. - 5 warning/no-warning unit tests + 2 slow regression tests with matching DGP - Document limitation in REGISTRY.md PowerAnalysis section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T19:59:54Z

/ai-review

github-actions · 2026-03-18T20:07:03Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 800d37f224f76b2218d164098a3fdec8d3797e6d

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. Static review of the provided diff and surrounding source; I did not execute the test suite in this environment because the package import fails without numpy.

Executive Summary

The prior staggered-registry blocker is mitigated: the single-cohort auto-DGP limitation is now documented in the Methodology Registry, surfaced as UserWarnings, and covered by regression tests.
One P1 remains in the new SyntheticDiD power path: the placebo-variance fail-fast ignores data_generator_kwargs overrides that change the effective treated/control split.
That allows simulate_power(), simulate_mde(), and simulate_sample_size() to return numeric output for an infeasible placebo design instead of raising the documented ValueError.
The new SDID tests only cover the treatment_fraction boundary, so the override-based regression is not exercised.
I did not find additional material issues in code quality, performance, maintainability, security, or docs beyond that gap.

Methodology

Severity: P1. Impact: For SyntheticDiD(variance_method="placebo"), the new prevalidation computes n_treated only from treatment_fraction, but the actual factor-model DGP kwargs are merged later and can override n_treated. Because generate_factor_data() accepts n_treated directly, a call like data_generator_kwargs={"n_treated": 30} can create n_control <= n_treated after the fail-fast has already passed. In that case SyntheticDiD falls back to se=0.0 and empty placebo effects, safe_inference() turns inference into NaNs, and simulate_power() converts the NaN p-values into non-rejections, yielding a numeric power result for an invalid placebo-variance design instead of the documented error. Concrete fix: build the effective factor-DGP kwargs before validating, derive the treated/control counts from the merged kwargs (or validate the generated data before fitting), and reuse that check in simulate_power() so simulate_mde() and simulate_sample_size() inherit it. References: diff_diff/power.py:L92-L109, diff_diff/power.py:L1409-L1425, diff_diff/power.py:L1490-L1500, diff_diff/prep_dgp.py:L270-L301, diff_diff/synthetic_did.py:L720-L730, diff_diff/synthetic_did.py:L423-L431, diff_diff/utils.py:L186-L191, docs/methodology/REGISTRY.md:L1113-L1114.
Severity: P3. Impact: The earlier automatic staggered-DGP mismatch is now a documented limitation with runtime warnings for control_group="not_yet_treated", clean_control="strict", and anticipation>0; under the review policy that makes it informational rather than blocking. Concrete fix: none required for approval. References: diff_diff/power.py:L254-L310, docs/methodology/REGISTRY.md:L1711-L1715, tests/test_power.py:L1197-L1274.

Code Quality

No material findings in the changed code beyond the methodology issue above.

Performance

No material findings.

Maintainability

No material findings.

Tech Debt

No material findings newly introduced by this diff.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The new SDID regression coverage only checks the placebo boundary through treatment_fraction; it never exercises the supported data_generator_kwargs override path that changes the actual treated count, so the P1 above shipped untested. Concrete fix: add simulate_power(), simulate_mde(), and simulate_sample_size() tests using SyntheticDiD(), n_units=50, treatment_fraction=0.3, and data_generator_kwargs={"n_treated": 30} (or any override that makes n_control <= n_treated) and assert the same ValueError as the high-treatment_fraction path. References: tests/test_power.py:L968-L1030, diff_diff/power.py:L1409-L1500.

Path to Approval

Move the SyntheticDiD placebo-feasibility validation to the effective DGP parameters after data_generator_kwargs are merged, so registry-based overrides of n_treated cannot bypass the documented ValueError.
Add regression tests for all three simulation helpers covering the override case, not just the treatment_fraction case.
If the fail-fast guarantee is intentionally limited to the default factor DGP, narrow the REGISTRY.md note/error text accordingly; otherwise keep the current note and make the validation match it.

… override Move the SyntheticDiD placebo-variance fail-fast after kwargs merge so that data_generator_kwargs={"n_treated": N} overrides are caught. Add tests for the override path across simulate_power, simulate_mde, and simulate_sample_size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T20:17:45Z

/ai-review

github-actions · 2026-03-18T20:27:19Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9e18a3ea05f874da0a9bc63f1f66fc0eb5f2f343

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. Static review only; I could not execute the test suite in this environment because pytest and numpy are unavailable.

Executive Summary

The prior SyntheticDiD placebo-validation blocker is fixed: the fail-fast now respects merged data_generator_kwargs, the behavior is documented in the Methodology Registry, and regression tests cover all three simulation helpers. See diff_diff/power.py:L1412, docs/methodology/REGISTRY.md:L1113, tests/test_power.py:L1032.
The earlier staggered auto-DGP mismatch concern is now documented with **Note:** language and runtime warnings, so that deviation is informational rather than blocking. See diff_diff/power.py:L254, docs/methodology/REGISTRY.md:L1715, tests/test_power.py:L1236.
One P1 remains on the new TripleDifference registry path: the adapter silently forces a fixed 2x2x2 DDD DGP, ignores several public design inputs, and only approximates n_units, so simulate_power(), simulate_mde(), and simulate_sample_size() can report results for a different design than the caller requested. See diff_diff/power.py:L112, diff_diff/power.py:L1493, diff_diff/prep_dgp.py:L410.
The new tests only exercise TripleDifference in the aligned n_units=80 simulate_power() case and do not cover the new DDD MDE/sample-size paths or non-representable n_units, so this behavior is currently unguarded. See tests/test_power.py:L930, tests/test_power.py:L1353, tests/test_power.py:L1486.
I did not find additional material issues in security, performance, or general code quality.

Methodology

Severity: P1. Impact: TripleDifference is now auto-supported through the registry, but _ddd_dgp_kwargs() drops n_periods, treatment_period, and treatment_fraction, and maps n_units to n_per_cell=max(2, n_units // 8). Since generate_ddd_data() is a fixed 8-cell, 2-period DGP, the simulation helpers can silently evaluate a different design than the one the API call specifies; for non-multiples of 8 the effective sample size is rounded, and for n_units < 16 it is rounded up to 16. There is no corresponding **Note:** in the power-analysis registry section documenting this DDD-specific deviation. Concrete fix: make the DDD path validate unsupported inputs (n_periods, treatment_period, treatment_fraction) and require an exact total-sample mapping, or expose/report n_per_cell and the effective total N explicitly so the returned power/MDE/sample-size matches the simulated design. See diff_diff/power.py:L112, diff_diff/power.py:L1493, diff_diff/power.py:L1890, diff_diff/power.py:L2094, diff_diff/prep_dgp.py:L410, docs/methodology/REGISTRY.md:L1707.
Severity: P3. Impact: The previous SyntheticDiD placebo-variance override gap is mitigated, and the staggered single-cohort auto-DGP limitations are now documented/warned in a way that matches the review policy. Concrete fix: none. See diff_diff/power.py:L1412, diff_diff/power.py:L254, docs/methodology/REGISTRY.md:L1114, docs/methodology/REGISTRY.md:L1715, tests/test_power.py:L1032, tests/test_power.py:L1236.

Code Quality

No material findings beyond the methodology issue above.

Performance

No material findings.

Maintainability

No material findings beyond the need to make the TripleDifference adapter semantics explicit.

Tech Debt

No separate tech-debt-only findings. The TripleDifference adapter issue above is not tracked in TODO.md, so it remains unmitigated for approval purposes. See TODO.md:L38.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The new test coverage does not exercise the DDD adapter’s parameter translation edge cases. TripleDifference is only tested in simulate_power() with n_units=80, which happens to map cleanly to 10 observations per cell, and there are no DDD-specific simulate_mde() or simulate_sample_size() cases. That leaves the P1 above unguarded. Concrete fix: add TripleDifference regression tests for simulate_power(), simulate_mde(), and simulate_sample_size() with values like n_units=10 and n_units=65, and assert the intended behavior for unsupported n_periods/treatment_period/treatment_fraction and for effective sample-size handling. See tests/test_power.py:L930, tests/test_power.py:L1353, tests/test_power.py:L1486.

Path to Approval

Make the TripleDifference registry path enforce a one-to-one mapping between public simulation inputs and the actual DDD DGP: validate or reject unsupported n_periods, treatment_period, and treatment_fraction, and stop silently rounding n_units without surfacing the effective design.
Add DDD-specific regression tests for simulate_power(), simulate_mde(), and simulate_sample_size() covering non-multiples of 8 and unsupported generic design knobs.
If any DDD-specific rounding or fixed-2x2x2 behavior is intentional, document it in docs/methodology/REGISTRY.md with a **Note:** and mirror that behavior in the public docstrings/results so the simulated design is explicit to callers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T20:54:06Z

/ai-review

github-actions · 2026-03-18T21:01:26Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0e912bbed172b3958c4e87088e35665c9a0b17c5

Overall Assessment

⛔ Blocker — Highest unmitigated severity: P0.

Static review only; I could not execute pytest in this environment because both pytest and numpy are unavailable.

Executive Summary

The prior TripleDifference blocker from the last review is now mitigated: the fixed 2×2×2 DGP limitation is documented in the Methodology Registry and tracked in TODO.md, so that deviation is informational rather than blocking. See docs/methodology/REGISTRY.md:L1720-L1721, TODO.md:L42-L50.
simulate_sample_size() now has a silent correctness bug: in the auto-bracketing path it can return an internal registry floor as required_n without checking any smaller sample sizes, even though the API says it finds the smallest N achieving target power. See diff_diff/power.py:L2072-L2075, diff_diff/power.py:L2132-L2204, tests/test_power.py:L1747-L1758.
The new DDD compatibility warning is only partially fixed: passing data_generator_kwargs={"n_per_cell": ...} suppresses warnings for still-ignored n_periods, treatment_period, and treatment_fraction, which contradicts the registry note that a warning is emitted when inputs differ from the effective DDD design. See diff_diff/power.py:L313-L359, docs/methodology/REGISTRY.md:L1721, tests/test_power.py:L1023-L1035.
The new tests currently codify both behaviors, so the required fixes will otherwise look like regressions. See tests/test_power.py:L1023-L1035, tests/test_power.py:L1747-L1758.

Methodology

Severity: P0. Impact: simulate_sample_size() and SimulationSampleSizeResults.required_n are documented as returning the required/smallest sample size, but the auto path seeds lo from an internal registry floor and returns immediately if that floor already reaches target power. Because that branch never evaluates any n < min_n, it can silently report a non-minimal required_n. This is a wrong statistical output, not just a conservative implementation choice. Concrete fix: use min_n only as an initial guess for bracketing, then search downward to the estimator’s true feasibility floor before returning; if you intend to return an upper bound instead, rename/document the result accordingly and emit a warning. See diff_diff/power.py:L1774-L1777, diff_diff/power.py:L2072-L2075, diff_diff/power.py:L2132-L2204, tests/test_power.py:L1747-L1758.
Severity: P1. Impact: _check_ddd_dgp_compat() suppresses all ignored-parameter warnings whenever n_per_cell is present in data_generator_kwargs, even though overriding n_per_cell does not make n_periods, treatment_period, or treatment_fraction meaningful to generate_ddd_data(). That leaves a silent design-mismatch path for DDD simulations and conflicts with the registry note promising a warning whenever inputs differ from the effective design. Concrete fix: let n_per_cell suppress only the effective-N rounding/clamping warning; keep warning on ignored n_periods, treatment_period, and treatment_fraction unless the caller supplies a fully custom data_generator. See diff_diff/power.py:L313-L359, docs/methodology/REGISTRY.md:L1721, tests/test_power.py:L1023-L1035.

Code Quality

No material findings beyond the correctness issues above.

Performance

No material findings.

Maintainability

No material findings beyond the need to separate “initial search guess” from “true minimum supported N” in the sample-size search path.

Tech Debt

Severity: P3. Impact: The earlier TripleDifference power-analysis limitation is now properly tracked in TODO.md and documented in REGISTRY.md, so that specific deviation is mitigated under the project’s review policy. Concrete fix: none for this PR. See TODO.md:L42-L50, docs/methodology/REGISTRY.md:L1720-L1721.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The new tests lock in the two behaviors above instead of the documented contract: test_lo_already_sufficient_auto asserts the auto path should return the registry floor as required_n, and test_ddd_no_warn_n_per_cell_override asserts no warning in a still-ignored-parameter DDD path. That will hide the fixes behind failing tests. Concrete fix: update those assertions to match the corrected behavior, and add a DDD regression where n_per_cell is overridden but mismatched n_periods/treatment_fraction still warn. See tests/test_power.py:L1023-L1035, tests/test_power.py:L1747-L1758.

Path to Approval

Fix simulate_sample_size() so the auto path does not return min_n as required_n without ruling out smaller sample sizes; either search downward after the first successful bracket or explicitly convert the API/result semantics to “upper bound” with a warning.
Fix _check_ddd_dgp_compat() so n_per_cell only suppresses the effective-sample-size warning, not the warnings for ignored n_periods, treatment_period, and treatment_fraction.
Update the new tests to reflect the corrected contracts, especially tests/test_power.py:L1023-L1035 and tests/test_power.py:L1747-L1758.

… warning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T21:23:41Z

/ai-review

github-actions · 2026-03-18T21:41:09Z

🔁 AI review rerun (requested by @igerber)

Head SHA: fa93beae35118378807aacbcecad8c3522477b21

Overall Assessment

⛔ Blocker — Highest unmitigated severity: P0.

Static review only. I could not run pytest or import the package in this environment because pytest and numpy are unavailable.

Executive Summary

The prior DDD warning-suppression finding appears resolved: _check_ddd_dgp_compat() now still warns on ignored DDD parameters when n_per_cell is overridden, and the fixed 2×2×2 DGP limitation is now documented in the Methodology Registry and tracked in TODO.md. diff_diff/power.py:L313-L361, docs/methodology/REGISTRY.md:L1720-L1721, TODO.md:L46-L50, tests/test_power.py:L1023-L1035
The new registry merge introduces a silent correctness bug: data_generator_kwargs can now overwrite builder-supplied core DGP inputs on registered estimators, so the simulated DGP can differ from the reported true_effect/search variable with no warning. diff_diff/power.py:L1554-L1564, diff_diff/power.py:L1645-L1664, diff_diff/power.py:L1933-L1952, diff_diff/power.py:L2137-L2156
The previous simulate_sample_size() blocker is only partially addressed. The new warning removes the silence, but auto mode still returns the registry floor as required_n without proving minimality, while the API/result docs still say “smallest” / “required” sample size. diff_diff/power.py:L1774-L1776, diff_diff/power.py:L2072-L2075, diff_diff/power.py:L2191-L2209, tests/test_power.py:L1762-L1774
I did not find a new unmitigated source-paper mismatch in estimator formulas or variance estimators themselves; the blocking issues are in the new simulation infrastructure.

Methodology

Severity: P0. Impact: On the registered-estimator path, dgp_kwargs.update(data_gen_kwargs) silently lets data_generator_kwargs override builder-supplied core simulation inputs. A conflicting key such as treatment_effect will make simulate_power() report one true_effect while generating data with another, which in turn corrupts bias, coverage, and any simulate_mde() / simulate_sample_size() search that relies on simulate_power() as its objective. Before this refactor, duplicate kwargs on the explicit call path would have raised immediately; now they silently change the estimand/DGP. Concrete fix: reject conflicting overrides on the registry path for core/search-driving inputs instead of blindly update()-ing, with a narrow allowlist only for documented, intentional overrides such as DDD n_per_cell. Add regression tests for collisions on treatment_effect and the searched quantity. Location: diff_diff/power.py:L1554-L1564, diff_diff/power.py:L1645-L1664, diff_diff/power.py:L1933-L1952, diff_diff/power.py:L2137-L2156
Severity: P1. Impact: simulate_sample_size() still claims to find the “smallest N” / “required sample size,” but in auto mode it returns the registry floor as soon as that floor reaches target power, without evaluating any smaller n. The added warning helps, but the returned required_n semantics remain incorrect and undocumented in the registry. Concrete fix: either actually search below the registry floor until infeasibility is established, or rename/document the auto-path result everywhere as an upper bound/search floor rather than required_n. Location: diff_diff/power.py:L1774-L1776, diff_diff/power.py:L1818-L1828, diff_diff/power.py:L2072-L2075, diff_diff/power.py:L2191-L2209

Code Quality

No material findings beyond the correctness issues above.

Performance

No material findings.

Maintainability

No material findings beyond the unresolved required_n contract mismatch in simulate_sample_size().

Tech Debt

Severity: P3. Impact: The TripleDifference power-analysis limitation is now properly mitigated under project policy: the fixed 2×2×2 DGP is documented in the Methodology Registry and tracked in TODO.md, so that deviation is informational rather than blocking. Concrete fix: none for this PR. Location: docs/methodology/REGISTRY.md:L1720-L1721, TODO.md:L46-L50

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The new tests now lock in the auto-floor behavior by asserting that simulate_sample_size() should return 20 in the “registry floor already sufficient” case, and there is no regression coverage for the new duplicate-core-kwarg collision path on registered estimators. That means the remaining correctness issues will look intentional to the suite. Concrete fix: update the auto-floor expectation once simulate_sample_size() semantics are corrected, and add explicit tests that conflicting registry-path overrides such as data_generator_kwargs={"treatment_effect": ...} raise instead of silently changing the simulated DGP. Location: tests/test_power.py:L1762-L1774, diff_diff/power.py:L1554-L1564

Path to Approval

Prevent silent core-DGP overrides on the registry path. Reject conflicting data_generator_kwargs that would replace builder-supplied simulation knobs or search variables, and add regression tests covering simulate_power(), simulate_mde(), and simulate_sample_size().
Make simulate_sample_size() consistent with its contract: either search below the registry floor until the true minimum is identified, or explicitly change the API/result/docs to return a documented upper bound/search floor instead of required_n.
Update the tests to match the corrected sample-size semantics and to cover the duplicate-kwarg regression path.

…arch to grid Addresses AI review P1/P2: TripleDifference results now report the effective sample size when DDD grid rounding occurs, and simulate_sample_size() bisects on multiples of 8 so required_n is always a realizable DDD design. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-20T12:31:44Z

/ai-review

github-actions · 2026-03-20T12:43:24Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5d234c519972e4d8f2516ca38b4a1b177fd01a11

Overall Assessment

⛔ Blocker — Highest unmitigated severity: P0.

Static review only. I could not run the package or pytest in this environment because numpy is unavailable.

Executive Summary

The prior TripleDifference power-analysis limitation now looks properly mitigated under project policy: the fixed 2×2×2 DGP is warned about at runtime and documented in the Methodology Registry / TODO.md. diff_diff/power.py:L325-L372, docs/methodology/REGISTRY.md:L1720-L1721, TODO.md:L42-L50
The main blocker from the prior review remains: the registry path still blindly merges data_generator_kwargs into registry-managed DGP kwargs, so the simulated DGP can differ from the reported true_effect or searched n_units with no warning. diff_diff/power.py:L1578-L1588, diff_diff/power.py:L1985-L2004, diff_diff/power.py:L2204-L2223
The prior simulate_sample_size() contract issue is only partially addressed: the new warning removes the silence, but auto mode still returns the registry floor as required_n without proving minimality, while the code/tests still describe it as the required/smallest N. diff_diff/power.py:L1810-L1812, diff_diff/power.py:L2127-L2130, diff_diff/power.py:L2260-L2279, tests/test_power.py:L1827-L1840
P1 [Newly identified]: the new staggered-DGP compatibility guard suppresses warnings on any cohort_periods override, even single-cohort overrides that still do not create the documented not_yet_treated / strict control design. diff_diff/power.py:L266-L300, docs/methodology/REGISTRY.md:L1720-L1720
I did not find a new unmitigated mismatch in the estimators’ core formulas or variance estimators themselves; the remaining issues are in the new simulation infrastructure.

Methodology

Severity: P0. Impact: the new registry path still does dgp_kwargs.update(data_gen_kwargs) before calling the generator, so data_generator_kwargs can silently overwrite registry-managed public inputs. At minimum, collisions on keys like treatment_effect or n_units let simulate_power() report one true_effect / searched sample size while generating a different DGP; because simulate_mde() and simulate_sample_size() use simulate_power() as their objective, the same bug corrupts both search routines. Concrete fix: reject collisions where data_generator_kwargs overwrites registry-managed public/search inputs, and either reject or explicitly re-synchronize any derived timing overrides before fitting. Add regression tests for simulate_power(), simulate_mde(), and simulate_sample_size() covering conflicting treatment_effect and n_units inputs. diff_diff/power.py:L1578-L1588, diff_diff/power.py:L1985-L2004, diff_diff/power.py:L2204-L2223
Severity: P1. Impact: simulate_sample_size() still claims to find the required/smallest sample size, but in auto mode it returns the registry heuristic floor (min_n) as soon as that floor reaches target power, without evaluating any smaller n. The warning helps, but the return value semantics are still wrong relative to the result object and docstring contract. Concrete fix: either actually search below the floor until infeasibility is established, or rename/document the auto-path result everywhere as an upper bound/search floor instead of required_n. diff_diff/power.py:L393-L480, diff_diff/power.py:L1810-L1812, diff_diff/power.py:L2127-L2130, diff_diff/power.py:L2260-L2279
Severity: P1 [Newly identified]. Impact: _check_staggered_dgp_compat() treats the mere presence of data_generator_kwargs["cohort_periods"] as sufficient to suppress the mismatch warning. A single-cohort override such as cohort_periods=[2] still does not satisfy the documented multi-cohort design needed for control_group="not_yet_treated" or clean_control="strict", so this new path can still run under the wrong control design with no warning. Concrete fix: only suppress the warning when the override actually creates at least two distinct treatment cohorts; add regression tests for [2] vs [2, 4] under Callaway-Sant’Anna / Sun-Abraham and StackedDiD. diff_diff/power.py:L266-L300, docs/methodology/REGISTRY.md:L1720-L1720

Code Quality

No material findings beyond the correctness/validation issues above.

Performance

No material findings.

Maintainability

No material findings separate from the misleading simulate_sample_size() contract above.

Tech Debt

Severity: P3. Impact: the TripleDifference power-analysis limitation is now properly documented in the Methodology Registry and tracked in TODO.md, so under project policy it is informational rather than blocking. Concrete fix: none required for this PR. docs/methodology/REGISTRY.md:L1721-L1721, TODO.md:L42-L50

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: the new tests now codify the auto-floor sample-size behavior (required_n == 20 after the warning) and still do not protect against the remaining registry-collision / single-cohort warning-suppression regressions, so the current suite makes the outstanding correctness issues look intentional. Concrete fix: update the auto-floor expectation once simulate_sample_size() semantics are corrected, and add explicit regression tests for conflicting registry-path overrides plus cohort_periods=[2] continuing to warn. tests/test_power.py:L1508-L1521, tests/test_power.py:L1827-L1840

Path to Approval

Block registry-path collisions where data_generator_kwargs overwrites registry-managed public/search inputs, and add regression tests covering treatment_effect and n_units conflicts through all three simulation entry points.
Make simulate_sample_size() match its advertised contract: either search below the registry floor until minimality is shown, or rename/document the returned quantity everywhere as an upper bound/search floor instead of required_n.
Tighten _check_staggered_dgp_compat() so control_group="not_yet_treated" and clean_control="strict" only suppress warnings when the override actually creates at least two distinct treatment cohorts; add tests for single-cohort and multi-cohort overrides.

…red compat Address AI review P0/P1/P2 findings: - P0: Reject data_generator_kwargs that override registry-managed keys (treatment_effect, noise_sd, n_units, etc.) to prevent silent desync - P1: simulate_sample_size() now searches below the registry floor when the floor already achieves target power, finding the true minimum N - P1: _check_staggered_dgp_compat() checks len(set(cohort_periods)) >= 2 instead of key existence, so single-cohort overrides like [2] still warn - P2: Add regression tests for all three fixes - Fix best_power initialization in bisection after downward search Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-20T13:19:18Z

/ai-review

github-actions · 2026-03-20T13:31:10Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 94fa3d4be34eba430a80d914d002e1a5b24c1fe5

Overall Assessment

⛔ Blocker — Highest unmitigated severity: P0.

Static review only. I could not run the test suite here because this environment does not have numpy or pytest.

Executive Summary

The prior exact-name collision bug is only partially fixed. Exact public-key collisions are now rejected, but the same desynchronization class still exists through derived DGP keys.
The prior single-cohort staggered-warning issue looks fixed: the warning is now suppressed only for true multi-cohort overrides.
The prior sample-size-floor issue also looks largely fixed: auto mode now probes below the registry floor instead of returning it unconditionally.
New blocker: SyntheticDiD can still get silently wrong power/MDE/sample-size results if data_generator_kwargs overrides factor-DGP timing (n_pre/n_post), because the wrapper still builds post_periods from the old treatment_period.
The same incomplete guard also leaves TripleDifference sample-size search vulnerable to n_per_cell overrides that decouple the searched n_units from the DGP’s actual sample size.
The fixed 2×2×2 DDD power limitation is now properly documented in the Methodology Registry and tracked in TODO.md, so that limitation itself is non-blocking under repo policy.

Methodology

Severity P0. Impact: the new collision guard only protects six exact public parameter names, but the registry still lets derived DGP knobs redefine the simulated design without re-synchronizing the estimator/search. For factor-model power, _factor_dgp_kwargs() maps the public design to n_pre/n_post, generate_factor_data() applies treatment starting at t >= n_pre, and _sdid_fit_kwargs() still constructs SyntheticDiD’s post_periods from the original treatment_period. A caller can therefore get silent wrong power output by passing data_generator_kwargs={"n_pre": ...} or {"n_post": ...} on the auto-registry path. The same hole lets DDD n_per_cell overrides disconnect simulate_sample_size()’s searched n_units from the DGP’s actual sample size. This is the same registry/DGP desync class the prior review flagged, just through derived keys instead of exact-name collisions. Concrete fix: treat these derived knobs as registry-managed too. At minimum, reject n_pre/n_post overrides for auto-registry SyntheticDiD unless post_periods is rebuilt from the effective DGP, and reject n_per_cell in DDD simulate_sample_size() or search over effective N instead of requested n_units. Add regression tests for both cases. diff_diff/power.py:L92-L109 diff_diff/power.py:L189-L203 diff_diff/power.py:L1522-L1542 diff_diff/power.py:L1601-L1646 diff_diff/power.py:L2259-L2395 diff_diff/prep_dgp.py:L270-L405 docs/methodology/REGISTRY.md:L1016-L1123 docs/methodology/REGISTRY.md:L1720-L1722
No other unmitigated methodology defects stood out in the changed estimator formulas, weighting, or SE paths.

Code Quality

No material findings beyond the incomplete registry/DGP guard above.

Performance

No material findings.

Maintainability

No separate finding beyond the estimator-specific guard gap above.

Tech Debt

Severity P3. Impact: the TripleDifference power-analysis limitation is now explicitly documented in the Methodology Registry and tracked in TODO.md, so under project policy it is informational rather than blocking. Concrete fix: none required for this PR. docs/methodology/REGISTRY.md:L1722-L1722 TODO.md:L42-L50

Security

No material findings.

Documentation/Tests

Severity P2. Impact: the new tests/docs cover the aligned happy path for n_per_cell and explicitly bless it as a generally allowed non-collision, but they do not cover the remaining derived-key desync cases that still break correctness: no regression protects SyntheticDiD with n_pre/n_post overrides, and no test exercises simulate_sample_size(TripleDifference(), data_generator_kwargs={"n_per_cell": ...}). The Registry note also overstates the safety of n_per_cell overrides. Concrete fix: add regression tests for those cases and narrow the docs/tests so they only treat the aligned n_units == 8 * n_per_cell case as benign, or reject the unsafe overrides outright. tests/test_power.py:L1023-L1050 tests/test_power.py:L1117-L1146 tests/test_power.py:L1896-L1905 docs/methodology/REGISTRY.md:L1722-L1722

Path to Approval

Close the remaining registry/DGP desync hole for derived keys: reject or re-synchronize n_pre/n_post on auto-registry SyntheticDiD, and reject n_per_cell for DDD sample-size search or change the search to operate on effective N.
Add a regression test showing that simulate_power(SyntheticDiD(), treatment_period=3, data_generator_kwargs={"n_pre": 1}) either raises or rebuilds post_periods from the generated data instead of silently using [3].
Add a regression test showing that simulate_sample_size(TripleDifference(), data_generator_kwargs={"n_per_cell": 10}) either raises immediately or reports/searches over the fixed effective sample size of 80 rather than arbitrary requested n_units.

…lysis Add n_pre and n_post to _PROTECTED_DGP_KEYS to prevent SyntheticDiD/TROP DGP desync when overriding timing-derived keys. Reject n_per_cell in simulate_sample_size() for TripleDifference since it freezes effective sample size across bisection. Add 4 regression tests and update REGISTRY.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-20T14:57:21Z

/ai-review

github-actions · 2026-03-20T15:09:42Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b83d531ee49e0da9f1e0ba2d90ba5ebd965849f9

Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not execute the test suite in this sandbox because numpy and pytest are unavailable.

Executive Summary

The prior derived-key desync findings look addressed: factor-model overrides of n_pre/n_post are now blocked, and DDD sample-size search now rejects fixed n_per_cell overrides. diff_diff/power.py:L242-L253 diff_diff/power.py:L2237-L2244 tests/test_power.py:L1856-L1866 tests/test_power.py:L1920-L1940
The fixed 2×2×2 TripleDifference power limitation is now properly documented in REGISTRY.md and tracked in TODO.md, so that prior limitation is non-blocking under repo policy. docs/methodology/REGISTRY.md:L1721-L1723 TODO.md:L46-L50
Remaining blocker: the new SyntheticDiD placebo fail-fast only runs on the registry DGP path. If a caller supplies data_generator=..., invalid n_control <= n_treated designs still run and can return degenerate power/MDE/sample-size outputs instead of the documented ValueError. diff_diff/power.py:L1496-L1518 diff_diff/power.py:L1603-L1668 diff_diff/synthetic_did.py:L720-L730 diff_diff/utils.py:L160-L191 docs/methodology/REGISTRY.md:L1113-L1114
Test coverage is broad for the registry-path fixes, but I do not see a regression covering that custom-DGP SyntheticDiD placebo path. tests/test_power.py:L1218-L1277 tests/test_power.py:L1359-L1440

Methodology

Affected methods: simulate_power, simulate_mde, and simulate_sample_size, with estimator-specific methodology checks for SyntheticDiD, TripleDifference, and the staggered-estimator registry adapters.

Severity P1. Impact: SyntheticDiD’s placebo variance assumption (n_control > n_treated) is only enforced when simulate_power() is using the registry DGP (and not use_custom_dgp). On the supported custom-DGP path, the wrapper will still fit SyntheticDiD, which falls through to SE=0.0 plus empty placebo draws; safe_inference() then converts inference to NaN, and simulate_power() counts those as non-rejections. That means simulate_power(), simulate_mde(), and simulate_sample_size() can return power-search outputs for a methodology-invalid placebo design instead of the Registry-documented hard failure. Concrete fix: move the placebo-feasibility check to the post-generation path so it uses the realized generated panel on both registry and custom-DGP flows, then let simulate_mde()/simulate_sample_size() inherit that behavior through simulate_power(). diff_diff/power.py:L1496-L1518 diff_diff/power.py:L1603-L1668 diff_diff/power.py:L2041-L2048 diff_diff/power.py:L2270-L2277 diff_diff/synthetic_did.py:L720-L730 docs/methodology/REGISTRY.md:L1113-L1114

Code Quality

No material findings beyond the P1 methodology gap above.

Performance

No material findings in the changed code.

Maintainability

No material findings in the changed code.

Tech Debt

Severity P3. Impact: the TripleDifference power-analysis limitation is now explicitly documented and tracked, so it is informational rather than blocking under the repo’s deferred-work policy. Concrete fix: none required for this PR. docs/methodology/REGISTRY.md:L1723-L1723 TODO.md:L50-L50

Security

No material findings.

Documentation/Tests

Severity P2. Impact: the new tests cover the registry-path SDID guard and the DDD desync fixes, but they do not cover the remaining supported combination of SyntheticDiD + custom data_generator, so the P1 above is currently unprotected by regression coverage. Concrete fix: add one custom factor-DGP wrapper test each for simulate_power(), simulate_mde(), and simulate_sample_size() that generates n_treated >= n_control and asserts ValueError rather than returning results. tests/test_power.py:L1218-L1277 tests/test_power.py:L1359-L1440

Path to Approval

Enforce the SyntheticDiD(variance_method="placebo") feasibility check on the custom-DGP path inside simulate_power() using the realized generated data, not only the registry-input parameters.
Add regression tests for simulate_power(), simulate_mde(), and simulate_sample_size() with a custom factor-model generator that yields n_control <= n_treated, and assert ValueError.
Keep the Methodology Registry note aligned with the implemented behavior once the custom-DGP path is fixed.

The existing pre-generation check only ran on the registry path. Custom DGPs could silently produce infeasible placebo designs (n_control <= n_treated), yielding SE=0.0 and degenerate power results. Add a post-generation check on the realized data that covers both paths, with regression tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-20T18:56:39Z

/ai-review

github-actions · 2026-03-20T19:03:10Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f1112acff540055b810214507e61c37355f06873

Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not execute the test suite in this sandbox because pytest and numpy are unavailable.

Executive Summary

The prior SyntheticDiD custom-DGP fail-fast finding looks addressed for the deterministic custom-DGP path and is now both documented and regression-tested. diff_diff/power.py:L1665 docs/methodology/REGISTRY.md:L1114 tests/test_power.py:L2022
The fixed 2×2×2 TripleDifference power-DGP limitation is now properly documented in the Methodology Registry and tracked in TODO.md, so that earlier limitation is non-blocking under repo policy. docs/methodology/REGISTRY.md:L1723 TODO.md:L55
[Newly identified] simulate_sample_size() silently floors explicit TripleDifference n_range bounds to 64, even though the same PR documents the realizable DDD minimum as 16 and the auto-search path itself uses 16 as its absolute floor. diff_diff/power.py:L2266 diff_diff/power.py:L2285 diff_diff/power.py:L2320 diff_diff/power.py:L2354 docs/methodology/REGISTRY.md:L1723
That means valid explicit brackets like n_range=(16, 56) cannot search any realizable DDD sample sizes below 64, so the new sample-size API can overstate required N and emit a misleading bracket warning instead of honoring the requested search space. diff_diff/power.py:L2320
The new DDD sample-size tests only exercise ranges starting at 64, so this regression path is currently uncovered. tests/test_power.py:L1118 tests/test_power.py:L1133

Methodology

Affected methods: simulation-based sample-size search, specifically the TripleDifference registry adapter.

Severity P1. Impact: explicit TripleDifference n_range inputs are snapped with _snap_n()'s default floor of profile.min_n (64), not the documented DDD realizable floor of 16. This is an undocumented mismatch between the new search behavior and the Registry note, and it can return an unnecessarily large required sample size for valid DDD designs below 64. Concrete fix: in the DDD branch, snap explicit n_range using the actual DGP floor (16, or a derived 2 * 8) rather than profile.min_n, or else make 64 the documented hard lower bound everywhere including the auto-search path. diff_diff/power.py:L2266 diff_diff/power.py:L2288 diff_diff/power.py:L2320 diff_diff/power.py:L2354 docs/methodology/REGISTRY.md:L1723
Severity P3-informational. Impact: the prior SyntheticDiD custom-DGP placebo-feasibility gap appears resolved for the deterministic custom-DGP case reviewed previously, and the implemented behavior is now documented in the Registry. Concrete fix: none required for this PR. diff_diff/power.py:L1665 docs/methodology/REGISTRY.md:L1114 tests/test_power.py:L2022

Code Quality

No material findings in the changed code beyond the P1 search-floor bug above.

Performance

No material findings in the changed code.

Maintainability

No material findings in the changed code.

Tech Debt

Severity P3-informational. Impact: the known limitation that TripleDifference power currently uses a fixed 2×2×2 cross-sectional DGP is now explicitly tracked in TODO.md, so it is properly deferred rather than blocking this PR. Concrete fix: none required for this PR. TODO.md:L55

Security

No material findings.

Documentation/Tests

Severity P2. Impact: the new tests validate DDD sample-size search only for n_range values starting at 64, which misses the explicit low-range path where _snap_n() collapses valid DDD brackets upward. Concrete fix: add a regression covering simulate_sample_size(TripleDifference(), n_range=(16, 56), ...) and assert that the snapped search remains inside the realizable 16..56 grid rather than collapsing to 64. tests/test_power.py:L1118 tests/test_power.py:L1133

Path to Approval

Change the DDD explicit-n_range path in simulate_sample_size() so that grid snapping uses the actual DDD minimum realizable sample size (16) instead of profile.min_n=64. diff_diff/power.py:L2285 diff_diff/power.py:L2320
Add a regression test for TripleDifference with an explicit low bracket such as n_range=(16, 56) and assert that the returned required_n is a multiple of 8 within that bracket when power is bracketed. tests/test_power.py:L1118
Keep the Methodology Registry aligned with the implemented lower bound after the fix; if 64 is actually intended, document it explicitly and remove the conflicting 16 floor from the auto-search path. docs/methodology/REGISTRY.md:L1723 diff_diff/power.py:L2354

Hoist abs_min before the bracket branch so explicit n_range snaps use the DDD realizable floor (16) instead of the registry profile floor (64). Also pass floor=abs_min to bisection midpoint snap. Adds regression test for low-range DDD sample-size search. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-20T19:26:22Z

/ai-review

github-actions · 2026-03-20T19:34:04Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d1dadfeaf1070edbdf3a04115faafc5062a0b0fc

Overall assessment

✅ Looks good — no unmitigated P0/P1 findings.

Static review only. I could not execute the test suite in this sandbox because pytest and numpy are unavailable.

Executive summary

The prior TripleDifference low-bracket sample-size issue appears fixed: DDD search bounds now snap against an absolute floor of 16, and the low-range regression for n_range=(16, 56) is present. diff_diff/power.py:L2319-L2323 tests/test_power.py:L1149-L1165
The prior SyntheticDiD custom-DGP placebo fail-fast gap appears resolved in code and documentation via _check_sdid_placebo_data() plus dedicated regressions for power/MDE/sample-size. diff_diff/power.py:L393-L420 diff_diff/power.py:L1665-L1667 docs/methodology/REGISTRY.md:L1113-L1114 tests/test_power.py:L2075-L2112
The remaining TripleDifference power-DGP limitation is now explicitly documented in the Methodology Registry and tracked in TODO.md, so it is non-blocking under repo policy. docs/methodology/REGISTRY.md:L1722-L1723 TODO.md:L44-L55
I did not find any new undocumented methodology deviations, incorrect variance/SE handling, or control-group composition errors in the changed paths.
The new public API surface is exported and documented, and api/power.rst is now included in the doc-snippet harness. diff_diff/init.py:L53-L65 diff_diff/init.py:L295-L306 docs/api/index.rst:L148-L158 docs/api/power.rst:L80-L143 tests/test_doc_snippets.py:L30-L40

Methodology

Affected methods: simulate_power, simulate_mde, simulate_sample_size, and the new estimator-to-DGP registry adapters. No estimator formula, weighting formula, or variance estimator was changed.

Severity P3-informational. Impact: The prior DDD explicit-bracket issue is addressed. simulate_sample_size() now snaps DDD brackets with abs_min = 16, matching the documented fixed 2×2×2 DGP floor, and the regression test covers the previously problematic low range. Concrete fix: none. diff_diff/power.py:L2319-L2323 docs/methodology/REGISTRY.md:L1723-L1723 tests/test_power.py:L1149-L1165
Severity P3-informational. Impact: The prior SyntheticDiD custom-DGP placebo-feasibility gap is addressed. Infeasible placebo designs now fail fast on realized custom-DGP data, and that behavior is documented/tested across all three simulation entry points. Concrete fix: none. diff_diff/power.py:L393-L420 diff_diff/power.py:L1665-L1667 docs/methodology/REGISTRY.md:L1113-L1114 tests/test_power.py:L2075-L2112
Severity P3-informational. Impact: The default single-cohort staggered DGP mismatch is surfaced with explicit warnings for control_group="not_yet_treated", clean_control="strict", and anticipation>0, and those warnings are documented in the Registry rather than left silent. Concrete fix: none. diff_diff/power.py:L270-L328 docs/methodology/REGISTRY.md:L1722-L1722 tests/test_power.py:L1468-L1525 tests/test_power.py:L1999-L2033

Code Quality

No material findings in the changed code.

Performance

No material findings in the changed code.

Maintainability

No material findings in the changed code.

Tech Debt

Severity P3-informational. Impact: The fixed-2×2×2 TripleDifference power-DGP limitation is explicitly tracked under TODO.md, so it is properly deferred rather than blocking this PR. Concrete fix: none for this PR. TODO.md:L44-L55

Security

No material findings.

Documentation/Tests

Severity P3-informational. Impact: The new public API (simulate_mde, simulate_sample_size, and the simulation result dataclasses) is exported and documented, and api/power.rst is now included in the snippet-test sweep. Concrete fix: none. diff_diff/init.py:L53-L65 diff_diff/init.py:L295-L306 docs/api/index.rst:L148-L158 docs/api/power.rst:L80-L143 tests/test_doc_snippets.py:L30-L40

Review limitation only: I could not run the added tests in this sandbox because pytest and numpy are not installed.

Expand the tutorial to cover the full simulation power analysis API added in PR #208: multi-estimator support (CallawaySantAnna, SyntheticDiD, TripleDifference), simulate_mde() bisection search, simulate_sample_size() bisection search, custom data generators, and analytical vs simulation guidance. Includes reference table of all 12 supported estimators. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber added a commit that referenced this pull request Mar 18, 2026

Address PR #208 review: add TWFE to registry, fix unregistered-estima…

5d2fdaf

…tor fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber and others added 7 commits March 18, 2026 15:00

Address PR #208 review: add TWFE to registry, fix unregistered-estima…

bbbacc0

…tor fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber force-pushed the power-analysis branch from a15ce5a to d1f2316 Compare March 18, 2026 19:00

Warn when TripleDifference power params don't match fixed 2×2×2 DGP

0e912bb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix DDD n_per_cell suppression scope and add sample-size auto-bracket…

fa93bea

… warning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber and others added 2 commits March 20, 2026 08:09

Merge remote-tracking branch 'origin/main' into power-analysis

33f7b21

igerber merged commit fe3d79a into main Mar 20, 2026
13 checks passed

igerber deleted the power-analysis branch March 20, 2026 20:20

igerber mentioned this pull request Mar 21, 2026

Update power analysis tutorial with simulation-based features #220

Merged

Conversation

igerber commented Mar 18, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

igerber commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

igerber commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

igerber commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

igerber commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Overall assessment

Executive summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

Uh oh!

Reviewers