Skip to content

Extend power analysis to all estimators + simulation-based MDE/sample size#208

Merged
igerber merged 17 commits intomainfrom
power-analysis
Mar 20, 2026
Merged

Extend power analysis to all estimators + simulation-based MDE/sample size#208
igerber merged 17 commits intomainfrom
power-analysis

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Mar 18, 2026

Summary

  • Add estimator registry mapping all 12 built-in estimators (excluding BaconDecomposition) to their appropriate DGP, fit kwargs, and result extraction profiles
  • Refactor simulate_power() from hardcoded 4-estimator if/elif chain to registry-based lookup supporting all estimators automatically
  • Add simulate_mde() — bisection search over effect sizes to find the minimum detectable effect for any estimator
  • Add simulate_sample_size() — bisection search over n_units to find the required sample size for any estimator
  • Add SimulationMDEResults and SimulationSampleSizeResults dataclasses with summary(), to_dict(), to_dataframe() methods
  • 34 new tests: registry unit tests, per-estimator coverage tests, MDE search tests, sample size search tests

Methodology references (required if estimator / math changes)

  • Method name(s): N/A — no methodology changes, this extends power analysis infrastructure only
  • Paper / source link(s): N/A
  • Any intentional deviations from the source (and why): None

Validation

  • Tests added/updated: tests/test_power.py — 34 new tests across 4 test classes (TestEstimatorRegistry, TestEstimatorCoverage, TestSimulateMDE, TestSimulateSampleSize)
  • All 76 tests pass (4 skipped: matplotlib, 2 deselected: slow-marked TROP/SDiD)
  • black, ruff check, and mypy all pass clean
  • Backtest / simulation / notebook evidence: N/A

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Overall Assessment

⛔ Blocker

Highest unmitigated severity is P0.

Executive Summary

  • The new ContinuousDiD registry profile is methodologically wrong for power analysis: it varies the DGP slope, but simulate_power() treats that scalar as the estimator’s scalar truth, even though the estimator returns overall_att. With defaults, treatment_effect=0 is still a non-null design.
  • The new simulate_mde() / simulate_sample_size() explicit-range paths can return unchecked upper bounds and power_at_* = 0.0 without ever evaluating the endpoint, because the supplied bracket is neither validated nor fully evaluated.
  • The added tests are mostly smoke tests. They do not cover ContinuousDiD null calibration or the explicit effect_range / n_range branches, so both defects can merge undetected.
  • I did not find a mitigating Methodology Registry note or TODO.md entry for either issue.

Path to Approval

  1. Remove ContinuousDiD from the automatic power registry for this PR, or redesign that profile so the simulated DGP, extracted estimate, and reported true_effect all refer to the same documented scalar estimand.
  2. In simulate_mde() and simulate_sample_size(), always evaluate user-supplied bracket endpoints, validate that the target power is actually bracketed, and never default power_at_* to 0.0 for unevaluated points.
  3. Add regression tests for ContinuousDiD under a null effect and for explicit effect_range / n_range edge cases, including narrow and unbracketed ranges.

Methodology

  • Severity: P0. Impact: The new ContinuousDiD power profile does not simulate the same estimand it evaluates. _continuous_dgp_kwargs() maps treatment_effect into the DGP’s att_slope only, while the generator keeps att_intercept=1.0 by default and defines linear treatment effects as att_intercept + att_slope * d. But _extract_continuous() pulls result.overall_att, and simulate_power() stores the input scalar as true_effect for bias/coverage. The Methodology Registry defines ContinuousDiD targets as ATT(d), ATT^{glob}, and ACRT(d), not the slope parameter. So this profile is wrong in two ways: treatment_effect=0 is still non-null by default, and even with att_intercept=0, the reported truth is still not the returned estimand. Concrete fix: either drop ContinuousDiD from _get_registry() for now, or add a ContinuousDiD-specific power API that targets a documented scalar quantity and computes the truth from the generated dose-response, not from the input slope. Evidence: diff_diff/power.py:127, diff_diff/power.py:273, diff_diff/power.py:1538, diff_diff/prep_dgp.py:780, diff_diff/prep_dgp.py:892, docs/methodology/REGISTRY.md:459

Code Quality

  • Severity: P1. Impact: The new bisection helpers do not correctly handle explicit search ranges. When effect_range / n_range is provided, the code just assigns lo, hi and skips endpoint evaluation. best_power is then initialized from an empty search_path to 0.0, and simulate_sample_size() can even exit before simulating anything when hi - lo <= 2. That means these functions can return an unchecked upper bound labeled as mde / required_n, together with fabricated power_at_mde / power_at_n = 0.0. Concrete fix: evaluate both endpoints whenever a bracket is supplied, require or explicitly warn/error if the target is not bracketed, and only report power_at_* for values that were actually simulated. Evidence: diff_diff/power.py:1842, diff_diff/power.py:1879, diff_diff/power.py:2016, diff_diff/power.py:2033, diff_diff/power.py:2036

Performance

No material findings.

Maintainability

No material findings beyond the correctness issues above.

Tech Debt

No separate tech-debt-only findings. The issues above are not tracked in TODO.md:38, so they remain unmitigated for assessment purposes.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: The added tests do not exercise the new failure modes. There is an existing DID null-effect calibration test, but the new ContinuousDiD coverage only checks that a result object is finite, and all new MDE/sample-size tests are DID-only. There is also no test for explicit effect_range / n_range inputs. Concrete fix: add tests that assert ContinuousDiD null behavior is calibrated or explicitly unsupported, plus tests that explicit brackets are evaluated/validated and that unbracketed ranges fail loudly. Evidence: tests/test_power.py:390, tests/test_power.py:980, tests/test_power.py:1001, tests/test_power.py:1106

Review note: I could not execute the test suite in this environment because the local Python environment is missing numpy, so this is a static diff review.

igerber added a commit that referenced this pull request Mar 18, 2026
…bisection brackets

P0: Remove ContinuousDiD from power analysis registry — the DGP simulated
att_slope as treatment_effect but extracted overall_att (a dose-weighted
average), so the simulated truth and evaluated estimand didn't match.

P1: Add endpoint evaluation for user-supplied effect_range/n_range in
simulate_mde() and simulate_sample_size(). Previously these paths skipped
evaluation entirely, leaving search_path empty and best_power at 0.0.
Now warns when brackets don't contain the target power.

P2: Add regression tests for all three fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 11539e7a01a24a990ff517c75c04099557578681


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

  • The prior ContinuousDiD blocker is resolved: it is no longer auto-registered for simulation power, and there is now a regression test for that path.
  • The new automatic TwoWayFixedEffects registry entry is still methodologically wrong for the default multi-period simulation path. It passes time="period" into an estimator that explicitly warns this does not estimate the standard DiD ATT.
  • The explicit-endpoint fix is only partial. simulate_sample_size() still never uses the lower-endpoint power, so it can return an oversized required_n when the lower bound already achieves target power.
  • The new tests are mostly smoke tests. They do not lock down either the TWFE methodology issue or the lower-bound sample-size bug.

Methodology

  • Severity: P1. Impact: the new TWFE registry profile is not aligned with the estimator’s documented estimand. _twfe_fit_kwargs() wires automatic power to time="period", but TwoWayFixedEffects.fit() explicitly warns that multi-valued time inputs estimate treated * period_number, not the standard treated * post ATT. That means the new “all built-in estimators are supported automatically” claim is false for TWFE under the default n_periods=4, and the resulting simulate_power(), simulate_mde(), and simulate_sample_size() outputs for TWFE are not methodologically valid. Concrete fix: do not auto-register TWFE for multi-period power simulation in its current form. Either restrict automatic TWFE support to validated two-period / binary-post designs, or change the estimator/power interface so time fixed effects and the post indicator are passed separately before claiming automatic support. Evidence: diff_diff/power.py:L139, diff_diff/power.py:L267, diff_diff/power.py:L1224, diff_diff/twfe.py:L96, docs/methodology/REGISTRY.md:L238, docs/methodology/REGISTRY.md:L253

Code Quality

  • Severity: P1. Impact: simulate_sample_size() still violates the basic bisection invariant because it evaluates the lower endpoint only for logging and never checks whether that lower bound already meets target power. If lo already satisfies the target, the function still bisects above it and can silently return a too-large required_n; the same bug exists in the auto-bracketing path when min_n is already sufficient. Concrete fix: evaluate and store power_lo, return lo immediately when power_lo >= power, and only enter bisection once the invariant power(lo) < target <= power(hi) holds. Evidence: diff_diff/power.py:L1988, diff_diff/power.py:L1999, diff_diff/power.py:L2016

Performance

No material findings in the changed code.

Maintainability

No material findings beyond the correctness issues above.

Tech Debt

No separate deferred-work findings. These issues are not tracked in the code-review debt section of TODO.md:L38, so they remain unmitigated for assessment.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests would not catch either P1. test_twfe() only asserts that the call returns a finite-looking result, not that the TWFE registry path is methodologically valid, and the sample-size tests never cover the case where the lower search bound already achieves target power. The simulate_power() docstring also overstates support by saying all built-in estimators are automatically supported. Concrete fix: add regression tests that either reject unsupported multi-period TWFE power simulation or validate a corrected two-period TWFE path, plus tests where simulate_sample_size() must return the lower endpoint for both explicit n_range and auto-bracketed strong-effect cases; narrow the docstring claim until that coverage exists. Evidence: tests/test_power.py:L844, tests/test_power.py:L1177, tests/test_power.py:L1201, diff_diff/power.py:L1224

Path to Approval

  1. Restrict or redesign automatic TWFE power support so it does not run the current multi-period time="period" path and report it as standard ATT power.
  2. Fix simulate_sample_size() to use both bracket endpoints and short-circuit when the lower bound already meets target power.
  3. Add regression tests for the two cases above, including an explicit lower-bound-sufficient sample-size search.

Static diff review only. I could not execute the test suite in this environment because the local Python environment is missing numpy.

igerber added a commit that referenced this pull request Mar 18, 2026
…short-circuit

- Remove TwoWayFixedEffects from power analysis registry (time="period"
  produces treated*period_number, not standard ATT)
- Add early return in simulate_sample_size() when lower bound already
  achieves target power (both explicit n_range and auto-bracket paths)
- Narrow docstring from "All" to "Most" built-in estimators
- Add regression tests for TWFE exclusion and lo-sufficient scenarios

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d79b042bd80e8d8a1afa453ae8397f4fb2e95b37


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

The two prior blockers from the last review appear resolved: TwoWayFixedEffects is no longer auto-registered for simulation power, and simulate_sample_size() now short-circuits when the lower bound already meets target power, with regression coverage for both cases diff_diff/power.py:L249-L331, tests/test_power.py:L803-L813, diff_diff/power.py:L1972-L2003, tests/test_power.py:L1229-L1256. One new P1 remains.

Executive Summary

Methodology

Code Quality

No material findings beyond the methodology issue above.

Performance

No material findings in the changed code.

Maintainability

No material findings.

Tech Debt

No separate deferred-work findings. The remaining P1 is not tracked under the code-review debt section of TODO.md:L38-L64, so it remains unmitigated for assessment.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: the new test coverage does not exercise the public default SyntheticDiD auto-support path. test_synthetic_did() only uses treatment_fraction=0.3, and the new MDE/sample-size tests only cover DifferenceInDifferences, so the current default-path regression would not be caught by the suite. Concrete fix: add regression tests for simulate_power(SyntheticDiD(), progress=False) at defaults, plus corresponding simulate_mde() / simulate_sample_size() coverage, and assert either valid finite inference or an explicit fail-fast contract. References: tests/test_power.py:L959-L970, tests/test_power.py:L978-L1103, tests/test_power.py:L1111-L1256.

Path to Approval

  1. Make the automatic SyntheticDiD simulation path valid or explicitly unsupported. Concretely: detect the default placebo-variance case where n_control <= n_treated and either raise with a clear message, or switch the supported automatic configuration to one that yields valid inference.
  2. Add regression tests covering the public default SyntheticDiD path in simulate_power(), and the same configuration through simulate_mde() and simulate_sample_size().

Static review only. I could not run the test suite in this environment because pytest is unavailable, and local Python imports fail because numpy is not installed.

igerber added a commit that referenced this pull request Mar 18, 2026
…ro-power bug

Add _sdid_dgp_kwargs that caps treatment_fraction at 0.4 so the placebo
variance method has enough pseudo-controls (n_control > n_treated).
Add regression test for the default SyntheticDiD path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 32f2b2b52bb09b1d35bcbe6a9c5b5f8dab5e3ee8


Overall Assessment

⛔ Blocker

Highest unmitigated severity: P0.

Executive Summary

Methodology

Code Quality

No material findings beyond the methodology blocker above.

Performance

No material findings in the changed code.

Maintainability

No material findings beyond the test gap below.

Tech Debt

No relevant entry under TODO.md:L38-L64 mitigates the blocker above, so it remains unmitigated for assessment.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests validate that the default SyntheticDiD path no longer returns zero power, but they do not verify that the requested design was preserved, and they do not exercise SyntheticDiD through simulate_mde() or simulate_sample_size(). That is why the silent treatment_fraction rewrite passes the suite. Concrete fix: add public-API regression tests that assert either (a) the realized treated share matches the requested treatment_fraction, or (b) unsupported placebo designs raise/warn explicitly; run the same contract through simulate_power(), simulate_mde(), and simulate_sample_size() for SyntheticDiD. References: tests/test_power.py:L697-L729, tests/test_power.py:L975-L987, tests/test_power.py:L995-L1269.

Path to Approval

  1. Remove the unconditional 40% cap from _sdid_dgp_kwargs() so the simulation respects the caller’s treatment_fraction.
  2. Add an explicit contract for unsupported SyntheticDiD placebo designs: if n_control <= n_treated, raise with a clear message or require an explicit bootstrap configuration, rather than silently changing the design.
  3. Add regression coverage for SyntheticDiD across simulate_power(), simulate_mde(), and simulate_sample_size() that checks the contract above.

Static review only. I could not run pytest here because pytest is unavailable, and local imports fail because numpy is not installed.

igerber added a commit that referenced this pull request Mar 18, 2026
…n cap with fail-fast validation

Remove _sdid_dgp_kwargs that silently capped treatment_fraction at 0.4.
Add upfront ValueError in simulate_power() when SyntheticDiD placebo
variance has n_control <= n_treated, with actionable error message
suggesting lowering treatment_fraction or using bootstrap variance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 4fe4a7ab2714d25bd64c6742359b04631c77e2e1


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

  • The prior SyntheticDiD blocker is resolved: the code now fails fast for placebo designs with n_control <= n_treated, the behavior is documented in the Methodology Registry, and tests cover simulate_power(), simulate_mde(), and simulate_sample_size().
  • [Newly identified] The registry refactor leaves TwoWayFixedEffects outside the automatic power-analysis path, so the default simulation APIs now error for a first-class public estimator. The new tests explicitly lock that unsupported state in.
  • The new "provide a custom data_generator and estimator_kwargs" fallback is not actually generic. Unregistered estimators with non-basic fit() signatures, such as ContinuousDiD, still receive hard-coded basic-DiD kwargs and cannot use the advertised escape hatch.
  • Test coverage is broad for registered estimators and the SyntheticDiD fail-fast path, but it adds no positive regression coverage for TwoWayFixedEffects and no test of the advertised custom fallback for non-basic estimators.

Methodology

Code Quality

  • Severity: P1. Impact: TwoWayFixedEffects is still exported as a core public estimator, but the new registry omits it and simulate_power() now raises "not in registry"; simulate_mde() and simulate_sample_size() inherit the same failure. The suite then codifies that unsupported state. This is an undocumented contraction of the default power-analysis API introduced by the refactor. Concrete fix: add a dedicated TwoWayFixedEffects profile to _get_registry() and cover all three public simulation entry points. When you add it, use a binary post indicator in the fit kwargs (time="post", unit="unit"), not the old time="period" pattern, because the registry explicitly says multi-valued time does not estimate the standard TWFE ATT. References: diff_diff/init.py:L37-L42, diff_diff/init.py:L188-L200, diff_diff/power.py:L249-L331, diff_diff/power.py:L1298-L1307, docs/methodology/REGISTRY.md:L253-L257, tests/test_power.py:L803-L814

Performance

No material findings in the changed code.

Maintainability

  • Severity: P2. Impact: The new fallback contract for unregistered estimators is incomplete. The error path tells users to "Provide a custom data_generator and estimator_kwargs," but when profile is None the code still injects treatment="treated" and time="post" unconditionally. Estimators with non-basic fit() signatures, such as ContinuousDiD.fit(data, outcome, unit, time, first_treat, dose, ...), cannot actually use the advertised escape hatch. Concrete fix: either add a truly generic fit_kwargs/fit_kwargs_builder override that lets callers fully control the fit signature, or raise a targeted error for non-basic unregistered estimators instead of suggesting a path that still fails. References: diff_diff/power.py:L1302-L1307, diff_diff/power.py:L1408-L1418, diff_diff/continuous_did.py:L153-L161

Tech Debt

No relevant TODO.md entry mitigates the P1 or P2 findings above. Reference: TODO.md:L38-L64

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests institutionalize TwoWayFixedEffects as unsupported instead of guarding against the regression, and they do not exercise the advertised custom-generator fallback for an unregistered non-basic estimator. That leaves both API-contract issues above unguarded. Concrete fix: replace the negative TwoWayFixedEffects test with positive coverage for simulate_power(), simulate_mde(), and simulate_sample_size(), and add one regression test that either proves the custom fallback works for a non-basic estimator or asserts a targeted "unsupported fit signature" error. References: tests/test_power.py:L803-L814, tests/test_power.py:L834-L1067, tests/test_power.py:L1075-L1352

Path to Approval

  1. Add a TwoWayFixedEffects registry entry so the default simulation APIs cover that public estimator.
  2. Make the unregistered-estimator fallback honest: either support fully caller-specified fit kwargs or fail with a targeted error instead of hard-coding basic-DiD kwargs.
  3. Add regression coverage for TwoWayFixedEffects across simulate_power(), simulate_mde(), and simulate_sample_size(), plus one test covering the intended behavior for an unregistered non-basic estimator.

Static review only. I did not execute the test suite in this environment.

igerber added a commit that referenced this pull request Mar 18, 2026
…tor fallback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 5d2fdaf892bbe20fd99ce97b96ba339bc9121ff3


Overall Assessment

✅ Looks good

Highest unmitigated severity: P2.

Static review only. I could not execute the test suite in this environment because pytest and the scientific Python deps are not installed.

Executive Summary

  • The prior SyntheticDiD blocker is resolved: the simulation APIs now fail fast when placebo variance would violate n_control > n_treated, that deviation is documented in the Methodology Registry, and tests cover simulate_power(), simulate_mde(), and simulate_sample_size().
  • The prior TwoWayFixedEffects blocker is resolved: TWFE is back in the estimator registry, and the registry now uses a binary post indicator, which matches the Registry guidance for standard TWFE ATT.
  • The prior “custom fallback” concern is only partially resolved: caller-controlled estimator_kwargs fixed the old fit-signature problem, but unregistered estimators with nonstandard result objects still cannot use the advertised escape hatch.
  • The power docs remain stale and unguarded: docs/api/power.rst still documents nonexistent PowerAnalysis.compute_* methods and old simulate_power() arguments, and the new snippet smoke test never exercises that file.
  • The new TODO.md entry correctly tracks the remaining blanket-NameError limitation in doc-snippet smoke tests, so that item is informational only.

Methodology

Code Quality

  • No material findings.

Performance

  • No material findings.

Maintainability

  • Severity: P2. Impact: The advertised unregistered-estimator fallback is still not truly generic. When profile is None, simulate_power() now lets callers fully control estimator.fit() kwargs, but result extraction still assumes either .att/.se/.p_value/.conf_int or .avg_*. Public unregistered estimators with different result schemas, such as ContinuousDiD (overall_att, overall_att_se, overall_att_p_value, overall_att_conf_int), will still fail even if the caller supplies a correct custom DGP wrapper and fit kwargs. That means the previous “generic fallback” concern is only partially resolved. Concrete fix: add a caller-supplied result_extractor hook, or extend the fallback extractor to handle overall_* result objects; then add a regression test using ContinuousDiD with explicit data_generator and estimator_kwargs. References: diff_diff/power.py:L1318-L1446, diff_diff/continuous_did.py:L153-L161, diff_diff/continuous_did_results.py:L112-L118, tests/test_power.py:L1097-L1138

Tech Debt

  • Severity: P3. Impact: tests/test_doc_snippets.py still treats any NameError as a pass, which can mask standalone snippet breakage, but this PR explicitly tracks that limitation in TODO.md, so it is mitigated and non-blocking under the review policy. Concrete fix: none required for approval; future work is to replace the blanket NameError catch with a per-snippet allowlist or explicit fixture setup. References: tests/test_doc_snippets.py:L357-L363, TODO.md:L62-L65

Security

  • No material findings.

Documentation/Tests

  • Severity: P2. Impact: The power documentation is still out of sync with the implementation, and the new smoke-test coverage does not protect it. docs/api/power.rst still advertises nonexistent PowerAnalysis.compute_* methods and stale simulate_power() parameters, while the newly exported simulate_mde, simulate_sample_size, SimulationMDEResults, and SimulationSampleSizeResults are absent from the API index. Because tests/test_doc_snippets.py does not include docs/api/power.rst, this broken documentation path remains untested. Concrete fix: update docs/api/power.rst to the current API (power, mde, sample_size, simulate_mde, simulate_sample_size, and both new result classes), add the new symbols to docs/api/index.rst and autosummary, and include api/power.rst in RST_FILES. References: docs/api/power.rst:L33-L65, docs/api/power.rst:L145-L160, docs/api/index.rst:L148-L155, tests/test_doc_snippets.py:L25-L40, diff_diff/init.py:L53-L64, diff_diff/init.py:L295-L306

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 18, 2026

/ai-review

igerber and others added 7 commits March 18, 2026 15:00
…sample size

- Add estimator registry mapping all 12 estimators to appropriate DGP,
  fit kwargs, and result extraction profiles
- Refactor simulate_power() to use registry lookup instead of hardcoded
  if/elif chain (was 4 estimators, now 12 + custom fallback)
- Add simulate_mde() for bisection search over effect sizes
- Add simulate_sample_size() for bisection search over n_units
- Add SimulationMDEResults and SimulationSampleSizeResults dataclasses
- 34 new tests: registry, estimator coverage, MDE search, sample size search

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…bisection brackets

P0: Remove ContinuousDiD from power analysis registry — the DGP simulated
att_slope as treatment_effect but extracted overall_att (a dose-weighted
average), so the simulated truth and evaluated estimand didn't match.

P1: Add endpoint evaluation for user-supplied effect_range/n_range in
simulate_mde() and simulate_sample_size(). Previously these paths skipped
evaluation entirely, leaving search_path empty and best_power at 0.0.
Now warns when brackets don't contain the target power.

P2: Add regression tests for all three fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…short-circuit

- Remove TwoWayFixedEffects from power analysis registry (time="period"
  produces treated*period_number, not standard ATT)
- Add early return in simulate_sample_size() when lower bound already
  achieves target power (both explicit n_range and auto-bracket paths)
- Narrow docstring from "All" to "Most" built-in estimators
- Add regression tests for TWFE exclusion and lo-sufficient scenarios

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ro-power bug

Add _sdid_dgp_kwargs that caps treatment_fraction at 0.4 so the placebo
variance method has enough pseudo-controls (n_control > n_treated).
Add regression test for the default SyntheticDiD path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n cap with fail-fast validation

Remove _sdid_dgp_kwargs that silently capped treatment_fraction at 0.4.
Add upfront ValueError in simulate_power() when SyntheticDiD placebo
variance has n_control <= n_treated, with actionable error message
suggesting lowering treatment_fraction or using bootstrap variance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tor fallback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… API symbols

- Add `result_extractor` parameter to simulate_power, simulate_mde, and
  simulate_sample_size for unregistered estimators with non-standard schemas
- Fix power.rst: correct PowerAnalysis method names, example code, and add
  SimulationMDEResults/SimulationSampleSizeResults/simulate_mde/simulate_sample_size
- Add 4 missing symbols to docs/api/index.rst autosummary
- Add api/power.rst to doc snippet smoke tests
- Add tests for custom result_extractor and MDE forwarding

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a15ce5a4d7a4a318b9edb024152b61cf8307004c


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. Static review against the provided diff and surrounding source context; I did not execute the test suite in this environment.

Executive Summary

  • Prior blockers from the last review appear addressed: TwoWayFixedEffects is back in the power registry with a binary post path, and the SyntheticDiD placebo fail-fast behavior is now documented in the Methodology Registry.
  • One unmitigated P1 remains in the new automatic power-analysis path: the staggered-estimator registry hardcodes a single-cohort + never-treated DGP and does not adapt to estimator-level settings like control_group, clean_control, or anticipation.
  • That means simulate_power(), and by extension simulate_mde() / simulate_sample_size(), can silently report power for a different design than the configured estimator object the caller passed.
  • The new power tests mostly cover default estimator instances, so those parameter-dependent staggered paths are not exercised.
  • The doc-snippet NameError masking limitation is now tracked in TODO.md, so it is informational only.

Methodology

Code Quality

  • No material findings in the changed code beyond the methodology issue above.

Performance

  • No material findings.

Maintainability

  • No material findings.

Tech Debt

  • Severity: P3. Impact: the new doc-snippet smoke test still treats any NameError as a pass, which can mask standalone example breakage, but this PR also adds that limitation to TODO.md, so it is properly tracked and non-blocking under the review policy. Concrete fix: none required for approval; later, replace blanket NameError suppression with per-snippet setup or an allowlist. References: tests/test_doc_snippets.py:L357-L363, TODO.md:L60-L65

Security

  • No material findings.

Documentation/Tests

  • Severity: P2. Impact: the new estimator-coverage tests only instantiate default estimator objects, so the regression suite never exercises the new auto-DGP path for control_group="not_yet_treated", clean_control="strict"/"not_yet_treated", or anticipation>0. That leaves the methodology mismatch above unguarded. Concrete fix: add explicit simulation tests for those configurations across simulate_power(), simulate_mde(), and simulate_sample_size(). References: tests/test_power.py:L828-L1095

Path to Approval

  1. Update the staggered power registry so the default DGP matches estimator settings that change identification/comparison design (control_group, clean_control, cohort structure, anticipation), or refuse those configurations unless callers supply matching data_generator_kwargs / data_generator.
  2. Add regression tests for at least CallawaySantAnna(control_group="not_yet_treated") with data_generator_kwargs={"never_treated_frac": 0.0, "cohort_periods": [2, 4]}, StackedDiD(clean_control="strict", kappa_pre=1, kappa_post=1) with later-treated clean controls, and one anticipation>0 staggered estimator case where the DGP matches the shifted effect onset/reference period.
  3. If any automatic-simulation limitation remains intentional, document it in docs/methodology/REGISTRY.md with a Note or Deviation label and emit a runtime warning when the registry default DGP is not faithful to the configured estimator.

Add _check_staggered_dgp_compat() that detects control_group="not_yet_treated",
clean_control="strict", and anticipation>0 on staggered estimators using the
automatic single-cohort DGP, and emits UserWarning with specific guidance on
supplying data_generator_kwargs or a custom data_generator.

- 5 warning/no-warning unit tests + 2 slow regression tests with matching DGP
- Document limitation in REGISTRY.md PowerAnalysis section

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 800d37f224f76b2218d164098a3fdec8d3797e6d


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. Static review of the provided diff and surrounding source; I did not execute the test suite in this environment because the package import fails without numpy.

Executive Summary

  • The prior staggered-registry blocker is mitigated: the single-cohort auto-DGP limitation is now documented in the Methodology Registry, surfaced as UserWarnings, and covered by regression tests.
  • One P1 remains in the new SyntheticDiD power path: the placebo-variance fail-fast ignores data_generator_kwargs overrides that change the effective treated/control split.
  • That allows simulate_power(), simulate_mde(), and simulate_sample_size() to return numeric output for an infeasible placebo design instead of raising the documented ValueError.
  • The new SDID tests only cover the treatment_fraction boundary, so the override-based regression is not exercised.
  • I did not find additional material issues in code quality, performance, maintainability, security, or docs beyond that gap.

Methodology

Code Quality

No material findings in the changed code beyond the methodology issue above.

Performance

No material findings.

Maintainability

No material findings.

Tech Debt

No material findings newly introduced by this diff.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: The new SDID regression coverage only checks the placebo boundary through treatment_fraction; it never exercises the supported data_generator_kwargs override path that changes the actual treated count, so the P1 above shipped untested. Concrete fix: add simulate_power(), simulate_mde(), and simulate_sample_size() tests using SyntheticDiD(), n_units=50, treatment_fraction=0.3, and data_generator_kwargs={"n_treated": 30} (or any override that makes n_control <= n_treated) and assert the same ValueError as the high-treatment_fraction path. References: tests/test_power.py:L968-L1030, diff_diff/power.py:L1409-L1500.

Path to Approval

  1. Move the SyntheticDiD placebo-feasibility validation to the effective DGP parameters after data_generator_kwargs are merged, so registry-based overrides of n_treated cannot bypass the documented ValueError.
  2. Add regression tests for all three simulation helpers covering the override case, not just the treatment_fraction case.
  3. If the fail-fast guarantee is intentionally limited to the default factor DGP, narrow the REGISTRY.md note/error text accordingly; otherwise keep the current note and make the validation match it.

… override

Move the SyntheticDiD placebo-variance fail-fast after kwargs merge so that
data_generator_kwargs={"n_treated": N} overrides are caught. Add tests for
the override path across simulate_power, simulate_mde, and simulate_sample_size.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9e18a3ea05f874da0a9bc63f1f66fc0eb5f2f343


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. Static review only; I could not execute the test suite in this environment because pytest and numpy are unavailable.

Executive Summary

Methodology

Code Quality

No material findings beyond the methodology issue above.

Performance

No material findings.

Maintainability

No material findings beyond the need to make the TripleDifference adapter semantics explicit.

Tech Debt

No separate tech-debt-only findings. The TripleDifference adapter issue above is not tracked in TODO.md, so it remains unmitigated for approval purposes. See TODO.md:L38.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: The new test coverage does not exercise the DDD adapter’s parameter translation edge cases. TripleDifference is only tested in simulate_power() with n_units=80, which happens to map cleanly to 10 observations per cell, and there are no DDD-specific simulate_mde() or simulate_sample_size() cases. That leaves the P1 above unguarded. Concrete fix: add TripleDifference regression tests for simulate_power(), simulate_mde(), and simulate_sample_size() with values like n_units=10 and n_units=65, and assert the intended behavior for unsupported n_periods/treatment_period/treatment_fraction and for effective sample-size handling. See tests/test_power.py:L930, tests/test_power.py:L1353, tests/test_power.py:L1486.

Path to Approval

  1. Make the TripleDifference registry path enforce a one-to-one mapping between public simulation inputs and the actual DDD DGP: validate or reject unsupported n_periods, treatment_period, and treatment_fraction, and stop silently rounding n_units without surfacing the effective design.
  2. Add DDD-specific regression tests for simulate_power(), simulate_mde(), and simulate_sample_size() covering non-multiples of 8 and unsupported generic design knobs.
  3. If any DDD-specific rounding or fixed-2x2x2 behavior is intentional, document it in docs/methodology/REGISTRY.md with a **Note:** and mirror that behavior in the public docstrings/results so the simulated design is explicit to callers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 0e912bbed172b3958c4e87088e35665c9a0b17c5


Overall Assessment

⛔ Blocker — Highest unmitigated severity: P0.

Static review only; I could not execute pytest in this environment because both pytest and numpy are unavailable.

Executive Summary

Methodology

  • Severity: P0. Impact: simulate_sample_size() and SimulationSampleSizeResults.required_n are documented as returning the required/smallest sample size, but the auto path seeds lo from an internal registry floor and returns immediately if that floor already reaches target power. Because that branch never evaluates any n < min_n, it can silently report a non-minimal required_n. This is a wrong statistical output, not just a conservative implementation choice. Concrete fix: use min_n only as an initial guess for bracketing, then search downward to the estimator’s true feasibility floor before returning; if you intend to return an upper bound instead, rename/document the result accordingly and emit a warning. See diff_diff/power.py:L1774-L1777, diff_diff/power.py:L2072-L2075, diff_diff/power.py:L2132-L2204, tests/test_power.py:L1747-L1758.
  • Severity: P1. Impact: _check_ddd_dgp_compat() suppresses all ignored-parameter warnings whenever n_per_cell is present in data_generator_kwargs, even though overriding n_per_cell does not make n_periods, treatment_period, or treatment_fraction meaningful to generate_ddd_data(). That leaves a silent design-mismatch path for DDD simulations and conflicts with the registry note promising a warning whenever inputs differ from the effective design. Concrete fix: let n_per_cell suppress only the effective-N rounding/clamping warning; keep warning on ignored n_periods, treatment_period, and treatment_fraction unless the caller supplies a fully custom data_generator. See diff_diff/power.py:L313-L359, docs/methodology/REGISTRY.md:L1721, tests/test_power.py:L1023-L1035.

Code Quality

No material findings beyond the correctness issues above.

Performance

No material findings.

Maintainability

No material findings beyond the need to separate “initial search guess” from “true minimum supported N” in the sample-size search path.

Tech Debt

  • Severity: P3. Impact: The earlier TripleDifference power-analysis limitation is now properly tracked in TODO.md and documented in REGISTRY.md, so that specific deviation is mitigated under the project’s review policy. Concrete fix: none for this PR. See TODO.md:L42-L50, docs/methodology/REGISTRY.md:L1720-L1721.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests lock in the two behaviors above instead of the documented contract: test_lo_already_sufficient_auto asserts the auto path should return the registry floor as required_n, and test_ddd_no_warn_n_per_cell_override asserts no warning in a still-ignored-parameter DDD path. That will hide the fixes behind failing tests. Concrete fix: update those assertions to match the corrected behavior, and add a DDD regression where n_per_cell is overridden but mismatched n_periods/treatment_fraction still warn. See tests/test_power.py:L1023-L1035, tests/test_power.py:L1747-L1758.

Path to Approval

  1. Fix simulate_sample_size() so the auto path does not return min_n as required_n without ruling out smaller sample sizes; either search downward after the first successful bracket or explicitly convert the API/result semantics to “upper bound” with a warning.
  2. Fix _check_ddd_dgp_compat() so n_per_cell only suppresses the effective-sample-size warning, not the warnings for ignored n_periods, treatment_period, and treatment_fraction.
  3. Update the new tests to reflect the corrected contracts, especially tests/test_power.py:L1023-L1035 and tests/test_power.py:L1747-L1758.

… warning

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: fa93beae35118378807aacbcecad8c3522477b21


Overall Assessment

⛔ Blocker — Highest unmitigated severity: P0.

Static review only. I could not run pytest or import the package in this environment because pytest and numpy are unavailable.

Executive Summary

Methodology

  • Severity: P0. Impact: On the registered-estimator path, dgp_kwargs.update(data_gen_kwargs) silently lets data_generator_kwargs override builder-supplied core simulation inputs. A conflicting key such as treatment_effect will make simulate_power() report one true_effect while generating data with another, which in turn corrupts bias, coverage, and any simulate_mde() / simulate_sample_size() search that relies on simulate_power() as its objective. Before this refactor, duplicate kwargs on the explicit call path would have raised immediately; now they silently change the estimand/DGP. Concrete fix: reject conflicting overrides on the registry path for core/search-driving inputs instead of blindly update()-ing, with a narrow allowlist only for documented, intentional overrides such as DDD n_per_cell. Add regression tests for collisions on treatment_effect and the searched quantity. Location: diff_diff/power.py:L1554-L1564, diff_diff/power.py:L1645-L1664, diff_diff/power.py:L1933-L1952, diff_diff/power.py:L2137-L2156
  • Severity: P1. Impact: simulate_sample_size() still claims to find the “smallest N” / “required sample size,” but in auto mode it returns the registry floor as soon as that floor reaches target power, without evaluating any smaller n. The added warning helps, but the returned required_n semantics remain incorrect and undocumented in the registry. Concrete fix: either actually search below the registry floor until infeasibility is established, or rename/document the auto-path result everywhere as an upper bound/search floor rather than required_n. Location: diff_diff/power.py:L1774-L1776, diff_diff/power.py:L1818-L1828, diff_diff/power.py:L2072-L2075, diff_diff/power.py:L2191-L2209

Code Quality

No material findings beyond the correctness issues above.

Performance

No material findings.

Maintainability

No material findings beyond the unresolved required_n contract mismatch in simulate_sample_size().

Tech Debt

  • Severity: P3. Impact: The TripleDifference power-analysis limitation is now properly mitigated under project policy: the fixed 2×2×2 DGP is documented in the Methodology Registry and tracked in TODO.md, so that deviation is informational rather than blocking. Concrete fix: none for this PR. Location: docs/methodology/REGISTRY.md:L1720-L1721, TODO.md:L46-L50

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests now lock in the auto-floor behavior by asserting that simulate_sample_size() should return 20 in the “registry floor already sufficient” case, and there is no regression coverage for the new duplicate-core-kwarg collision path on registered estimators. That means the remaining correctness issues will look intentional to the suite. Concrete fix: update the auto-floor expectation once simulate_sample_size() semantics are corrected, and add explicit tests that conflicting registry-path overrides such as data_generator_kwargs={"treatment_effect": ...} raise instead of silently changing the simulated DGP. Location: tests/test_power.py:L1762-L1774, diff_diff/power.py:L1554-L1564

Path to Approval

  1. Prevent silent core-DGP overrides on the registry path. Reject conflicting data_generator_kwargs that would replace builder-supplied simulation knobs or search variables, and add regression tests covering simulate_power(), simulate_mde(), and simulate_sample_size().
  2. Make simulate_sample_size() consistent with its contract: either search below the registry floor until the true minimum is identified, or explicitly change the API/result/docs to return a documented upper bound/search floor instead of required_n.
  3. Update the tests to match the corrected sample-size semantics and to cover the duplicate-kwarg regression path.

igerber and others added 2 commits March 20, 2026 08:09
…arch to grid

Addresses AI review P1/P2: TripleDifference results now report the
effective sample size when DDD grid rounding occurs, and
simulate_sample_size() bisects on multiples of 8 so required_n is
always a realizable DDD design.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 20, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 5d234c519972e4d8f2516ca38b4a1b177fd01a11


Overall Assessment

⛔ Blocker — Highest unmitigated severity: P0.

Static review only. I could not run the package or pytest in this environment because numpy is unavailable.

Executive Summary

Methodology

  • Severity: P0. Impact: the new registry path still does dgp_kwargs.update(data_gen_kwargs) before calling the generator, so data_generator_kwargs can silently overwrite registry-managed public inputs. At minimum, collisions on keys like treatment_effect or n_units let simulate_power() report one true_effect / searched sample size while generating a different DGP; because simulate_mde() and simulate_sample_size() use simulate_power() as their objective, the same bug corrupts both search routines. Concrete fix: reject collisions where data_generator_kwargs overwrites registry-managed public/search inputs, and either reject or explicitly re-synchronize any derived timing overrides before fitting. Add regression tests for simulate_power(), simulate_mde(), and simulate_sample_size() covering conflicting treatment_effect and n_units inputs. diff_diff/power.py:L1578-L1588, diff_diff/power.py:L1985-L2004, diff_diff/power.py:L2204-L2223
  • Severity: P1. Impact: simulate_sample_size() still claims to find the required/smallest sample size, but in auto mode it returns the registry heuristic floor (min_n) as soon as that floor reaches target power, without evaluating any smaller n. The warning helps, but the return value semantics are still wrong relative to the result object and docstring contract. Concrete fix: either actually search below the floor until infeasibility is established, or rename/document the auto-path result everywhere as an upper bound/search floor instead of required_n. diff_diff/power.py:L393-L480, diff_diff/power.py:L1810-L1812, diff_diff/power.py:L2127-L2130, diff_diff/power.py:L2260-L2279
  • Severity: P1 [Newly identified]. Impact: _check_staggered_dgp_compat() treats the mere presence of data_generator_kwargs["cohort_periods"] as sufficient to suppress the mismatch warning. A single-cohort override such as cohort_periods=[2] still does not satisfy the documented multi-cohort design needed for control_group="not_yet_treated" or clean_control="strict", so this new path can still run under the wrong control design with no warning. Concrete fix: only suppress the warning when the override actually creates at least two distinct treatment cohorts; add regression tests for [2] vs [2, 4] under Callaway-Sant’Anna / Sun-Abraham and StackedDiD. diff_diff/power.py:L266-L300, docs/methodology/REGISTRY.md:L1720-L1720

Code Quality

No material findings beyond the correctness/validation issues above.

Performance

No material findings.

Maintainability

No material findings separate from the misleading simulate_sample_size() contract above.

Tech Debt

  • Severity: P3. Impact: the TripleDifference power-analysis limitation is now properly documented in the Methodology Registry and tracked in TODO.md, so under project policy it is informational rather than blocking. Concrete fix: none required for this PR. docs/methodology/REGISTRY.md:L1721-L1721, TODO.md:L42-L50

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests now codify the auto-floor sample-size behavior (required_n == 20 after the warning) and still do not protect against the remaining registry-collision / single-cohort warning-suppression regressions, so the current suite makes the outstanding correctness issues look intentional. Concrete fix: update the auto-floor expectation once simulate_sample_size() semantics are corrected, and add explicit regression tests for conflicting registry-path overrides plus cohort_periods=[2] continuing to warn. tests/test_power.py:L1508-L1521, tests/test_power.py:L1827-L1840

Path to Approval

  1. Block registry-path collisions where data_generator_kwargs overwrites registry-managed public/search inputs, and add regression tests covering treatment_effect and n_units conflicts through all three simulation entry points.
  2. Make simulate_sample_size() match its advertised contract: either search below the registry floor until minimality is shown, or rename/document the returned quantity everywhere as an upper bound/search floor instead of required_n.
  3. Tighten _check_staggered_dgp_compat() so control_group="not_yet_treated" and clean_control="strict" only suppress warnings when the override actually creates at least two distinct treatment cohorts; add tests for single-cohort and multi-cohort overrides.

…red compat

Address AI review P0/P1/P2 findings:
- P0: Reject data_generator_kwargs that override registry-managed keys
  (treatment_effect, noise_sd, n_units, etc.) to prevent silent desync
- P1: simulate_sample_size() now searches below the registry floor when
  the floor already achieves target power, finding the true minimum N
- P1: _check_staggered_dgp_compat() checks len(set(cohort_periods)) >= 2
  instead of key existence, so single-cohort overrides like [2] still warn
- P2: Add regression tests for all three fixes
- Fix best_power initialization in bisection after downward search

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 20, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 94fa3d4be34eba430a80d914d002e1a5b24c1fe5


Overall Assessment

⛔ Blocker — Highest unmitigated severity: P0.

Static review only. I could not run the test suite here because this environment does not have numpy or pytest.

Executive Summary

  • The prior exact-name collision bug is only partially fixed. Exact public-key collisions are now rejected, but the same desynchronization class still exists through derived DGP keys.
  • The prior single-cohort staggered-warning issue looks fixed: the warning is now suppressed only for true multi-cohort overrides.
  • The prior sample-size-floor issue also looks largely fixed: auto mode now probes below the registry floor instead of returning it unconditionally.
  • New blocker: SyntheticDiD can still get silently wrong power/MDE/sample-size results if data_generator_kwargs overrides factor-DGP timing (n_pre/n_post), because the wrapper still builds post_periods from the old treatment_period.
  • The same incomplete guard also leaves TripleDifference sample-size search vulnerable to n_per_cell overrides that decouple the searched n_units from the DGP’s actual sample size.
  • The fixed 2×2×2 DDD power limitation is now properly documented in the Methodology Registry and tracked in TODO.md, so that limitation itself is non-blocking under repo policy.

Methodology

  • Severity P0. Impact: the new collision guard only protects six exact public parameter names, but the registry still lets derived DGP knobs redefine the simulated design without re-synchronizing the estimator/search. For factor-model power, _factor_dgp_kwargs() maps the public design to n_pre/n_post, generate_factor_data() applies treatment starting at t >= n_pre, and _sdid_fit_kwargs() still constructs SyntheticDiD’s post_periods from the original treatment_period. A caller can therefore get silent wrong power output by passing data_generator_kwargs={"n_pre": ...} or {"n_post": ...} on the auto-registry path. The same hole lets DDD n_per_cell overrides disconnect simulate_sample_size()’s searched n_units from the DGP’s actual sample size. This is the same registry/DGP desync class the prior review flagged, just through derived keys instead of exact-name collisions. Concrete fix: treat these derived knobs as registry-managed too. At minimum, reject n_pre/n_post overrides for auto-registry SyntheticDiD unless post_periods is rebuilt from the effective DGP, and reject n_per_cell in DDD simulate_sample_size() or search over effective N instead of requested n_units. Add regression tests for both cases. diff_diff/power.py:L92-L109 diff_diff/power.py:L189-L203 diff_diff/power.py:L1522-L1542 diff_diff/power.py:L1601-L1646 diff_diff/power.py:L2259-L2395 diff_diff/prep_dgp.py:L270-L405 docs/methodology/REGISTRY.md:L1016-L1123 docs/methodology/REGISTRY.md:L1720-L1722

  • No other unmitigated methodology defects stood out in the changed estimator formulas, weighting, or SE paths.

Code Quality

  • No material findings beyond the incomplete registry/DGP guard above.

Performance

  • No material findings.

Maintainability

  • No separate finding beyond the estimator-specific guard gap above.

Tech Debt

  • Severity P3. Impact: the TripleDifference power-analysis limitation is now explicitly documented in the Methodology Registry and tracked in TODO.md, so under project policy it is informational rather than blocking. Concrete fix: none required for this PR. docs/methodology/REGISTRY.md:L1722-L1722 TODO.md:L42-L50

Security

  • No material findings.

Documentation/Tests

  • Severity P2. Impact: the new tests/docs cover the aligned happy path for n_per_cell and explicitly bless it as a generally allowed non-collision, but they do not cover the remaining derived-key desync cases that still break correctness: no regression protects SyntheticDiD with n_pre/n_post overrides, and no test exercises simulate_sample_size(TripleDifference(), data_generator_kwargs={"n_per_cell": ...}). The Registry note also overstates the safety of n_per_cell overrides. Concrete fix: add regression tests for those cases and narrow the docs/tests so they only treat the aligned n_units == 8 * n_per_cell case as benign, or reject the unsafe overrides outright. tests/test_power.py:L1023-L1050 tests/test_power.py:L1117-L1146 tests/test_power.py:L1896-L1905 docs/methodology/REGISTRY.md:L1722-L1722

Path to Approval

  1. Close the remaining registry/DGP desync hole for derived keys: reject or re-synchronize n_pre/n_post on auto-registry SyntheticDiD, and reject n_per_cell for DDD sample-size search or change the search to operate on effective N.
  2. Add a regression test showing that simulate_power(SyntheticDiD(), treatment_period=3, data_generator_kwargs={"n_pre": 1}) either raises or rebuilds post_periods from the generated data instead of silently using [3].
  3. Add a regression test showing that simulate_sample_size(TripleDifference(), data_generator_kwargs={"n_per_cell": 10}) either raises immediately or reports/searches over the fixed effective sample size of 80 rather than arbitrary requested n_units.

…lysis

Add n_pre and n_post to _PROTECTED_DGP_KEYS to prevent SyntheticDiD/TROP
DGP desync when overriding timing-derived keys. Reject n_per_cell in
simulate_sample_size() for TripleDifference since it freezes effective
sample size across bisection. Add 4 regression tests and update REGISTRY.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 20, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b83d531ee49e0da9f1e0ba2d90ba5ebd965849f9


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not execute the test suite in this sandbox because numpy and pytest are unavailable.

Executive Summary

Methodology

Affected methods: simulate_power, simulate_mde, and simulate_sample_size, with estimator-specific methodology checks for SyntheticDiD, TripleDifference, and the staggered-estimator registry adapters.

  • Severity P1. Impact: SyntheticDiD’s placebo variance assumption (n_control > n_treated) is only enforced when simulate_power() is using the registry DGP (and not use_custom_dgp). On the supported custom-DGP path, the wrapper will still fit SyntheticDiD, which falls through to SE=0.0 plus empty placebo draws; safe_inference() then converts inference to NaN, and simulate_power() counts those as non-rejections. That means simulate_power(), simulate_mde(), and simulate_sample_size() can return power-search outputs for a methodology-invalid placebo design instead of the Registry-documented hard failure. Concrete fix: move the placebo-feasibility check to the post-generation path so it uses the realized generated panel on both registry and custom-DGP flows, then let simulate_mde()/simulate_sample_size() inherit that behavior through simulate_power(). diff_diff/power.py:L1496-L1518 diff_diff/power.py:L1603-L1668 diff_diff/power.py:L2041-L2048 diff_diff/power.py:L2270-L2277 diff_diff/synthetic_did.py:L720-L730 docs/methodology/REGISTRY.md:L1113-L1114

Code Quality

  • No material findings beyond the P1 methodology gap above.

Performance

  • No material findings in the changed code.

Maintainability

  • No material findings in the changed code.

Tech Debt

  • Severity P3. Impact: the TripleDifference power-analysis limitation is now explicitly documented and tracked, so it is informational rather than blocking under the repo’s deferred-work policy. Concrete fix: none required for this PR. docs/methodology/REGISTRY.md:L1723-L1723 TODO.md:L50-L50

Security

  • No material findings.

Documentation/Tests

  • Severity P2. Impact: the new tests cover the registry-path SDID guard and the DDD desync fixes, but they do not cover the remaining supported combination of SyntheticDiD + custom data_generator, so the P1 above is currently unprotected by regression coverage. Concrete fix: add one custom factor-DGP wrapper test each for simulate_power(), simulate_mde(), and simulate_sample_size() that generates n_treated >= n_control and asserts ValueError rather than returning results. tests/test_power.py:L1218-L1277 tests/test_power.py:L1359-L1440

Path to Approval

  1. Enforce the SyntheticDiD(variance_method="placebo") feasibility check on the custom-DGP path inside simulate_power() using the realized generated data, not only the registry-input parameters.
  2. Add regression tests for simulate_power(), simulate_mde(), and simulate_sample_size() with a custom factor-model generator that yields n_control <= n_treated, and assert ValueError.
  3. Keep the Methodology Registry note aligned with the implemented behavior once the custom-DGP path is fixed.

The existing pre-generation check only ran on the registry path. Custom
DGPs could silently produce infeasible placebo designs (n_control <= n_treated),
yielding SE=0.0 and degenerate power results. Add a post-generation check
on the realized data that covers both paths, with regression tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 20, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f1112acff540055b810214507e61c37355f06873


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not execute the test suite in this sandbox because pytest and numpy are unavailable.

Executive Summary

Methodology

Affected methods: simulation-based sample-size search, specifically the TripleDifference registry adapter.

  • Severity P1. Impact: explicit TripleDifference n_range inputs are snapped with _snap_n()'s default floor of profile.min_n (64), not the documented DDD realizable floor of 16. This is an undocumented mismatch between the new search behavior and the Registry note, and it can return an unnecessarily large required sample size for valid DDD designs below 64. Concrete fix: in the DDD branch, snap explicit n_range using the actual DGP floor (16, or a derived 2 * 8) rather than profile.min_n, or else make 64 the documented hard lower bound everywhere including the auto-search path. diff_diff/power.py:L2266 diff_diff/power.py:L2288 diff_diff/power.py:L2320 diff_diff/power.py:L2354 docs/methodology/REGISTRY.md:L1723
  • Severity P3-informational. Impact: the prior SyntheticDiD custom-DGP placebo-feasibility gap appears resolved for the deterministic custom-DGP case reviewed previously, and the implemented behavior is now documented in the Registry. Concrete fix: none required for this PR. diff_diff/power.py:L1665 docs/methodology/REGISTRY.md:L1114 tests/test_power.py:L2022

Code Quality

  • No material findings in the changed code beyond the P1 search-floor bug above.

Performance

  • No material findings in the changed code.

Maintainability

  • No material findings in the changed code.

Tech Debt

  • Severity P3-informational. Impact: the known limitation that TripleDifference power currently uses a fixed 2×2×2 cross-sectional DGP is now explicitly tracked in TODO.md, so it is properly deferred rather than blocking this PR. Concrete fix: none required for this PR. TODO.md:L55

Security

  • No material findings.

Documentation/Tests

  • Severity P2. Impact: the new tests validate DDD sample-size search only for n_range values starting at 64, which misses the explicit low-range path where _snap_n() collapses valid DDD brackets upward. Concrete fix: add a regression covering simulate_sample_size(TripleDifference(), n_range=(16, 56), ...) and assert that the snapped search remains inside the realizable 16..56 grid rather than collapsing to 64. tests/test_power.py:L1118 tests/test_power.py:L1133

Path to Approval

  1. Change the DDD explicit-n_range path in simulate_sample_size() so that grid snapping uses the actual DDD minimum realizable sample size (16) instead of profile.min_n=64. diff_diff/power.py:L2285 diff_diff/power.py:L2320
  2. Add a regression test for TripleDifference with an explicit low bracket such as n_range=(16, 56) and assert that the returned required_n is a multiple of 8 within that bracket when power is bracketed. tests/test_power.py:L1118
  3. Keep the Methodology Registry aligned with the implemented lower bound after the fix; if 64 is actually intended, document it explicitly and remove the conflicting 16 floor from the auto-search path. docs/methodology/REGISTRY.md:L1723 diff_diff/power.py:L2354

Hoist abs_min before the bracket branch so explicit n_range snaps use the
DDD realizable floor (16) instead of the registry profile floor (64).
Also pass floor=abs_min to bisection midpoint snap. Adds regression test
for low-range DDD sample-size search.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 20, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d1dadfeaf1070edbdf3a04115faafc5062a0b0fc


Overall assessment

✅ Looks good — no unmitigated P0/P1 findings.

Static review only. I could not execute the test suite in this sandbox because pytest and numpy are unavailable.

Executive summary

Methodology

Affected methods: simulate_power, simulate_mde, simulate_sample_size, and the new estimator-to-DGP registry adapters. No estimator formula, weighting formula, or variance estimator was changed.

Code Quality

  • No material findings in the changed code.

Performance

  • No material findings in the changed code.

Maintainability

  • No material findings in the changed code.

Tech Debt

  • Severity P3-informational. Impact: The fixed-2×2×2 TripleDifference power-DGP limitation is explicitly tracked under TODO.md, so it is properly deferred rather than blocking this PR. Concrete fix: none for this PR. TODO.md:L44-L55

Security

  • No material findings.

Documentation/Tests

Review limitation only: I could not run the added tests in this sandbox because pytest and numpy are not installed.

@igerber igerber merged commit fe3d79a into main Mar 20, 2026
13 checks passed
@igerber igerber deleted the power-analysis branch March 20, 2026 20:20
igerber added a commit that referenced this pull request Mar 21, 2026
Expand the tutorial to cover the full simulation power analysis API
added in PR #208: multi-estimator support (CallawaySantAnna, SyntheticDiD,
TripleDifference), simulate_mde() bisection search, simulate_sample_size()
bisection search, custom data generators, and analytical vs simulation
guidance. Includes reference table of all 12 supported estimators.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant