Skip to content

Align TROP with paper methodology (Equation 5, D matrix semantics)#110

Merged
igerber merged 13 commits intomainfrom
fix/trop-methodology-alignment
Jan 25, 2026
Merged

Align TROP with paper methodology (Equation 5, D matrix semantics)#110
igerber merged 13 commits intomainfrom
fix/trop-methodology-alignment

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Jan 25, 2026

Summary

  • Fix CV scoring to use SUM instead of average per paper's Equation 5
  • Remove post_periods parameter from fit() and trop() (D matrix is sole input for treatment timing)
  • Replace pre_periods/post_periods lists in TROPResults with n_pre_periods/n_post_periods counts
  • Add warning when >10% of LOOCV fits fail (helps diagnose numerical instability)
  • Fix NaN handling for undefined inference fields (t_stat, p_value now return np.nan when SE is 0)
  • Update REGISTRY.md with D matrix semantics and comprehensive requirements checklist
  • Add API verification tests for v2.1.8 changes

Methodology references (required if estimator / math changes)

  • Method name(s): TROP (Triply Robust Panel Estimators)
  • Paper / source link(s): Athey, Imbens, Qu & Viviano (2025)
  • Any intentional deviations from the source (and why):
    • CV scoring was using average; fixed to use SUM per Equation 5: Q(λ) = Σ_{j,s: D_js=0} [τ̂_js^loocv(λ)]²
    • Removed post_periods parameter since paper uses D matrix as sole treatment timing input (no separate post_periods concept)

Validation

  • Tests added/updated: tests/test_trop.py (new TestAPIChangesV2_1_8 class with 5 tests)
  • All 43 TROP tests pass
  • Backtest / simulation / notebook evidence (if applicable): N/A

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

- Fix CV scoring to use SUM instead of average per Equation 5
- Remove post_periods parameter (D matrix is sole input for treatment timing)
- Add LOOCV warning when >10% of fits fail
- Fix NaN handling for undefined inference fields (t_stat, p_value)
- Update REGISTRY.md with D matrix semantics and requirements checklist
- Add API verification tests for v2.1.8 changes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall assessment: ⛔ Blocker

Executive Summary

  • P0: CI is still computed when se<=0, so inference remains finite and misleading.
  • P1: LOOCV failure-rate warning required by the registry is missing in the Rust backend (and skipped when all fits fail).
  • P2: TROP docs/examples still show post_periods, which now errors.
  • P2: LOOCV warning test doesn’t assert emission, so regressions can slip through.

Methodology

  • P0 Impact: When se<=0, conf_int is still computed (often a zero-width CI) even though t_stat/p_value are NaN, violating the NaN-propagation rule and misleading inference. Fix: guard compute_confidence_interval and/or set conf_int=(np.nan, np.nan) when se is non-positive/non-finite before constructing TROPResults. References: diff_diff/trop.py:900, diff_diff/utils.py:81.
  • P1 Impact: The registry requires a warning when >10% LOOCV fits fail, but the Rust backend never warns and the Python path skips the warning when n_valid==0, so diagnostics diverge from the methodology requirements. Fix: propagate failure counts from Rust and emit the same warning across backends (including the all-fail case). References: docs/methodology/REGISTRY.md:451, rust/src/trop.rs:279, diff_diff/trop.py:1438.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • P2 Impact: TROP examples and API docs still pass post_periods, which now raises TypeError and contradicts the updated D-matrix semantics, leading to broken copy-paste usage. Fix: update TROP docs/examples to remove post_periods and note the absorbing D requirement. References: README.md:1229, README.md:1282, docs/api/trop.rst:133, docs/tutorials/10_trop.ipynb:174.
  • P2 Impact: The new LOOCV warning test doesn’t assert that a warning was emitted, so regressions in warning behavior (especially with the Rust backend) can pass silently. Fix: assert on captured warning contents or create a deterministic failure fixture/mocking to guarantee warning emission. Reference: tests/test_trop.py:1589.

P0: Fix confidence interval NaN propagation
- Move conf_int computation inside SE guard
- When SE <= 0, conf_int is now (np.nan, np.nan) instead of finite values

P1: Fix LOOCV warning consistency
- Add warning before early return when ALL LOOCV fits fail (n_valid == 0)
- Add warning after Rust LOOCV call when best_score is infinite
- Ensures warnings are emitted regardless of backend used

P2a: Update TROP documentation
- Remove outdated post_periods parameter from TROP examples in README.md
- Update docs/api/trop.rst examples
- Update docs/tutorials/10_trop.ipynb (6 cells with TROP fit() calls)
- TROP now infers treatment periods from the treatment indicator column

P2b: Add test assertions
- Update test_loocv_warning_on_many_failures with proper warning capture
- Add test_nan_propagation_when_se_zero to verify P0 fix

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 31b10d0487bf541f4a797548ff3c9d70c606c432


Overall assessment: ⚠️ Needs changes

Executive summary:

  • P1: TROP LOOCV warnings diverge by backend; Rust path omits the >10% failure warning required by the Methodology Registry.
  • P2: README still shows post_periods in a TROP example, which will now raise a TypeError.
  • P3: New LOOCV warning test doesn’t assert warnings (and doesn’t verify SUM vs AVG), leaving regression gaps.

Methodology

  • Severity: P1; Impact: The registry requires a warning when >10% of LOOCV fits fail, but the Rust LOOCV backend never emits this (only Python does), so TROP’s behavior depends on backend and can silently miss numerical-instability warnings; Concrete fix: return n_valid/n_attempted from the Rust grid search or emit the warning in Rust and surface it to Python so the >10% failure warning is enforced in all paths (docs/methodology/REGISTRY.md:451, rust/src/trop.rs:281, diff_diff/trop.py:780, diff_diff/trop.py:1455).

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity: P2; Impact: README’s TROP example still uses removed post_periods, so the documented call now fails; Concrete fix: drop post_periods from the example and add the absorbing‑state note like other sections (README.md:1147).
  • Severity: P3; Impact: The LOOCV warning test doesn’t assert that a warning was actually raised (and doesn’t check the SUM‑of‑squares behavior), so regressions can pass unnoticed; Concrete fix: make the test deterministic (e.g., force >10% failures) and assert loocv_warnings is non‑empty, plus add a small unit test for _loocv_score_obs_specific using a known dataset to confirm SUM vs AVG (tests/test_trop.py:1629).

igerber and others added 2 commits January 25, 2026 09:17
…rminism

P1: Rust backend warning consistency
- Modified Rust loocv_grid_search to return (score, n_valid, n_attempted)
- Python now emits >10% failure warning after Rust call, matching Python path
- Both backends now emit consistent LOOCV warnings

P2: Fixed last missed post_periods documentation example
- Updated TROP example in main README section (line 1147)

P3: Added deterministic test for LOOCV warnings
- New test_loocv_warning_deterministic_with_mock mocks _estimate_model
- Forces 20% failure rate to trigger >10% warning threshold
- Patches Rust backend to ensure Python path is tested

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update test_loocv_grid_search_returns_valid_params to expect 6 values
(score, n_valid, n_attempted) instead of 4 to match the updated Rust
function signature.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 458b0ee57881271c500ac8c37c3ebc367e51a148


Methodology

  • P1: LOOCV scoring drops failed control fits from the sum, so Q(λ) no longer reflects all D==0 cells as defined in Eq. 5 (docs/methodology/REGISTRY.md:505). Impact: when failures occur, λ selection can be biased toward unstable parameter sets because missing terms effectively reduce the objective, potentially changing ATT/SE; applies to both Python and Rust paths. Fix: penalize failures explicitly (e.g., return inf when any failures occur, or add a large penalty per failed fit) so the score represents all attempted control observations; implement in diff_diff/trop.py:1428 and rust/src/trop.rs:300.

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • No issues found.

Open Questions

  • None.

Overall Assessment

  • ⚠️ Needs changes

Executive Summary

  • P1: LOOCV objective drops failed fits, which can bias λ selection away from the Eq. 5 definition when failures occur (diff_diff/trop.py:1428, rust/src/trop.rs:300).
  • TROP D-matrix semantics and removal of post_periods align with the Methodology Registry and cited paper expectations.
  • NaN propagation for zero SE is correctly handled in inference outputs (diff_diff/trop.py:915).

Per PR #110 feedback: Equation 5 specifies Q(λ) as a sum over ALL D==0
cells. The previous implementation skipped failed fits and only warned
if >10% failed, which could bias λ selection toward unstable parameter
combinations.

Changes:
- Python/Rust: Return inf immediately on first fit failure
- Add warning mentioning Equation 5 compliance
- Remove obsolete >10% failure threshold warning
- Update methodology docs to reflect strict failure handling
- Update test to verify new behavior

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: faa2f11bf5e9f1b9edc669c8601d9384f9fc90ed


Overall assessment: ⚠️ Needs changes

Executive summary:

  • P1: TROP now relies solely on the D matrix; there’s no runtime check that D is an absorbing state, which can silently bias ATT if users pass event‑style indicators.
  • P2: Rust LOOCV failure warnings can under-report or mischaracterize failures because the score exits on the first failure yet still uses n_valid/n_attempted for warnings.
  • LOOCV scoring now matches Equation 5 (sum, inf on failure) and NaN propagation for undefined inference is fixed, aligning with the Methodology Registry.
  • Docs/tests cover the API change, but assumption-validation tests are still missing.

Methodology

  • P1: Missing absorbing‑state validation for the treatment indicator after removing post_periods. Impact: event‑style D (only first treated period marked) will be interpreted as treatment “turning off,” causing post‑treatment cells to be treated as controls and violating the registry’s D‑matrix semantics (Eq. 1/3), biasing ATT. Fix: add a per‑unit monotonic non‑decreasing D check (warn or raise, with override flag) before first_treat_period inference in TROP.fit. diff_diff/trop.py:L747-L770

Code Quality

  • P2: Rust LOOCV warnings use n_valid values computed before early exit on the first failure; this can suppress warnings (<10% failures) or emit misleading “all fits failed” messages. Impact: users may miss numerical‑instability signals or get inaccurate diagnostics. Fix: have Rust return any_failure/n_failed by continuing to count failures (even if score is already ), or update Python to warn on n_valid < n_attempted and avoid “all failed” language when early exit occurred. rust/src/trop.rs:L326-L342, diff_diff/trop.py:L786-L815

Performance

  • P3: No issues found. Impact: None. Fix: None.

Maintainability

  • P3: No issues found. Impact: None. Fix: None.

Tech Debt

  • P3: No issues found. Impact: None. Fix: None.

Security

  • P3: No issues found. Impact: None. Fix: None.

Documentation/Tests

  • P2: No tests enforce the absorbing‑state assumption; once validation is added, there’s no guard against regressions. Impact: future changes could re‑introduce silent bias. Fix: add a unit test with a non‑absorbing D sequence (e.g., 0→1→0 within a unit) that asserts a warning/error. tests/test_trop.py

- Implement paper's footnote 2 LOOCV approach: two-stage coarse-to-fine
  cycling (univariate searches then coordinate descent) instead of
  exhaustive grid search
- Add D matrix absorbing-state validation (P1): check monotonic
  non-decreasing treatment indicator, raise ValueError with helpful
  message for event-style D (0→1→0)
- Update Rust backend with matching cycling LOOCV implementation
- Update REGISTRY.md with two-stage LOOCV procedure documentation
- Add tests for D matrix validation and cycling search convergence

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 88206e023bd822d38f6f70ecc3b870969d4bbc70


Overall assessment: ⚠️ Needs changes

Executive summary:

  • TROP LOOCV selection can accept Q(λ)=∞ and skip the registry-required fallback, so invalid tuning parameters can be used.
  • The Stage‑1 LOOCV “disabled” λ_nn path maps to 0, which conflicts with the stated “L=0/disabled” semantics and likely the paper’s footnote.
  • Rust LOOCV warning counts can be misleading due to early‑exit counting.
  • Tests don’t currently assert fallback‑to‑defaults when LOOCV fails for all parameters.

Methodology:

  • P1 Impact: Q(λ)=∞ is treated as a selectable outcome because best_lambda is set even when the score is infinite, so the “all combinations failed → fall back to defaults” registry rule isn’t enforced; this can yield invalid tuning and non‑conforming estimates. Fix: treat best_score == np.inf/not np.isfinite(best_score) or n_valid==0 as a hard failure and leave best_lambda=None to trigger fallback; also skip assigning best_lambda when _loocv_score_obs_specific returns np.inf. (diff_diff/trop.py:L969-L1055, rust/src/trop.rs:L365-L371)
  • P1 Impact: The Stage‑1 “disabled” λ_nn uses as a sentinel but converts it to 0, which is “no regularization” in _estimate_model and not the “L=0/disabled” behavior described in the docstring/registry, so the two‑stage search does not match footnote‑2 semantics and can bias tuning. Fix: implement explicit handling for lambda_nn == inf to force L=0 (or an equivalent large penalty) in both Python and Rust, or update the registry/docstrings to define “disabled” as 0 and remove the sentinel. (diff_diff/trop.py:L665-L730, diff_diff/trop.py:L1408-L1439, rust/src/trop.rs:L149-L199)

Code Quality:

  • P3 Impact: Rust LOOCV warnings compute n_failed = n_attempted - n_valid, but n_valid stops at the first failure (early return), so the warning can overstate failure rates and miss single‑failure diagnostics. Fix: track attempted/failed counts through all observations or warn based on best_score == inf with “≥1 failure; Q=∞” wording. (rust/src/trop.rs:L430-L465, diff_diff/trop.py:L976-L990)

Performance:

  • P3 Impact: None identified. Fix: None.

Maintainability:

  • P3 Impact: None identified. Fix: None.

Tech Debt:

  • P3 Impact: None identified. Fix: None.

Security:

  • P3 Impact: None identified. Fix: None.

Documentation/Tests:

  • P2 Impact: No regression test asserts fallback to defaults when all LOOCV fits fail (both Python and Rust paths), so the new failure semantics can regress silently. Fix: add a deterministic test that forces LOOCV to return inf and asserts default λ selection plus warning. (tests/test_trop.py:L1722-L1805)

…tics

- Fix infinite score check: only assign best_lambda when score is finite
  (both Python and Rust code paths)
- Fix λ_nn=∞ semantics: use 1e10 instead of 0.0 per paper Equations 2-3
  - λ_time/λ_unit=∞ → 0.0 (uniform weights) is correct
  - λ_nn=∞ → 1e10 (large penalty → L≈0, factor model disabled)
  - Note: λ_nn=0 means NO regularization, opposite of "disabled"
- Add regression tests for infinite score fallback to defaults
- Document λ=∞ semantics in REGISTRY.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 61cdd46590b789018a9c7bb00e149a9b6baa3f20


Overall assessment: ⚠️ Needs changes

Executive Summary

  • P1 (TROP): Rust LOOCV path doesn’t emit the registry-mandated first-failure warning (with (t,i), λ), so diagnostics diverge from Python and required methodology checks are silent (rust/src/trop.rs:434, diff_diff/trop.py:985, docs/methodology/REGISTRY.md:523).
  • P1 (TROP): λ=∞ “disabled” semantics aren’t enforced consistently—Rust univariate search ignores inf grid values and Python final estimation uses inf directly, risking NaN weights if inf appears in grids (rust/src/trop.rs:190, diff_diff/trop.py:1066, docs/methodology/REGISTRY.md:516).
  • P2: Absorbing-state validation runs after fillna(0), which can falsely reject unbalanced panels with missing treated rows (diff_diff/trop.py:896).

Methodology

  • Severity: P1 | Impact: With Rust backend, LOOCV failures can be silent or only reported as aggregate counts for the chosen λ, missing the registry-required “first failure (t,i,λ)” warning and masking instability across other λ values. | Fix: return first-failure metadata from Rust and emit the same warning as Python (or emit the warning inside Rust), and align failure reporting with the early-return behavior (rust/src/trop.rs:434, diff_diff/trop.py:985, docs/methodology/REGISTRY.md:523).
  • Severity: P1 | Impact: The registry defines λ=∞ as “disabled” (time/unit → uniform weights, nn → L≈0), but Rust does not normalize inf grid values and Python final estimation uses best_lambda directly; if np.inf is in a grid, weights can become NaN and behavior diverges from documented semantics. | Fix: normalize any inf λ values in Rust (including value) and in Python before final estimation; add tests for inf grids (rust/src/trop.rs:190, diff_diff/trop.py:1066, docs/methodology/REGISTRY.md:516).

Code Quality

  • Severity: P2 | Impact: Monotonic D validation is applied after fillna(0), so missing treatment rows in unbalanced panels appear as 1→0 and raise ValueError, potentially breaking previously tolerated missingness. | Fix: document a balanced-panel requirement or forward-fill D per unit (or ignore missing entries) before np.diff (diff_diff/trop.py:896).

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity: P3 | Impact: No tests exercise λ=∞ grids or Rust warning parity, so the registry semantics can regress silently. | Fix: add tests near the LOOCV/parameter search coverage to exercise inf grid handling and Rust warning propagation (skip when Rust backend unavailable) (tests/test_trop.py:1724).

Add infinity conversion in fit() after unpacking best_lambda to ensure
final estimation uses the same effective parameters that LOOCV evaluated.

Changes:
- Convert λ_time/λ_unit=∞→0 and λ_nn=∞→1e10 before final estimation
- Add test for infinity-valued parameter grids
- Clarify REGISTRY.md that conversion applies to both LOOCV and final estimation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 34d647f43c4de44980f87e5de60c1588d3228123


Overall Assessment

  • ⚠️ Needs changes

Executive Summary

  • TROP LOOCV in Rust treats λ=∞ as literal infinity for the searched parameter, diverging from REGISTRY/Python conversion and the final estimator.
  • TROP variance estimation can use raw λ=∞ while the point estimate uses converted values, making SE/CI inconsistent when infinity is in the grid.
  • Python LOOCV returns a zero score when the control set is empty after subsampling, violating the empty-result-set guardrails and risking invalid tuning.

Methodology

  • P1 | Impact: Rust LOOCV does not convert λ=∞ when it is the parameter being searched, so tuning can optimize a different objective than the documented “disabled” semantics and the Python path. | Fix: In univariate_loocv_search, normalize value when it is infinite (λ_time/λ_unit0.0, λ_nn1e10) before calling loocv_score_for_params, or pre-process grids to an “effective lambda” array. | Ref: rust/src/trop.rs:L187-L203
  • P1 | Impact: If LOOCV selects λ=∞, the point estimate uses the converted effective lambda but bootstrap/jackknife use the raw best_lambda, so SE/CI are for a different estimator than the reported ATT. | Fix: Convert to an “effective_lambda” tuple once and pass it to _bootstrap_variance, _jackknife_variance, and _fit_with_fixed_lambda, or add conversion inside those functions. | Ref: diff_diff/trop.py:L1066-L1140, diff_diff/trop.py:L1947-L2013
  • P1 | Impact: _loocv_score_obs_specific returns 0.0 if control_obs becomes empty (e.g., max_loocv_samples=0 or all control outcomes are NaN), so invalid λ can win by default; this violates the empty-result-set handling requirement. | Fix: After subsampling, if control_obs is empty, return np.inf (and warn) or enforce max_loocv_samples >= 1. | Ref: diff_diff/trop.py:L1641-L1692

Code Quality

  • None noted.

Performance

  • None noted.

Maintainability

  • None noted.

Tech Debt

  • None noted.

Security

  • None noted.

Documentation/Tests

  • P2 | Impact: No targeted test asserts Rust λ=∞ conversion behavior or the empty-control_obs LOOCV path, so the above regressions could slip. | Fix: Add coverage in tests/test_rust_backend.py for inf-grid conversion and in tests/test_trop.py for max_loocv_samples=0/all-NaN control outcomes.

Open Questions

  • Should max_loocv_samples=0 be a supported “skip LOOCV” mode, or should it hard-error/clamp to 1?
  • Do you intend to allow λ=∞ in user grids with the Rust backend? If yes, should Rust mirror Python’s effective-parameter conversion exactly?

1. Rust LOOCV Parameter Conversion (rust/src/trop.rs):
   - Convert grid values before LOOCV evaluation to match Python behavior
   - λ_time=∞→0.0, λ_unit=∞→0.0, λ_nn=∞→1e10
   - Ensures Rust and Python evaluate identical objective functions

2. Variance Estimation Consistency (diff_diff/trop.py):
   - Create effective_lambda with converted values after infinity conversion
   - Pass effective_lambda to both bootstrap and jackknife variance methods
   - Store original grid values in TROPResults for user visibility
   - Point estimation and SE now use identical parameters

3. Empty Control Set Handling (diff_diff/trop.py):
   - Add check for empty control_obs after subsampling in _loocv_score_obs_specific
   - Return np.inf with warning if control observations are empty
   - Prevents empty control sets from "winning" with score 0.0

Tests added for all three fixes. Documentation updated in TROPResults
docstring, TROP.fit() docstring, and REGISTRY.md edge cases.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 25, 2026

/ai-review

1 similar comment
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f52b1001fae4311add0d982dbe89ab641b18ee9b


Overall assessment: ⚠️ Needs changes

Executive Summary

  • P1: LOOCV final scoring uses raw inf values (Python fallback + Rust), contradicting the registry’s λ=∞ conversion and risking wrong Q(λ) or default fallback when np.inf is in grids.
  • P2: Registry promises a first-failure LOOCV warning with (t, i), but the Rust path only emits aggregate warnings, so the documented diagnostics aren’t met.
  • P2: Absorbing-state validation treats missing D as 0; in unbalanced panels this can falsely flag valid data or misclassify missing post-treatment rows.

Methodology

  • Severity: P1; Impact: When grids include np.inf, the final LOOCV score is computed with unconverted values, so Q(λ) can be inconsistent with the “disabled” semantics and can incorrectly trigger fallback to defaults or misreport loocv_score; Fix: Convert λ values to effective values (inf→0.0 / 1e10) before computing the final LOOCV score, in both Python and Rust, while still storing original grid values for results (diff_diff/trop.py:L1050-L1056, rust/src/trop.rs:L386-L391, docs/methodology/REGISTRY.md:L538-L542).
  • Severity: P2; Impact: Rust backend warnings don’t include the first failing (t, i) observation as required by the registry, so diagnostics are incomplete vs documented behavior; Fix: Return the first failed (t, i) from Rust (or emit a structured error) and include it in the Python warning, or update the registry to explicitly scope this warning to the Python path (docs/methodology/REGISTRY.md:L523-L526, diff_diff/trop.py:L970-L1006).

Code Quality

  • Severity: P2; Impact: D is fillna(0) before the monotonicity check, so unbalanced panels with missing post‑treatment rows can raise false “not absorbing” errors or silently treat missing as untreated; Fix: Validate monotonicity only on observed D values per unit (mask missing), or explicitly require balanced panels and raise a clear error when D is missing (diff_diff/trop.py:L903-L918).

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • No issues found.

Open Questions / Assumptions

  • Do we want to formally require balanced panels for TROP? If not, I’d recommend masking missing D values in the absorbing-state check rather than fillna(0).

Issue 1: Final LOOCV score infinity conversion
- Convert inf values before calling loocv_score_for_params in Rust
- Ensures final score uses same converted values that LOOCV evaluated
- λ_time/λ_unit=∞ → 0.0, λ_nn=∞ → 1e10

Issue 2: Rust LOOCV failed observation metadata
- Extend loocv_score_for_params to return Option<(usize, usize)>
- Track first failed observation (t, i) for informative warnings
- Python now includes coordinates in LOOCV failure warnings

Issue 3: D matrix validation for unbalanced panels
- Track missing values before fillna(0) with missing_mask
- Only validate monotonicity between observed periods
- Missing data no longer triggers false absorbing-state violations

Tests: 4 new tests in TestPR110FeedbackRound8 class
Docs: Updated REGISTRY.md with unbalanced panel support

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 25, 2026

/ai-review

Update test to unpack 7 values from loocv_grid_search (added first_failed_obs
in round 8) and add assertion for the new return value.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 25, 2026

/ai-review

1 similar comment
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 835ade2942cede0ea09ffac1e5943112b266b5b2


Overall assessment: ⚠️ Needs changes

Executive summary:

  • TROP’s LOOCV sum/“D is absorbing” changes align with the Methodology Registry, but the new D-validation can miss real 1→0 violations when missing periods create gaps (P1).
  • NaN/Inf propagation is fixed in TROP only; other estimators still emit t_stat=0.0/p_value=1.0 for se<=0, which remains misleading and inconsistent (P1).
  • New Rust fallback test will fail because the mocked return tuple is shorter than the updated Rust API (P2).
  • Rust LOOCV failure-rate warnings can be misleading because early-exit makes the “n_failed/n_attempted” math inaccurate (P2).

Methodology

  • P1 — Impact: Absorbing-state validation can miss true violations when a unit has D=1, then later observed D=0 with missing periods in between; invalid D can pass and bias ATT, contrary to the registry’s monotonic requirement. Location: diff_diff/trop.py:913. Concrete fix: Validate monotonicity across observed periods per unit (e.g., drop NaNs from D_raw[:, i] and ensure non-decreasing; flag any 1→0 even if non-adjacent).
  • P1 — Impact: NaN/Inf propagation policy is only applied to TROP; other estimators still return finite t-stats/p-values when se<=0, which can mislead inference and violates the edge-case checklist. Locations: diff_diff/diagnostics.py:665, diff_diff/sun_abraham.py:603, diff_diff/triple_diff.py:601, diff_diff/synthetic_did.py:339, diff_diff/linalg.py:1267. Concrete fix: Replace else 0.0/1.0 with np.nan and set CI to (np.nan, np.nan) when se<=0 across these call sites to match TROP behavior.

Code Quality

  • P2 — Impact: Rust LOOCV warnings report n_failed/n_attempted using n_valid from an early-exit loop, which can drastically overstate failures and mislead diagnostics. Locations: diff_diff/trop.py:1012, rust/src/trop.rs:500. Concrete fix: Either remove the >10% failure-rate warning when early-exit is used, or compute counts by continuing through all control_obs, or return a simple “failed_any” flag and warn without counts.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • P2 — Impact: test_rust_infinite_score_triggers_fallback mocks the Rust LOOCV return tuple with 6 values, but TROP.fit now expects 7, so the test will fail before reaching the fallback logic. Location: tests/test_trop.py:2085. Concrete fix: Update the mock to return 7 values (append None for first_failed_obs).

Update mock return value for _rust_loocv_grid_search to include the
7th value (first_failed_obs) that was added in round 8. The mock was
returning 6 values but the actual Rust function returns 7.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ebd9cb2ce9a4f4094c616baa7a563c0791a76ab0


Overall Assessment

  • ⚠️ Needs changes

Executive Summary

  • Method affected: TROP (Triply Robust Panel); LOOCV sum objective + D‑matrix semantics align with REGISTRY/Athey‑Imbens‑Qu‑Viviano (2025) as updated.
  • P1: Absorbing‑state validation can miss 1→0 drops across missing periods, allowing invalid D in unbalanced panels.
  • P3: n_post_periods metadata definition doesn’t match how it’s computed under missing treated periods.

Methodology

  • Severity: P1; Impact: The absorbing‑state check only inspects adjacent observed periods, so a 1→0 transition with a missing gap is not flagged, violating the REGISTRY/paper assumption and potentially biasing ATT; Fix: per unit, drop missing D and verify the observed sequence is non‑decreasing (or forward‑fill then check), plus add a test for a 1→0 gap case. diff_diff/trop.py:904

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • Severity: P3; Impact: n_post_periods is documented as “periods with D=1 observations” but is computed as n_periods - first_treat_period, which can overcount in unbalanced panels; Fix: compute via np.any(D == 1, axis=1) (using observed D) or update the docstring/README to match the current definition; add a regression test with a missing treated period. diff_diff/trop.py:136, diff_diff/trop.py:962

…n_post_periods fix

P1: Fix absorbing-state validation to catch 1→0 violations across missing period gaps.
    The old vectorized check only looked at adjacent periods, missing violations like
    D[2]=1, missing [3,4], D[5]=0. Now checks each unit's observed D sequence.

P3: Fix n_post_periods to count periods with actual D=1 observations, matching the
    docstring claim, rather than calendar periods from first treatment.

Also updates methodology registry documentation for both changes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3a708fccdca15733c13fd3012d7cf5cfbabcc67a


Overall Assessment
✅ Looks good

Methodology

  • No issues found.

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • P3 – Impact: LOOCV sum vs average regressions could slip without a deterministic assertion. Fix: add a small fixture that computes a known Q(λ) and asserts sum (not mean), e.g., near TestAPIChangesV2_1_8 in tests/test_trop.py:1724.

Executive Summary

  • TROP estimator and Rust LOOCV tuning updates align with the REGISTRY and cited paper (absorbing-state D semantics, Equation 5 sum, λ=∞ conversion, SE NaN handling).
  • No methodological or implementation regressions found in the reviewed diff.
  • One low-severity test gap: add a deterministic LOOCV sum assertion to lock in the Equation 5 behavior.

@igerber igerber merged commit 525ac13 into main Jan 25, 2026
4 checks passed
@igerber igerber deleted the fix/trop-methodology-alignment branch January 25, 2026 21:49
@igerber igerber mentioned this pull request Jan 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant