Skip to content

feat(AC-322): multi-objective advancement contract for generation gating#451

Merged
jayscambler merged 2 commits intomainfrom
feat/ac-322-advancement-contract
Mar 18, 2026
Merged

feat(AC-322): multi-objective advancement contract for generation gating#451
jayscambler merged 2 commits intomainfrom
feat/ac-322-advancement-contract

Conversation

@jayscambler
Copy link
Contributor

Summary

  • AC-322: Canonical multi-objective advancement contract that gates generation progression on composite metrics instead of only local best-score delta

New Module

Module Purpose
harness/pipeline/advancement.py AdvancementMetrics, AdvancementRationale, evaluate_advancement()

Key Design Decisions (per comments)

  • Confidence and uncertainty are first-class inputsconfidence, sample_agreement as explicit metric fields with risk flagging below thresholds
  • Separate proxy from truthsearch_proxy_score is informational; resolved_truth_score binds when present and overrides the delta check
  • Binding vs proxybinding_checks list shows what drove the decision; proxy_signals shows what was informational
  • Error rate veto — > 20% error rate vetoes advance regardless of score improvement
  • Regression → immediate rollback — negative delta never retries

Evaluation Chain

1. Error rate > 20% → rollback (binding veto)
2. Confidence < 0.5 → risk flag
3. Variance > 0.04 → risk flag
4. Resolved truth score present → binds over proxy (truth delta check)
5. Score regression (delta < 0) → rollback
6. Delta >= min_delta → advance
7. Delta < min_delta, retries remaining → retry
8. Delta < min_delta, no retries → rollback

Test plan

  • 12 tests in test_advancement_contract.py
  • AdvancementMetrics: construction, delta computed, roundtrip
  • AdvancementRationale: construction, roundtrip
  • evaluate_advancement: advance, rollback regression, retry marginal, rollback high error, low confidence risk, truth overrides proxy, component scores
  • ruff clean, mypy clean, full suite 4415 passed

Defines the canonical metrics, rationale, and evaluation logic for
deciding whether a generation should advance, retry, or rollback:

- AdvancementMetrics: composite gate input with score delta, mean,
  variance, error_rate, crash_count, confidence, sample_agreement,
  search_proxy_score vs resolved_truth_score, generalization_gap,
  cost/tokens
- AdvancementRationale: operator-visible explanation with decision,
  reason, component_scores, binding_checks (what drove the decision),
  proxy_signals (informational), risk_flags (concerns)
- evaluate_advancement(): multi-objective evaluation:
  1. Error rate veto (binding: > 20% → rollback)
  2. Confidence/variance risk flags (proxy)
  3. Resolved truth score binds over search proxy when present
  4. Score regression always rolls back
  5. Delta < threshold → retry (if budget), else rollback

Key design per comments:
- Confidence and uncertainty are first-class inputs
- search_proxy_score and resolved_truth_score are separate —
  truth binds, proxy signals
- Component scores and risk flags are always in the rationale
  for auditability

12 tests covering metrics (construction, delta, roundtrip), rationale
(construction, roundtrip), evaluation (advance, rollback on regression,
retry on marginal, rollback on high error, risk flag on low confidence,
truth overrides proxy, component scores present).
@linear
Copy link

linear bot commented Mar 18, 2026

@jayscambler jayscambler merged commit 94ee9ee into main Mar 18, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant