feat(AC-322): multi-objective advancement contract for generation gating#451
Merged
jayscambler merged 2 commits intomainfrom Mar 18, 2026
Merged
feat(AC-322): multi-objective advancement contract for generation gating#451jayscambler merged 2 commits intomainfrom
jayscambler merged 2 commits intomainfrom
Conversation
Defines the canonical metrics, rationale, and evaluation logic for deciding whether a generation should advance, retry, or rollback: - AdvancementMetrics: composite gate input with score delta, mean, variance, error_rate, crash_count, confidence, sample_agreement, search_proxy_score vs resolved_truth_score, generalization_gap, cost/tokens - AdvancementRationale: operator-visible explanation with decision, reason, component_scores, binding_checks (what drove the decision), proxy_signals (informational), risk_flags (concerns) - evaluate_advancement(): multi-objective evaluation: 1. Error rate veto (binding: > 20% → rollback) 2. Confidence/variance risk flags (proxy) 3. Resolved truth score binds over search proxy when present 4. Score regression always rolls back 5. Delta < threshold → retry (if budget), else rollback Key design per comments: - Confidence and uncertainty are first-class inputs - search_proxy_score and resolved_truth_score are separate — truth binds, proxy signals - Component scores and risk flags are always in the rationale for auditability 12 tests covering metrics (construction, delta, roundtrip), rationale (construction, roundtrip), evaluation (advance, rollback on regression, retry on marginal, rollback on high error, risk flag on low confidence, truth overrides proxy, component scores present).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New Module
harness/pipeline/advancement.pyKey Design Decisions (per comments)
confidence,sample_agreementas explicit metric fields with risk flagging below thresholdssearch_proxy_scoreis informational;resolved_truth_scorebinds when present and overrides the delta checkbinding_checkslist shows what drove the decision;proxy_signalsshows what was informationalEvaluation Chain
Test plan
test_advancement_contract.py