feat(AC-323): holdout evaluation before advancing a generation#452
Merged
jayscambler merged 4 commits intomainfrom Mar 19, 2026
Merged
feat(AC-323): holdout evaluation before advancing a generation#452jayscambler merged 4 commits intomainfrom
jayscambler merged 4 commits intomainfrom
Conversation
120c45b to
b091a53
Compare
Adds holdout verification that blocks advancement when candidates fail to generalize beyond in-sample performance: - HoldoutPolicy: configurable holdout_seeds, min_holdout_score, max_generalization_gap, seed_offset, enabled flag - HoldoutResult: holdout mean/scores, in-sample score, generalization gap, pass/fail with reason - holdout_check(): pure function that verifies holdout scores against policy thresholds (minimum score and max gap) - HoldoutVerifier: runs evaluation on held-out seeds via pluggable evaluate_fn, auto-passes when policy is disabled A generation can win the main tournament (in-sample) and still be blocked if holdout evaluation shows: - Mean holdout score below min_holdout_score threshold - Generalization gap (|in_sample - holdout|) exceeds max_generalization_gap Holdout seeds use a configurable offset (default 10000) to ensure they don't overlap with tournament seeds. 14 tests covering policy (defaults, custom, disabled, roundtrip), result (construction, roundtrip), holdout_check (passes, fails below threshold, fails gap too large, passes zero gap, empty scores), and verifier (with evaluator, disabled auto-pass, unique seeds).
46ecfb8 to
f0ebaa9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New Module
harness/pipeline/holdout.pyKey Design
A generation can win the main tournament and still be blocked:
min_holdout_score: holdout mean below threshold → failmax_generalization_gap: |in_sample - holdout| too large → failenabled: False: auto-pass (for scenarios that don't need holdout)HoldoutVerifier accepts a pluggable
evaluate_fn(strategy, seed) -> floatfor testability without live execution. Seeds use configurable offset (default 10000) to avoid overlap with tournament seeds.Feeds into AC-322's advancement contract via
resolved_truth_scoreandgeneralization_gap.Test plan
test_holdout_evaluation.py