Skip to content

feat(AC-323): holdout evaluation before advancing a generation#452

Merged
jayscambler merged 4 commits intomainfrom
feat/ac-323-holdout-evaluation
Mar 19, 2026
Merged

feat(AC-323): holdout evaluation before advancing a generation#452
jayscambler merged 4 commits intomainfrom
feat/ac-323-holdout-evaluation

Conversation

@jayscambler
Copy link
Contributor

Summary

  • AC-323: Holdout verification that blocks generation advancement when candidates fail to generalize beyond in-sample performance

New Module

Module Purpose
harness/pipeline/holdout.py HoldoutPolicy, HoldoutResult, holdout_check(), HoldoutVerifier

Key Design

A generation can win the main tournament and still be blocked:

  • min_holdout_score: holdout mean below threshold → fail
  • max_generalization_gap: |in_sample - holdout| too large → fail
  • enabled: False: auto-pass (for scenarios that don't need holdout)

HoldoutVerifier accepts a pluggable evaluate_fn(strategy, seed) -> float for testability without live execution. Seeds use configurable offset (default 10000) to avoid overlap with tournament seeds.

Feeds into AC-322's advancement contract via resolved_truth_score and generalization_gap.

Test plan

  • 14 tests in test_holdout_evaluation.py
  • HoldoutPolicy: defaults, custom, disabled, roundtrip
  • HoldoutResult: construction, roundtrip
  • holdout_check: passes, fails below threshold, fails gap too large, zero gap, empty scores
  • HoldoutVerifier: with evaluator, disabled auto-pass, unique seeds
  • ruff clean, mypy clean, full suite 4417 passed

@linear
Copy link

linear bot commented Mar 18, 2026

@jayscambler jayscambler force-pushed the feat/ac-323-holdout-evaluation branch from 120c45b to b091a53 Compare March 18, 2026 16:02
Adds holdout verification that blocks advancement when candidates
fail to generalize beyond in-sample performance:

- HoldoutPolicy: configurable holdout_seeds, min_holdout_score,
  max_generalization_gap, seed_offset, enabled flag
- HoldoutResult: holdout mean/scores, in-sample score,
  generalization gap, pass/fail with reason
- holdout_check(): pure function that verifies holdout scores
  against policy thresholds (minimum score and max gap)
- HoldoutVerifier: runs evaluation on held-out seeds via pluggable
  evaluate_fn, auto-passes when policy is disabled

A generation can win the main tournament (in-sample) and still be
blocked if holdout evaluation shows:
- Mean holdout score below min_holdout_score threshold
- Generalization gap (|in_sample - holdout|) exceeds max_generalization_gap

Holdout seeds use a configurable offset (default 10000) to ensure
they don't overlap with tournament seeds.

14 tests covering policy (defaults, custom, disabled, roundtrip),
result (construction, roundtrip), holdout_check (passes, fails below
threshold, fails gap too large, passes zero gap, empty scores), and
verifier (with evaluator, disabled auto-pass, unique seeds).
@jayscambler jayscambler force-pushed the feat/ac-323-holdout-evaluation branch from 46ecfb8 to f0ebaa9 Compare March 18, 2026 22:41
@jayscambler jayscambler merged commit 616ca46 into main Mar 19, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant