From 549cfb13a612077a61e6bb3393e4e539c3695740 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Fri, 6 Feb 2026 05:23:29 +0000 Subject: [PATCH 1/9] docs: add evaluation architecture overview --- README.md | 3 ++ docs/evaluation-architecture.md | 58 +++++++++++++++++++++++++++++++++ 2 files changed, 61 insertions(+) create mode 100644 docs/evaluation-architecture.md diff --git a/README.md b/README.md index 16476f0c..f2e6c263 100644 --- a/README.md +++ b/README.md @@ -168,6 +168,9 @@ print(f"Score (bounded): {result.score}") print(f"Score (unbounded): {result.score_unbounded}") ``` +See `docs/evaluation-architecture.md` for an overview of the evaluation stack +and runner mapping. + ### Batch Evaluation For testing your solutions at scale with public test cases. diff --git a/docs/evaluation-architecture.md b/docs/evaluation-architecture.md new file mode 100644 index 00000000..dbd635a8 --- /dev/null +++ b/docs/evaluation-architecture.md @@ -0,0 +1,58 @@ +# Evaluation Architecture + +This document summarizes how evaluation flows through the codebase and names the +key components introduced in the recent refactors. + +## Components + +### SingleEvaluator +`SingleEvaluator` is the unified API for **single-problem evaluation**. It: + +- Chooses a runner based on track and backend. +- Registers cleanup hooks for SkyPilot clusters (SIGINT/atexit). + +### BatchEvaluator +`BatchEvaluator` orchestrates **batch evaluation** with parallel workers and +SkyPilot cluster pools. It handles: + +- Work queues, resumable state, and result aggregation. +- Cluster pool lifecycle (create/cleanup). + +### Runners +Runners execute the actual evaluation. The mapping is: + +- **Research + docker** → `ResearchDockerRunner` +- **Research + skypilot** → `ResearchSkyPilotRunner` +- **Algorithmic + docker** → `AlgorithmicLocalRunner` +- **Algorithmic + skypilot** → `AlgorithmicSkyPilotRunner` + +## Runner Flow (Research) + +Both research runners share the same input validation and config parsing: + +- Validate solution file and `.FAILED` marker. +- Ensure problem path exists. +- Load `config.yaml` and runtime settings. +- Build uv install command if `uv_project` is provided. + +The execution path diverges only at the backend: + +- Docker runner launches a local container. +- SkyPilot runner provisions and executes on cloud resources. + +## Cleanup Behavior + +- `ResearchSkyPilotRunner` always downs the evaluation cluster unless + `keep_cluster=True`. +- Active clusters are tracked in a registry so `SingleEvaluator` can clean up on + SIGINT/atexit. +- `BatchEvaluator` uses its own cluster pool cleanup (independent of the + registry). + +## CI Mapping + +- **Validate Problems** uses `SingleEvaluator` through + `scripts/validate_problems.py`. +- **Weekly Batch Evaluation** uses `BatchEvaluator` via `scripts/run_eval.sh` + and typically runs on SkyPilot (GCP by default). + From edb51982fc9cee989b87cfa076011cad34b05118 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Fri, 6 Feb 2026 05:37:48 +0000 Subject: [PATCH 2/9] docs: clarify evaluation design decisions --- docs/evaluation-architecture.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/docs/evaluation-architecture.md b/docs/evaluation-architecture.md index dbd635a8..e1d81790 100644 --- a/docs/evaluation-architecture.md +++ b/docs/evaluation-architecture.md @@ -26,6 +26,20 @@ Runners execute the actual evaluation. The mapping is: - **Algorithmic + docker** → `AlgorithmicLocalRunner` - **Algorithmic + skypilot** → `AlgorithmicSkyPilotRunner` +## Design Decisions + +- **Single vs Batch**: `SingleEvaluator` stays focused on one-off evaluation + (simple API + cleanup hooks), while `BatchEvaluator` owns scheduling, + resumable state, and cluster pools. This keeps single-run paths lightweight + and batch runs scalable. +- **Shared research helpers**: input validation and config parsing are shared + in `ResearchRunner` to avoid drift between Docker and SkyPilot backends. +- **Cleanup strategy**: research SkyPilot evaluations down clusters by default + (cost/safety), with `keep_cluster` as the opt-out. Batch uses its own pool + cleanup because cluster lifecycle is managed at the scheduler level. +- **Naming**: runner class names are explicit about track + backend + (e.g., `ResearchDockerRunner`) to remove ambiguity in logs and docs. + ## Runner Flow (Research) Both research runners share the same input validation and config parsing: @@ -55,4 +69,3 @@ The execution path diverges only at the backend: `scripts/validate_problems.py`. - **Weekly Batch Evaluation** uses `BatchEvaluator` via `scripts/run_eval.sh` and typically runs on SkyPilot (GCP by default). - From 268a19cb3ee88de805745f0fb3deceb1e282d7cd Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Fri, 6 Feb 2026 05:38:43 +0000 Subject: [PATCH 3/9] docs: simplify cleanup rationale --- docs/evaluation-architecture.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/evaluation-architecture.md b/docs/evaluation-architecture.md index e1d81790..7b7d949b 100644 --- a/docs/evaluation-architecture.md +++ b/docs/evaluation-architecture.md @@ -34,9 +34,8 @@ Runners execute the actual evaluation. The mapping is: and batch runs scalable. - **Shared research helpers**: input validation and config parsing are shared in `ResearchRunner` to avoid drift between Docker and SkyPilot backends. -- **Cleanup strategy**: research SkyPilot evaluations down clusters by default - (cost/safety), with `keep_cluster` as the opt-out. Batch uses its own pool - cleanup because cluster lifecycle is managed at the scheduler level. +- **Cleanup strategy**: research evaluations down clusters by default unless + `keep_cluster` is set; batch handles its own pool cleanup. - **Naming**: runner class names are explicit about track + backend (e.g., `ResearchDockerRunner`) to remove ambiguity in logs and docs. From 2973560fe9b3f5c9738912a0e8f413abce181324 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Fri, 6 Feb 2026 05:40:29 +0000 Subject: [PATCH 4/9] docs: expand evaluation architecture overview --- docs/evaluation-architecture.md | 48 +++++++++++++++++++++------------ 1 file changed, 31 insertions(+), 17 deletions(-) diff --git a/docs/evaluation-architecture.md b/docs/evaluation-architecture.md index 7b7d949b..2b4bcf65 100644 --- a/docs/evaluation-architecture.md +++ b/docs/evaluation-architecture.md @@ -1,18 +1,39 @@ # Evaluation Architecture -This document summarizes how evaluation flows through the codebase and names the -key components introduced in the recent refactors. +This document summarizes how evaluation flows through the codebase, who it is +for, and the guiding design choices behind the current structure. + +## Audience + +- Contributors changing evaluation behavior, runners, or CI workflows. +- Maintainers debugging evaluation failures or infra cleanup issues. + +## Goals + +- Clear separation between single-problem and batch evaluation. +- Shared validation/config parsing across research backends. +- Predictable cleanup to avoid orphaned cloud resources. +- Explicit naming to avoid backend ambiguity. + +## Architecture at a Glance + +- **CLI**: + - `frontier eval` → `SingleEvaluator` + - `frontier batch` → `BatchEvaluator` +- **CI**: + - Validate Problems → `scripts/validate_problems.py` → `SingleEvaluator` + - Weekly Batch Evaluation → `scripts/run_eval.sh` → `BatchEvaluator` ## Components ### SingleEvaluator -`SingleEvaluator` is the unified API for **single-problem evaluation**. It: +Unified API for **single-problem evaluation**. It: - Chooses a runner based on track and backend. - Registers cleanup hooks for SkyPilot clusters (SIGINT/atexit). ### BatchEvaluator -`BatchEvaluator` orchestrates **batch evaluation** with parallel workers and +Orchestrates **batch evaluation** with parallel workers and SkyPilot cluster pools. It handles: - Work queues, resumable state, and result aggregation. @@ -53,18 +74,11 @@ The execution path diverges only at the backend: - Docker runner launches a local container. - SkyPilot runner provisions and executes on cloud resources. -## Cleanup Behavior - -- `ResearchSkyPilotRunner` always downs the evaluation cluster unless - `keep_cluster=True`. -- Active clusters are tracked in a registry so `SingleEvaluator` can clean up on - SIGINT/atexit. -- `BatchEvaluator` uses its own cluster pool cleanup (independent of the - registry). +## Operations (Cleanup + CI) -## CI Mapping +- **Cleanup**: research evaluations down clusters by default unless + `keep_cluster=True`; `SingleEvaluator` also cleans up on SIGINT/atexit using an + active-cluster registry. `BatchEvaluator` owns its cluster pool lifecycle. -- **Validate Problems** uses `SingleEvaluator` through - `scripts/validate_problems.py`. -- **Weekly Batch Evaluation** uses `BatchEvaluator` via `scripts/run_eval.sh` - and typically runs on SkyPilot (GCP by default). +- **CI**: Validate Problems runs single evals; Weekly Batch Evaluation runs + batch evals (typically SkyPilot on GCP). From 0f0e49c5678a0173103f5f29eee20341065dcaf1 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Fri, 6 Feb 2026 05:41:19 +0000 Subject: [PATCH 5/9] docs: add ARCHITECTURE overview --- docs/evaluation-architecture.md => ARCHITECTURE.md | 0 README.md | 2 +- 2 files changed, 1 insertion(+), 1 deletion(-) rename docs/evaluation-architecture.md => ARCHITECTURE.md (100%) diff --git a/docs/evaluation-architecture.md b/ARCHITECTURE.md similarity index 100% rename from docs/evaluation-architecture.md rename to ARCHITECTURE.md diff --git a/README.md b/README.md index f2e6c263..515b7bd2 100644 --- a/README.md +++ b/README.md @@ -168,7 +168,7 @@ print(f"Score (bounded): {result.score}") print(f"Score (unbounded): {result.score_unbounded}") ``` -See `docs/evaluation-architecture.md` for an overview of the evaluation stack +See `ARCHITECTURE.md` for an overview of the evaluation stack and runner mapping. ### Batch Evaluation From 26efaca59eb51576c4cd3ecbe30fd5cdfc384df7 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Fri, 6 Feb 2026 05:41:42 +0000 Subject: [PATCH 6/9] docs: clarify architecture audience --- ARCHITECTURE.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 2b4bcf65..fc9f6c0b 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -5,8 +5,8 @@ for, and the guiding design choices behind the current structure. ## Audience -- Contributors changing evaluation behavior, runners, or CI workflows. -- Maintainers debugging evaluation failures or infra cleanup issues. +- Researchers and engineers using Frontier-CS to evaluate solutions. +- Contributors extending problems, runners, or evaluation workflows. ## Goals From b4afb495dfe1f48581739f6fe2a3a11094aedd7d Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Fri, 6 Feb 2026 05:42:12 +0000 Subject: [PATCH 7/9] docs: align architecture audience with project roles --- ARCHITECTURE.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index fc9f6c0b..a2b430cc 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -5,8 +5,9 @@ for, and the guiding design choices behind the current structure. ## Audience -- Researchers and engineers using Frontier-CS to evaluate solutions. -- Contributors extending problems, runners, or evaluation workflows. +- **Problem contributors** (see `CONTRIBUTING.md`) +- **Model submitters** (see `SUBMIT.md`) +- **General researchers** using Frontier-CS to evaluate solutions ## Goals From 4d204ce0f9354f58076afbb9a62512f86d7e8d43 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Fri, 6 Feb 2026 05:43:11 +0000 Subject: [PATCH 8/9] docs: capture evaluation design choices --- ARCHITECTURE.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index a2b430cc..103d9081 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -60,6 +60,18 @@ Runners execute the actual evaluation. The mapping is: `keep_cluster` is set; batch handles its own pool cleanup. - **Naming**: runner class names are explicit about track + backend (e.g., `ResearchDockerRunner`) to remove ambiguity in logs and docs. +- **Score semantics**: a score of **0** can still mean the evaluator ran + successfully; failures are reported via status/metadata rather than relying + solely on score. +- **Reference solutions**: problems include `reference.cpp`/`reference.py` so + CI can verify end-to-end evaluation without requiring model submissions. +- **Results separation**: evaluation outputs are pushed to a dedicated results + repository to keep the main repo lean and auditable. +- **Internal vs public**: internal test cases and tooling live in a private + repo, with public artifacts kept minimal but compatible. +- **Weekly vs local**: weekly CI uses `scripts/run_eval.sh` with batch + scheduling, while local runs can use the same script or `frontier eval` for + quick iteration. ## Runner Flow (Research) From cebb6ae2f0f31255bbec5d57deadaf554777d27d Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Tue, 10 Feb 2026 17:22:36 +0000 Subject: [PATCH 9/9] docs: polish --- ARCHITECTURE.md | 122 ++++++++++++++++++++++++++++++------------------ 1 file changed, 76 insertions(+), 46 deletions(-) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 103d9081..47cfe6fa 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -1,7 +1,7 @@ # Evaluation Architecture -This document summarizes how evaluation flows through the codebase, who it is -for, and the guiding design choices behind the current structure. +This document describes how evaluation flows through the codebase and the +design choices behind the current structure. ## Audience @@ -12,7 +12,7 @@ for, and the guiding design choices behind the current structure. ## Goals - Clear separation between single-problem and batch evaluation. -- Shared validation/config parsing across research backends. +- Shared validation and config parsing across research backends. - Predictable cleanup to avoid orphaned cloud resources. - Explicit naming to avoid backend ambiguity. @@ -21,6 +21,7 @@ for, and the guiding design choices behind the current structure. - **CLI**: - `frontier eval` → `SingleEvaluator` - `frontier batch` → `BatchEvaluator` + - `frontier list` / `frontier show` → problem discovery - **CI**: - Validate Problems → `scripts/validate_problems.py` → `SingleEvaluator` - Weekly Batch Evaluation → `scripts/run_eval.sh` → `BatchEvaluator` @@ -28,70 +29,99 @@ for, and the guiding design choices behind the current structure. ## Components ### SingleEvaluator -Unified API for **single-problem evaluation**. It: +Unified API for **single-problem evaluation**: -- Chooses a runner based on track and backend. -- Registers cleanup hooks for SkyPilot clusters (SIGINT/atexit). +- Selects a runner based on track and backend. +- Registers SIGINT/atexit hooks to tear down SkyPilot clusters on exit. ### BatchEvaluator -Orchestrates **batch evaluation** with parallel workers and -SkyPilot cluster pools. It handles: +Orchestrates **batch evaluation** with parallel workers and SkyPilot +cluster pools: -- Work queues, resumable state, and result aggregation. -- Cluster pool lifecycle (create/cleanup). +- Work queues with resumable state and result aggregation. +- Resource-grouped cluster pools — pairs are grouped by + `ResourceSignature` (cloud, accelerators, instance type) so that + CPU-only and GPU problems run on separate pools. +- Hash-based resume — each result stores solution/problem hashes so stale + results are automatically re-evaluated when source changes. ### Runners -Runners execute the actual evaluation. The mapping is: - -- **Research + docker** → `ResearchDockerRunner` -- **Research + skypilot** → `ResearchSkyPilotRunner` -- **Algorithmic + docker** → `AlgorithmicLocalRunner` -- **Algorithmic + skypilot** → `AlgorithmicSkyPilotRunner` +Runners execute evaluation. The class hierarchy: + +``` +Runner (ABC) +├── ResearchRunner # shared: problem validation, config loading, uv install +│ ├── ResearchDockerRunner # local container +│ └── ResearchSkyPilotRunner # cloud via SkyPilot +├── AlgorithmicLocalRunner # go-judge via HTTP +└── AlgorithmicSkyPilotRunner # go-judge on SkyPilot +``` + +### Solution Generation (`gen/`) +The `gen/` module generates solutions by calling LLMs. It is independent of +the evaluation pipeline — `frontier eval` and `frontier batch` do not +depend on it. It provides an LLM interface, an API key pool for concurrent +requests, and solution file formatting. ## Design Decisions -- **Single vs Batch**: `SingleEvaluator` stays focused on one-off evaluation +- **Single vs Batch**: `SingleEvaluator` stays focused on one-off runs (simple API + cleanup hooks), while `BatchEvaluator` owns scheduling, - resumable state, and cluster pools. This keeps single-run paths lightweight - and batch runs scalable. -- **Shared research helpers**: input validation and config parsing are shared - in `ResearchRunner` to avoid drift between Docker and SkyPilot backends. -- **Cleanup strategy**: research evaluations down clusters by default unless - `keep_cluster` is set; batch handles its own pool cleanup. -- **Naming**: runner class names are explicit about track + backend + resumable state, and cluster pools. This keeps single-run paths + lightweight and batch runs scalable. +- **Shared research helpers**: input validation and config parsing live in + the `ResearchRunner` base class so Docker and SkyPilot backends stay in + sync. +- **Cleanup strategy**: research evaluations tear down clusters by default + unless `keep_cluster` is set. `SingleEvaluator` cleans up via an + active-cluster registry; `BatchEvaluator` manages its own pool lifecycle. +- **Naming**: runner class names encode track + backend (e.g., `ResearchDockerRunner`) to remove ambiguity in logs and docs. -- **Score semantics**: a score of **0** can still mean the evaluator ran - successfully; failures are reported via status/metadata rather than relying - solely on score. -- **Reference solutions**: problems include `reference.cpp`/`reference.py` so - CI can verify end-to-end evaluation without requiring model submissions. -- **Results separation**: evaluation outputs are pushed to a dedicated results +- **Score semantics**: a score of **0** can mean the evaluator ran + successfully; failures are reported via status/metadata rather than score + alone. +- **Reference solutions**: problems ship with `reference.cpp`/`reference.py` + so CI can verify end-to-end evaluation without model submissions. +- **Results separation**: evaluation outputs go to a dedicated results repository to keep the main repo lean and auditable. - **Internal vs public**: internal test cases and tooling live in a private - repo, with public artifacts kept minimal but compatible. + repo; public artifacts are kept minimal but compatible. - **Weekly vs local**: weekly CI uses `scripts/run_eval.sh` with batch - scheduling, while local runs can use the same script or `frontier eval` for - quick iteration. + scheduling; local runs use the same script or `frontier eval` for quick + iteration. +- **Resource-grouped cluster pools**: `BatchEvaluator` groups pairs by + `ResourceSignature` (cloud × accelerators × instance type) and creates a + separate pool per group, avoiding the waste of running CPU-only problems + on GPU clusters. +- **Hash-based resume**: resuming a batch compares solution/problem hashes + against stored results. Changed inputs are re-evaluated even when a prior + result exists, preventing silently stale scores. +- **Generation vs evaluation**: solution generation (`gen/`) is fully + decoupled from evaluation. Generated files are plain source files with no + special metadata; the evaluator has no dependency on the generation + pipeline. ## Runner Flow (Research) -Both research runners share the same input validation and config parsing: +Both research runners share the same pre-evaluation steps (via +`ResearchRunner`): -- Validate solution file and `.FAILED` marker. -- Ensure problem path exists. -- Load `config.yaml` and runtime settings. -- Build uv install command if `uv_project` is provided. +1. Validate solution file and `.FAILED` marker. +2. Verify the problem path exists. +3. Load `config.yaml` and runtime settings. +4. Build uv install command if `uv_project` is specified. -The execution path diverges only at the backend: +Execution diverges at the backend: -- Docker runner launches a local container. -- SkyPilot runner provisions and executes on cloud resources. +- **Docker** — launches a local container. +- **SkyPilot** — provisions a cloud VM and runs remotely. ## Operations (Cleanup + CI) -- **Cleanup**: research evaluations down clusters by default unless - `keep_cluster=True`; `SingleEvaluator` also cleans up on SIGINT/atexit using an - active-cluster registry. `BatchEvaluator` owns its cluster pool lifecycle. +- **Cleanup**: research evaluations tear down clusters by default unless + `keep_cluster=True`. `SingleEvaluator` uses an active-cluster registry to + clean up on SIGINT/atexit; `BatchEvaluator` manages its own cluster pool + lifecycle. -- **CI**: Validate Problems runs single evals; Weekly Batch Evaluation runs - batch evals (typically SkyPilot on GCP). +- **CI**: problem validation runs single evals; the weekly batch job runs + full evaluations on SkyPilot (typically GCP).