From 549cfb13a612077a61e6bb3393e4e539c3695740 Mon Sep 17 00:00:00 2001
From: Andy Lee <andylizf@outlook.com>
Date: Fri, 6 Feb 2026 05:23:29 +0000
Subject: [PATCH 1/9] docs: add evaluation architecture overview

---
 README.md                       |  3 ++
 docs/evaluation-architecture.md | 58 +++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)
 create mode 100644 docs/evaluation-architecture.md

diff --git a/README.md b/README.md
index 16476f0c..f2e6c263 100644
--- a/README.md
+++ b/README.md
@@ -168,6 +168,9 @@ print(f"Score (bounded): {result.score}")
 print(f"Score (unbounded): {result.score_unbounded}")
 ```
 
+See `docs/evaluation-architecture.md` for an overview of the evaluation stack
+and runner mapping.
+
 ### Batch Evaluation
 
 For testing your solutions at scale with public test cases.
diff --git a/docs/evaluation-architecture.md b/docs/evaluation-architecture.md
new file mode 100644
index 00000000..dbd635a8
--- /dev/null
+++ b/docs/evaluation-architecture.md
@@ -0,0 +1,58 @@
+# Evaluation Architecture
+
+This document summarizes how evaluation flows through the codebase and names the
+key components introduced in the recent refactors.
+
+## Components
+
+### SingleEvaluator
+`SingleEvaluator` is the unified API for **single-problem evaluation**. It:
+
+- Chooses a runner based on track and backend.
+- Registers cleanup hooks for SkyPilot clusters (SIGINT/atexit).
+
+### BatchEvaluator
+`BatchEvaluator` orchestrates **batch evaluation** with parallel workers and
+SkyPilot cluster pools. It handles:
+
+- Work queues, resumable state, and result aggregation.
+- Cluster pool lifecycle (create/cleanup).
+
+### Runners
+Runners execute the actual evaluation. The mapping is:
+
+- **Research + docker** → `ResearchDockerRunner`
+- **Research + skypilot** → `ResearchSkyPilotRunner`
+- **Algorithmic + docker** → `AlgorithmicLocalRunner`
+- **Algorithmic + skypilot** → `AlgorithmicSkyPilotRunner`
+
+## Runner Flow (Research)
+
+Both research runners share the same input validation and config parsing:
+
+- Validate solution file and `.FAILED` marker.
+- Ensure problem path exists.
+- Load `config.yaml` and runtime settings.
+- Build uv install command if `uv_project` is provided.
+
+The execution path diverges only at the backend:
+
+- Docker runner launches a local container.
+- SkyPilot runner provisions and executes on cloud resources.
+
+## Cleanup Behavior
+
+- `ResearchSkyPilotRunner` always downs the evaluation cluster unless
+  `keep_cluster=True`.
+- Active clusters are tracked in a registry so `SingleEvaluator` can clean up on
+  SIGINT/atexit.
+- `BatchEvaluator` uses its own cluster pool cleanup (independent of the
+  registry).
+
+## CI Mapping
+
+- **Validate Problems** uses `SingleEvaluator` through
+  `scripts/validate_problems.py`.
+- **Weekly Batch Evaluation** uses `BatchEvaluator` via `scripts/run_eval.sh`
+  and typically runs on SkyPilot (GCP by default).
+

From edb51982fc9cee989b87cfa076011cad34b05118 Mon Sep 17 00:00:00 2001
From: Andy Lee <andylizf@outlook.com>
Date: Fri, 6 Feb 2026 05:37:48 +0000
Subject: [PATCH 2/9] docs: clarify evaluation design decisions

---
 docs/evaluation-architecture.md | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/docs/evaluation-architecture.md b/docs/evaluation-architecture.md
index dbd635a8..e1d81790 100644
--- a/docs/evaluation-architecture.md
+++ b/docs/evaluation-architecture.md
@@ -26,6 +26,20 @@ Runners execute the actual evaluation. The mapping is:
 - **Algorithmic + docker** → `AlgorithmicLocalRunner`
 - **Algorithmic + skypilot** → `AlgorithmicSkyPilotRunner`
 
+## Design Decisions
+
+- **Single vs Batch**: `SingleEvaluator` stays focused on one-off evaluation
+  (simple API + cleanup hooks), while `BatchEvaluator` owns scheduling,
+  resumable state, and cluster pools. This keeps single-run paths lightweight
+  and batch runs scalable.
+- **Shared research helpers**: input validation and config parsing are shared
+  in `ResearchRunner` to avoid drift between Docker and SkyPilot backends.
+- **Cleanup strategy**: research SkyPilot evaluations down clusters by default
+  (cost/safety), with `keep_cluster` as the opt-out. Batch uses its own pool
+  cleanup because cluster lifecycle is managed at the scheduler level.
+- **Naming**: runner class names are explicit about track + backend
+  (e.g., `ResearchDockerRunner`) to remove ambiguity in logs and docs.
+
 ## Runner Flow (Research)
 
 Both research runners share the same input validation and config parsing:
@@ -55,4 +69,3 @@ The execution path diverges only at the backend:
   `scripts/validate_problems.py`.
 - **Weekly Batch Evaluation** uses `BatchEvaluator` via `scripts/run_eval.sh`
   and typically runs on SkyPilot (GCP by default).
-

From 268a19cb3ee88de805745f0fb3deceb1e282d7cd Mon Sep 17 00:00:00 2001
From: Andy Lee <andylizf@outlook.com>
Date: Fri, 6 Feb 2026 05:38:43 +0000
Subject: [PATCH 3/9] docs: simplify cleanup rationale

---
 docs/evaluation-architecture.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/docs/evaluation-architecture.md b/docs/evaluation-architecture.md
index e1d81790..7b7d949b 100644
--- a/docs/evaluation-architecture.md
+++ b/docs/evaluation-architecture.md
@@ -34,9 +34,8 @@ Runners execute the actual evaluation. The mapping is:
   and batch runs scalable.
 - **Shared research helpers**: input validation and config parsing are shared
   in `ResearchRunner` to avoid drift between Docker and SkyPilot backends.
-- **Cleanup strategy**: research SkyPilot evaluations down clusters by default
-  (cost/safety), with `keep_cluster` as the opt-out. Batch uses its own pool
-  cleanup because cluster lifecycle is managed at the scheduler level.
+- **Cleanup strategy**: research evaluations down clusters by default unless
+  `keep_cluster` is set; batch handles its own pool cleanup.
 - **Naming**: runner class names are explicit about track + backend
   (e.g., `ResearchDockerRunner`) to remove ambiguity in logs and docs.
 

From 2973560fe9b3f5c9738912a0e8f413abce181324 Mon Sep 17 00:00:00 2001
From: Andy Lee <andylizf@outlook.com>
Date: Fri, 6 Feb 2026 05:40:29 +0000
Subject: [PATCH 4/9] docs: expand evaluation architecture overview

---
 docs/evaluation-architecture.md | 48 +++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 17 deletions(-)

diff --git a/docs/evaluation-architecture.md b/docs/evaluation-architecture.md
index 7b7d949b..2b4bcf65 100644
--- a/docs/evaluation-architecture.md
+++ b/docs/evaluation-architecture.md
@@ -1,18 +1,39 @@
 # Evaluation Architecture
 
-This document summarizes how evaluation flows through the codebase and names the
-key components introduced in the recent refactors.
+This document summarizes how evaluation flows through the codebase, who it is
+for, and the guiding design choices behind the current structure.
+
+## Audience
+
+- Contributors changing evaluation behavior, runners, or CI workflows.
+- Maintainers debugging evaluation failures or infra cleanup issues.
+
+## Goals
+
+- Clear separation between single-problem and batch evaluation.
+- Shared validation/config parsing across research backends.
+- Predictable cleanup to avoid orphaned cloud resources.
+- Explicit naming to avoid backend ambiguity.
+
+## Architecture at a Glance
+
+- **CLI**:
+  - `frontier eval` → `SingleEvaluator`
+  - `frontier batch` → `BatchEvaluator`
+- **CI**:
+  - Validate Problems → `scripts/validate_problems.py` → `SingleEvaluator`
+  - Weekly Batch Evaluation → `scripts/run_eval.sh` → `BatchEvaluator`
 
 ## Components
 
 ### SingleEvaluator
-`SingleEvaluator` is the unified API for **single-problem evaluation**. It:
+Unified API for **single-problem evaluation**. It:
 
 - Chooses a runner based on track and backend.
 - Registers cleanup hooks for SkyPilot clusters (SIGINT/atexit).
 
 ### BatchEvaluator
-`BatchEvaluator` orchestrates **batch evaluation** with parallel workers and
+Orchestrates **batch evaluation** with parallel workers and
 SkyPilot cluster pools. It handles:
 
 - Work queues, resumable state, and result aggregation.
@@ -53,18 +74,11 @@ The execution path diverges only at the backend:
 - Docker runner launches a local container.
 - SkyPilot runner provisions and executes on cloud resources.
 
-## Cleanup Behavior
-
-- `ResearchSkyPilotRunner` always downs the evaluation cluster unless
-  `keep_cluster=True`.
-- Active clusters are tracked in a registry so `SingleEvaluator` can clean up on
-  SIGINT/atexit.
-- `BatchEvaluator` uses its own cluster pool cleanup (independent of the
-  registry).
+## Operations (Cleanup + CI)
 
-## CI Mapping
+- **Cleanup**: research evaluations down clusters by default unless
+  `keep_cluster=True`; `SingleEvaluator` also cleans up on SIGINT/atexit using an
+  active-cluster registry. `BatchEvaluator` owns its cluster pool lifecycle.
 
-- **Validate Problems** uses `SingleEvaluator` through
-  `scripts/validate_problems.py`.
-- **Weekly Batch Evaluation** uses `BatchEvaluator` via `scripts/run_eval.sh`
-  and typically runs on SkyPilot (GCP by default).
+- **CI**: Validate Problems runs single evals; Weekly Batch Evaluation runs
+  batch evals (typically SkyPilot on GCP).

From 0f0e49c5678a0173103f5f29eee20341065dcaf1 Mon Sep 17 00:00:00 2001
From: Andy Lee <andylizf@outlook.com>
Date: Fri, 6 Feb 2026 05:41:19 +0000
Subject: [PATCH 5/9] docs: add ARCHITECTURE overview

---
 docs/evaluation-architecture.md => ARCHITECTURE.md | 0
 README.md                                          | 2 +-
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename docs/evaluation-architecture.md => ARCHITECTURE.md (100%)

diff --git a/docs/evaluation-architecture.md b/ARCHITECTURE.md
similarity index 100%
rename from docs/evaluation-architecture.md
rename to ARCHITECTURE.md
diff --git a/README.md b/README.md
index f2e6c263..515b7bd2 100644
--- a/README.md
+++ b/README.md
@@ -168,7 +168,7 @@ print(f"Score (bounded): {result.score}")
 print(f"Score (unbounded): {result.score_unbounded}")
 ```
 
-See `docs/evaluation-architecture.md` for an overview of the evaluation stack
+See `ARCHITECTURE.md` for an overview of the evaluation stack
 and runner mapping.
 
 ### Batch Evaluation

From 26efaca59eb51576c4cd3ecbe30fd5cdfc384df7 Mon Sep 17 00:00:00 2001
From: Andy Lee <andylizf@outlook.com>
Date: Fri, 6 Feb 2026 05:41:42 +0000
Subject: [PATCH 6/9] docs: clarify architecture audience

---
 ARCHITECTURE.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
index 2b4bcf65..fc9f6c0b 100644
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -5,8 +5,8 @@ for, and the guiding design choices behind the current structure.
 
 ## Audience
 
-- Contributors changing evaluation behavior, runners, or CI workflows.
-- Maintainers debugging evaluation failures or infra cleanup issues.
+- Researchers and engineers using Frontier-CS to evaluate solutions.
+- Contributors extending problems, runners, or evaluation workflows.
 
 ## Goals
 

From b4afb495dfe1f48581739f6fe2a3a11094aedd7d Mon Sep 17 00:00:00 2001
From: Andy Lee <andylizf@outlook.com>
Date: Fri, 6 Feb 2026 05:42:12 +0000
Subject: [PATCH 7/9] docs: align architecture audience with project roles

---
 ARCHITECTURE.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
index fc9f6c0b..a2b430cc 100644
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -5,8 +5,9 @@ for, and the guiding design choices behind the current structure.
 
 ## Audience
 
-- Researchers and engineers using Frontier-CS to evaluate solutions.
-- Contributors extending problems, runners, or evaluation workflows.
+- **Problem contributors** (see `CONTRIBUTING.md`)
+- **Model submitters** (see `SUBMIT.md`)
+- **General researchers** using Frontier-CS to evaluate solutions
 
 ## Goals
 

From 4d204ce0f9354f58076afbb9a62512f86d7e8d43 Mon Sep 17 00:00:00 2001
From: Andy Lee <andylizf@outlook.com>
Date: Fri, 6 Feb 2026 05:43:11 +0000
Subject: [PATCH 8/9] docs: capture evaluation design choices

---
 ARCHITECTURE.md | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
index a2b430cc..103d9081 100644
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -60,6 +60,18 @@ Runners execute the actual evaluation. The mapping is:
   `keep_cluster` is set; batch handles its own pool cleanup.
 - **Naming**: runner class names are explicit about track + backend
   (e.g., `ResearchDockerRunner`) to remove ambiguity in logs and docs.
+- **Score semantics**: a score of **0** can still mean the evaluator ran
+  successfully; failures are reported via status/metadata rather than relying
+  solely on score.
+- **Reference solutions**: problems include `reference.cpp`/`reference.py` so
+  CI can verify end-to-end evaluation without requiring model submissions.
+- **Results separation**: evaluation outputs are pushed to a dedicated results
+  repository to keep the main repo lean and auditable.
+- **Internal vs public**: internal test cases and tooling live in a private
+  repo, with public artifacts kept minimal but compatible.
+- **Weekly vs local**: weekly CI uses `scripts/run_eval.sh` with batch
+  scheduling, while local runs can use the same script or `frontier eval` for
+  quick iteration.
 
 ## Runner Flow (Research)
 

From cebb6ae2f0f31255bbec5d57deadaf554777d27d Mon Sep 17 00:00:00 2001
From: Andy Lee <andylizf@outlook.com>
Date: Tue, 10 Feb 2026 17:22:36 +0000
Subject: [PATCH 9/9] docs: polish

---
 ARCHITECTURE.md | 122 ++++++++++++++++++++++++++++++------------------
 1 file changed, 76 insertions(+), 46 deletions(-)

diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
index 103d9081..47cfe6fa 100644
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -1,7 +1,7 @@
 # Evaluation Architecture
 
-This document summarizes how evaluation flows through the codebase, who it is
-for, and the guiding design choices behind the current structure.
+This document describes how evaluation flows through the codebase and the
+design choices behind the current structure.
 
 ## Audience
 
@@ -12,7 +12,7 @@ for, and the guiding design choices behind the current structure.
 ## Goals
 
 - Clear separation between single-problem and batch evaluation.
-- Shared validation/config parsing across research backends.
+- Shared validation and config parsing across research backends.
 - Predictable cleanup to avoid orphaned cloud resources.
 - Explicit naming to avoid backend ambiguity.
 
@@ -21,6 +21,7 @@ for, and the guiding design choices behind the current structure.
 - **CLI**:
   - `frontier eval` → `SingleEvaluator`
   - `frontier batch` → `BatchEvaluator`
+  - `frontier list` / `frontier show` → problem discovery
 - **CI**:
   - Validate Problems → `scripts/validate_problems.py` → `SingleEvaluator`
   - Weekly Batch Evaluation → `scripts/run_eval.sh` → `BatchEvaluator`
@@ -28,70 +29,99 @@ for, and the guiding design choices behind the current structure.
 ## Components
 
 ### SingleEvaluator
-Unified API for **single-problem evaluation**. It:
+Unified API for **single-problem evaluation**:
 
-- Chooses a runner based on track and backend.
-- Registers cleanup hooks for SkyPilot clusters (SIGINT/atexit).
+- Selects a runner based on track and backend.
+- Registers SIGINT/atexit hooks to tear down SkyPilot clusters on exit.
 
 ### BatchEvaluator
-Orchestrates **batch evaluation** with parallel workers and
-SkyPilot cluster pools. It handles:
+Orchestrates **batch evaluation** with parallel workers and SkyPilot
+cluster pools:
 
-- Work queues, resumable state, and result aggregation.
-- Cluster pool lifecycle (create/cleanup).
+- Work queues with resumable state and result aggregation.
+- Resource-grouped cluster pools — pairs are grouped by
+  `ResourceSignature` (cloud, accelerators, instance type) so that
+  CPU-only and GPU problems run on separate pools.
+- Hash-based resume — each result stores solution/problem hashes so stale
+  results are automatically re-evaluated when source changes.
 
 ### Runners
-Runners execute the actual evaluation. The mapping is:
-
-- **Research + docker** → `ResearchDockerRunner`
-- **Research + skypilot** → `ResearchSkyPilotRunner`
-- **Algorithmic + docker** → `AlgorithmicLocalRunner`
-- **Algorithmic + skypilot** → `AlgorithmicSkyPilotRunner`
+Runners execute evaluation. The class hierarchy:
+
+```
+Runner (ABC)
+├── ResearchRunner              # shared: problem validation, config loading, uv install
+│   ├── ResearchDockerRunner    # local container
+│   └── ResearchSkyPilotRunner  # cloud via SkyPilot
+├── AlgorithmicLocalRunner      # go-judge via HTTP
+└── AlgorithmicSkyPilotRunner   # go-judge on SkyPilot
+```
+
+### Solution Generation (`gen/`)
+The `gen/` module generates solutions by calling LLMs. It is independent of
+the evaluation pipeline — `frontier eval` and `frontier batch` do not
+depend on it. It provides an LLM interface, an API key pool for concurrent
+requests, and solution file formatting.
 
 ## Design Decisions
 
-- **Single vs Batch**: `SingleEvaluator` stays focused on one-off evaluation
+- **Single vs Batch**: `SingleEvaluator` stays focused on one-off runs
   (simple API + cleanup hooks), while `BatchEvaluator` owns scheduling,
-  resumable state, and cluster pools. This keeps single-run paths lightweight
-  and batch runs scalable.
-- **Shared research helpers**: input validation and config parsing are shared
-  in `ResearchRunner` to avoid drift between Docker and SkyPilot backends.
-- **Cleanup strategy**: research evaluations down clusters by default unless
-  `keep_cluster` is set; batch handles its own pool cleanup.
-- **Naming**: runner class names are explicit about track + backend
+  resumable state, and cluster pools. This keeps single-run paths
+  lightweight and batch runs scalable.
+- **Shared research helpers**: input validation and config parsing live in
+  the `ResearchRunner` base class so Docker and SkyPilot backends stay in
+  sync.
+- **Cleanup strategy**: research evaluations tear down clusters by default
+  unless `keep_cluster` is set. `SingleEvaluator` cleans up via an
+  active-cluster registry; `BatchEvaluator` manages its own pool lifecycle.
+- **Naming**: runner class names encode track + backend
   (e.g., `ResearchDockerRunner`) to remove ambiguity in logs and docs.
-- **Score semantics**: a score of **0** can still mean the evaluator ran
-  successfully; failures are reported via status/metadata rather than relying
-  solely on score.
-- **Reference solutions**: problems include `reference.cpp`/`reference.py` so
-  CI can verify end-to-end evaluation without requiring model submissions.
-- **Results separation**: evaluation outputs are pushed to a dedicated results
+- **Score semantics**: a score of **0** can mean the evaluator ran
+  successfully; failures are reported via status/metadata rather than score
+  alone.
+- **Reference solutions**: problems ship with `reference.cpp`/`reference.py`
+  so CI can verify end-to-end evaluation without model submissions.
+- **Results separation**: evaluation outputs go to a dedicated results
   repository to keep the main repo lean and auditable.
 - **Internal vs public**: internal test cases and tooling live in a private
-  repo, with public artifacts kept minimal but compatible.
+  repo; public artifacts are kept minimal but compatible.
 - **Weekly vs local**: weekly CI uses `scripts/run_eval.sh` with batch
-  scheduling, while local runs can use the same script or `frontier eval` for
-  quick iteration.
+  scheduling; local runs use the same script or `frontier eval` for quick
+  iteration.
+- **Resource-grouped cluster pools**: `BatchEvaluator` groups pairs by
+  `ResourceSignature` (cloud × accelerators × instance type) and creates a
+  separate pool per group, avoiding the waste of running CPU-only problems
+  on GPU clusters.
+- **Hash-based resume**: resuming a batch compares solution/problem hashes
+  against stored results. Changed inputs are re-evaluated even when a prior
+  result exists, preventing silently stale scores.
+- **Generation vs evaluation**: solution generation (`gen/`) is fully
+  decoupled from evaluation. Generated files are plain source files with no
+  special metadata; the evaluator has no dependency on the generation
+  pipeline.
 
 ## Runner Flow (Research)
 
-Both research runners share the same input validation and config parsing:
+Both research runners share the same pre-evaluation steps (via
+`ResearchRunner`):
 
-- Validate solution file and `.FAILED` marker.
-- Ensure problem path exists.
-- Load `config.yaml` and runtime settings.
-- Build uv install command if `uv_project` is provided.
+1. Validate solution file and `.FAILED` marker.
+2. Verify the problem path exists.
+3. Load `config.yaml` and runtime settings.
+4. Build uv install command if `uv_project` is specified.
 
-The execution path diverges only at the backend:
+Execution diverges at the backend:
 
-- Docker runner launches a local container.
-- SkyPilot runner provisions and executes on cloud resources.
+- **Docker** — launches a local container.
+- **SkyPilot** — provisions a cloud VM and runs remotely.
 
 ## Operations (Cleanup + CI)
 
-- **Cleanup**: research evaluations down clusters by default unless
-  `keep_cluster=True`; `SingleEvaluator` also cleans up on SIGINT/atexit using an
-  active-cluster registry. `BatchEvaluator` owns its cluster pool lifecycle.
+- **Cleanup**: research evaluations tear down clusters by default unless
+  `keep_cluster=True`. `SingleEvaluator` uses an active-cluster registry to
+  clean up on SIGINT/atexit; `BatchEvaluator` manages its own cluster pool
+  lifecycle.
 
-- **CI**: Validate Problems runs single evals; Weekly Batch Evaluation runs
-  batch evals (typically SkyPilot on GCP).
+- **CI**: problem validation runs single evals; the weekly batch job runs
+  full evaluations on SkyPilot (typically GCP).