diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 00000000..47cfe6fa --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,127 @@ +# Evaluation Architecture + +This document describes how evaluation flows through the codebase and the +design choices behind the current structure. + +## Audience + +- **Problem contributors** (see `CONTRIBUTING.md`) +- **Model submitters** (see `SUBMIT.md`) +- **General researchers** using Frontier-CS to evaluate solutions + +## Goals + +- Clear separation between single-problem and batch evaluation. +- Shared validation and config parsing across research backends. +- Predictable cleanup to avoid orphaned cloud resources. +- Explicit naming to avoid backend ambiguity. + +## Architecture at a Glance + +- **CLI**: + - `frontier eval` → `SingleEvaluator` + - `frontier batch` → `BatchEvaluator` + - `frontier list` / `frontier show` → problem discovery +- **CI**: + - Validate Problems → `scripts/validate_problems.py` → `SingleEvaluator` + - Weekly Batch Evaluation → `scripts/run_eval.sh` → `BatchEvaluator` + +## Components + +### SingleEvaluator +Unified API for **single-problem evaluation**: + +- Selects a runner based on track and backend. +- Registers SIGINT/atexit hooks to tear down SkyPilot clusters on exit. + +### BatchEvaluator +Orchestrates **batch evaluation** with parallel workers and SkyPilot +cluster pools: + +- Work queues with resumable state and result aggregation. +- Resource-grouped cluster pools — pairs are grouped by + `ResourceSignature` (cloud, accelerators, instance type) so that + CPU-only and GPU problems run on separate pools. +- Hash-based resume — each result stores solution/problem hashes so stale + results are automatically re-evaluated when source changes. + +### Runners +Runners execute evaluation. The class hierarchy: + +``` +Runner (ABC) +├── ResearchRunner # shared: problem validation, config loading, uv install +│ ├── ResearchDockerRunner # local container +│ └── ResearchSkyPilotRunner # cloud via SkyPilot +├── AlgorithmicLocalRunner # go-judge via HTTP +└── AlgorithmicSkyPilotRunner # go-judge on SkyPilot +``` + +### Solution Generation (`gen/`) +The `gen/` module generates solutions by calling LLMs. It is independent of +the evaluation pipeline — `frontier eval` and `frontier batch` do not +depend on it. It provides an LLM interface, an API key pool for concurrent +requests, and solution file formatting. + +## Design Decisions + +- **Single vs Batch**: `SingleEvaluator` stays focused on one-off runs + (simple API + cleanup hooks), while `BatchEvaluator` owns scheduling, + resumable state, and cluster pools. This keeps single-run paths + lightweight and batch runs scalable. +- **Shared research helpers**: input validation and config parsing live in + the `ResearchRunner` base class so Docker and SkyPilot backends stay in + sync. +- **Cleanup strategy**: research evaluations tear down clusters by default + unless `keep_cluster` is set. `SingleEvaluator` cleans up via an + active-cluster registry; `BatchEvaluator` manages its own pool lifecycle. +- **Naming**: runner class names encode track + backend + (e.g., `ResearchDockerRunner`) to remove ambiguity in logs and docs. +- **Score semantics**: a score of **0** can mean the evaluator ran + successfully; failures are reported via status/metadata rather than score + alone. +- **Reference solutions**: problems ship with `reference.cpp`/`reference.py` + so CI can verify end-to-end evaluation without model submissions. +- **Results separation**: evaluation outputs go to a dedicated results + repository to keep the main repo lean and auditable. +- **Internal vs public**: internal test cases and tooling live in a private + repo; public artifacts are kept minimal but compatible. +- **Weekly vs local**: weekly CI uses `scripts/run_eval.sh` with batch + scheduling; local runs use the same script or `frontier eval` for quick + iteration. +- **Resource-grouped cluster pools**: `BatchEvaluator` groups pairs by + `ResourceSignature` (cloud × accelerators × instance type) and creates a + separate pool per group, avoiding the waste of running CPU-only problems + on GPU clusters. +- **Hash-based resume**: resuming a batch compares solution/problem hashes + against stored results. Changed inputs are re-evaluated even when a prior + result exists, preventing silently stale scores. +- **Generation vs evaluation**: solution generation (`gen/`) is fully + decoupled from evaluation. Generated files are plain source files with no + special metadata; the evaluator has no dependency on the generation + pipeline. + +## Runner Flow (Research) + +Both research runners share the same pre-evaluation steps (via +`ResearchRunner`): + +1. Validate solution file and `.FAILED` marker. +2. Verify the problem path exists. +3. Load `config.yaml` and runtime settings. +4. Build uv install command if `uv_project` is specified. + +Execution diverges at the backend: + +- **Docker** — launches a local container. +- **SkyPilot** — provisions a cloud VM and runs remotely. + +## Operations (Cleanup + CI) + +- **Cleanup**: research evaluations tear down clusters by default unless + `keep_cluster=True`. `SingleEvaluator` uses an active-cluster registry to + clean up on SIGINT/atexit; `BatchEvaluator` manages its own cluster pool + lifecycle. + +- **CI**: problem validation runs single evals; the weekly batch job runs + full evaluations on SkyPilot (typically GCP). diff --git a/README.md b/README.md index 16476f0c..515b7bd2 100644 --- a/README.md +++ b/README.md @@ -168,6 +168,9 @@ print(f"Score (bounded): {result.score}") print(f"Score (unbounded): {result.score_unbounded}") ``` +See `ARCHITECTURE.md` for an overview of the evaluation stack +and runner mapping. + ### Batch Evaluation For testing your solutions at scale with public test cases.