Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Evaluation Architecture

This document describes how evaluation flows through the codebase and the
design choices behind the current structure.

## Audience

- **Problem contributors** (see `CONTRIBUTING.md`)
- **Model submitters** (see `SUBMIT.md`)
- **General researchers** using Frontier-CS to evaluate solutions

## Goals

- Clear separation between single-problem and batch evaluation.
- Shared validation and config parsing across research backends.
- Predictable cleanup to avoid orphaned cloud resources.
- Explicit naming to avoid backend ambiguity.

## Architecture at a Glance

- **CLI**:
- `frontier eval` → `SingleEvaluator`
- `frontier batch` → `BatchEvaluator`
- `frontier list` / `frontier show` → problem discovery
- **CI**:
- Validate Problems → `scripts/validate_problems.py` → `SingleEvaluator`
- Weekly Batch Evaluation → `scripts/run_eval.sh` → `BatchEvaluator`

## Components

### SingleEvaluator
Unified API for **single-problem evaluation**:

- Selects a runner based on track and backend.
- Registers SIGINT/atexit hooks to tear down SkyPilot clusters on exit.

### BatchEvaluator
Orchestrates **batch evaluation** with parallel workers and SkyPilot
cluster pools:

- Work queues with resumable state and result aggregation.
- Resource-grouped cluster pools — pairs are grouped by
`ResourceSignature` (cloud, accelerators, instance type) so that
CPU-only and GPU problems run on separate pools.
- Hash-based resume — each result stores solution/problem hashes so stale
results are automatically re-evaluated when source changes.

### Runners
Runners execute evaluation. The class hierarchy:

```
Runner (ABC)
├── ResearchRunner # shared: problem validation, config loading, uv install
│ ├── ResearchDockerRunner # local container
│ └── ResearchSkyPilotRunner # cloud via SkyPilot
├── AlgorithmicLocalRunner # go-judge via HTTP
└── AlgorithmicSkyPilotRunner # go-judge on SkyPilot
```

### Solution Generation (`gen/`)
The `gen/` module generates solutions by calling LLMs. It is independent of
the evaluation pipeline — `frontier eval` and `frontier batch` do not
depend on it. It provides an LLM interface, an API key pool for concurrent
requests, and solution file formatting.

## Design Decisions

- **Single vs Batch**: `SingleEvaluator` stays focused on one-off runs
(simple API + cleanup hooks), while `BatchEvaluator` owns scheduling,
resumable state, and cluster pools. This keeps single-run paths
lightweight and batch runs scalable.
- **Shared research helpers**: input validation and config parsing live in
the `ResearchRunner` base class so Docker and SkyPilot backends stay in
sync.
- **Cleanup strategy**: research evaluations tear down clusters by default
unless `keep_cluster` is set. `SingleEvaluator` cleans up via an
active-cluster registry; `BatchEvaluator` manages its own pool lifecycle.
- **Naming**: runner class names encode track + backend
(e.g., `ResearchDockerRunner`) to remove ambiguity in logs and docs.
- **Score semantics**: a score of **0** can mean the evaluator ran
successfully; failures are reported via status/metadata rather than score
alone.
- **Reference solutions**: problems ship with `reference.cpp`/`reference.py`
so CI can verify end-to-end evaluation without model submissions.
- **Results separation**: evaluation outputs go to a dedicated results
repository to keep the main repo lean and auditable.
- **Internal vs public**: internal test cases and tooling live in a private
repo; public artifacts are kept minimal but compatible.
- **Weekly vs local**: weekly CI uses `scripts/run_eval.sh` with batch
scheduling; local runs use the same script or `frontier eval` for quick
iteration.
- **Resource-grouped cluster pools**: `BatchEvaluator` groups pairs by
`ResourceSignature` (cloud × accelerators × instance type) and creates a
separate pool per group, avoiding the waste of running CPU-only problems
on GPU clusters.
- **Hash-based resume**: resuming a batch compares solution/problem hashes
against stored results. Changed inputs are re-evaluated even when a prior
result exists, preventing silently stale scores.
- **Generation vs evaluation**: solution generation (`gen/`) is fully
decoupled from evaluation. Generated files are plain source files with no
special metadata; the evaluator has no dependency on the generation
pipeline.

## Runner Flow (Research)

Both research runners share the same pre-evaluation steps (via
`ResearchRunner`):

1. Validate solution file and `.FAILED` marker.
2. Verify the problem path exists.
3. Load `config.yaml` and runtime settings.
4. Build uv install command if `uv_project` is specified.

Execution diverges at the backend:

- **Docker** — launches a local container.
- **SkyPilot** — provisions a cloud VM and runs remotely.

## Operations (Cleanup + CI)

- **Cleanup**: research evaluations tear down clusters by default unless
`keep_cluster=True`. `SingleEvaluator` uses an active-cluster registry to
clean up on SIGINT/atexit; `BatchEvaluator` manages its own cluster pool
lifecycle.

- **CI**: problem validation runs single evals; the weekly batch job runs
full evaluations on SkyPilot (typically GCP).
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,9 @@ print(f"Score (bounded): {result.score}")
print(f"Score (unbounded): {result.score_unbounded}")
```

See `ARCHITECTURE.md` for an overview of the evaluation stack
and runner mapping.

### Batch Evaluation

For testing your solutions at scale with public test cases.
Expand Down