Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
de6e992
docs: standardize on frontier batch, simplify redundant sections
andylizf Feb 3, 2026
b3544a8
docs: remove vague local testing line in algorithmic README
andylizf Feb 4, 2026
ad52289
docs: remove redundant pip install skypilot-nightly
andylizf Feb 4, 2026
cf935ce
docs: explain why research defaults to SkyPilot
andylizf Feb 4, 2026
24dc087
docs: add SUBMIT.md reference to algorithmic batch eval
andylizf Feb 4, 2026
072670c
docs: fix typos and inconsistencies
andylizf Feb 4, 2026
7cb6dc6
docs: add info boxes to clarify document purpose
andylizf Feb 4, 2026
6ac274e
docs: move C++17 note to CLI section
andylizf Feb 4, 2026
1c4e5cc
docs: add Solution Requirements section in algorithmic README
andylizf Feb 4, 2026
ec77f01
docs: fix inconsistencies and add Solution Requirements to research
andylizf Feb 4, 2026
8dbd715
docs: restore problem detection note in research README
andylizf Feb 4, 2026
bbd91b6
docs: fix Python API backend documentation in research README
andylizf Feb 4, 2026
efa161b
feat: auto-detect backend by track in Python API (research->skypilot,…
andylizf Feb 4, 2026
8b41564
docs: update docstrings to reflect track-based backend defaults
andylizf Feb 4, 2026
23fd532
fix: replace deprecated --skypilot references with --backend
andylizf Feb 4, 2026
d246c4b
fix: use correct -j flag in CI and remove redundant --skypilot
andylizf Feb 4, 2026
ec608c5
feat: add multi-language support for research track problems
andylizf Feb 4, 2026
a62efc8
docs: document language field for research track problems
andylizf Feb 4, 2026
8f84fa6
docs: update PR template with SUBMIT.md link and multi-language support
andylizf Feb 4, 2026
ea6e605
feat: support multi-language reference solutions in CI validation
andylizf Feb 4, 2026
48ced5c
chore: add reference.cpp for nbody_simulation problems
andylizf Feb 4, 2026
fbe06d5
refactor: rename frontier-eval to frontier in scripts
andylizf Feb 4, 2026
b0615b0
feat: add per-provider concurrency control based on RPM limits
andylizf Feb 4, 2026
fb30427
fix: update validate_problems.py for positional track argument
andylizf Feb 4, 2026
2fb0a77
fix: support multi-language solutions in docker/skypilot runners
andylizf Feb 5, 2026
2d84681
fix: optimize nbody reference and fix multi-language runner support
andylizf Feb 5, 2026
9189cbb
docs: clarify reference solution must score > 0 (beat baseline)
andylizf Feb 5, 2026
253fcb7
docs: explain why reference must score > 0
andylizf Feb 5, 2026
74ec7b4
fix: use docker backend for CI validation
andylizf Feb 5, 2026
7f6404e
Revert "fix: use docker backend for CI validation"
andylizf Feb 5, 2026
452e02a
fix: activate GCP service account for SkyPilot in CI
andylizf Feb 5, 2026
7587bd6
docs: add project rules for Claude
andylizf Feb 5, 2026
fd70693
fix: show full error output in validation
andylizf Feb 5, 2026
830d49a
fix: improve CI debugging and SSH key setup for SkyPilot
andylizf Feb 5, 2026
9262040
fix: remove hardcoded AWS cloud from nbody config for CI compatibility
andylizf Feb 5, 2026
0986915
Revert "fix: remove hardcoded AWS cloud from nbody config for CI comp…
andylizf Feb 5, 2026
b151619
fix: configure AWS credentials for CI SkyPilot
andylizf Feb 5, 2026
711c1cd
fix: configure both AWS and GCP credentials for CI
andylizf Feb 5, 2026
4eace96
fix: add skypilot AWS and GCP extras
andylizf Feb 5, 2026
3ce4f56
fix: parse score with regex when JSON is masked by CI
andylizf Feb 5, 2026
d97d690
fix: use evaluator API in validate_problems and keep JSON output clean
andylizf Feb 5, 2026
811d9cd
fix: always down skypilot clusters unless kept
andylizf Feb 5, 2026
f7662b6
fix: add eval/validate cleanup hooks for skypilot
andylizf Feb 5, 2026
98baf2f
refactor: centralize skypilot cleanup registry
andylizf Feb 5, 2026
6d1d753
refactor: register cleanup hooks in evaluator
andylizf Feb 5, 2026
2461490
refactor: share research runner validation helpers
andylizf Feb 5, 2026
9b74119
refactor: centralize research runtime config loading
andylizf Feb 5, 2026
fe4b15b
refactor: share uv install script for research runners
andylizf Feb 5, 2026
a57ab81
refactor: share timeout prefix helper for research runners
andylizf Feb 5, 2026
32f4a43
refactor: rename evaluator and runner classes
andylizf Feb 5, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Project Rules for Frontier-CS

## Backend Selection

**NEVER change the backend due to missing credentials or CI configuration issues.**

- Research track: always uses SkyPilot (cloud VMs)
- Algorithmic track: always uses Docker (local)

If CI fails due to credentials/permissions, fix the credentials - do NOT change the code to use a different backend. The backend choice is intentional for each track's evaluation requirements.
3 changes: 2 additions & 1 deletion .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
## Summary
<!-- Brief description of changes -->

> Please read [CONTRIBUTING.md](../CONTRIBUTING.md) before submitting.

## Type of Change
- [ ] New research problem
Expand All @@ -21,4 +22,4 @@
## CI Validation (for new problems)
> When adding new problems, CI will automatically validate that your reference solution achieves score > 0.
> - Algorithmic problems: Include `reference.cpp` in your problem directory
> - Research problems: Include `reference.py` in your problem directory
> - Research problems: Include `reference.py` (or `reference.cpp` if `language: cpp` in config.yaml)
6 changes: 3 additions & 3 deletions .github/PULL_REQUEST_TEMPLATE/research_problem.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ labels: research-problem
- [ ] `evaluate.sh` - Evaluation entry point
- [ ] `evaluator.py` - Scoring logic (outputs 0-100 score)
- [ ] `resources/` - Problem-specific code/data
- [ ] `reference.py` - Reference solution **(required for CI)**
- [ ] `reference.{py,cpp}` - Reference solution **(required for CI, extension matches `language` in config.yaml)**

### Problem Structure
```
Expand All @@ -38,15 +38,15 @@ research/{problem_name}/
├── set_up_env.sh
├── evaluate.sh
├── evaluator.py
├── reference.py # Required: CI will validate this achieves score > 0
├── reference.{py,cpp} # Required: CI validates score > 0 (extension per language)
└── resources/
└── ...
```

### Testing
- [ ] Verified `set_up_env.sh` runs successfully
- [ ] Verified `evaluate.sh` runs and outputs a numeric score
- [ ] **Reference solution (`reference.py`) achieves score > 0**
- [ ] **Reference solution achieves score > 0**

**Test Results** (if available):
```
Expand Down
36 changes: 32 additions & 4 deletions .github/workflows/validate-problems.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,25 +78,53 @@ jobs:
- name: Install dependencies
run: uv sync

- name: Setup AWS credentials
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
mkdir -p ~/.aws
cat > ~/.aws/credentials << EOF
[default]
aws_access_key_id = $AWS_ACCESS_KEY_ID
aws_secret_access_key = $AWS_SECRET_ACCESS_KEY
EOF
cat > ~/.aws/config << EOF
[default]
region = us-east-1
EOF
echo "AWS credentials configured"

- name: Setup GCP credentials
env:
GCP_CREDS: ${{ secrets.GCP_CREDENTIALS }}
run: |
if [ -n "$GCP_CREDS" ]; then
echo "$GCP_CREDS" > /tmp/gcp-key.json
echo "GOOGLE_APPLICATION_CREDENTIALS=/tmp/gcp-key.json" >> $GITHUB_ENV
gcloud auth activate-service-account --key-file=/tmp/gcp-key.json
gcloud config set project ${{ secrets.GCP_PROJECT_ID }}
echo "GCP credentials configured"
else
echo "No GCP credentials available, skipping..."
fi

- name: Generate SSH key for SkyPilot
run: |
mkdir -p ~/.ssh
if [ ! -f ~/.ssh/sky-key ]; then
ssh-keygen -t rsa -b 4096 -f ~/.ssh/sky-key -N "" -C "sky-ci"
echo "Generated SSH key for SkyPilot"
fi

- name: Setup SkyPilot
run: |
uv run sky check || echo "SkyPilot check failed, continuing..."
uv run sky check aws gcp || echo "SkyPilot check failed, continuing..."

- name: Validate problems
timeout-minutes: 30
run: |
echo "Validating research problems: ${{ needs.detect-changes.outputs.research }}"
uv run python scripts/validate_problems.py \
--track research \
--problems ${{ needs.detect-changes.outputs.research }}
--timeout 1200 \
--problems ${{ needs.detect-changes.outputs.research }} \
--verbose
7 changes: 2 additions & 5 deletions .github/workflows/weekly-eval.yml
Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,7 @@ jobs:
--track research \
--internal-dir internal \
--results-repo results-repo \
--workers $WORKERS \
--clusters $CLUSTERS \
--skypilot \
-j $CLUSTERS \
--push

- name: Run algorithmic evaluation
Expand All @@ -116,8 +114,7 @@ jobs:
--track algorithmic \
--internal-dir internal \
--results-repo results-repo \
--workers $WORKERS \
--skypilot \
-j $WORKERS \
--push

- name: Upload results artifact
Expand Down
16 changes: 10 additions & 6 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Contributing to Frontier-CS

Frontier-CS is currently an **invitation-only** project for new problems.
> **For Problem Contributors**: Guidelines for creating and submitting new problems to Frontier-CS.

Frontier-CS is currently an **invitation-only** project for new problems.
Please create a GitHub pull request (PR) with your proposed problem following the guidelines below. After your PR is reviewed and merged, please send any hidden test data and reference solutions to the contact email provided at the end of this document.


Expand Down Expand Up @@ -130,11 +132,11 @@ research/{problem_name}/
├── evaluate.sh # Evaluation entry point
├── evaluator.py # Scoring logic
├── readme # Problem description
├── reference.py # Reference solution (required for CI validation)
├── reference.{py,cpp} # Reference solution (required for CI, extension per language)
└── resources/ # Problem-specific code/data
```

> **Note**: The `reference.py` is required for CI validation. When you submit a PR, the CI will automatically run your reference solution and verify it achieves score > 0.
> **Note**: A reference solution is required for CI validation. Use `reference.py` for Python problems or `reference.cpp` if `language: cpp` in config.yaml. The CI will automatically run your reference solution and verify it achieves score > 0.

### Solution Interface

Expand Down Expand Up @@ -331,10 +333,12 @@ When you submit a PR that adds or modifies problems, CI will automatically valid
| Track | File | Location |
|-------|------|----------|
| Algorithmic | `reference.cpp` | `algorithmic/problems/{id}/reference.cpp` |
| Research | `reference.py` | `research/problems/{name}/reference.py` |
| Research | `reference.{py,cpp}` | `research/problems/{name}/reference.{ext}` (extension per `language` in config.yaml) |

If the reference solution is missing or scores 0, the PR will be blocked from merging.

> **Important**: The reference solution must achieve score > 0. This is a design choice to ensure the evaluator is working correctly - a score > 0 proves that the evaluation pipeline can successfully compile/run the solution and produce a valid score. If the reference only scores 0, we cannot distinguish between "evaluator error" and "valid solution with no improvement". For problems that measure speedup against a baseline, the reference must be **faster than the baseline**, not just a copy of it.

### Local Testing

Before submitting a PR, test your reference solution locally:
Expand All @@ -343,8 +347,8 @@ Before submitting a PR, test your reference solution locally:
# Algorithmic
frontier eval algorithmic {id} algorithmic/problems/{id}/reference.cpp

# Research
frontier eval research {name} research/problems/{name}/reference.py
# Research (use .py or .cpp based on problem's language config)
frontier eval research {name} research/problems/{name}/reference.{ext}
```

## Contact
Expand Down
22 changes: 9 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,9 +150,9 @@ frontier eval algorithmic 1 <your_solution.cpp> --unbounded
### Python API

```python
from frontier_cs import FrontierCSEvaluator
from frontier_cs import SingleEvaluator

evaluator = FrontierCSEvaluator()
evaluator = SingleEvaluator()

# Evaluate a research problem
result = evaluator.evaluate("research", problem_id="flash_attn", code=my_code)
Expand Down Expand Up @@ -195,28 +195,24 @@ research/solutions/

```bash
# Evaluate all research solutions (uses SkyPilot by default)
uv run frontier-eval batch research
frontier batch research

# Evaluate all algorithmic solutions (uses Docker by default)
uv run frontier-eval batch algorithmic
frontier batch algorithmic

# Filter by model or problem
uv run frontier-eval batch research --model gpt5.1
uv run frontier-eval batch research --problem flash_attn
uv run frontier-eval batch research --model gpt5.1 --problem flash_attn
frontier batch research --model gpt5.1
frontier batch research --problem flash_attn

# Override default backend
uv run frontier-eval batch research --backend docker
uv run frontier-eval batch algorithmic --backend skypilot
frontier batch research --backend docker
frontier batch algorithmic --backend skypilot
```

**Custom solutions directory:** You can test solutions from a custom directory with the same structure:

```bash
# Your custom directory should have the same structure:
# my_solutions/{problem}/{model}.py

uv run frontier-eval batch research --solutions-dir ./my_solutions
frontier batch research --solutions-dir ./my_solutions
```

Results are saved to `./results/batch/{track}/` by default. The state file tracks which (solution, problem) pairs have been evaluated, so you can:
Expand Down
36 changes: 19 additions & 17 deletions SUBMIT.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Evaluating Your Model

Complete workflow for benchmarking your model on Frontier-CS and submitting results to the leaderboard.
> **For Model Providers**: Complete workflow for benchmarking your model on Frontier-CS and submitting results to the leaderboard.

## Step 1: Prepare Solutions

Expand All @@ -19,7 +19,7 @@ research/solutions/gemm_optimization/squares/my_model.py
algorithmic/solutions/1/my_model.cpp
```

- **Research track**: Python (`.py`)
- **Research track**: Python (`.py`) by default, or C++ (`.cpp`) if problem specifies `language: cpp` in config.yaml
- **Algorithmic track**: C++17 (`.cpp`)
- We recommend generating **5 variants per model** to compute Score@5

Expand All @@ -36,7 +36,7 @@ research/solutions/
└── ...
```
```bash
frontier-eval batch research --model my_model
frontier batch research --model my_model
```

**2. Use your own directory**
Expand All @@ -48,7 +48,7 @@ frontier-eval batch research --model my_model
└── ...
```
```bash
frontier-eval batch research --solutions-dir ./my_solutions
frontier batch research --solutions-dir ./my_solutions
```

**3. Explicit pairs file**
Expand All @@ -59,39 +59,39 @@ frontier-eval batch research --solutions-dir ./my_solutions
./my_solutions/cross_entropy/my_model.py:cross_entropy
```
```bash
frontier-eval batch research --pairs-file pairs.txt
frontier batch research --pairs-file pairs.txt
```

### Backend Options

```bash
# Research defaults to SkyPilot, algorithmic defaults to Docker
frontier-eval batch research --backend docker
frontier-eval batch algorithmic --backend skypilot
frontier batch research --backend docker
frontier batch algorithmic --backend skypilot

# Parallelism
frontier-eval batch research --workers 20 --clusters 4
frontier batch research --workers 20 --clusters 4
```

### Result Storage

```bash
# Local (default): results saved to ./results/batch/{track}/
frontier-eval batch research
frontier batch research

# Cloud bucket (requires --backend skypilot): results written directly to S3/GCS
frontier-eval batch research --bucket-url s3://my-bucket/results
frontier batch research --bucket-url s3://my-bucket/results

# Sync from bucket to local
frontier-eval batch research --bucket-url s3://my-bucket/results --sync-bucket
frontier batch research --bucket-url s3://my-bucket/results --sync-bucket
```

### Control Options

```bash
frontier-eval batch research --status # Check status
frontier-eval batch research --no-resume # Force re-evaluate all
frontier-eval batch research --retry-failed # Retry failed (including score=0)
frontier batch research --status # Check status
frontier batch research --no-resume # Force re-evaluate all
frontier batch research --retry-failed # Retry failed (including score=0)
```

- Incremental evaluation with hash-based caching (solution/problem changes trigger re-evaluation)
Expand All @@ -114,7 +114,7 @@ We welcome submissions from all models and agent frameworks. To have your result

### Algorithmic Problems

We currently release **1 -- 3 public test case** per problem for local testing and debugging. Full evaluation (with all test cases) is performed on our servers.
We currently release **1-3 public test cases** per problem for local testing and debugging. Full evaluation (with all test cases) is performed on our servers.

#### What to Submit

Expand Down Expand Up @@ -174,7 +174,7 @@ Problem (e.g., gemm_optimization, poc_generation)

Each variant has a unique **Problem ID** based on its path under `research/`.

The full list of all evaluatable variants is in [`research/problems.txt`](research/problems.txt) (109 variants total, aggregated into ~50 categories for reporting).
The full list of all evaluatable variants is in [`research/scripts/problems.txt`](research/scripts/problems.txt).

| Type | Example Path | Problem ID |
|------|-------------|------------|
Expand Down Expand Up @@ -309,7 +309,9 @@ export GOOGLE_API_KEY=...

### Generate Solutions

#### Research Track (Python)
#### Research Track

Most research problems are Python, but some (e.g., `nbody_simulation`) require C++. The language is configured per-problem via `language` field in `config.yaml`.

```bash
# Generate one solution
Expand Down
Loading