Frontier-CS is an unsolved, open-ended, verifiable, and diverse benchmark for evaluating AI on challenging computer science problems.
Think of it as an "exam" for AI, but instead of easy textbook questions, we give problems that are genuinely difficult: ones that researchers struggle with, that have no known optimal solutions, or that require deep expertise to even attempt.
Current benchmarks are becoming too easy. Models score 90%+ on many existing coding benchmarks, but that doesn't mean they can actually do useful research or solve real-world engineering challenges.
Frontier-CS is different:
| Traditional Benchmarks | Frontier-CS | |
|---|---|---|
| Difficulty | Often saturated with evolving intelligence | Unsolved: no solution has achieved perfect scores |
| Problems | Textbook-style, known solutions | Open-ended research & optimization challenges |
| Evaluation | Binary pass-or-fail | Verifiable continuous scoring, always room to improve |
| Scope | Usually one domain | Diverse: systems, ML, algorithms, security, and more |
Score@k = best-of-k runs; Avg@k = average over k runs; Elo uses Bradley–Terry from single-attempt performance (difficulty-normalized).
| Rank | Model | Score@1 | Avg@5 | Score@5 | Elo |
|---|---|---|---|---|---|
| 🥇 | Gemini 3.0 Pro | 33.12 | 34.58 | 56.09 | 1265 |
| 🥈 | GPT 5.2 Thinking | 32.40 | 33.11 | 47.19 | 1242 |
| 🥉 | GPT 5 Thinking | 23.10 | 22.58 | 39.73 | 1196 |
| 4 | DeepSeek 3.2 | 24.83 | 23.89 | 41.44 | 1193 |
| 5 | Grok 4 | 24.04 | 22.98 | 36.81 | 1174 |
| 6 | Gemini 2.5 Pro | 20.34 | 19.32 | 36.65 | 1167 |
| 7 | GPT 5.1 Thinking | 20.64 | 21.49 | 34.76 | 1164 |
Human reference: 86.99 (Score@1).
| Rank | Model | Score@1 | Avg@5 | Score@5 | Elo |
|---|---|---|---|---|---|
| 🥇 | Gemini 3.0 Pro | 46.55 | 43.14 | 59.22 | 1283 |
| 🥈 | GPT 5 Thinking | 30.91 | 34.94 | 55.25 | 1218 |
| 🥉 | GPT 5.1 Thinking | 32.12 | 33.70 | 56.79 | 1214 |
| 4 | GPT 5.2 Thinking | 30.29 | 34.09 | 58.90 | 1210 |
| 5 | Gemini 2.5 Pro | 21.66 | 25.74 | 51.57 | 1180 |
| 6 | Grok 4 | 26.75 | 24.01 | 48.15 | 1149 |
| 7 | DeepSeek 3.2 | 21.51 | 21.76 | 44.41 | 1146 |
Requirements: Python 3.11+, Docker 24+ (for local evaluation)
git clone https://github.com/FrontierCS/Frontier-CS.git
cd Frontier-CS
# Install dependencies (using uv, recommended)
uv sync
# Or with pip:
pip install -e .Here's Algorithmic Problem 0 - try to beat GPT-5!
# Run the example solution (Human Expert Solution)
frontier eval algorithmic 0 algorithmic/problems/0/examples/reference.cpp
# Run the example solution (GPT-5 Thinking Solution)
frontier eval algorithmic 0 algorithmic/problems/0/examples/gpt5.cpp
# Try your own solution!
frontier eval algorithmic 0 <your_solution.cpp># List all problems
frontier list research
# Evaluate a generated solution locally for flash_attn problem (requires Docker)
frontier eval research flash_attn <your_solution.py>
# Evaluate on cloud (requires SkyPilot)
frontier eval research flash_attn <your_solution.py> --skypilotSee research/README.md for full documentation.
# Evaluate a solution locally (requires Docker)
frontier eval algorithmic 1 <your_solution.cpp>
# Evaluate on cloud (requires SkyPilot)
frontier eval algorithmic 1 <your_solution.cpp> --skypilotSee algorithmic/README.md for full documentation.
Frontier-CS supports unbounded scoring, enabling open-ended evaluation compatible with algorithm evolution frameworks such as OpenEvolve.
# Get unbounded score (without clipping to 100)
frontier eval research flash_attn <your_solution.py> --unbounded
frontier eval algorithmic 1 <your_solution.cpp> --unboundedfrom frontier_cs import FrontierCSEvaluator
evaluator = FrontierCSEvaluator()
# Evaluate a research problem
result = evaluator.evaluate("research", problem_id="flash_attn", code=my_code)
print(f"Score: {result.score}")
# Evaluate an algorithmic problem
result = evaluator.evaluate("algorithmic", problem_id=1, code=cpp_code)
print(f"Score: {result.score}")
# Get unbounded score for algorithmic problems
result = evaluator.evaluate("algorithmic", problem_id=1, code=cpp_code, unbounded=True)
print(f"Score (bounded): {result.score}")
print(f"Score (unbounded): {result.score_unbounded}")For testing your solutions at scale with public test cases.
Solution directory structure:
{track}/solutions/
{problem}/
{model}.py # variant 0
{model}_1.py # variant 1
{model}_2.py # variant 2
Example for research track:
research/solutions/
flash_attn/
gpt5.py
claude4.5sonnet.py
cross_entropy/
gpt5.py
Basic usage:
# Evaluate all research solutions (uses SkyPilot by default)
uv run frontier-eval batch research
# Evaluate all algorithmic solutions (uses Docker by default)
uv run frontier-eval batch algorithmic
# Filter by model or problem
uv run frontier-eval batch research --model gpt5.1
uv run frontier-eval batch research --problem flash_attn
uv run frontier-eval batch research --model gpt5.1 --problem flash_attn
# Override default backend
uv run frontier-eval batch research --backend docker
uv run frontier-eval batch algorithmic --backend skypilotCustom solutions directory: You can test solutions from a custom directory with the same structure:
# Your custom directory should have the same structure:
# my_solutions/{problem}/{model}.py
uv run frontier-eval batch research --solutions-dir ./my_solutionsResults are saved to ./results/batch/{track}/ by default. The state file tracks which (solution, problem) pairs have been evaluated, so you can:
- Resume interrupted evaluations automatically
- Run multiple times with different
--solutions-dirand results accumulate
See --help for all options.
Note: For maintainers,
./scripts/run_eval.shis used for full evaluation with private test cases.
Reference solutions and full test cases are withheld. We release partial test cases so you can develop and debug locally. For full evaluation and leaderboard inclusion, please follow the instructions in SUBMIT.md and submit your solutions to qmang@berkeley.edu, wenhao.chai@princeton.edu, huanzhimao@berkeley.edu, or zhifei.li@berkeley.edu.
Questions? Join our Discord
Some problems are adapted from ALE-bench and AI-Driven Research for Systems (ADRS).
If you use Frontier-CS in your research, please cite:
@misc{mang2025frontiercsevolvingchallengesevolving,
title={FrontierCS: Evolving Challenges for Evolving Intelligence},
author = {Qiuyang Mang and Wenhao Chai and Zhifei Li and Huanzhi Mao and
Shang Zhou and Alexander Du and Hanchen Li and Shu Liu and
Edwin Chen and Yichuan Wang and Xieting Chu and Zerui Cheng and
Yuan Xu and Tian Xia and Zirui Wang and Tianneng Shi and
Jianzhu Yao and Yilong Zhao and Qizheng Zhang and Charlie Ruan and
Zeyu Shen and Kaiyuan Liu and Runyuan He and Dong Xing and
Zerui Li and Zirong Zeng and Yige Jiang and Lufeng Cheng and
Ziyi Zhao and Youran Sun and Wesley Zheng and Meiyuwang Zhang and
Ruyi Ji and Xuechang Tu and Zihan Zheng and Zexing Chen and
Kangyang Zhou and Zhaozi Wang and Jingbang Chen and
Aleksandra Korolova and Peter Henderson and Pramod Viswanath and
Vijay Ganesh and Saining Xie and Zhuang Liu and Dawn Song and
Sewon Min and Ion Stoica and Joseph E. Gonzalez and
Jingbo Shang and Alvin Cheung},
year={2025},
eprint={2512.15699},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2512.15699},
}
