docs: restructure documentation and align eval/batch backends #88

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

andylizf merged 50 commits into main from docs/restructure-and-align-backend

Feb 5, 2026

.claude/CLAUDE.md

-Original file line number
+Diff line change
@@ -0,0 +1,10 @@
+    # Project Rules for Frontier-CS
+    ## Backend Selection
+    **NEVER change the backend due to missing credentials or CI configuration issues.**
+    - Research track: always uses SkyPilot (cloud VMs)
+    - Algorithmic track: always uses Docker (local)
+    If CI fails due to credentials/permissions, fix the credentials - do NOT change the code to use a different backend. The backend choice is intentional for each track's evaluation requirements.

.github/PULL_REQUEST_TEMPLATE.md

-Original file line number
+Diff line change
@@ -1,6 +1,7 @@
     ## Summary
     <!-- Brief description of changes -->
+    > Please read [CONTRIBUTING.md](../CONTRIBUTING.md) before submitting.
     ## Type of Change
     - [ ] New research problem
@@ Expand All / @@ -21,4 +22,4 @@ @@
     ## CI Validation (for new problems)
     > When adding new problems, CI will automatically validate that your reference solution achieves score > 0.
     > - Algorithmic problems: Include `reference.cpp` in your problem directory
-    > - Research problems: Include `reference.py` in your problem directory
+    > - Research problems: Include `reference.py` (or `reference.cpp` if `language: cpp` in config.yaml)

.github/PULL_REQUEST_TEMPLATE/research_problem.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -28,7 +28,7 @@ labels: research-problem
  
    - [ ] `evaluate.sh` - Evaluation entry point

    - [ ] `evaluator.py` - Scoring logic (outputs 0-100 score)

    - [ ] `resources/` - Problem-specific code/data

    - [ ] `reference.py` - Reference solution **(required for CI)**

    - [ ] `reference.{py,cpp}` - Reference solution **(required for CI, extension matches `language` in config.yaml)**

    ### Problem Structure

    ```

    @@ -38,15 +38,15 @@ research/{problem_name}/
  
    ├── set_up_env.sh

    ├── evaluate.sh

    ├── evaluator.py

    ├── reference.py     # Required: CI will validate this achieves score > 0

    ├── reference.{py,cpp}  # Required: CI validates score > 0 (extension per language)

    └── resources/

        └── ...

    ```

    ### Testing

    - [ ] Verified `set_up_env.sh` runs successfully

    - [ ] Verified `evaluate.sh` runs and outputs a numeric score

    - [ ] **Reference solution (`reference.py`) achieves score > 0**

    - [ ] **Reference solution achieves score > 0**

    **Test Results** (if available):

    ```

.github/workflows/validate-problems.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -78,25 +78,53 @@ jobs: @@
           - name: Install dependencies
             run: uv sync
+          - name: Setup AWS credentials
+            env:
+              AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
+              AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+            run: |
+              mkdir -p ~/.aws
+              cat > ~/.aws/credentials << EOF
+              [default]
+              aws_access_key_id = $AWS_ACCESS_KEY_ID
+              aws_secret_access_key = $AWS_SECRET_ACCESS_KEY
+              EOF
+              cat > ~/.aws/config << EOF
+              [default]
+              region = us-east-1
+              EOF
+              echo "AWS credentials configured"
           - name: Setup GCP credentials
             env:
               GCP_CREDS: ${{ secrets.GCP_CREDENTIALS }}
             run: |
               if [ -n "$GCP_CREDS" ]; then
                 echo "$GCP_CREDS" > /tmp/gcp-key.json
                 echo "GOOGLE_APPLICATION_CREDENTIALS=/tmp/gcp-key.json" >> $GITHUB_ENV
+                gcloud auth activate-service-account --key-file=/tmp/gcp-key.json
+                gcloud config set project ${{ secrets.GCP_PROJECT_ID }}
                 echo "GCP credentials configured"
-              else
-                echo "No GCP credentials available, skipping..."
+              fi
+          - name: Generate SSH key for SkyPilot
+            run: |
+              mkdir -p ~/.ssh
+              if [ ! -f ~/.ssh/sky-key ]; then
+                ssh-keygen -t rsa -b 4096 -f ~/.ssh/sky-key -N "" -C "sky-ci"
+                echo "Generated SSH key for SkyPilot"
               fi
           - name: Setup SkyPilot
             run: |
-              uv run sky check || echo "SkyPilot check failed, continuing..."
+              uv run sky check aws gcp || echo "SkyPilot check failed, continuing..."
           - name: Validate problems
+            timeout-minutes: 30
             run: |
               echo "Validating research problems: ${{ needs.detect-changes.outputs.research }}"
               uv run python scripts/validate_problems.py \
                 --track research \
-                --problems ${{ needs.detect-changes.outputs.research }}
+                --timeout 1200 \
+                --problems ${{ needs.detect-changes.outputs.research }} \
+                --verbose

.github/workflows/weekly-eval.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -100,9 +100,7 @@ jobs: @@
                 --track research \
                 --internal-dir internal \
                 --results-repo results-repo \
-                --workers $WORKERS \
-                --clusters $CLUSTERS \
-                --skypilot \
+                -j $CLUSTERS \
                 --push
           - name: Run algorithmic evaluation
@@ Expand All / @@ -116,8 +114,7 @@ jobs: @@
                 --track algorithmic \
                 --internal-dir internal \
                 --results-repo results-repo \
-                --workers $WORKERS \
-                --skypilot \
+                -j $WORKERS \
                 --push
           - name: Upload results artifact
@@ Expand Down @@

CONTRIBUTING.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,6 +1,8 @@
  
    # Contributing to Frontier-CS

    Frontier-CS is currently an **invitation-only** project for new problems. 

    > **For Problem Contributors**: Guidelines for creating and submitting new problems to Frontier-CS.

    Frontier-CS is currently an **invitation-only** project for new problems.

    Please create a GitHub pull request (PR) with your proposed problem following the guidelines below. After your PR is reviewed and merged, please send any hidden test data and reference solutions to the contact email provided at the end of this document.

    @@ -130,11 +132,11 @@ research/{problem_name}/
  
    ├── evaluate.sh          # Evaluation entry point

    ├── evaluator.py         # Scoring logic

    ├── readme               # Problem description

    ├── reference.py         # Reference solution (required for CI validation)

    ├── reference.{py,cpp}   # Reference solution (required for CI, extension per language)

    └── resources/           # Problem-specific code/data

    ```

    > **Note**: The `reference.py` is required for CI validation. When you submit a PR, the CI will automatically run your reference solution and verify it achieves score > 0.

    > **Note**: A reference solution is required for CI validation. Use `reference.py` for Python problems or `reference.cpp` if `language: cpp` in config.yaml. The CI will automatically run your reference solution and verify it achieves score > 0.

    ### Solution Interface

    @@ -331,10 +333,12 @@ When you submit a PR that adds or modifies problems, CI will automatically valid
  
    | Track | File | Location |

    |-------|------|----------|

    | Algorithmic | `reference.cpp` | `algorithmic/problems/{id}/reference.cpp` |

    | Research | `reference.py` | `research/problems/{name}/reference.py` |

    | Research | `reference.{py,cpp}` | `research/problems/{name}/reference.{ext}` (extension per `language` in config.yaml) |

    If the reference solution is missing or scores 0, the PR will be blocked from merging.

    > **Important**: The reference solution must achieve score > 0. This is a design choice to ensure the evaluator is working correctly - a score > 0 proves that the evaluation pipeline can successfully compile/run the solution and produce a valid score. If the reference only scores 0, we cannot distinguish between "evaluator error" and "valid solution with no improvement". For problems that measure speedup against a baseline, the reference must be **faster than the baseline**, not just a copy of it.

    ### Local Testing

    Before submitting a PR, test your reference solution locally:

    @@ -343,8 +347,8 @@ Before submitting a PR, test your reference solution locally:
  
    # Algorithmic

    frontier eval algorithmic {id} algorithmic/problems/{id}/reference.cpp

    # Research

    frontier eval research {name} research/problems/{name}/reference.py

    # Research (use .py or .cpp based on problem's language config)

    frontier eval research {name} research/problems/{name}/reference.{ext}

    ```

    ## Contact

README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -150,9 +150,9 @@ frontier eval algorithmic 1 <your_solution.cpp> --unbounded
  
    ### Python API

    ```python

    from frontier_cs import FrontierCSEvaluator

    from frontier_cs import SingleEvaluator

    evaluator = FrontierCSEvaluator()

    evaluator = SingleEvaluator()

    # Evaluate a research problem

    result = evaluator.evaluate("research", problem_id="flash_attn", code=my_code)

    @@ -195,28 +195,24 @@ research/solutions/
  
    ```bash

    # Evaluate all research solutions (uses SkyPilot by default)

    uv run frontier-eval batch research

    frontier batch research

    # Evaluate all algorithmic solutions (uses Docker by default)

    uv run frontier-eval batch algorithmic

    frontier batch algorithmic

    # Filter by model or problem

    uv run frontier-eval batch research --model gpt5.1

    uv run frontier-eval batch research --problem flash_attn

    uv run frontier-eval batch research --model gpt5.1 --problem flash_attn

    frontier batch research --model gpt5.1

    frontier batch research --problem flash_attn

    # Override default backend

    uv run frontier-eval batch research --backend docker

    uv run frontier-eval batch algorithmic --backend skypilot

    frontier batch research --backend docker

    frontier batch algorithmic --backend skypilot

    ```

    **Custom solutions directory:** You can test solutions from a custom directory with the same structure:

    ```bash

    # Your custom directory should have the same structure:

    # my_solutions/{problem}/{model}.py

    uv run frontier-eval batch research --solutions-dir ./my_solutions

    frontier batch research --solutions-dir ./my_solutions

    ```

    Results are saved to `./results/batch/{track}/` by default. The state file tracks which (solution, problem) pairs have been evaluated, so you can:

SUBMIT.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,6 +1,6 @@
  
    # Evaluating Your Model

    Complete workflow for benchmarking your model on Frontier-CS and submitting results to the leaderboard.

    > **For Model Providers**: Complete workflow for benchmarking your model on Frontier-CS and submitting results to the leaderboard.

    ## Step 1: Prepare Solutions

    @@ -19,7 +19,7 @@ research/solutions/gemm_optimization/squares/my_model.py
  
    algorithmic/solutions/1/my_model.cpp

    ```

    - **Research track**: Python (`.py`)

    - **Research track**: Python (`.py`) by default, or C++ (`.cpp`) if problem specifies `language: cpp` in config.yaml

    - **Algorithmic track**: C++17 (`.cpp`)

    - We recommend generating **5 variants per model** to compute Score@5

    @@ -36,7 +36,7 @@ research/solutions/
  
    └── ...

    ```

    ```bash

    frontier-eval batch research --model my_model

    frontier batch research --model my_model

    ```

    **2. Use your own directory**

    @@ -48,7 +48,7 @@ frontier-eval batch research --model my_model
  
    └── ...

    ```

    ```bash

    frontier-eval batch research --solutions-dir ./my_solutions

    frontier batch research --solutions-dir ./my_solutions

    ```

    **3. Explicit pairs file**

    @@ -59,39 +59,39 @@ frontier-eval batch research --solutions-dir ./my_solutions
  
    ./my_solutions/cross_entropy/my_model.py:cross_entropy

    ```

    ```bash

    frontier-eval batch research --pairs-file pairs.txt

    frontier batch research --pairs-file pairs.txt

    ```

    ### Backend Options

    ```bash

    # Research defaults to SkyPilot, algorithmic defaults to Docker

    frontier-eval batch research --backend docker

    frontier-eval batch algorithmic --backend skypilot

    frontier batch research --backend docker

    frontier batch algorithmic --backend skypilot

    # Parallelism

    frontier-eval batch research --workers 20 --clusters 4

    frontier batch research --workers 20 --clusters 4

    ```

    ### Result Storage

    ```bash

    # Local (default): results saved to ./results/batch/{track}/

    frontier-eval batch research

    frontier batch research

    # Cloud bucket (requires --backend skypilot): results written directly to S3/GCS

    frontier-eval batch research --bucket-url s3://my-bucket/results

    frontier batch research --bucket-url s3://my-bucket/results

    # Sync from bucket to local

    frontier-eval batch research --bucket-url s3://my-bucket/results --sync-bucket

    frontier batch research --bucket-url s3://my-bucket/results --sync-bucket

    ```

    ### Control Options

    ```bash

    frontier-eval batch research --status          # Check status

    frontier-eval batch research --no-resume       # Force re-evaluate all

    frontier-eval batch research --retry-failed    # Retry failed (including score=0)

    frontier batch research --status          # Check status

    frontier batch research --no-resume       # Force re-evaluate all

    frontier batch research --retry-failed    # Retry failed (including score=0)

    ```

    - Incremental evaluation with hash-based caching (solution/problem changes trigger re-evaluation)

    @@ -114,7 +114,7 @@ We welcome submissions from all models and agent frameworks. To have your result
  
    ### Algorithmic Problems

    We currently release **1 -- 3 public test case** per problem for local testing and debugging. Full evaluation (with all test cases) is performed on our servers.

    We currently release **1-3 public test cases** per problem for local testing and debugging. Full evaluation (with all test cases) is performed on our servers.

    #### What to Submit

    @@ -174,7 +174,7 @@ Problem (e.g., gemm_optimization, poc_generation)
  
    Each variant has a unique **Problem ID** based on its path under `research/`.

    The full list of all evaluatable variants is in [`research/problems.txt`](research/problems.txt) (109 variants total, aggregated into ~50 categories for reporting).

    The full list of all evaluatable variants is in [`research/scripts/problems.txt`](research/scripts/problems.txt).

    | Type | Example Path | Problem ID |

    |------|-------------|------------|

    @@ -309,7 +309,9 @@ export GOOGLE_API_KEY=...
  
    ### Generate Solutions

    #### Research Track (Python)

    #### Research Track

    Most research problems are Python, but some (e.g., `nbody_simulation`) require C++. The language is configured per-problem via `language` field in `config.yaml`.

    ```bash

    # Generate one solution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: restructure documentation and align eval/batch backends #88

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!