Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions records/track_10min_16mb/2026-03-24_PROTEUS_v9/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# PROTEUS v9 — Parameter Golf Submission

**Built with [PROTEUS](https://lightspeedup.com) by LightSpeedUp**

## Result

**Mean val_bpb: 1.1526** (3 seeds, std: 0.0004)

| Seed | Post-Quant BPB | TTT BPB (1 epoch) | Steps | Step Avg |
|------|---------------|-------------------|-------|----------|
| 42 | 1.1804 | 1.1527 | 6989 | 85.7ms |
| 1337 | 1.1749 | 1.1529 | 6997 | 85.8ms |
| 2024 | 1.1771 | 1.1522 | 7093 | 84.6ms |

## TTT Legality — Single Epoch, Score-Then-Train

This submission uses **single-epoch TTT** (`TTT_EPOCHS=1`), addressing the ruling on PR #568 ([comment](https://github.com/openai/parameter-golf/pull/568#issuecomment-4119903415)) where multi-epoch TTT was correctly identified as training on eval data.

**Why single-epoch is different:**

In single-epoch, each chunk is processed left-to-right within the document:
1. Forward pass on chunk → **score** (accumulate loss for BPB)
2. **Train** LoRA on that chunk's loss (backward-looking)
3. Advance to next chunk

Each token is scored **exactly once**, **before** being trained on. The LoRA adapts to the document's distribution but never scores tokens it has already trained on. This is the same score-then-train pattern as PR #77 (merged), applied once per document.

**What changed from v7/v8:** `TTT_EPOCHS` reduced from 2-5 to 1. No other code changes.

## Architecture

- 11 transformer layers, dim=512, 8 heads / 4 KV heads (GQA)
- MLP 3x expansion (1536 hidden), relu² activation
- SmearGate + BigramHash(2048, dim=128) + OrthoInit
- Depth-scaled residual: `1/sqrt(layer_idx + 1)` per block
- U-Net skip connections, tied embeddings
- RoPE base 50K with NTK-aware eval scaling
- XSA on last 4 layers
- 26.8M parameters

## Training

- Muon optimizer (matrix_lr=0.025, WD=0.04, momentum=0.99)
- AdamW for embeddings/scalars (WD=0.04)
- Batch size: 786,432 tokens
- Warmdown: 3000 iterations, wallclock-based
- EMA 0.997 every step
- 3% magnitude pruning before export
- Gradient clipping: 0.3

## Quantization

- **INT6 GPTQ-lite** — 5 clip percentiles per row, pick lowest MSE
- FP16 for tied embeddings
- FP32 for control tensors (scales, mixes, gains)
- zstd-22 compression
- Artifact: ~15.4 MB (96.3% of 16MB budget)
- Quant gap: 0.006 BPB

## Test-Time Training (TTT)

- **Single epoch** — each token scored once before training
- LoRA rank 8, targets: Q + V projections + LM head
- Optimizer: Adam (lr=0.01, betas 0.9/0.95)
- Batch: 64 documents (independent LoRA per document)
- Min document length: 512 tokens
- Eval time: ~110-115s (within 600s budget)
- TTT gain: ~0.025 BPB over post-quant baseline

## Platform

Trained on RunPod 8xH100 SXM, PyTorch 2.8.0+cu128.
18 changes: 18 additions & 0 deletions records/track_10min_16mb/2026-03-24_PROTEUS_v9/submission.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"author": "Mato (LightSpeedUp)",
"github_id": "MatoTeziTanka",
"name": "PROTEUS v9",
"blurb": "11L, INT6 GPTQ-lite, depth-scaled residual, single-epoch score-then-train LoRA TTT (batch=64). Built with PROTEUS by LightSpeedUp — lightspeedup.com",
"date": "2026-03-24T12:00:00Z",
"val_loss": 1.9461,
"val_bpb": 1.1526,
"bytes_total": 15408603,
"bytes_code": 71033,
"seeds": {
"42": {"val_bpb": 1.1527, "ttt_epochs": 1, "ttt_min_doc": 512},
"1337": {"val_bpb": 1.1529, "ttt_epochs": 1, "ttt_min_doc": 512},
"2024": {"val_bpb": 1.1522, "ttt_epochs": 1, "ttt_min_doc": 512}
},
"mean_val_bpb": 1.1526,
"std_val_bpb": 0.0004
}
Loading