Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean) by abaybektursun · Pull Request #473 · openai/parameter-golf

abaybektursun · 2026-03-22T23:28:14Z

Record: Legal TTT + Parallel Muon + Parameter Banking — val_bpb 1.1214

val_bpb = 1.1214 (legal score-first TTT, 3-seed mean, std 0.0009) | ~16.0 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	step_avg	steps	Pre-TTT bpb	Post-TTT bpb	TTT gain	TTT eval time	Artifact
1337	82.5ms	7,273	1.1224	1.1204	-0.0020	413s	15,979,188
42	82.5ms	7,278	1.1237	1.1216	-0.0021	406s	15,987,108
2025	82.5ms	7,270	1.1240	1.1221	-0.0019	405s	15,988,312
Mean	82.5ms	7,274	1.1233	1.1214 (std 0.0009)	-0.0019	~408s

Timing Budget

Phase	Time
Training	600s (≤10 min on 8×H100)
Standard eval	~120s
Legal TTT eval	~408s
Total eval	~528s (< 10 min)

TTT Legality

Backward-looking, score-first TTT (PR #461 framework):

Val tokens split into 1,893 non-overlapping 32K chunks
For each chunk:
- SCORE: sliding window eval under torch.inference_mode() — no gradients, no weight mutation
- TRAIN: SGD on the already-scored chunk (3 epochs, all blocks unfrozen, cosine LR, grad clip 1.0)
Last chunk scored but never trained on
Chunk N scored by model adapted only on chunks 0..N-1

Key improvements over initial submission

Change	BPB impact
BigramHash 2048 → 3072	-0.0005
TTT freeze=2 → freeze=0	-0.0004
Combined	-0.0006

Based on 17 overnight experiments testing epochs (3/5/10/20), freeze depth (0/1/2), chunk size (16K/32K), LR (0.001/0.002/0.003), and optimizer (SGD vs AdamW).

Training Architecture

PR #414 stack + Parameter Banking + Parallel Muon (PR #399):

11L, 512d, 8H/4KV, MLP 3× (relu²)
XSA on last 4 layers, Partial RoPE (16/64), LN Scale
SmearGate, BigramHash(3072), VE128 on layers 9-10
EMA(0.997) + Tight SWA(every 50), GPTQ-lite int6 + lzma
Parameter Banking + Parallel Muon (82.5ms/step)

Credits

Optimizer: PR #399 by @abaybektursun
TTT recipe: PR #461 by @anantdgoel
Base model: PR #414 by @signalrush

🤖 Generated with Claude Code

mohosy · 2026-03-23T00:14:01Z

parameter banking is smart, batching the ns ortho steps like that is clean. curious about the legal ttt protocol too, scoring before updating makes way more sense than what most people been doing

Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211. PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better. TTT recipe: 32K-token chunks, score-first (inference_mode), then train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1, cosine LR decay across chunks, grad clip 1.0). Removed TTT burst (replaced by legal TTT eval). 1499 lines (under 1500 limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Based on PR openai#414's exact train_gpt.py (11L, EMA, XSA, PartialRoPE, LNScale, VE128, GPTQ-lite, QAT@0.15, warmdown=3500, int6+zstd-22). Added legal score-first TTT from PR openai#461/openai#473 protocol: - SGD + momentum 0.9, lr=0.002 with cosine decay - 3 epochs per 32K token chunk - Freeze blocks 0-1 - Score each chunk BEFORE training on it (inference_mode) - Expected ~0.002 bpb improvement over base Strategy shift: reproduce proven frontier instead of iterating on our custom stack. PR openai#414 achieves 1.1233 on 3 seeds; adding legal TTT should push to ~1.121. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun · 2026-03-23T11:36:19Z

1. SGD+momentum is the right optimizer for legal TTT

AdamW (lr=0.0005 and lr=0.0001) performed dramatically worse — 1.1598 vs 1.1220 baseline. AdamW overfits to each chunk's distribution instead of generalizing to the next. SGD+momentum's implicit regularization is critical for the score-first constraint.

2. More epochs monotonically improve (but hit eval time limit)

Epochs (chunk=32K, freeze=2)	BPB	TTT time
3	1.1220	400s
5	1.1213	551s
10	1.1205	925s
20	1.1202	1643s

Saturates around 1.1202. But 10 min eval limit means only 3ep configs (~400s TTT + ~120s standard eval) are compliant.

3. Freeze depth and epochs interact

	freeze=0	freeze=1	freeze=2
3ep	1.1213	1.1216	1.1220
10ep	1.1210	1.1207	1.1205

At low epochs: no freezing is best (more params adapt per chunk). At high epochs: freezing 2 blocks prevents catastrophic forgetting.

4. Smaller chunks help

16K chunks consistently beat 32K at the same epoch count (~0.0003 BPB). More frequent adaptation cycles improve generalization.

5. Bigger BigramHash is a major win

BIGRAM	BPB (10ep)	Artifact
2048	1.1205	15.84 MB
3072	1.1200	15.99 MB
4096	1.1187	16.10 MB (over!)

BIGRAM=3072 gives -0.0005 and fits under 16MB. BIGRAM=4096 gives -0.0018 but exceeds the limit by 103KB.

6. XSA on all layers doesn't help

XSA_LAST_N=11 gave worse pre-TTT (1.1227 vs 1.1234) due to fewer steps from overhead. TTT partially compensated but net result was the same.

Best timing-compliant config (within 10 min eval)

BIGRAM=3072, 3ep, freeze=0: result pending (running now). Expected ~1.1208 based on interpolation.

Best unconstrained config

BIGRAM=3072, 10ep chunk=16K, freeze=2: 1.1192 (15.99 MB) — but requires ~27 min eval.

LR doesn't matter much

lr=0.002 and lr=0.003 gave identical results (1.1205). Cosine LR decay across chunks normalizes the effective learning rate.

…d mean) Legal score-first TTT (PR openai#461 recipe) + BigramHash(3072) + freeze=0 on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results (BIGRAM=3072, 3ep, freeze=0, SGD+mom=0.9): Seed 1337: 1.1204 bpb, 413s TTT, 15.98 MB Seed 42: 1.1216 bpb, 406s TTT, 15.99 MB Seed 2025: 1.1221 bpb, 405s TTT, 15.99 MB Mean: 1.1214 (std 0.0009) All artifacts under 16MB. All eval times under 600s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…RIDE P1: TTT was running on the pre-quantization base_model instead of the int6 round-tripped eval_model. This overstated TTT gains since the artifact model has quantization noise. Now matches PR openai#473's approach. P2: TTT hardcoded stride=64 instead of using args.eval_stride. Now honors the configured stride so TTT results stay consistent with the sliding window eval path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211. PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better. TTT recipe: 32K-token chunks, score-first (inference_mode), then train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1, cosine LR decay across chunks, grad clip 1.0). Removed TTT burst (replaced by legal TTT eval). 1499 lines (under 1500 limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

valerio-oai · 2026-03-24T13:59:53Z

This looks valid, but does not clear the SOTA at the time, PR #374, by the required 0.005 nats threshold. Good try, though!

abaybektursun force-pushed the submission/legal-ttt-1.1213 branch from 5aa9ee5 to 40de278 Compare March 22, 2026 23:30

notapplica mentioned this pull request Mar 22, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

abaybektursun force-pushed the submission/legal-ttt-1.1213 branch from 40de278 to a33f075 Compare March 23, 2026 00:25

abaybektursun changed the title ~~Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1213~~ Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1218 (2-seed mean) Mar 23, 2026

abaybektursun force-pushed the submission/legal-ttt-1.1213 branch from a33f075 to f337da2 Compare March 23, 2026 00:46

abaybektursun changed the title ~~Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1218 (2-seed mean)~~ Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1219 (3-seed mean) Mar 23, 2026

abaybektursun force-pushed the submission/legal-ttt-1.1213 branch from f337da2 to f54a596 Compare March 23, 2026 01:59

abaybektursun changed the title ~~Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1219 (3-seed mean)~~ Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1220 (3-seed mean) Mar 23, 2026

rosikand mentioned this pull request Mar 23, 2026

scout design rosikand/parameter-golf#4

Open

abaybektursun force-pushed the submission/legal-ttt-1.1213 branch from ffe7c40 to 93e90ed Compare March 23, 2026 12:59

abaybektursun changed the title ~~Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1220 (3-seed mean)~~ Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean) Mar 23, 2026

abaybektursun added 2 commits March 23, 2026 12:03

Fix pre-TTT BPB, TTT gains, steps, and timing to match logs exactly

78a6eeb

Fix submission.json: val_loss 1.8930→1.8934, pre-TTT 1.1230→1.1233

7e29b78

anantdgoel mentioned this pull request Mar 24, 2026

Non-record: VR + GA + Late QAT + Full GPTQ — 1.1418 BPB, 15.7 MB #601

Open

valerio-oai closed this Mar 24, 2026

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean)#473

Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean)#473
abaybektursun wants to merge 3 commits intoopenai:mainfrom
abaybektursun:submission/legal-ttt-1.1213

abaybektursun commented Mar 22, 2026 •

edited

Loading

Uh oh!

mohosy commented Mar 23, 2026

Uh oh!

abaybektursun commented Mar 23, 2026

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

abaybektursun commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: Legal TTT + Parallel Muon + Parameter Banking — val_bpb 1.1214

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Timing Budget

TTT Legality

Key improvements over initial submission

Training Architecture

Credits

Uh oh!

mohosy commented Mar 23, 2026

Uh oh!

abaybektursun commented Mar 23, 2026

1. SGD+momentum is the right optimizer for legal TTT

2. More epochs monotonically improve (but hit eval time limit)

3. Freeze depth and epochs interact

4. Smaller chunks help

5. Bigger BigramHash is a major win

6. XSA on all layers doesn't help

Best timing-compliant config (within 10 min eval)

Best unconstrained config

LR doesn't matter much

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abaybektursun commented Mar 22, 2026 •

edited

Loading