Skip to content

Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean)#473

Closed
abaybektursun wants to merge 3 commits intoopenai:mainfrom
abaybektursun:submission/legal-ttt-1.1213
Closed

Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean)#473
abaybektursun wants to merge 3 commits intoopenai:mainfrom
abaybektursun:submission/legal-ttt-1.1213

Conversation

@abaybektursun
Copy link
Contributor

@abaybektursun abaybektursun commented Mar 22, 2026

Record: Legal TTT + Parallel Muon + Parameter Banking — val_bpb 1.1214

val_bpb = 1.1214 (legal score-first TTT, 3-seed mean, std 0.0009) | ~16.0 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed step_avg steps Pre-TTT bpb Post-TTT bpb TTT gain TTT eval time Artifact
1337 82.5ms 7,273 1.1224 1.1204 -0.0020 413s 15,979,188
42 82.5ms 7,278 1.1237 1.1216 -0.0021 406s 15,987,108
2025 82.5ms 7,270 1.1240 1.1221 -0.0019 405s 15,988,312
Mean 82.5ms 7,274 1.1233 1.1214 (std 0.0009) -0.0019 ~408s

Timing Budget

Phase Time
Training 600s (≤10 min on 8×H100)
Standard eval ~120s
Legal TTT eval ~408s
Total eval ~528s (< 10 min)

TTT Legality

Backward-looking, score-first TTT (PR #461 framework):

  1. Val tokens split into 1,893 non-overlapping 32K chunks
  2. For each chunk:
    • SCORE: sliding window eval under torch.inference_mode() — no gradients, no weight mutation
    • TRAIN: SGD on the already-scored chunk (3 epochs, all blocks unfrozen, cosine LR, grad clip 1.0)
  3. Last chunk scored but never trained on
  4. Chunk N scored by model adapted only on chunks 0..N-1

Key improvements over initial submission

Change BPB impact
BigramHash 2048 → 3072 -0.0005
TTT freeze=2 → freeze=0 -0.0004
Combined -0.0006

Based on 17 overnight experiments testing epochs (3/5/10/20), freeze depth (0/1/2), chunk size (16K/32K), LR (0.001/0.002/0.003), and optimizer (SGD vs AdamW).

Training Architecture

PR #414 stack + Parameter Banking + Parallel Muon (PR #399):

  • 11L, 512d, 8H/4KV, MLP 3× (relu²)
  • XSA on last 4 layers, Partial RoPE (16/64), LN Scale
  • SmearGate, BigramHash(3072), VE128 on layers 9-10
  • EMA(0.997) + Tight SWA(every 50), GPTQ-lite int6 + lzma
  • Parameter Banking + Parallel Muon (82.5ms/step)

Credits

🤖 Generated with Claude Code

@mohosy
Copy link

mohosy commented Mar 23, 2026

parameter banking is smart, batching the ns ortho steps like that is clean. curious about the legal ttt protocol too, scoring before updating makes way more sense than what most people been doing

@abaybektursun abaybektursun force-pushed the submission/legal-ttt-1.1213 branch from 40de278 to a33f075 Compare March 23, 2026 00:25
@abaybektursun abaybektursun changed the title Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1213 Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1218 (2-seed mean) Mar 23, 2026
@abaybektursun abaybektursun force-pushed the submission/legal-ttt-1.1213 branch from a33f075 to f337da2 Compare March 23, 2026 00:46
@abaybektursun abaybektursun changed the title Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1218 (2-seed mean) Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1219 (3-seed mean) Mar 23, 2026
newjordan referenced this pull request in newjordan/parameter-golf Mar 23, 2026
Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211.
PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better.

TTT recipe: 32K-token chunks, score-first (inference_mode), then
train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1,
cosine LR decay across chunks, grad clip 1.0).

Removed TTT burst (replaced by legal TTT eval).
1499 lines (under 1500 limit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Mar 23, 2026
Based on PR openai#414's exact train_gpt.py (11L, EMA, XSA, PartialRoPE,
LNScale, VE128, GPTQ-lite, QAT@0.15, warmdown=3500, int6+zstd-22).

Added legal score-first TTT from PR openai#461/openai#473 protocol:
- SGD + momentum 0.9, lr=0.002 with cosine decay
- 3 epochs per 32K token chunk
- Freeze blocks 0-1
- Score each chunk BEFORE training on it (inference_mode)
- Expected ~0.002 bpb improvement over base

Strategy shift: reproduce proven frontier instead of iterating on
our custom stack. PR openai#414 achieves 1.1233 on 3 seeds; adding
legal TTT should push to ~1.121.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abaybektursun abaybektursun force-pushed the submission/legal-ttt-1.1213 branch from f337da2 to f54a596 Compare March 23, 2026 01:59
@abaybektursun abaybektursun changed the title Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1219 (3-seed mean) Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1220 (3-seed mean) Mar 23, 2026
@abaybektursun
Copy link
Contributor Author

1. SGD+momentum is the right optimizer for legal TTT

AdamW (lr=0.0005 and lr=0.0001) performed dramatically worse — 1.1598 vs 1.1220 baseline. AdamW overfits to each chunk's distribution instead of generalizing to the next. SGD+momentum's implicit regularization is critical for the score-first constraint.

2. More epochs monotonically improve (but hit eval time limit)

Epochs (chunk=32K, freeze=2) BPB TTT time
3 1.1220 400s
5 1.1213 551s
10 1.1205 925s
20 1.1202 1643s

Saturates around 1.1202. But 10 min eval limit means only 3ep configs (~400s TTT + ~120s standard eval) are compliant.

3. Freeze depth and epochs interact

freeze=0 freeze=1 freeze=2
3ep 1.1213 1.1216 1.1220
10ep 1.1210 1.1207 1.1205

At low epochs: no freezing is best (more params adapt per chunk). At high epochs: freezing 2 blocks prevents catastrophic forgetting.

4. Smaller chunks help

16K chunks consistently beat 32K at the same epoch count (~0.0003 BPB). More frequent adaptation cycles improve generalization.

5. Bigger BigramHash is a major win

BIGRAM BPB (10ep) Artifact
2048 1.1205 15.84 MB
3072 1.1200 15.99 MB
4096 1.1187 16.10 MB (over!)

BIGRAM=3072 gives -0.0005 and fits under 16MB. BIGRAM=4096 gives -0.0018 but exceeds the limit by 103KB.

6. XSA on all layers doesn't help

XSA_LAST_N=11 gave worse pre-TTT (1.1227 vs 1.1234) due to fewer steps from overhead. TTT partially compensated but net result was the same.

Best timing-compliant config (within 10 min eval)

BIGRAM=3072, 3ep, freeze=0: result pending (running now). Expected ~1.1208 based on interpolation.

Best unconstrained config

BIGRAM=3072, 10ep chunk=16K, freeze=2: 1.1192 (15.99 MB) — but requires ~27 min eval.

LR doesn't matter much

lr=0.002 and lr=0.003 gave identical results (1.1205). Cosine LR decay across chunks normalizes the effective learning rate.

…d mean)

Legal score-first TTT (PR openai#461 recipe) + BigramHash(3072) + freeze=0
on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399).

3-seed results (BIGRAM=3072, 3ep, freeze=0, SGD+mom=0.9):
  Seed 1337: 1.1204 bpb, 413s TTT, 15.98 MB
  Seed 42:   1.1216 bpb, 406s TTT, 15.99 MB
  Seed 2025: 1.1221 bpb, 405s TTT, 15.99 MB
  Mean:      1.1214 (std 0.0009)

All artifacts under 16MB. All eval times under 600s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abaybektursun abaybektursun force-pushed the submission/legal-ttt-1.1213 branch from ffe7c40 to 93e90ed Compare March 23, 2026 12:59
@abaybektursun abaybektursun changed the title Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1220 (3-seed mean) Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean) Mar 23, 2026
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Mar 23, 2026
…RIDE

P1: TTT was running on the pre-quantization base_model instead of the
int6 round-tripped eval_model. This overstated TTT gains since the
artifact model has quantization noise. Now matches PR openai#473's approach.

P2: TTT hardcoded stride=64 instead of using args.eval_stride. Now
honors the configured stride so TTT results stay consistent with
the sliding window eval path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211.
PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better.

TTT recipe: 32K-token chunks, score-first (inference_mode), then
train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1,
cosine LR decay across chunks, grad clip 1.0).

Removed TTT burst (replaced by legal TTT eval).
1499 lines (under 1500 limit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211.
PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better.

TTT recipe: 32K-token chunks, score-first (inference_mode), then
train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1,
cosine LR decay across chunks, grad clip 1.0).

Removed TTT burst (replaced by legal TTT eval).
1499 lines (under 1500 limit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@valerio-oai
Copy link
Contributor

This looks valid, but does not clear the SOTA at the time, PR #374, by the required 0.005 nats threshold. Good try, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants