Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean)#473
Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean)#473abaybektursun wants to merge 3 commits intoopenai:mainfrom
Conversation
5aa9ee5 to
40de278
Compare
|
parameter banking is smart, batching the ns ortho steps like that is clean. curious about the legal ttt protocol too, scoring before updating makes way more sense than what most people been doing |
40de278 to
a33f075
Compare
a33f075 to
f337da2
Compare
Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211. PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better. TTT recipe: 32K-token chunks, score-first (inference_mode), then train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1, cosine LR decay across chunks, grad clip 1.0). Removed TTT burst (replaced by legal TTT eval). 1499 lines (under 1500 limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Based on PR openai#414's exact train_gpt.py (11L, EMA, XSA, PartialRoPE, LNScale, VE128, GPTQ-lite, QAT@0.15, warmdown=3500, int6+zstd-22). Added legal score-first TTT from PR openai#461/openai#473 protocol: - SGD + momentum 0.9, lr=0.002 with cosine decay - 3 epochs per 32K token chunk - Freeze blocks 0-1 - Score each chunk BEFORE training on it (inference_mode) - Expected ~0.002 bpb improvement over base Strategy shift: reproduce proven frontier instead of iterating on our custom stack. PR openai#414 achieves 1.1233 on 3 seeds; adding legal TTT should push to ~1.121. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f337da2 to
f54a596
Compare
1. SGD+momentum is the right optimizer for legal TTTAdamW (lr=0.0005 and lr=0.0001) performed dramatically worse — 1.1598 vs 1.1220 baseline. AdamW overfits to each chunk's distribution instead of generalizing to the next. SGD+momentum's implicit regularization is critical for the score-first constraint. 2. More epochs monotonically improve (but hit eval time limit)
Saturates around 1.1202. But 10 min eval limit means only 3ep configs (~400s TTT + ~120s standard eval) are compliant. 3. Freeze depth and epochs interact
At low epochs: no freezing is best (more params adapt per chunk). At high epochs: freezing 2 blocks prevents catastrophic forgetting. 4. Smaller chunks help16K chunks consistently beat 32K at the same epoch count (~0.0003 BPB). More frequent adaptation cycles improve generalization. 5. Bigger BigramHash is a major win
BIGRAM=3072 gives -0.0005 and fits under 16MB. BIGRAM=4096 gives -0.0018 but exceeds the limit by 103KB. 6. XSA on all layers doesn't helpXSA_LAST_N=11 gave worse pre-TTT (1.1227 vs 1.1234) due to fewer steps from overhead. TTT partially compensated but net result was the same. Best timing-compliant config (within 10 min eval)BIGRAM=3072, 3ep, freeze=0: result pending (running now). Expected ~1.1208 based on interpolation. Best unconstrained configBIGRAM=3072, 10ep chunk=16K, freeze=2: 1.1192 (15.99 MB) — but requires ~27 min eval. LR doesn't matter muchlr=0.002 and lr=0.003 gave identical results (1.1205). Cosine LR decay across chunks normalizes the effective learning rate. |
…d mean) Legal score-first TTT (PR openai#461 recipe) + BigramHash(3072) + freeze=0 on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results (BIGRAM=3072, 3ep, freeze=0, SGD+mom=0.9): Seed 1337: 1.1204 bpb, 413s TTT, 15.98 MB Seed 42: 1.1216 bpb, 406s TTT, 15.99 MB Seed 2025: 1.1221 bpb, 405s TTT, 15.99 MB Mean: 1.1214 (std 0.0009) All artifacts under 16MB. All eval times under 600s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ffe7c40 to
93e90ed
Compare
…RIDE P1: TTT was running on the pre-quantization base_model instead of the int6 round-tripped eval_model. This overstated TTT gains since the artifact model has quantization noise. Now matches PR openai#473's approach. P2: TTT hardcoded stride=64 instead of using args.eval_stride. Now honors the configured stride so TTT results stay consistent with the sliding window eval path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211. PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better. TTT recipe: 32K-token chunks, score-first (inference_mode), then train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1, cosine LR decay across chunks, grad clip 1.0). Removed TTT burst (replaced by legal TTT eval). 1499 lines (under 1500 limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211. PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better. TTT recipe: 32K-token chunks, score-first (inference_mode), then train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1, cosine LR decay across chunks, grad clip 1.0). Removed TTT burst (replaced by legal TTT eval). 1499 lines (under 1500 limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This looks valid, but does not clear the SOTA at the time, PR #374, by the required 0.005 nats threshold. Good try, though! |
Record: Legal TTT + Parallel Muon + Parameter Banking — val_bpb 1.1214
val_bpb = 1.1214 (legal score-first TTT, 3-seed mean, std 0.0009) | ~16.0 MB | 8×H100 SXM
3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
Timing Budget
TTT Legality
Backward-looking, score-first TTT (PR #461 framework):
torch.inference_mode()— no gradients, no weight mutationKey improvements over initial submission
Based on 17 overnight experiments testing epochs (3/5/10/20), freeze depth (0/1/2), chunk size (16K/32K), LR (0.001/0.002/0.003), and optimizer (SGD vs AdamW).
Training Architecture
PR #414 stack + Parameter Banking + Parallel Muon (PR #399):
Credits
🤖 Generated with Claude Code