Non-record: 1.1354 BPB — 10L TTT 22ep AdamW Cosine + LeakyReLU(0.5)² + TrigramHash by bigbag · Pull Request #562 · openai/parameter-golf

bigbag · 2026-03-23T18:05:01Z

Summary

val_bpb: 1.1354 (single seed, 8xH100 SXM)
Artifact: 15.35 MB (under 16MB limit)
Training: 600s | Eval (TTT + sliding window): 603s

Techniques

Architecture (10L, 512d, GQA 8/4)

Value Residual (ResFormer-style layer-0 V mixing, -0.015 BPB)
Gated Attention (per-head sigmoid gates, -0.003 BPB)
XSA on last 4 layers (-0.005 BPB)
LeakyReLU(0.5)² activation — preserves negative gradient flow (-0.003 BPB vs ReLU²)
TrigramHash — extends BigramHash to 3-token context via XOR hashing into shared embedding table
SmearGate, LN Scale (depth-scaled residuals), U-Net skip connections

Training

Muon optimizer (Newton-Schulz) for matrices, AdamW for embeddings/scalars
MATRIX_LR=0.03 (PR Non-record: Basis Block Interpolation (novel negative result) + Hyperparameter Sweep (MATRIX_LR=0.03 improves SOTA by 0.059 bpb) #530 finding)
SWA: 27 checkpoints averaged
Late QAT: threshold 0.5

Quantization

Mixed int5 (MLP) / int6 (attention) + zstd-22
3% magnitude pruning, FP16 passthrough for embeddings

Test-Time Training (22 epochs, within eval budget)

AdamW (lr=0.0005, wd=0.0) with per-step cosine LR decay to 0
Per-layer LR groups: 3x for output projections, 0.5x for input projections
Batched 32 sequences per GPU, distributed gradient sync via all_reduce
Gradient clipping at 1.0
TTT time: 406s, eval time: 197s

Key Findings

Batched TTT (32 seqs/GPU) is ~500x faster than chunk-based (1 seq × 256 tokens)
Per-step cosine decay (not per-epoch) prevents overfitting at high epoch counts
Gradient sync per step (all_reduce on grads) is critical for multi-GPU TTT — syncing parameters at the end causes divergence
Per-layer LR groups compensate for uneven quantization damage (output projections are most affected)
LeakyReLU(0.5)² gives consistent -0.003 BPB improvement
TrigramHash reuses BigramHash's embedding table for zero extra parameters

…igramHash Architecture: 10L 512d GQA 8/4, Value Residual, Gated Attention, XSA4, LN Scale, LeakyReLU(0.5)², TrigramHash + BigramHash (shared table), SmearGate, SWA 27ckpts, Late QAT 0.5, int5-MLP/int6-attn + zstd-22. TTT: 22-epoch AdamW with per-step cosine LR decay, per-layer LR groups (3x proj, 0.5x fc), batched 32 seqs/GPU, grad sync + clip 1.0. Result: val_bpb=1.1354, artifact=15.35MB, train=600s, eval=603s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bigbag changed the title ~~Record: 1.1354 BPB — 10L TTT 22ep AdamW Cosine + LeakyReLU(0.5)² + TrigramHash~~ Non-record: 1.1354 BPB — 10L TTT 22ep AdamW Cosine + LeakyReLU(0.5)² + TrigramHash Mar 23, 2026

bigbag force-pushed the submission/10L-ttt22-leaky-relu-trigram-1.1354 branch 2 times, most recently from 76c6512 to 5dca5af Compare March 23, 2026 18:12

bigbag force-pushed the submission/10L-ttt22-leaky-relu-trigram-1.1354 branch from 5dca5af to c7a96b3 Compare March 23, 2026 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 1.1354 BPB — 10L TTT 22ep AdamW Cosine + LeakyReLU(0.5)² + TrigramHash#562

Non-record: 1.1354 BPB — 10L TTT 22ep AdamW Cosine + LeakyReLU(0.5)² + TrigramHash#562
bigbag wants to merge 1 commit intoopenai:mainfrom
bigbag:submission/10L-ttt22-leaky-relu-trigram-1.1354

bigbag commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bigbag commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Techniques

Architecture (10L, 512d, GQA 8/4)

Training

Quantization

Test-Time Training (22 epochs, within eval budget)

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bigbag commented Mar 23, 2026 •

edited

Loading