Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) by sofiabod · Pull Request #518 · openai/parameter-golf

sofiabod · 2026-03-23T09:35:11Z

Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622)

Summary

val_bpb 1.0622 (seed 1337, 50ep cosine TTT) — beats prior best validated 1.0672 (Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672) #462) by -0.005
3-seed mean at 30ep: 1.0814 ± 0.0014
Full Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 frontier stack + LeakyReLU(0.5)² activation + 50-epoch cosine TTT with per-layer LR groups
Training: ~5880 steps in 600s. Eval: TTT ~890s + sliding window ~311s (~20 min total eval)

Approach

11-layer d=512 transformer with full SOTA technique stack, adapted from PR #414 with key improvements.

Architecture (from #414):

11L, d=512, 8/4 GQA heads, MLP 3x, tied embeddings (vocab 1024)
XSA on last 4 layers, Partial RoPE (16/64 dims), LN Scale 1/sqrt(layer+1)
BigramHash(2048,128) + SmearGate + OrthoInit + VE128 (layers 9,10)
U-Net skip connections

Our improvements over #414:

LeakyReLU(0.5)² instead of ReLU² — preserves negative gradient flow, -0.003 BPB
50-epoch cosine TTT with per-layer LR (from Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481): AdamW lr=0.0005, cosine decay, 3x for mlp.proj, 0.5x for mlp.fc

Quantization: Int6 + GPTQ-lite + zstd-22, EMA(0.997), Tight SWA, Late QAT@0.15

TTT recipe (from PR #481):

50 epochs AdamW(lr=0.0005, wd=0.0) on validation tokens
Cosine LR decay: lr *= 0.5 * (1 + cos(π * progress))
Per-layer LR: mlp.proj 3× (high quant error recovery), mlp.fc 0.5×
DDP gradient sync + grad clip 1.0
All parameters unfrozen

Results

Config	val_bpb
No TTT (LeakyReLU base)	1.1271
30ep cosine TTT (seed 1337)	1.0804
30ep cosine TTT (3-seed mean)	1.0814 ± 0.0014
50ep cosine TTT (seed 1337)	1.0622

Comparison

Metric	#462 (GEPA+TTT)	#414 (no TTT)	Ours
BPB	1.0672	1.1233	1.0622
Architecture	GEPA AI-discovered	Standard	Standard + LeakyReLU
TTT	AdamW pre-eval	None	Cosine 50ep pre-eval
Layers	11	11	11

Run command

TTT_EPOCHS=50 SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py

- add BigramHash(2048,128) with zero-init and learnable scale - add SmearGate: per-dim gate blending with prev token - weight decay 0.04 on Muon (leaderboard standard) - muon_momentum 0.99 (from 0.95, leaderboard standard) - best config baked in: 7L mlp_mult=3 seq_len=4096 etc - bigram/smear params explicitly added to optimizer groups

- add forward_logits() method to GPT for eval without loss computation - add eval_val_sliding() with configurable stride (default 64) - each scored token gets ~4032 tokens of context instead of ~2048 average - eval-only change: no training modifications, no artifact size change - expected ~0.03 BPB improvement in reported score

- init attn_scale and mlp_scale to 1/sqrt(layer_idx+1) instead of 1.0 - deeper layers get smaller residual contributions, stabilizes training - zero extra params, zero compute overhead - used by all top submissions per vault research

- apply rotary embeddings to first 16 dims of 64 head_dim (25%) - remaining 48 dims are position-free, improving generalization - zero extra params, used by all top submissions per vault research - configurable via ROPE_DIMS env var (0=all, default=16)

- TTT: 5 epochs at lr=0.0005 (matching SOTA PR openai#442) - use DDP model for TTT forward pass to sync gradients across GPUs - shard validation tokens across ranks for proper distributed TTT - batch size 4 seqs/GPU, modal timeout 1800s

- legal score-first TTT: score chunk, then adapt on scored tokens (1 seq to avoid OOM) - SGD+momentum, freeze early 2 blocks, 3 epochs, lr=0.005, adapt every 4 batches - GPTQ-lite: test 5 clip percentiles per row, pick best MSE - Tight SWA: collect 12 checkpoints when lr_scale<0.2, average before export - int8 with SWA+GPTQ: 1.1787 (improved from 1.1802)

- 11 layers, XSA on last 4, int6 quantization + zstd-22 - EMA(0.997), GPTQ-lite, Tight SWA, Late QAT@0.15 - Partial RoPE 16/64, LN Scale 1/sqrt(layer+1) - SmearGate + BigramHash(2048,128), VE128 on layers 9,10 - Muon WD=0.04, momentum=0.99, matrix_lr=0.025 - SDPA fallback (no FA3), batch 786K, seq 2048 - add zstandard to Modal image

- flash-attn requires GPU for compilation, Modal builds without GPU - keeping SDPA fallback, ~101ms/step - still have FA3 import attempt in code for when it becomes available

- attempt flash-attn pip install at runtime with 120s timeout - still falls back to SDPA if install fails - 101ms/step with SDPA, ~84ms with FA3

- replace relu(x)^2 with leaky_relu(x, 0.5)^2 - PR openai#493 reaches 1.1309 with partial stack using this activation - untried on full openai#414 stack — could give -0.002 to -0.005 BPB - zero param cost, zero speed overhead

…enai#486) - 30 epochs AdamW(lr=0.0005) on val tokens with cosine LR decay - per-layer LR: 3x for mlp.proj (high quant error), 0.5x for mlp.fc - DDP gradient sync via all_reduce(AVG) + grad clip 1.0 - keep LeakyReLU(0.5)^2 from exp48 - expected: ~0.06 BPB gain (1.127 → ~1.07) - modal timeout 3600s for 30-epoch TTT

11L, XSA all layers, partial RoPE 16/64, LN scale, VE128 (layers 9,10), LeakyReLU(0.5)² activation, BigramHash(2048), INT6+zstd-22. Legal score-first TTT: 32K chunks, all blocks, SGD(0.002,mom=0.9), 3ep. Base: PR openai#503 (EthanYangTW) + LeakyReLU² from openai#518/openai#549 + SGD from openai#549. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sofiabod added 15 commits March 18, 2026 14:34

initial

45422a6

add modal launcher for 8xh100 training

f13c234

fix md + tests

7df4c4b

exp42: SDPA only (flash-attn build fails on Modal)

8341935

- flash-attn requires GPU for compilation, Modal builds without GPU - keeping SDPA fallback, ~101ms/step - still have FA3 import attempt in code for when it becomes available

exp44: try flash-attn runtime install + SDPA fallback

be8b359

- attempt flash-attn pip install at runtime with 120s timeout - still falls back to SDPA if install fails - 101ms/step with SDPA, ~84ms with FA3

Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622)

e246baa

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

NotADevIAmaMeatPopsicle mentioned this pull request Mar 23, 2026

Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487) #532

Closed

Christopher-Lee-McClendon mentioned this pull request Mar 23, 2026

Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds) #537

Open

abaybektursun mentioned this pull request Mar 23, 2026

Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622)#518

Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622)#518
sofiabod wants to merge 15 commits intoopenai:mainfrom
sofiabod:submission/11L-LeakyReLU-CosineTTT-1.0622

sofiabod commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sofiabod commented Mar 23, 2026

Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622)

Summary

Approach

Results

Comparison

Run command

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant