Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) by abaybektursun · Pull Request #549 · openai/parameter-golf

abaybektursun · 2026-03-23T16:08:26Z

Record: LeakyReLU² + Legal TTT + Parallel Muon — val_bpb 1.1194

val_bpb = 1.1194 (3-seed mean, std 0.0006) | ~15.95 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	step_avg	steps	Pre-TTT bpb	Post-TTT bpb	TTT gain	TTT time	Artifact
1337	83.3ms	7,179	1.1217	1.1192	-0.0025	410s	15,977,386
42	83.4ms	7,182	1.1227	1.1200	-0.0027	408s	15,876,510
2025	83.4ms	7,193	1.1212	1.1189	-0.0023	408s	15,990,006
Mean	83.4ms	7,185	1.1218	1.1194 (std 0.0006)	-0.0025	~409s

Key Innovation: LeakyReLU(0.5)²

One-line activation change delivering -0.003 BPB vs standard relu²:

# relu² (standard)
x = torch.relu(self.fc(x)).square()
# leaky relu² (this submission)
x = F.leaky_relu(self.fc(x), negative_slope=0.5).square()

Preserves negative gradient flow through the MLP. Source: PR #493 by @parinzee (ablated at -0.003), PR #518 by @sofiabod.

Legal TTT (Score-First, PR #461 Framework)

Every token scored BEFORE any weight update, enforced by torch.inference_mode():

for each 32K-token chunk:
    Phase 1 — SCORE: sliding window eval (inference_mode)
    Phase 2 — TRAIN: SGD(lr=0.002, mom=0.9), 3 epochs, all blocks unfrozen

Adapted from PR #461 by @Christopher-Lee-McClendon (changed freeze=2 → freeze=0 based on our ablation showing unfreezing all blocks is optimal at 3 epochs).

Total eval: ~530s (120s standard + 409s TTT) — within 10 min limit.

Training Architecture

PR #414 stack + Parameter Banking + Parallel Muon (PR #399):

11L, 512d, 8H/4KV, LeakyReLU(0.5)² MLP 3×
BigramHash(1536), XSA4, Partial RoPE, LN Scale, VE128
EMA(0.997) + Tight SWA, GPTQ-lite int6 + lzma
Parameter Banking + Parallel Muon (83.4ms/step)

Credits

LeakyReLU²: PR #493 by @parinzee, PR #518 by @sofiabod
Optimizer: PR #399 by @abaybektursun
TTT recipe: PR #461 by @Christopher-Lee-McClendon (adapted: freeze=0)
Base model: PR #414 by @signalrush

🤖 Generated with Claude Code

…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

11L, XSA all layers, partial RoPE 16/64, LN scale, VE128 (layers 9,10), LeakyReLU(0.5)² activation, BigramHash(2048), INT6+zstd-22. Legal score-first TTT: 32K chunks, all blocks, SGD(0.002,mom=0.9), 3ep. Base: PR openai#503 (EthanYangTW) + LeakyReLU² from openai#518/openai#549 + SGD from openai#549. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…stopher-Lee-McClendon

…stopher-Lee-McClendon

Multiple top PRs (openai#535, openai#549, openai#569) demonstrate -0.0015 to -0.003 bpb from this change. LeakyReLU preserves gradient flow through negative pre-activations while maintaining the sparsity/gating benefits of squaring. At 22M params, dead neurons from hard ReLU are expensive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun force-pushed the submission/leaky-relu-legal-ttt-1.1183 branch from f6a0b0d to 8ff3e0e Compare March 23, 2026 16:27

abaybektursun changed the title ~~Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1195 (3-seed mean)~~ Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) Mar 23, 2026

abaybektursun added 2 commits March 23, 2026 12:13

Fix pre-TTT BPB, TTT gains, and steps to match logs exactly

139d573

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

b08d72a

…stopher-Lee-McClendon

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean)#549

Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean)#549
abaybektursun wants to merge 3 commits intoopenai:mainfrom
abaybektursun:submission/leaky-relu-legal-ttt-1.1183

abaybektursun commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abaybektursun commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: LeakyReLU² + Legal TTT + Parallel Muon — val_bpb 1.1194

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Key Innovation: LeakyReLU(0.5)²

Legal TTT (Score-First, PR #461 Framework)

Training Architecture

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abaybektursun commented Mar 23, 2026 •

edited

Loading