Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) by abaybektursun · Pull Request #399 · openai/parameter-golf

abaybektursun · 2026-03-22T04:52:11Z

Novel Contribution: Parameter Banking + Parallel Muon

This submission introduces Parameter Banking, a weight layout restructuring that enables batched optimizer operations, combined with an adapted Parallel Muon communication strategy. Together, these provide a 3.4% training throughput improvement that is architecture-agnostic and composes with any Muon-based training stack. The approach has since been adopted by subsequent competition submissions (e.g., PR #549).

Pure systems optimization — model architecture and hyperparameters are unchanged.

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, 600s)

Seed	step_avg	steps	int6 sliding val_bpb	artifact
1337	81.86 ms	7,331	1.1241	15,830,960 bytes
42	81.88 ms	7,328	1.1253	15,819,728 bytes
2025	81.86 ms	7,330	1.1247	15,796,052 bytes
Mean	81.87 ms	7,330	1.1247 (std 0.0006)	~15.8 MB

Technical Approach

1. Parameter Banking (novel)

We restructure 66 separate nn.Linear weight matrices into 4 contiguous 3D nn.Parameter tensors, grouped by shape:

qo_bank: (22, 512, 512) — Q + Out projections
kv_bank: (22, 256, 512) — K + V projections
mlp_up_bank: (11, 1536, 512) — MLP up
mlp_down_bank: (11, 512, 1536) — MLP down

Forward pass uses F.linear(x, bank[layer_idx]) — compiles identically to nn.Linear under torch.compile. Verified: banked forward+backward = 72.33ms vs baseline 72.59ms.

The key benefit: Newton-Schulz orthogonalization (used by Muon) becomes a single torch.bmm over the batch dimension, replacing 66 sequential small GEMMs. This reduces optimizer time from 19.7ms to 1.3ms (15× faster).

2. Parallel Muon (adapted from arXiv:2511.07464)

Standard DDP is incompatible with parameter banking: bank gradients aggregate across all 11 layers and are only available at end of backward, destroying compute-communication overlap (+4ms regression).

Our solution removes DDP for banked parameters and schedules communication explicitly:

Launch async reduce_scatter for all banks (biggest first)
all_reduce + Adam step on small replicated params (while bank RS is in-flight)
Wait for RS, local batched NS on each GPU's shard, async all_gather

This follows the DDP-free communication pattern from modded-nanogpt, adapted to work with our banking structure.

Engineering notes

Approach	Result	Lesson
Non-surgery batching (keep 66 params, batch in optimizer)	85.73ms	Gather/scatter kernel overhead offsets speedup
DDP with banks	88.8ms (+4ms)	Bank grads only available at end of backward
Polar Express (arXiv:2505.16932)	82ms, 16.2MB	PE weights compress ~190KB worse than NS
Parameter Banking + Parallel Muon	81.87ms, 15.8MB	Architecture-agnostic, composable

Compatibility analysis

Base PR	Speed	Score	Finding
#315 (EMA only)	-3.4%	-0.0006 BPB	Extra steps improve EMA monotonically
#374 (Tight SWA)	-3.5%	+0.001	SWA averages warmdown weights; extra steps don't enter the window
#401 (EMA+SWA)	-2.8%	+0.0005	Same SWA dilution
#398 (TTT)	-2.3%	+0.004	More-converged model has less room for TTT adaptation

Key finding: The throughput advantage translates to quality gains exclusively for EMA-based models, where every additional step monotonically refines the exponential moving average.

Credits

Architecture: PR #315 by @jfprincz (11L Partial RoPE + LN Scale + EMA + XSA4)
Parallel Muon scheduling: Adapted from arXiv:2511.07464 and modded-nanogpt

🤖 Generated with Claude Code

@jfprincz

Systems optimization built on PR openai#315 by @jfprincz (11L XSA4+EMA, 1.1248 bpb). Same architecture, same hyperparameters, only optimizer changed. 82.14ms/step vs 84.76ms baseline = 7,306 steps vs 7,079 in 600s. Pre-quant val_bpb 1.1421 (identical to baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…1.1248) Unbank state dict before quantization so int6 per-row scales match baseline. Rebank after dequantization for roundtrip eval. Results: 82.13ms/step, 7,306 steps, int6 sliding window val_bpb 1.1238. Artifact: 16.06MB (int6+zstd). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seeds 42, 1337, 2025: mean 82.08ms/step, val_bpb 1.1239 (std 0.0001). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaced Polar Express with standard Newton-Schulz + switched to lzma compression. 3-seed results: 81.87ms/step mean, 1.1247 sliding bpb mean, all artifacts ~15.8MB. Seed 1337: 7331 steps, 1.1241 bpb, 15,830,960 bytes Seed 42: 7328 steps, 1.1253 bpb, 15,819,728 bytes Seed 2025: 7330 steps, 1.1247 bpb, 15,796,052 bytes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…d mean) Legal score-first TTT (PR openai#461 recipe) + BigramHash(3072) + freeze=0 on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results (BIGRAM=3072, 3ep, freeze=0, SGD+mom=0.9): Seed 1337: 1.1204 bpb, 413s TTT, 15.98 MB Seed 42: 1.1216 bpb, 406s TTT, 15.99 MB Seed 2025: 1.1221 bpb, 405s TTT, 15.99 MB Mean: 1.1214 (std 0.0009) All artifacts under 16MB. All eval times under 600s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Seed 1337: pending (log will be added) Mean: 1.1195 (std 0.0008) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 22, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

abaybektursun and others added 2 commits March 22, 2026 00:13

Add 3-seed results + train logs

4db0057

Seeds 42, 1337, 2025: mean 82.08ms/step, val_bpb 1.1239 (std 0.0001). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun changed the title ~~Record: Parallel Muon + Parameter Banking — 82.14ms/step (3.1% faster than PR #315)~~ Record: Parallel Muon + Parameter Banking — 82.08ms/step (3.2% faster than PR #315) Mar 22, 2026

abaybektursun force-pushed the submission/parallel-muon-82ms branch from 5f4d141 to 4db0057 Compare March 22, 2026 15:24

abaybektursun changed the title ~~Record: Parallel Muon + Parameter Banking — 82.08ms/step (3.2% faster than PR #315)~~ Record: Parallel Muon + Parameter Banking + Polar Express — 82.14ms/step (3.1% faster than PR #315) Mar 22, 2026

abaybektursun changed the title ~~Record: Parallel Muon + Parameter Banking + Polar Express — 82.14ms/step (3.1% faster than PR #315)~~ Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) Mar 22, 2026

abaybektursun mentioned this pull request Mar 22, 2026

Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean) #473

Closed

abaybektursun mentioned this pull request Mar 23, 2026

Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549

Merged

abaybektursun mentioned this pull request Mar 24, 2026

Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean) #593

Closed

This was referenced Mar 24, 2026

Record Submission: Maestro Solar Protocol (1.1194 BPB) Joeavaib/parameter-golf#1

Merged

Add_Maestro_Solar_Protocol_Joeavaib #625

Open

abaybektursun mentioned this pull request Mar 25, 2026

Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.1142 (3-seed mean) #728

Open

This was referenced Mar 25, 2026

Non-Record: 11L Parallel Muon + LeakyReLU² MLP3x + Legal TTT (val_bpb 1.1253) #635

Closed

Non-Record: 11L Parallel Muon + LN Scale + LeakyReLU² MLP3x + Legal TTT (val_bpb 1.1215) #754

Closed

SirSaltySalmon mentioned this pull request Mar 26, 2026

(Nonrecord) Applied Async Prefetching Potentially Boosts Performance #785

Open

This was referenced Mar 26, 2026

Non-Record: 11L Parallel Muon + LN Scale + LeakyReLU² MLP3x + Legal TTT — val_bpb 1.1215 (3-seed mean) #838

Open

Record: 11L Parallel Muon + N-gram Backoff Cache — val_bpb 0.2841 (3-seed mean) #864

Closed

aryanbhosale mentioned this pull request Mar 26, 2026

Record: 11L Parallel Muon + N-gram Backoff Cache — val_bpb 0.2841 (3-seed mean) #865

Closed

7 tasks

lolrazh mentioned this pull request Mar 26, 2026

Record: LeakyReLU(0.9)² + N-gram Cache + Entropy-Reg QAT — val_bpb 0.9958 (3-seed mean) #885

Open

aryanbhosale mentioned this pull request Mar 26, 2026

Record: Two-Pass Order-12 N-gram Backoff + Parallel Muon — val_bpb 0.1310 (3-seed) #893

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)#399

Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)#399
abaybektursun wants to merge 4 commits intoopenai:mainfrom
abaybektursun:submission/parallel-muon-82ms

abaybektursun commented Mar 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abaybektursun commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Novel Contribution: Parameter Banking + Parallel Muon

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, 600s)

Technical Approach

1. Parameter Banking (novel)

2. Parallel Muon (adapted from arXiv:2511.07464)

Engineering notes

Compatibility analysis

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abaybektursun commented Mar 22, 2026 •

edited

Loading