Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d) by mikeapedia · Pull Request #1089 · openai/parameter-golf

mikeapedia · 2026-03-29T18:10:07Z

Record Submission: 1.1091 BPB — Turbo-Muon + EngramLite + Parameter Banking + GPTQ Mixed-Precision

val_bpb: 1.1091 (3-seed mean, std 0.0005) | ~15.3 MB | 8×H100 SXM

Results (8×H100 80GB SXM)

Seed	step_avg	steps	val_bpb (SW s64)	val_bpb (roundtrip)	Artifact bytes
42	93.26ms	6284	1.1086	1.1324	15,992,528
1337	93.11ms	6295	1.1090	1.1328	15,993,413
2025	93.11ms	6294	1.1096	1.1335	15,993,904
Mean	93.16ms	6291	1.1091	1.1329

Summary

11-layer GPT (512d, 8H, 4KV GQA) combining eight key innovations over the PR #609 baseline:

Turbo-Muon Optimizer — AOL preconditioning + Polar Express coefficients + row_col post-NS normalization (4 NS iterations instead of 5)
EngramLite — Multi-head prime-based hash embeddings (bigram + trigram, 2 heads, 8192 buckets)
Parameter Banking — 3D bank tensors enabling batched Newton-Schulz via torch.bmm
ASQU v3 Per-Layer Slopes — Fixed per-layer LeakyReLU negative slopes from 3 rounds of adaptive tuning: layer 0 near-ReLU² (−0.014) → layer 10 (0.468)
U-Net Skip Connections — Learned sigmoid-gated encoder/decoder skip paths
ValueEmbedding — Token identity reinjection at deep layers (9, 10)
SmearGate — Causal shift blending with predecessor token
XSA (all 11 layers) — Efficient cross-sequence attention via GQA-aware reshape

Architecture

Partial RoPE (16/64 dims), LN Scale (1/√(layer+1)), logit softcap 30.0, tied embeddings, per-head QK gain (init 1.5), LeakyReLU(ASQU v3 per-layer)² MLP at 3.5× width.

Compression

GPTQ mixed-precision — int5 baseline with Hessian-sensitivity-based selective promotion to int6/int7. Hessian collection (64 calibration batches) runs within the 600s training budget via gptq_reserve_ms=9000.
Late QAT — Soft-round sigmoid activated at LR scale < 15%, α ramp 1→16
SWA (float32, every 50 steps) + EMA (decay=0.997)
Brotli + byte-shuffle compression
Code shrinking — AST dead-code removal + pyminify + LZMA self-extracting wrapper (145 KB → 28 KB), freeing ~117 KB artifact budget for model weights

Key Hyperparameters

Param	Value
Layers	11 (512d)
MLP	3.5× with LeakyReLU(ASQU v3 per-layer)²
XSA	All 11 layers
Muon LR	0.025, momentum=0.99, WD=0.04
SWA	every 50 steps after 20%

Credits

Built on PR #609 (1.1154 bpb). Techniques from PRs #198, #265, #287, #399, #493, #518, #634.

Extra Dependencies

brotli>=1.1 (falls back to lzma if missing). torch>=2.11, Python>=3.12.

Full details in the submission README. Human-readable source in train_gpt_human.py. Training logs for all 3 seeds included.

…xed-Precision 11L/512d GPT with Turbo-Muon (AOL+Polar Express+row_col), EngramLite hash embeddings, U-Net skip connections, Parameter Banking, GPTQ mixed-precision int6/int7 with Hessian sensitivity, brotli compression. Dev-run benchmark: 1.1119 val_bpb (sliding window, 1xH100). Awaiting 3-seed validation on 8xH100 before opening PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Syncs version requirements with pyproject.toml. torch>=2.11 is needed for torch.compile fullgraph improvements and CUDA 13.0 support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Seeds: 42, 1337, 2025 Mean val_bpb (SW s64): 1.1086 Max artifact bytes: 15997089 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… budget Removed periodic eval_val during training loop (fired at step 0 and 4000) and the diagnostic post-EMA eval. These burned ~10-15s of wallclock on evals that don't affect the final score — the real evaluation happens in the post-quantization sweep. Reclaimed time yields ~100-150 extra training steps at steady-state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AST dead-code removal + pyminify + LZMA/base85 self-extracting wrapper. Reduces train_gpt.py code_bytes, freeing artifact budget for model weights. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…d architecture details - Added XSA (Cross-Sequence Attention) all 11 layers as key innovation - Fixed quantization description: int5 baseline with selective promotion to int6/int7 - Clarified GPTQ Hessian collection runs within training budget (14s reserved) - Added architecture details: Partial RoPE, LN Scale, logit softcap, GQA, tied embeddings, QK gain Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Changes from previous run: - Removed mid-training and diagnostic eval_val calls (reclaims ~10-15s for training) - Shrunk train_gpt.py via AST pruning + pyminify + LZMA (125KB -> 24KB, frees ~99KB code budget) - Human-readable source preserved as train_gpt_human.py 3-seed mean val_bpb (SW s64): 1.1091 Max artifact bytes: 15993904 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Observed GPTQ times across 3 seeds: 7.13s, 7.16s, 7.25s (max 7.25s). 9s reserve gives 1.75s safety margin (24% headroom) while freeing ~5s of training budget (~53 extra steps at 93ms/step). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Both from top submissions, zero code risk: MUON_BACKEND_STEPS=4 (PR openai#1089): 4 NS iterations vs 5 Saves ~1-2ms/step, proven at 1.1086 BPB BIGRAM_VOCAB_SIZE=4096 (PR openai#1072): larger hash table More n-gram patterns, proven at 1.117 BPB MLP 3.5x investigated but doesn't fit 16MB budget (+2.2MB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Decoded from base85+LZMA compressed submission. Key innovations: - EngramLite: multi-head bigram+trigram hash (8192 buckets, 2 heads) - LeakyReLU(0.3)² (not 0.5) - MLP 3.5× (fits via mixed int6/int7 quantization) - AOL Polar Express Muon coefficients (4 NS steps) - Sigmoid-gated skip connections - Brotli + byte-shuffle compression - Hessian sensitivity-based bit allocation (int6/int7 mixed) - Soft-round QAT - LR floor 0.05 (warmdown doesn't reach zero) Requires: brotli package Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ported from openai#1 submission (1.1086 BPB) into our merged stack: 1. EngramLite: multi-head bigram+trigram hash (8192 buckets, 2 heads, 2 orders, 32 dim/head) with learned sigmoid gate. Replaces BigramHash. 2. Sigmoid-gated skip connections: gate = sigmoid(skip_gates[i]), x = lerp(skip_weight*skip, x, gate). More expressive than additive. 3. LeakyReLU(0.3)² (was 0.5). Fused kernel disabled for non-0.5 slopes. 4. Muon 4 NS steps (was 5). AOL Polar coefficients from PR openai#1089. 5. LR floor 0.05 (warmdown doesn't reach zero). NOT ported (diminishing returns vs complexity): - Brotli compression (keep LZMA) - Mixed int6/int7 (keep uniform int6) - Soft-round QAT Expected: close to 1.1086 with FA3, possibly ~1.109-1.111. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rom record.

AnirudhRahul · 2026-03-30T08:04:47Z

https://github.com/openai/parameter-golf/pull/1126/changes#diff-5ac0e90191b85d767bb3c0b9777c0871a9a0326e02bb2dd9269378ffa46af26e
^Struggling to reproduce these results the step times I'm getting are ~10% slower than posted in these logs

…ed mean) 3-seed results: 1.1131/1.1119/1.1133 (mean 1.1128, std 0.0008) Built on PR openai#1089 with GPTQ reserve optimization (14s→9s) and forward-only fused Triton MLP kernel architecture (currently disabled pending torch.compile compat, falls back gracefully).

mikeapedia · 2026-03-30T17:21:29Z

@AnirudhRahul - Just looked at the logs and the only difference that jumps out at me is that I was running pytorch 2.11.0+cu130, not cu126. For cu126 flash attention 3 might be better than SDPA.

Port score-first legal TTT from training-base: every token is scored BEFORE any gradient update touches it. Last chunk scored but never trained on. Includes Polyak EMA weight averaging, entropy-adaptive epochs, multi-GPU support, and 600s eval budget guard. Also reduces gptq_reserve_ms from 14s to 9s (observed max 7.25s, 24% headroom) to reclaim ~53 extra training steps per run. TTT is opt-in via TTT_ENABLED=1. Default eval path unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add Test-Time Training section describing score-first legal TTT with Polyak EMA, entropy-adaptive epochs, and budget guard - Add TTT run command example - Fix Hessian collection reference: 14s → 9s (matches gptq_reserve_ms) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Move t_eval_phase before GPTQ quantization so the 600s eval budget correctly includes quantization + compression + decompression time. Previously it was set after model load, so TTT's budget guard didn't account for ~25-40s of GPTQ overhead and could exceed the budget. - Add _eval_phase_elapsed helper and log GPTQ+compression load time. - Add cumulative eval phase summary with 580s warning threshold. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three bugs fixed: 1. Dead code remover used "A." prefix but code uses "args." — remover was a complete no-op, never matching any branches. Fixed to exact "args.load_snapshot" / "args.snapshot_post_hessian" matching. 2. Used exact equality (not substring) to avoid matching negated forms like "not args.load_snapshot" which guards the 440-line training loop — substring matching would have deleted all training code. 3. __main__ block ignored sys.argv and had a destructive rename workflow that would overwrite train_gpt_human.py with the old shrunk version. Now accepts: python shrink.py <input> <output> Legacy no-args mode has safety check against clobbering existing files. Dead code removal now saves ~15.7K chars (10.9%) by stripping snapshot save/restore branches that are never used during competition runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shrink.py: - Add orelse guard: if a dead-code target `if` ever gains an `else` clause, skip removal instead of silently deleting the else body - Catch FileNotFoundError when uvx is not on PATH - Show pyminify stderr on failure (was silently discarded) - Wrap post-pyminify processing in try/finally for reliable temp file cleanup on all error paths train_gpt_human.py: - Gate _byte_unshuffle on _BYTE_SHUFFLE for symmetry with the conditional _byte_shuffle call (benign due to magic-header check but was an unnecessary asymmetry) - Document that TTT progress log bpb is a rank-0 local estimate (1/world_size of data on multi-GPU); final returned value is globally correct after all_reduce Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Ported from PR openai#1089: - Brotli (quality=11) + byte-shuffle (stride=2) replaces LZMA Expected: ~5-10% smaller artifacts, freeing bytes for higher precision - Mixed int5/int6/int7 per-layer based on Hessian sensitivity Most sensitive layers get int7, least get int5 - MIXED_PRECISION=1 enabled by default

Bug 1 (CRITICAL): A matrix never rescaled after Gershgorin scaling. Must do: A = s * A * s (diagonal similarity transform). Without this, NS iterations use inconsistent eigenvalue estimates. Bug 2 (CRITICAL): Wrong Polar Express coefficients for iters 3-5. Our coefficients were from the Frobenius-init table. PR openai#1089 uses refined AOL-specific coefficients. Bug 3: Loop recomputed A on iter 0 instead of reusing AOL's A. AOL's preconditioned A should be used for the first NS step. Root cause of 3x slower convergence: all three bugs compound. Implementation now matches PR openai#1089 lines 163-219 exactly. Co-Authored-By: Kevin Tan <kft@lightarchitects.io> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Critical realization: our ported innovations (EngramLite, gated skips, LeakyReLU(0.3)², Turbo-Muon) HURT by 0.003 BPB vs the PR openai#1060 baseline. PR openai#1060 gets 1.1122, our merged version gets 1.1151. The partial port of PR openai#1089 innovations doesn't capture their interactions. Clean path to record: run PR openai#1060's exact code with FA3 + GPTQ_RESERVE=9s. Expected: 1.1115-1.1122 BPB (well below 1.1144 threshold).

Complete pipeline to beat openai#1 (1.0806 BPB): - train_gpt_scylla_stack.py: PR openai#1060 + metadata-based tokenizer loading - retokenize.py: TokenMonster retokenization of FineWeb - deploy_scylla.sh: two-phase deploy (retokenize once, train many) Strategy: PR openai#1143 used old stack. We use PR openai#1060's modern stack (GPTQ, XSA-all, coprime loader) on the same Scylla tokenizer. Expected: ~1.070-1.080 BPB (beating both openai#1143 and openai#1089).

Regenerated compressed submission from updated train_gpt_human.py. 24,615 bytes → 27,402 bytes (+2.8KB from ~300 lines of TTT code). Dead code removal now active: strips snapshot save/restore branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

….1126 3-seed results: 1.1126/1.1123/1.1129 (mean 1.1126, std 0.0003) Built on PR openai#1089 with GPTQ reserve optimization (14s to 9s). Includes experimental fused Triton MLP kernel (hard-disabled).

- ASQU v3: Hard-coded per-layer LeakyReLU slopes from 3 rounds of adaptive tuning [-0.014..0.468], threaded through MLP→Block→GPT and Hessian copies - Fused leaky_relu²: torch.where replaces F.leaky_relu().square(), avoiding intermediate tensor materialization (~120MB less HBM traffic/fwd) - QAT dual-compile: Pre-cache both non-QAT and QAT compiled graphs during warmup, eliminating 5-30s mid-training recompile stall - foreach EMA: torch._foreach_lerp_ for fp32 params with dtype-safe fallback for bf16 embeddings, reducing kernel launch overhead - README: Add ASQU v3 section, update architecture table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CRITICAL BUG FIX: - Removed DDP wrapping entirely (was causing double gradient reduction on multi-GPU: DDP all-reduce + Muon reduce-scatter = 2× division) - Added manual coalesced all-reduce for non-Muon replicated params (embeddings, scalars, head) — matches PR openai#1089 architecture - Removed DDP import and all require_backward_grad_sync references - Warmup loop no longer uses DDP sync Added: - AUDIT.md: comprehensive 19-point audit comparing all algorithms against PR openai#1089, documenting every design choice and difference - 5 novel algorithm proposals for further improvement Verified: - 19/19 automated audit checks pass - Polar Express coefficients match to <1e-7 relative error - All hash primes match exactly - All hyperparameter defaults match frontier values - Syntax clean (ast.parse passes)

…all SDP backends, remove duplicates Critical fixes: - U-Net skip connections: changed from additive (x + g*sw*skip) to lerp interpolation (torch.lerp(scaled_skip, x, g)) matching PR openai#1089 exactly - Polar Express coefficients: added missing iter 6 entry, upgraded all coefficients to full 15+ digit precision matching reference - SDP backends: enabled mem_efficient and math backends (was disabled) - Removed duplicate INT8_KEEP_FLOAT_MAX_NUMEL definition (line 726) - Removed duplicate quant_raw_bytes assignment (line 1819) Audit V2 document with independent findings added.

mikeapedia and others added 3 commits March 29, 2026 09:41

Add torch>=2.11 and Python>=3.12 version constraints to requirements.txt

47bf9aa

Syncs version requirements with pyproject.toml. torch>=2.11 is needed for torch.compile fullgraph improvements and CUDA 13.0 support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fill in 3-seed 8xH100 validation results

4f04eff

Seeds: 42, 1337, 2025 Mean val_bpb (SW s64): 1.1086 Max artifact bytes: 15997089 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 29, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

mikeapedia and others added 4 commits March 29, 2026 13:01

Add code shrinking script for submission size optimization

7a465c4

AST dead-code removal + pyminify + LZMA/base85 self-extracting wrapper. Reduces train_gpt.py code_bytes, freeing artifact budget for model weights. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mikeapedia changed the title ~~Record Submission: 1.1086 BPB - Turbo-Muon + EngramLite + ParamBanking (11L 512d)~~ Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d) Mar 29, 2026

demouo added a commit to demouo/parameter-golf that referenced this pull request Mar 30, 2026

experiment: Turbo-Muon style post-NS row/col norm (from PR openai#1089)

adc51d1

Copilot AI mentioned this pull request Mar 30, 2026

Novel approaches analysis for sub-1.10 BPB Parameter Golf kailean/parameter-golf#1

Draft

icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 30, 2026

Run 3: 1.1146 BPP — FA3 + ported PR openai#1089 innovations. 0.0002 f…

4e3a27c

…rom record.

icryo mentioned this pull request Mar 30, 2026

Record: EngramLite + Gated Skips + Full GPTQ + FA3 — val_bpb 1.1146 (1-seed, 2 pending) #1122

Open

6 tasks

ahmettrkck added a commit to ahmettrkck/parameter-golf that referenced this pull request Mar 30, 2026

use PR openai#1089 code as base (1.1086 bpb proven)

b0e4f64

mikeapedia and others added 5 commits March 30, 2026 11:27

Bortlesboat mentioned this pull request Mar 31, 2026

Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1.1126 (3-seed mean) #1169

Open

6 tasks

dexhunter mentioned this pull request Mar 31, 2026

Record: SLOT + Split-LR + Full GPTQ + XSA-all — val_bpb 1.1015 (3-seed mean) #1172

Open

dexhunter mentioned this pull request Mar 31, 2026

review: Rerun of PR #1120 (Rascal) on 8xH100 SXM #1177

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d)#1089

Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d)#1089
mikeapedia wants to merge 15 commits intoopenai:mainfrom
mikeapedia:submission/turbo-muon-engram-banking

mikeapedia commented Mar 29, 2026 •

edited

Loading

Uh oh!

AnirudhRahul commented Mar 30, 2026

Uh oh!

mikeapedia commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikeapedia commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record Submission: 1.1091 BPB — Turbo-Muon + EngramLite + Parameter Banking + GPTQ Mixed-Precision

Results (8×H100 80GB SXM)

Summary

Architecture

Compression

Key Hyperparameters

Credits

Extra Dependencies

Uh oh!

AnirudhRahul commented Mar 30, 2026

Uh oh!

mikeapedia commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikeapedia commented Mar 29, 2026 •

edited

Loading

mikeapedia commented Mar 30, 2026 •

edited

Loading