Skip to content

Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d)#1089

Open
mikeapedia wants to merge 15 commits intoopenai:mainfrom
mikeapedia:submission/turbo-muon-engram-banking
Open

Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d)#1089
mikeapedia wants to merge 15 commits intoopenai:mainfrom
mikeapedia:submission/turbo-muon-engram-banking

Conversation

@mikeapedia
Copy link
Copy Markdown

@mikeapedia mikeapedia commented Mar 29, 2026

Record Submission: 1.1091 BPB — Turbo-Muon + EngramLite + Parameter Banking + GPTQ Mixed-Precision

val_bpb: 1.1091 (3-seed mean, std 0.0005) | ~15.3 MB | 8×H100 SXM

Results (8×H100 80GB SXM)

Seed step_avg steps val_bpb (SW s64) val_bpb (roundtrip) Artifact bytes
42 93.26ms 6284 1.1086 1.1324 15,992,528
1337 93.11ms 6295 1.1090 1.1328 15,993,413
2025 93.11ms 6294 1.1096 1.1335 15,993,904
Mean 93.16ms 6291 1.1091 1.1329

Summary

11-layer GPT (512d, 8H, 4KV GQA) combining eight key innovations over the PR #609 baseline:

  • Turbo-Muon Optimizer — AOL preconditioning + Polar Express coefficients + row_col post-NS normalization (4 NS iterations instead of 5)
  • EngramLite — Multi-head prime-based hash embeddings (bigram + trigram, 2 heads, 8192 buckets)
  • Parameter Banking — 3D bank tensors enabling batched Newton-Schulz via torch.bmm
  • ASQU v3 Per-Layer Slopes — Fixed per-layer LeakyReLU negative slopes from 3 rounds of adaptive tuning: layer 0 near-ReLU² (−0.014) → layer 10 (0.468)
  • U-Net Skip Connections — Learned sigmoid-gated encoder/decoder skip paths
  • ValueEmbedding — Token identity reinjection at deep layers (9, 10)
  • SmearGate — Causal shift blending with predecessor token
  • XSA (all 11 layers) — Efficient cross-sequence attention via GQA-aware reshape

Architecture

Partial RoPE (16/64 dims), LN Scale (1/√(layer+1)), logit softcap 30.0, tied embeddings, per-head QK gain (init 1.5), LeakyReLU(ASQU v3 per-layer)² MLP at 3.5× width.

Compression

  • GPTQ mixed-precision — int5 baseline with Hessian-sensitivity-based selective promotion to int6/int7. Hessian collection (64 calibration batches) runs within the 600s training budget via gptq_reserve_ms=9000.
  • Late QAT — Soft-round sigmoid activated at LR scale < 15%, α ramp 1→16
  • SWA (float32, every 50 steps) + EMA (decay=0.997)
  • Brotli + byte-shuffle compression
  • Code shrinking — AST dead-code removal + pyminify + LZMA self-extracting wrapper (145 KB → 28 KB), freeing ~117 KB artifact budget for model weights

Key Hyperparameters

Param Value
Layers 11 (512d)
MLP 3.5× with LeakyReLU(ASQU v3 per-layer)²
XSA All 11 layers
Muon LR 0.025, momentum=0.99, WD=0.04
SWA every 50 steps after 20%

Credits

Built on PR #609 (1.1154 bpb). Techniques from PRs #198, #265, #287, #399, #493, #518, #634.

Extra Dependencies

brotli>=1.1 (falls back to lzma if missing). torch>=2.11, Python>=3.12.

Full details in the submission README. Human-readable source in train_gpt_human.py. Training logs for all 3 seeds included.

mikeapedia and others added 3 commits March 29, 2026 09:41
…xed-Precision

11L/512d GPT with Turbo-Muon (AOL+Polar Express+row_col), EngramLite hash
embeddings, U-Net skip connections, Parameter Banking, GPTQ mixed-precision
int6/int7 with Hessian sensitivity, brotli compression.

Dev-run benchmark: 1.1119 val_bpb (sliding window, 1xH100).
Awaiting 3-seed validation on 8xH100 before opening PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Syncs version requirements with pyproject.toml. torch>=2.11 is needed for
torch.compile fullgraph improvements and CUDA 13.0 support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Seeds: 42, 1337, 2025
Mean val_bpb (SW s64): 1.1086
Max artifact bytes: 15997089

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mikeapedia and others added 4 commits March 29, 2026 13:01
… budget

Removed periodic eval_val during training loop (fired at step 0 and 4000)
and the diagnostic post-EMA eval. These burned ~10-15s of wallclock on
evals that don't affect the final score — the real evaluation happens in
the post-quantization sweep. Reclaimed time yields ~100-150 extra training
steps at steady-state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AST dead-code removal + pyminify + LZMA/base85 self-extracting wrapper.
Reduces train_gpt.py code_bytes, freeing artifact budget for model weights.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d architecture details

- Added XSA (Cross-Sequence Attention) all 11 layers as key innovation
- Fixed quantization description: int5 baseline with selective promotion to int6/int7
- Clarified GPTQ Hessian collection runs within training budget (14s reserved)
- Added architecture details: Partial RoPE, LN Scale, logit softcap, GQA, tied embeddings, QK gain

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changes from previous run:
- Removed mid-training and diagnostic eval_val calls (reclaims ~10-15s for training)
- Shrunk train_gpt.py via AST pruning + pyminify + LZMA (125KB -> 24KB, frees ~99KB code budget)
- Human-readable source preserved as train_gpt_human.py

3-seed mean val_bpb (SW s64): 1.1091
Max artifact bytes: 15993904

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mikeapedia mikeapedia changed the title Record Submission: 1.1086 BPB - Turbo-Muon + EngramLite + ParamBanking (11L 512d) Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d) Mar 29, 2026
Observed GPTQ times across 3 seeds: 7.13s, 7.16s, 7.25s (max 7.25s).
9s reserve gives 1.75s safety margin (24% headroom) while freeing ~5s
of training budget (~53 extra steps at 93ms/step).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
demouo added a commit to demouo/parameter-golf that referenced this pull request Mar 30, 2026
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 30, 2026
Both from top submissions, zero code risk:
  MUON_BACKEND_STEPS=4 (PR openai#1089): 4 NS iterations vs 5
    Saves ~1-2ms/step, proven at 1.1086 BPB
  BIGRAM_VOCAB_SIZE=4096 (PR openai#1072): larger hash table
    More n-gram patterns, proven at 1.117 BPB

MLP 3.5x investigated but doesn't fit 16MB budget (+2.2MB).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 30, 2026
Decoded from base85+LZMA compressed submission. Key innovations:
- EngramLite: multi-head bigram+trigram hash (8192 buckets, 2 heads)
- LeakyReLU(0.3)² (not 0.5)
- MLP 3.5× (fits via mixed int6/int7 quantization)
- AOL Polar Express Muon coefficients (4 NS steps)
- Sigmoid-gated skip connections
- Brotli + byte-shuffle compression
- Hessian sensitivity-based bit allocation (int6/int7 mixed)
- Soft-round QAT
- LR floor 0.05 (warmdown doesn't reach zero)

Requires: brotli package

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 30, 2026
Ported from openai#1 submission (1.1086 BPB) into our merged stack:

1. EngramLite: multi-head bigram+trigram hash (8192 buckets, 2 heads,
   2 orders, 32 dim/head) with learned sigmoid gate. Replaces BigramHash.
2. Sigmoid-gated skip connections: gate = sigmoid(skip_gates[i]),
   x = lerp(skip_weight*skip, x, gate). More expressive than additive.
3. LeakyReLU(0.3)² (was 0.5). Fused kernel disabled for non-0.5 slopes.
4. Muon 4 NS steps (was 5). AOL Polar coefficients from PR openai#1089.
5. LR floor 0.05 (warmdown doesn't reach zero).

NOT ported (diminishing returns vs complexity):
- Brotli compression (keep LZMA)
- Mixed int6/int7 (keep uniform int6)
- Soft-round QAT

Expected: close to 1.1086 with FA3, possibly ~1.109-1.111.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 30, 2026
@AnirudhRahul
Copy link
Copy Markdown

https://github.com/openai/parameter-golf/pull/1126/changes#diff-5ac0e90191b85d767bb3c0b9777c0871a9a0326e02bb2dd9269378ffa46af26e
^Struggling to reproduce these results the step times I'm getting are ~10% slower than posted in these logs

ahmettrkck added a commit to ahmettrkck/parameter-golf that referenced this pull request Mar 30, 2026
Bortlesboat added a commit to Bortlesboat/parameter-golf that referenced this pull request Mar 30, 2026
…ed mean)

3-seed results: 1.1131/1.1119/1.1133 (mean 1.1128, std 0.0008)
Built on PR openai#1089 with GPTQ reserve optimization (14s→9s) and
forward-only fused Triton MLP kernel architecture (currently disabled
pending torch.compile compat, falls back gracefully).
@mikeapedia
Copy link
Copy Markdown
Author

mikeapedia commented Mar 30, 2026

@AnirudhRahul - Just looked at the logs and the only difference that jumps out at me is that I was running pytorch 2.11.0+cu130, not cu126. For cu126 flash attention 3 might be better than SDPA.

mikeapedia and others added 5 commits March 30, 2026 11:27
Port score-first legal TTT from training-base: every token is scored
BEFORE any gradient update touches it. Last chunk scored but never
trained on. Includes Polyak EMA weight averaging, entropy-adaptive
epochs, multi-GPU support, and 600s eval budget guard.

Also reduces gptq_reserve_ms from 14s to 9s (observed max 7.25s,
24% headroom) to reclaim ~53 extra training steps per run.

TTT is opt-in via TTT_ENABLED=1. Default eval path unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Test-Time Training section describing score-first legal TTT
  with Polyak EMA, entropy-adaptive epochs, and budget guard
- Add TTT run command example
- Fix Hessian collection reference: 14s → 9s (matches gptq_reserve_ms)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move t_eval_phase before GPTQ quantization so the 600s eval budget
  correctly includes quantization + compression + decompression time.
  Previously it was set after model load, so TTT's budget guard didn't
  account for ~25-40s of GPTQ overhead and could exceed the budget.
- Add _eval_phase_elapsed helper and log GPTQ+compression load time.
- Add cumulative eval phase summary with 580s warning threshold.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three bugs fixed:
1. Dead code remover used "A." prefix but code uses "args." — remover
   was a complete no-op, never matching any branches. Fixed to exact
   "args.load_snapshot" / "args.snapshot_post_hessian" matching.
2. Used exact equality (not substring) to avoid matching negated forms
   like "not args.load_snapshot" which guards the 440-line training
   loop — substring matching would have deleted all training code.
3. __main__ block ignored sys.argv and had a destructive rename workflow
   that would overwrite train_gpt_human.py with the old shrunk version.
   Now accepts: python shrink.py <input> <output>
   Legacy no-args mode has safety check against clobbering existing files.

Dead code removal now saves ~15.7K chars (10.9%) by stripping snapshot
save/restore branches that are never used during competition runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
shrink.py:
- Add orelse guard: if a dead-code target `if` ever gains an `else`
  clause, skip removal instead of silently deleting the else body
- Catch FileNotFoundError when uvx is not on PATH
- Show pyminify stderr on failure (was silently discarded)
- Wrap post-pyminify processing in try/finally for reliable temp
  file cleanup on all error paths

train_gpt_human.py:
- Gate _byte_unshuffle on _BYTE_SHUFFLE for symmetry with the
  conditional _byte_shuffle call (benign due to magic-header check
  but was an unnecessary asymmetry)
- Document that TTT progress log bpb is a rank-0 local estimate
  (1/world_size of data on multi-GPU); final returned value is
  globally correct after all_reduce

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 30, 2026
Ported from PR openai#1089:
- Brotli (quality=11) + byte-shuffle (stride=2) replaces LZMA
  Expected: ~5-10% smaller artifacts, freeing bytes for higher precision
- Mixed int5/int6/int7 per-layer based on Hessian sensitivity
  Most sensitive layers get int7, least get int5
- MIXED_PRECISION=1 enabled by default
theLightArchitect added a commit to theLightArchitect/parameter-golf that referenced this pull request Mar 30, 2026
Bug 1 (CRITICAL): A matrix never rescaled after Gershgorin scaling.
  Must do: A = s * A * s (diagonal similarity transform).
  Without this, NS iterations use inconsistent eigenvalue estimates.

Bug 2 (CRITICAL): Wrong Polar Express coefficients for iters 3-5.
  Our coefficients were from the Frobenius-init table.
  PR openai#1089 uses refined AOL-specific coefficients.

Bug 3: Loop recomputed A on iter 0 instead of reusing AOL's A.
  AOL's preconditioned A should be used for the first NS step.

Root cause of 3x slower convergence: all three bugs compound.
Implementation now matches PR openai#1089 lines 163-219 exactly.

Co-Authored-By: Kevin Tan <kft@lightarchitects.io>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 31, 2026
Critical realization: our ported innovations (EngramLite, gated skips,
LeakyReLU(0.3)², Turbo-Muon) HURT by 0.003 BPB vs the PR openai#1060 baseline.
PR openai#1060 gets 1.1122, our merged version gets 1.1151. The partial port
of PR openai#1089 innovations doesn't capture their interactions.

Clean path to record: run PR openai#1060's exact code with FA3 + GPTQ_RESERVE=9s.
Expected: 1.1115-1.1122 BPB (well below 1.1144 threshold).
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 31, 2026
Complete pipeline to beat openai#1 (1.0806 BPB):
- train_gpt_scylla_stack.py: PR openai#1060 + metadata-based tokenizer loading
- retokenize.py: TokenMonster retokenization of FineWeb
- deploy_scylla.sh: two-phase deploy (retokenize once, train many)

Strategy: PR openai#1143 used old stack. We use PR openai#1060's modern stack
(GPTQ, XSA-all, coprime loader) on the same Scylla tokenizer.
Expected: ~1.070-1.080 BPB (beating both openai#1143 and openai#1089).
Regenerated compressed submission from updated train_gpt_human.py.
24,615 bytes → 27,402 bytes (+2.8KB from ~300 lines of TTT code).
Dead code removal now active: strips snapshot save/restore branches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bortlesboat added a commit to Bortlesboat/parameter-golf that referenced this pull request Mar 31, 2026
….1126

3-seed results: 1.1126/1.1123/1.1129 (mean 1.1126, std 0.0003)
Built on PR openai#1089 with GPTQ reserve optimization (14s to 9s).
Includes experimental fused Triton MLP kernel (hard-disabled).
- ASQU v3: Hard-coded per-layer LeakyReLU slopes from 3 rounds of adaptive
  tuning [-0.014..0.468], threaded through MLP→Block→GPT and Hessian copies
- Fused leaky_relu²: torch.where replaces F.leaky_relu().square(), avoiding
  intermediate tensor materialization (~120MB less HBM traffic/fwd)
- QAT dual-compile: Pre-cache both non-QAT and QAT compiled graphs during
  warmup, eliminating 5-30s mid-training recompile stall
- foreach EMA: torch._foreach_lerp_ for fp32 params with dtype-safe fallback
  for bf16 embeddings, reducing kernel launch overhead
- README: Add ASQU v3 section, update architecture table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
wsylvest added a commit to wsylvest/parameter-golf that referenced this pull request Mar 31, 2026
CRITICAL BUG FIX:
- Removed DDP wrapping entirely (was causing double gradient reduction
  on multi-GPU: DDP all-reduce + Muon reduce-scatter = 2× division)
- Added manual coalesced all-reduce for non-Muon replicated params
  (embeddings, scalars, head) — matches PR openai#1089 architecture
- Removed DDP import and all require_backward_grad_sync references
- Warmup loop no longer uses DDP sync

Added:
- AUDIT.md: comprehensive 19-point audit comparing all algorithms
  against PR openai#1089, documenting every design choice and difference
- 5 novel algorithm proposals for further improvement

Verified:
- 19/19 automated audit checks pass
- Polar Express coefficients match to <1e-7 relative error
- All hash primes match exactly
- All hyperparameter defaults match frontier values
- Syntax clean (ast.parse passes)
wsylvest added a commit to wsylvest/parameter-golf that referenced this pull request Mar 31, 2026
…all SDP backends, remove duplicates

Critical fixes:
- U-Net skip connections: changed from additive (x + g*sw*skip) to lerp
  interpolation (torch.lerp(scaled_skip, x, g)) matching PR openai#1089 exactly
- Polar Express coefficients: added missing iter 6 entry, upgraded all
  coefficients to full 15+ digit precision matching reference
- SDP backends: enabled mem_efficient and math backends (was disabled)
- Removed duplicate INT8_KEEP_FLOAT_MAX_NUMEL definition (line 726)
- Removed duplicate quant_raw_bytes assignment (line 1819)

Audit V2 document with independent findings added.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants