Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d)#1089
Open
mikeapedia wants to merge 15 commits intoopenai:mainfrom
Open
Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d)#1089mikeapedia wants to merge 15 commits intoopenai:mainfrom
mikeapedia wants to merge 15 commits intoopenai:mainfrom
Conversation
…xed-Precision 11L/512d GPT with Turbo-Muon (AOL+Polar Express+row_col), EngramLite hash embeddings, U-Net skip connections, Parameter Banking, GPTQ mixed-precision int6/int7 with Hessian sensitivity, brotli compression. Dev-run benchmark: 1.1119 val_bpb (sliding window, 1xH100). Awaiting 3-seed validation on 8xH100 before opening PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Syncs version requirements with pyproject.toml. torch>=2.11 is needed for torch.compile fullgraph improvements and CUDA 13.0 support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Seeds: 42, 1337, 2025 Mean val_bpb (SW s64): 1.1086 Max artifact bytes: 15997089 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… budget Removed periodic eval_val during training loop (fired at step 0 and 4000) and the diagnostic post-EMA eval. These burned ~10-15s of wallclock on evals that don't affect the final score — the real evaluation happens in the post-quantization sweep. Reclaimed time yields ~100-150 extra training steps at steady-state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AST dead-code removal + pyminify + LZMA/base85 self-extracting wrapper. Reduces train_gpt.py code_bytes, freeing artifact budget for model weights. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d architecture details - Added XSA (Cross-Sequence Attention) all 11 layers as key innovation - Fixed quantization description: int5 baseline with selective promotion to int6/int7 - Clarified GPTQ Hessian collection runs within training budget (14s reserved) - Added architecture details: Partial RoPE, LN Scale, logit softcap, GQA, tied embeddings, QK gain Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changes from previous run: - Removed mid-training and diagnostic eval_val calls (reclaims ~10-15s for training) - Shrunk train_gpt.py via AST pruning + pyminify + LZMA (125KB -> 24KB, frees ~99KB code budget) - Human-readable source preserved as train_gpt_human.py 3-seed mean val_bpb (SW s64): 1.1091 Max artifact bytes: 15993904 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Observed GPTQ times across 3 seeds: 7.13s, 7.16s, 7.25s (max 7.25s). 9s reserve gives 1.75s safety margin (24% headroom) while freeing ~5s of training budget (~53 extra steps at 93ms/step). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
demouo
added a commit
to demouo/parameter-golf
that referenced
this pull request
Mar 30, 2026
icryo
added a commit
to icryo/parameter-golf
that referenced
this pull request
Mar 30, 2026
Both from top submissions, zero code risk: MUON_BACKEND_STEPS=4 (PR openai#1089): 4 NS iterations vs 5 Saves ~1-2ms/step, proven at 1.1086 BPB BIGRAM_VOCAB_SIZE=4096 (PR openai#1072): larger hash table More n-gram patterns, proven at 1.117 BPB MLP 3.5x investigated but doesn't fit 16MB budget (+2.2MB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
icryo
added a commit
to icryo/parameter-golf
that referenced
this pull request
Mar 30, 2026
Decoded from base85+LZMA compressed submission. Key innovations: - EngramLite: multi-head bigram+trigram hash (8192 buckets, 2 heads) - LeakyReLU(0.3)² (not 0.5) - MLP 3.5× (fits via mixed int6/int7 quantization) - AOL Polar Express Muon coefficients (4 NS steps) - Sigmoid-gated skip connections - Brotli + byte-shuffle compression - Hessian sensitivity-based bit allocation (int6/int7 mixed) - Soft-round QAT - LR floor 0.05 (warmdown doesn't reach zero) Requires: brotli package Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
icryo
added a commit
to icryo/parameter-golf
that referenced
this pull request
Mar 30, 2026
Ported from openai#1 submission (1.1086 BPB) into our merged stack: 1. EngramLite: multi-head bigram+trigram hash (8192 buckets, 2 heads, 2 orders, 32 dim/head) with learned sigmoid gate. Replaces BigramHash. 2. Sigmoid-gated skip connections: gate = sigmoid(skip_gates[i]), x = lerp(skip_weight*skip, x, gate). More expressive than additive. 3. LeakyReLU(0.3)² (was 0.5). Fused kernel disabled for non-0.5 slopes. 4. Muon 4 NS steps (was 5). AOL Polar coefficients from PR openai#1089. 5. LR floor 0.05 (warmdown doesn't reach zero). NOT ported (diminishing returns vs complexity): - Brotli compression (keep LZMA) - Mixed int6/int7 (keep uniform int6) - Soft-round QAT Expected: close to 1.1086 with FA3, possibly ~1.109-1.111. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
icryo
added a commit
to icryo/parameter-golf
that referenced
this pull request
Mar 30, 2026
6 tasks
|
https://github.com/openai/parameter-golf/pull/1126/changes#diff-5ac0e90191b85d767bb3c0b9777c0871a9a0326e02bb2dd9269378ffa46af26e |
ahmettrkck
added a commit
to ahmettrkck/parameter-golf
that referenced
this pull request
Mar 30, 2026
Bortlesboat
added a commit
to Bortlesboat/parameter-golf
that referenced
this pull request
Mar 30, 2026
…ed mean) 3-seed results: 1.1131/1.1119/1.1133 (mean 1.1128, std 0.0008) Built on PR openai#1089 with GPTQ reserve optimization (14s→9s) and forward-only fused Triton MLP kernel architecture (currently disabled pending torch.compile compat, falls back gracefully).
Author
|
@AnirudhRahul - Just looked at the logs and the only difference that jumps out at me is that I was running pytorch 2.11.0+cu130, not cu126. For cu126 flash attention 3 might be better than SDPA. |
Port score-first legal TTT from training-base: every token is scored BEFORE any gradient update touches it. Last chunk scored but never trained on. Includes Polyak EMA weight averaging, entropy-adaptive epochs, multi-GPU support, and 600s eval budget guard. Also reduces gptq_reserve_ms from 14s to 9s (observed max 7.25s, 24% headroom) to reclaim ~53 extra training steps per run. TTT is opt-in via TTT_ENABLED=1. Default eval path unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Test-Time Training section describing score-first legal TTT with Polyak EMA, entropy-adaptive epochs, and budget guard - Add TTT run command example - Fix Hessian collection reference: 14s → 9s (matches gptq_reserve_ms) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move t_eval_phase before GPTQ quantization so the 600s eval budget correctly includes quantization + compression + decompression time. Previously it was set after model load, so TTT's budget guard didn't account for ~25-40s of GPTQ overhead and could exceed the budget. - Add _eval_phase_elapsed helper and log GPTQ+compression load time. - Add cumulative eval phase summary with 580s warning threshold. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three bugs fixed: 1. Dead code remover used "A." prefix but code uses "args." — remover was a complete no-op, never matching any branches. Fixed to exact "args.load_snapshot" / "args.snapshot_post_hessian" matching. 2. Used exact equality (not substring) to avoid matching negated forms like "not args.load_snapshot" which guards the 440-line training loop — substring matching would have deleted all training code. 3. __main__ block ignored sys.argv and had a destructive rename workflow that would overwrite train_gpt_human.py with the old shrunk version. Now accepts: python shrink.py <input> <output> Legacy no-args mode has safety check against clobbering existing files. Dead code removal now saves ~15.7K chars (10.9%) by stripping snapshot save/restore branches that are never used during competition runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
shrink.py: - Add orelse guard: if a dead-code target `if` ever gains an `else` clause, skip removal instead of silently deleting the else body - Catch FileNotFoundError when uvx is not on PATH - Show pyminify stderr on failure (was silently discarded) - Wrap post-pyminify processing in try/finally for reliable temp file cleanup on all error paths train_gpt_human.py: - Gate _byte_unshuffle on _BYTE_SHUFFLE for symmetry with the conditional _byte_shuffle call (benign due to magic-header check but was an unnecessary asymmetry) - Document that TTT progress log bpb is a rank-0 local estimate (1/world_size of data on multi-GPU); final returned value is globally correct after all_reduce Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
icryo
added a commit
to icryo/parameter-golf
that referenced
this pull request
Mar 30, 2026
Ported from PR openai#1089: - Brotli (quality=11) + byte-shuffle (stride=2) replaces LZMA Expected: ~5-10% smaller artifacts, freeing bytes for higher precision - Mixed int5/int6/int7 per-layer based on Hessian sensitivity Most sensitive layers get int7, least get int5 - MIXED_PRECISION=1 enabled by default
theLightArchitect
added a commit
to theLightArchitect/parameter-golf
that referenced
this pull request
Mar 30, 2026
Bug 1 (CRITICAL): A matrix never rescaled after Gershgorin scaling. Must do: A = s * A * s (diagonal similarity transform). Without this, NS iterations use inconsistent eigenvalue estimates. Bug 2 (CRITICAL): Wrong Polar Express coefficients for iters 3-5. Our coefficients were from the Frobenius-init table. PR openai#1089 uses refined AOL-specific coefficients. Bug 3: Loop recomputed A on iter 0 instead of reusing AOL's A. AOL's preconditioned A should be used for the first NS step. Root cause of 3x slower convergence: all three bugs compound. Implementation now matches PR openai#1089 lines 163-219 exactly. Co-Authored-By: Kevin Tan <kft@lightarchitects.io> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
icryo
added a commit
to icryo/parameter-golf
that referenced
this pull request
Mar 31, 2026
Critical realization: our ported innovations (EngramLite, gated skips, LeakyReLU(0.3)², Turbo-Muon) HURT by 0.003 BPB vs the PR openai#1060 baseline. PR openai#1060 gets 1.1122, our merged version gets 1.1151. The partial port of PR openai#1089 innovations doesn't capture their interactions. Clean path to record: run PR openai#1060's exact code with FA3 + GPTQ_RESERVE=9s. Expected: 1.1115-1.1122 BPB (well below 1.1144 threshold).
icryo
added a commit
to icryo/parameter-golf
that referenced
this pull request
Mar 31, 2026
Complete pipeline to beat openai#1 (1.0806 BPB): - train_gpt_scylla_stack.py: PR openai#1060 + metadata-based tokenizer loading - retokenize.py: TokenMonster retokenization of FineWeb - deploy_scylla.sh: two-phase deploy (retokenize once, train many) Strategy: PR openai#1143 used old stack. We use PR openai#1060's modern stack (GPTQ, XSA-all, coprime loader) on the same Scylla tokenizer. Expected: ~1.070-1.080 BPB (beating both openai#1143 and openai#1089).
Regenerated compressed submission from updated train_gpt_human.py. 24,615 bytes → 27,402 bytes (+2.8KB from ~300 lines of TTT code). Dead code removal now active: strips snapshot save/restore branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bortlesboat
added a commit
to Bortlesboat/parameter-golf
that referenced
this pull request
Mar 31, 2026
….1126 3-seed results: 1.1126/1.1123/1.1129 (mean 1.1126, std 0.0003) Built on PR openai#1089 with GPTQ reserve optimization (14s to 9s). Includes experimental fused Triton MLP kernel (hard-disabled).
6 tasks
- ASQU v3: Hard-coded per-layer LeakyReLU slopes from 3 rounds of adaptive tuning [-0.014..0.468], threaded through MLP→Block→GPT and Hessian copies - Fused leaky_relu²: torch.where replaces F.leaky_relu().square(), avoiding intermediate tensor materialization (~120MB less HBM traffic/fwd) - QAT dual-compile: Pre-cache both non-QAT and QAT compiled graphs during warmup, eliminating 5-30s mid-training recompile stall - foreach EMA: torch._foreach_lerp_ for fp32 params with dtype-safe fallback for bf16 embeddings, reducing kernel launch overhead - README: Add ASQU v3 section, update architecture table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
wsylvest
added a commit
to wsylvest/parameter-golf
that referenced
this pull request
Mar 31, 2026
CRITICAL BUG FIX: - Removed DDP wrapping entirely (was causing double gradient reduction on multi-GPU: DDP all-reduce + Muon reduce-scatter = 2× division) - Added manual coalesced all-reduce for non-Muon replicated params (embeddings, scalars, head) — matches PR openai#1089 architecture - Removed DDP import and all require_backward_grad_sync references - Warmup loop no longer uses DDP sync Added: - AUDIT.md: comprehensive 19-point audit comparing all algorithms against PR openai#1089, documenting every design choice and difference - 5 novel algorithm proposals for further improvement Verified: - 19/19 automated audit checks pass - Polar Express coefficients match to <1e-7 relative error - All hash primes match exactly - All hyperparameter defaults match frontier values - Syntax clean (ast.parse passes)
wsylvest
added a commit
to wsylvest/parameter-golf
that referenced
this pull request
Mar 31, 2026
…all SDP backends, remove duplicates Critical fixes: - U-Net skip connections: changed from additive (x + g*sw*skip) to lerp interpolation (torch.lerp(scaled_skip, x, g)) matching PR openai#1089 exactly - Polar Express coefficients: added missing iter 6 entry, upgraded all coefficients to full 15+ digit precision matching reference - SDP backends: enabled mem_efficient and math backends (was disabled) - Removed duplicate INT8_KEEP_FLOAT_MAX_NUMEL definition (line 726) - Removed duplicate quant_raw_bytes assignment (line 1819) Audit V2 document with independent findings added.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record Submission: 1.1091 BPB — Turbo-Muon + EngramLite + Parameter Banking + GPTQ Mixed-Precision
val_bpb: 1.1091 (3-seed mean, std 0.0005) | ~15.3 MB | 8×H100 SXM
Results (8×H100 80GB SXM)
Summary
11-layer GPT (512d, 8H, 4KV GQA) combining eight key innovations over the PR #609 baseline:
torch.bmmArchitecture
Partial RoPE (16/64 dims), LN Scale (1/√(layer+1)), logit softcap 30.0, tied embeddings, per-head QK gain (init 1.5), LeakyReLU(ASQU v3 per-layer)² MLP at 3.5× width.
Compression
gptq_reserve_ms=9000.Key Hyperparameters
Credits
Built on PR #609 (1.1154 bpb). Techniques from PRs #198, #265, #287, #399, #493, #518, #634.
Extra Dependencies
brotli>=1.1(falls back to lzma if missing).torch>=2.11,Python>=3.12.Full details in the submission README. Human-readable source in
train_gpt_human.py. Training logs for all 3 seeds included.