Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393) by Idan3011 · Pull Request #187 · openai/parameter-golf

Idan3011 · 2026-03-20T07:23:51Z

EMA-GPU + Multi-Order N-gram Backoff + Pre-Enrichment Confidence + XSA

val_bpb: 0.9393 (multi-order n-gram backoff 2-11, entropy-adaptive alpha + pre-enrichment confidence) |
1.1478 (sliding window) | 14.94 MB | 8xH100 SXM, 600s

Progress

	v1	v2	v3	v4	v5	v6	v7	v8 (this)
val_bpb	1.1855	1.1709	1.1668	1.1629	1.0689	0.9784	0.9408	0.9393
Eval method	sliding	sliding	sliding	sliding	5-gram	multi-order 2-7	multi-order 2-11	2-11 + PE
conf
Params	19.4M	24.7M	25.2M	25.2M	25.3M	25.3M	25.3M	25.3M
Artifact	15.75 MB	15.57 MB	15.02 MB	15.05 MB	14.95 MB	14.94 MB	14.94 MB	14.94 MB
Steps (600s)	8,004	6,423	5,373	5,636	9,312	9,268	9,268	9,268
Step time	75ms	93ms	112ms	106ms	64ms	65ms	65ms	65ms

Key Contributions

EMA on GPU (37% faster training)

EMA state kept on GPU during training instead of synchronous GPU→CPU copy every step. Only moved to CPU at the
end for serialization.

Step time: 64.7ms (vs 101ms before). Enables 9,268 steps in 600s vs ~5,900 — 57% more gradient updates.

Multi-Order N-gram Backoff (score-first, backward-looking)

Multi-order n-gram backoff with entropy-adaptive alpha during sliding window eval.

Multi-order backoff: orders 11→10→9→8→7→6→5→4→3→2, first hit with count≥2 wins
Entropy-adaptive alpha: alpha = 0.05 + 0.55 * sigmoid(3 * (H - 3.5))
Cache built from already-scored tokens only (backward-looking)
Score-first: cache updated AFTER segment scoring
Dual-array hash scheme: separate context count and pair count arrays per order (4M buckets each)
Per-GPU independent cache, no cross-GPU sync
Hash tables precomputed for all orders in single pass
Integrated into sliding window eval (single pass)

Improvement: 1.1478 → 0.9408 = -0.207 BPB

Pre-Enrichment Confidence Modulation

Uses the pre-enrichment layer's transformation magnitude as a confidence signal. High delta = model uncertain
about this context = trust n-gram more. Low delta = model confident = trust model more. Modulates
entropy-adaptive alpha by (0.5 + 1.0 * pe_conf).

Additional improvement: 0.9408 → 0.9393 = -0.0015 BPB

GELU Pre-Enrichment (512→768→512)

Wider nonlinear transformation before the residual stream: embedding → BigramHash add → SmearGate →
Linear(512→768) → GELU → Linear(768→512) → RMS Norm → transformer blocks

XSA (Exclusive Self Attention) on Last 4 Layers

Removes self-value bias via orthogonal projection (arXiv:2603.09078, GQA-aware PR #265 @unnir). Zero
parameters.

Additional Techniques

SmearGate: Per-dim gate blending each token with previous token. F.pad for efficiency.
BigramHash (2048x128): Hash-table embedding for token bigrams.
EMA (decay=0.997): Quant gap 0.004.
Int6 QAT + lzma: 14.94 MB artifact.
Value Residual + Gated Attention: Toggleable (default OFF, not used in this submission).

Architecture: 10L, 512d, 8H/4KV GQA, MLP 3x, tied embeddings, U-Net skip connections. Training: Muon+AdamW,
WD=0.04, matrix_lr=0.025, warmdown=3500, batch=524K, seq=2048.

What Didn't Work

Log-odds mixing: n-gram probabilities near zero create catastrophic logits. Linear mixing is correct.
SSE post-correction: Online bias learning always pushes predictions toward 1.0. Broken by design.
BigramHash confidence signal: Embedding norm didn't correlate with prediction accuracy. Regression.
Orders 12-13: No improvement over 2-11. Diminishing returns plateau.
Frontier stack (LeakyReLU², Partial RoPE, LN Scale, Value Embedding): Stacked together = regression.
Encoder recurrence: 900x quant error amplification. Removed.
TTT (Test-Time Training): SGD TTT hurt GPTQ models. Replaced by n-gram cache.
12L MLP 2x: Width beats depth at this scale.
Grad clip 0.3: Hurt per-step BPB vs no clipping.

Reproduction

All defaults baked in. No env vars needed.

  python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
  torchrun --standalone --nproc_per_node=8 train_gpt.py

8xH100 SXM, 600s training + ~188s eval.

Key Metrics

Metric	Value
val_bpb (n-gram + PE confidence)	0.9393
Sliding window val_bpb	1.1478
Post-quant val_bpb (standard)	1.1690
Pre-quant val_bpb	1.1646
Quant gap	0.004
Training time	600,031ms (9,268 steps at 64.7ms)
Peak memory	13,058 MiB
Artifact size	14,942,971 bytes
Model parameters	25,254,992

Credits

Muon optimizer — modded-nanogpt baseline (kellerjordan)
SmearGate + BigramHash — PR Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65 (@aquariouseworkman)
XSA — arXiv:2603.09078; GQA-aware PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265 (@unnir)
EMA + GPTQ-lite + warmdown tuning — PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (@signalrush)
N-gram eval cache — concept PR Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659 (@deanbrr); fixed 5-gram PR Podracing: 1.0461 BPB (3-seed mean) — 5-gram eval + LeakyReLU² #706 (@newjordan); multi-order
entropy-adaptive PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 (@Asukabot0)
Overtone init — modded-nanogpt baseline
GELU Pre-Enrichment — original to this submission
EMA on GPU — original to this submission
Pre-Enrichment Confidence Modulation — original to this submission

Update Log

v1 (1.1855): int8+zlib, MLP 2x, seq 1024
v2 (1.1709): int6 QAT + lzma, MLP 3x, SWA, seq 2048
v3 (1.1668): + SmearGate + BigramHash + EMA + wider pre-enrichment
v4 (1.1629): + XSA on last 4 layers
v5 (1.0689): + EMA on GPU (64ms/step) + 5-gram eval cache
v6 (0.9784): + multi-order backoff 2-7 + entropy-adaptive alpha
v7 (0.9408): + extended to orders 2-11 + steeper alpha (3.0, threshold 3.5)
v8 (0.9393): + pre-enrichment confidence modulation

Two CastedLinear(512,512) layers applied to token embeddings before entering the residual stream. No activation between them. Weights optimized via Muon alongside block matrix params. Also updates .gitignore for venv and build artifacts.

Tests whether true nonlinearity improves over the linear-only factorization that scored val_bpb 1.4188.

Encoder blocks 0-3 run twice with RMS norm between passes. Decoder runs once using skip connections from the refined second encoder pass. 13 effective layers from 9 physical blocks, zero extra parameters.

All 9 blocks run twice with RMS norm between passes. 18 effective layers from 9 physical blocks, zero extra params. Replaces encoder-only recurrence from previous commit.

17 effective layers from 9 physical blocks. RMS norm between each encoder pass. Testing if 3x beats 2x encoder recurrence.

…imit) 3x encoder recurrence exceeds A100 SM shared memory (168096 > 166912). 2x encoder recurrence remains our best: val_bpb 1.4235.

Allows overriding the default 50/50 split to put more blocks in the encoder for deeper recurrence. Default behavior unchanged.

Best config: 4+5 split with 2x encoder recurrence. 6+3 split tested and was worse (1.4267 vs 1.4235).

signal After encoder passes, compute prediction loss from encoder output weighted at 0.1x and add to final loss. Gives encoder blocks direct learning signal instead of only through decoder backprop.

Auxiliary loss was inflating val_bpb metric during evaluation. Now uses weight=0.1 during training, 0.0 during eval.

Second encoder pass runs blocks in reverse order (3→2→1→0) for bidirectional refinement. Auxiliary encoder loss reverted — it hurt performance (1.4135 vs 1.4077 without it).

Novel architecture (ours): - GELU pre-enrichment before transformer blocks - 2x encoder recurrence with RMS norm between passes Proven techniques adopted: - Overtone init (power-law SVD embedding initialization) - FP16 embedding passthrough (avoids int8 compound error) - Muon decoupled weight decay (0.02) - Sliding window eval (stride=64, ~960 tokens context per token) Run with: NUM_LAYERS=10 TIED_EMBED_LR=0.1 WARMDOWN_ITERS=2500 MATRIX_LR=0.06 torchrun --standalone --nproc_per_node=8 train_gpt.py

Sliding window with stride=64 is too slow unbatched on single GPU (~30 min). Falls back to regular eval on single GPU for testing. Multi-GPU distributes windows across ranks.

1. Batched sliding window eval (stride=64, batch=256) with proper per-token scoring via forward_logits method 2. Reverted FP16 embedding passthrough to fit 16MB cap 3. Encoder recurrence behind ENCODER_RECURRENCE=1 env var for A/B testing recurrence vs no-recurrence"

…2L config Phase-transition sigmoid init for resid_mix (from rank 1). Late-K: last 2 layers c_k.weight kept fp16 during quantization. GRAD_CLIP_NORM=1.0 default. RUN_CONFIG=C: 12L MLP 2x (18 effective layers with recurrence).

…(val_bpb=1.1668)

Process 16K tokens per batch with numpy, not 64 per window.

Only transfer target token log probs (2MB) not full vocab (2GB per batch).

Precompute all hashes upfront (6 numpy passes). Clamp ng_prob to [0,1] to prevent hash collision artifacts. Progress logging.

All n-gram operations on GPU — hash precomputation, lookups, scoring, cache updates via scatter_add_. No numpy bottleneck.

Single 5-gram order, fixed alpha=0.20 No backoff loop, no entropy, no log_softmax for n-gram. Three torch ops per batch: lookup, blend, scatter_add.

…ram predictions)

…alpha

…bpb=0.9784)

…_bpb=0.9408)

…tput

…=0.9393)

Idan3011 added 17 commits March 18, 2026 18:40

feat: add GELU activation between pre-enrichment projections

55ecffe

Tests whether true nonlinearity improves over the linear-only factorization that scored val_bpb 1.4188.

feat: add encoder depth recurrence (2x encoder pass before decoder)

526d8c0

Encoder blocks 0-3 run twice with RMS norm between passes. Decoder runs once using skip connections from the refined second encoder pass. 13 effective layers from 9 physical blocks, zero extra parameters.

feat: full 2x U-Net recurrence (encoder+decoder both run twice)

3a949b1

All 9 blocks run twice with RMS norm between passes. 18 effective layers from 9 physical blocks, zero extra params. Replaces encoder-only recurrence from previous commit.

feat: 3x encoder recurrence (3 encoder passes, 1 decoder pass)

39704a2

17 effective layers from 9 physical blocks. RMS norm between each encoder pass. Testing if 3x beats 2x encoder recurrence.

revert: back to 2x encoder recurrence (3x hits Triton shared memory l…

1fb15ea

…imit) 3x encoder recurrence exceeds A100 SM shared memory (168096 > 166912). 2x encoder recurrence remains our best: val_bpb 1.4235.

feat: configurable encoder/decoder split via NUM_ENCODER_LAYERS

5449ba5

Allows overriding the default 50/50 split to put more blocks in the encoder for deeper recurrence. Default behavior unchanged.

cleanup: remove NUM_ENCODER_LAYERS override, keep best config

dbf0262

Best config: 4+5 split with 2x encoder recurrence. 6+3 split tested and was worse (1.4267 vs 1.4235).

feat: auxiliary encoder loss for direct encoder gradient

34684f8

signal After encoder passes, compute prediction loss from encoder output weighted at 0.1x and add to final loss. Gives encoder blocks direct learning signal instead of only through decoder backprop.

fix: zero auxiliary encoder loss weight during eval

eb8f39e

Auxiliary loss was inflating val_bpb metric during evaluation. Now uses weight=0.1 during training, 0.0 during eval.

feat: reverse encoder recurrence + revert auxiliary loss

b23ef90

Second encoder pass runs blocks in reverse order (3→2→1→0) for bidirectional refinement. Auxiliary encoder loss reverted — it hurt performance (1.4135 vs 1.4077 without it).

fix: sliding window eval only on multi-GPU, regular eval on single GPU

cc20051

Sliding window with stride=64 is too slow unbatched on single GPU (~30 min). Falls back to regular eval on single GPU for testing. Multi-GPU distributes windows across ranks.

competition run: disable unbatched sliding window, use regular eval

356d403

Record: Pre-Enrichment + Encoder Recurrence (val_bpb=1.1855)

c2e9b1e

Record: Pre-Enrichment + Encoder Recurrence (val_bpb=1.1855)

6ee7458

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Idan3011 added 3 commits March 20, 2026 10:21

feat: int6 QAT + lzma + MLP 3x + SWA + WD 0.04 + dual run configs

317cca5

fix: encoder recurrence default ON + log header

9b7c19c

Record: Pre-Enrichment + Encoder Recurrence (val_bpb=1.1709)

c1bde37

Idan3011 changed the title ~~Record: Pre-Enrichment + Encoder Recurrence (val_bpb=1.1855)~~ Record: Pre-Enrichment + Encoder Recurrence (val_bpb=1.1709) Mar 20, 2026

Idan3011 added 6 commits March 20, 2026 13:32

revert phase-transition/Late-K/grad-clip, prep batch size test

ba49938

feat: EMA replaces SWA + wider pre-enrichment 512-768-512

9f5dea8

feat: SmearGate + BigramHash + EMA + wider pre-enrichment

967b7b4

Record: val_bpb=1.1668 with SmearGate + BigramHash + EMA

6fb6486

Record: Pre-Enrichment + Encoder Recurrence + SmearGate + BigramHash …

9f17c55

…(val_bpb=1.1668)

Idan3011 changed the title ~~Record: Pre-Enrichment + Encoder Recurrence (val_bpb=1.1709)~~ Record: Pre-Enrichment + Encoder Recurrence + SmearGate + BigramHash (val_bpb=1.1668) Mar 21, 2026

feat: XSA on last 4 layers

07f52c2

Idan3011 added 7 commits March 25, 2026 18:50

perf: batch-vectorized n-gram across all windows

20a74fd

Process 16K tokens per batch with numpy, not 64 per window.

perf: fix 2GB GPU transfer bottleneck in n-gram eval

db6b7f7

Only transfer target token log probs (2MB) not full vocab (2GB per batch).

fix: precompute n-gram hashes + clamp probability bug

2964d05

Precompute all hashes upfront (6 numpy passes). Clamp ng_prob to [0,1] to prevent hash collision artifacts. Progress logging.

perf: full GPU n-gram eval with torch

623f6b6

All n-gram operations on GPU — hash precomputation, lookups, scoring, cache updates via scatter_add_. No numpy bottleneck.

perf: simplified 5-gram eval — 3 GPU ops per batch

8f580fd

Single 5-gram order, fixed alpha=0.20 No backoff loop, no entropy, no log_softmax for n-gram. Three torch ops per batch: lookup, blend, scatter_add.

feat: 1.0689 BPB — EMA-GPU + 5-gram eval cache

b53bb3e

Record: EMA-GPU + 5-gram eval cache (val_bpb=1.0689)

f3ace44

Idan3011 changed the title ~~Record: Pre-Enrichment + Encoder Recurrence + XSA + SmearGate + BigramHash (val_bpb=1.1629)~~ Record: EMA-GPU + 5-gram eval cache (val_bpb=1.0689) Mar 25, 2026

Idan3011 added 4 commits March 25, 2026 22:38

feat: multi-order backoff 2-7 + entropy-adaptive alpha + log-odds mixing

1f24092

fix: revert log-odds mixing to linear (log-odds destroys near-zero ng…

4678c6f

…ram predictions)

feat: 0.9784 BPB — multi-order n-gram backoff 2-7 + entropy-adaptive …

396db74

…alpha

Record: multi-order n-gram backoff 2-7 + entropy-adaptive alpha (val_…

777500e

…bpb=0.9784)

Idan3011 changed the title ~~Record: EMA-GPU + 5-gram eval cache (val_bpb=1.0689)~~ Record: EMA-GPU + Multi-Order N-gram Backoff (val_bpb=0.9784) Mar 26, 2026

Idan3011 added 6 commits March 26, 2026 00:13

perf: extend n-gram to orders 2-9

89f3c76

perf: extend n-gram to orders 2-11 + steeper alpha (3.0, threshold 3.5)

fa4a343

perf: orders 2-13 + SSE post-correction

0ce37fc

fix: remove broken SSE, keep orders 2-13 + steeper alpha

962aaee

feat: 0.9408 BPB — multi-order backoff 2-11 + entropy-adaptive alpha

21d738c

Record: multi-order n-gram backoff 2-11 + entropy-adaptive alpha (val…

08d7068

…_bpb=0.9408)

Idan3011 changed the title ~~Record: EMA-GPU + Multi-Order N-gram Backoff (val_bpb=0.9784)~~ Record: EMA-GPU + Multi-Order N-gram Backoff (val_bpb=0.9408) Mar 26, 2026

Idan3011 added 5 commits March 26, 2026 00:56

feat: BigramHash confidence modulation for n-gram alpha

bff53c5

feat: pre-enrichment confidence signal for n-gram alpha

b1a8c89

perf: more aggressive pre-enrichment alpha (0.5+1.0) + reorder log ou…

0e06209

…tput

feat: 0.9393 BPB — pre-enrichment confidence + orders 2-11

10c49a6

Record: pre-enrichment confidence + multi-order backoff 2-11 (val_bpb…

2c3317e

…=0.9393)

Idan3011 changed the title ~~Record: EMA-GPU + Multi-Order N-gram Backoff (val_bpb=0.9408)~~ Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393) Mar 26, 2026

Merge branch 'pre-enrichment' into submission

7e07f4d

Idan3011 closed this Mar 26, 2026

Idan3011 deleted the submission branch March 26, 2026 04:37

Idan3011 restored the submission branch March 26, 2026 04:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393)#187

Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393)#187
Idan3011 wants to merge 73 commits intoopenai:mainfrom
Idan3011:submission

Idan3011 commented Mar 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Idan3011 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

EMA-GPU + Multi-Order N-gram Backoff + Pre-Enrichment Confidence + XSA

Progress

Key Contributions

EMA on GPU (37% faster training)

Multi-Order N-gram Backoff (score-first, backward-looking)

Pre-Enrichment Confidence Modulation

GELU Pre-Enrichment (512→768→512)

XSA (Exclusive Self Attention) on Last 4 Layers

Additional Techniques

What Didn't Work

Reproduction

Key Metrics

Credits

Update Log

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Idan3011 commented Mar 20, 2026 •

edited

Loading