Skip to content

Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393)#187

Closed
Idan3011 wants to merge 73 commits intoopenai:mainfrom
Idan3011:submission
Closed

Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393)#187
Idan3011 wants to merge 73 commits intoopenai:mainfrom
Idan3011:submission

Conversation

@Idan3011
Copy link

@Idan3011 Idan3011 commented Mar 20, 2026

EMA-GPU + Multi-Order N-gram Backoff + Pre-Enrichment Confidence + XSA

val_bpb: 0.9393 (multi-order n-gram backoff 2-11, entropy-adaptive alpha + pre-enrichment confidence) |
1.1478 (sliding window) | 14.94 MB | 8xH100 SXM, 600s


Progress

v1 v2 v3 v4 v5 v6 v7 v8 (this)
val_bpb 1.1855 1.1709 1.1668 1.1629 1.0689 0.9784 0.9408 0.9393
Eval method sliding sliding sliding sliding 5-gram multi-order 2-7 multi-order 2-11 2-11 + PE
conf
Params 19.4M 24.7M 25.2M 25.2M 25.3M 25.3M 25.3M 25.3M
Artifact 15.75 MB 15.57 MB 15.02 MB 15.05 MB 14.95 MB 14.94 MB 14.94 MB 14.94 MB
Steps (600s) 8,004 6,423 5,373 5,636 9,312 9,268 9,268 9,268
Step time 75ms 93ms 112ms 106ms 64ms 65ms 65ms 65ms

Key Contributions

EMA on GPU (37% faster training)

EMA state kept on GPU during training instead of synchronous GPU→CPU copy every step. Only moved to CPU at the
end for serialization.

Step time: 64.7ms (vs 101ms before). Enables 9,268 steps in 600s vs ~5,900 — 57% more gradient updates.

Multi-Order N-gram Backoff (score-first, backward-looking)

Multi-order n-gram backoff with entropy-adaptive alpha during sliding window eval.

  • Multi-order backoff: orders 11→10→9→8→7→6→5→4→3→2, first hit with count≥2 wins
  • Entropy-adaptive alpha: alpha = 0.05 + 0.55 * sigmoid(3 * (H - 3.5))
  • Cache built from already-scored tokens only (backward-looking)
  • Score-first: cache updated AFTER segment scoring
  • Dual-array hash scheme: separate context count and pair count arrays per order (4M buckets each)
  • Per-GPU independent cache, no cross-GPU sync
  • Hash tables precomputed for all orders in single pass
  • Integrated into sliding window eval (single pass)

Improvement: 1.1478 → 0.9408 = -0.207 BPB

Pre-Enrichment Confidence Modulation

Uses the pre-enrichment layer's transformation magnitude as a confidence signal. High delta = model uncertain
about this context = trust n-gram more. Low delta = model confident = trust model more. Modulates
entropy-adaptive alpha by (0.5 + 1.0 * pe_conf).

Additional improvement: 0.9408 → 0.9393 = -0.0015 BPB

GELU Pre-Enrichment (512→768→512)

Wider nonlinear transformation before the residual stream: embedding → BigramHash add → SmearGate →
Linear(512→768) → GELU → Linear(768→512) → RMS Norm → transformer blocks

XSA (Exclusive Self Attention) on Last 4 Layers

Removes self-value bias via orthogonal projection (arXiv:2603.09078, GQA-aware PR #265 @unnir). Zero
parameters.


Additional Techniques

  • SmearGate: Per-dim gate blending each token with previous token. F.pad for efficiency.
  • BigramHash (2048x128): Hash-table embedding for token bigrams.
  • EMA (decay=0.997): Quant gap 0.004.
  • Int6 QAT + lzma: 14.94 MB artifact.
  • Value Residual + Gated Attention: Toggleable (default OFF, not used in this submission).

Architecture: 10L, 512d, 8H/4KV GQA, MLP 3x, tied embeddings, U-Net skip connections. Training: Muon+AdamW,
WD=0.04, matrix_lr=0.025, warmdown=3500, batch=524K, seq=2048.


What Didn't Work

  • Log-odds mixing: n-gram probabilities near zero create catastrophic logits. Linear mixing is correct.
  • SSE post-correction: Online bias learning always pushes predictions toward 1.0. Broken by design.
  • BigramHash confidence signal: Embedding norm didn't correlate with prediction accuracy. Regression.
  • Orders 12-13: No improvement over 2-11. Diminishing returns plateau.
  • Frontier stack (LeakyReLU², Partial RoPE, LN Scale, Value Embedding): Stacked together = regression.
  • Encoder recurrence: 900x quant error amplification. Removed.
  • TTT (Test-Time Training): SGD TTT hurt GPTQ models. Replaced by n-gram cache.
  • 12L MLP 2x: Width beats depth at this scale.
  • Grad clip 0.3: Hurt per-step BPB vs no clipping.

Reproduction

All defaults baked in. No env vars needed.

  python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
  torchrun --standalone --nproc_per_node=8 train_gpt.py

8xH100 SXM, 600s training + ~188s eval.


Key Metrics

Metric Value
val_bpb (n-gram + PE confidence) 0.9393
Sliding window val_bpb 1.1478
Post-quant val_bpb (standard) 1.1690
Pre-quant val_bpb 1.1646
Quant gap 0.004
Training time 600,031ms (9,268 steps at 64.7ms)
Peak memory 13,058 MiB
Artifact size 14,942,971 bytes
Model parameters 25,254,992

Credits


Update Log

  • v1 (1.1855): int8+zlib, MLP 2x, seq 1024
  • v2 (1.1709): int6 QAT + lzma, MLP 3x, SWA, seq 2048
  • v3 (1.1668): + SmearGate + BigramHash + EMA + wider pre-enrichment
  • v4 (1.1629): + XSA on last 4 layers
  • v5 (1.0689): + EMA on GPU (64ms/step) + 5-gram eval cache
  • v6 (0.9784): + multi-order backoff 2-7 + entropy-adaptive alpha
  • v7 (0.9408): + extended to orders 2-11 + steeper alpha (3.0, threshold 3.5)
  • v8 (0.9393): + pre-enrichment confidence modulation

Idan3011 added 17 commits March 18, 2026 18:40
Two CastedLinear(512,512) layers applied to token embeddings before
entering the residual stream. No activation between them. Weights
optimized via Muon alongside block matrix params.
Also updates .gitignore for venv and build artifacts.
  Tests whether true nonlinearity improves over the linear-only
  factorization that scored val_bpb 1.4188.
  Encoder blocks 0-3 run twice with RMS norm between passes.
  Decoder runs once using skip connections from the refined second
  encoder pass. 13 effective layers from 9 physical blocks, zero
  extra parameters.
  All 9 blocks run twice with RMS norm between passes.
  18 effective layers from 9 physical blocks, zero extra params.
  Replaces encoder-only recurrence from previous commit.
  17 effective layers from 9 physical blocks. RMS norm between
  each encoder pass. Testing if 3x beats 2x encoder recurrence.
…imit)

  3x encoder recurrence exceeds A100 SM shared memory (168096 > 166912).
  2x encoder recurrence remains our best: val_bpb 1.4235.
  Allows overriding the default 50/50 split to put more blocks in
  the encoder for deeper recurrence. Default behavior unchanged.
  Best config: 4+5 split with 2x encoder recurrence.
  6+3 split tested and was worse (1.4267 vs 1.4235).
   signal

  After encoder passes, compute prediction loss from encoder output
  weighted at 0.1x and add to final loss. Gives encoder blocks direct
  learning signal instead of only through decoder backprop.
  Auxiliary loss was inflating val_bpb metric during evaluation.
  Now uses weight=0.1 during training, 0.0 during eval.
  Second encoder pass runs blocks in reverse order (3→2→1→0) for
  bidirectional refinement. Auxiliary encoder loss reverted — it
  hurt performance (1.4135 vs 1.4077 without it).
  Novel architecture (ours):
  - GELU pre-enrichment before transformer blocks
  - 2x encoder recurrence with RMS norm between passes

  Proven techniques adopted:
  - Overtone init (power-law SVD embedding initialization)
  - FP16 embedding passthrough (avoids int8 compound error)
  - Muon decoupled weight decay (0.02)
  - Sliding window eval (stride=64, ~960 tokens context per token)

  Run with: NUM_LAYERS=10 TIED_EMBED_LR=0.1 WARMDOWN_ITERS=2500 MATRIX_LR=0.06
  torchrun --standalone --nproc_per_node=8 train_gpt.py
  Sliding window with stride=64 is too slow unbatched on single GPU
  (~30 min). Falls back to regular eval on single GPU for testing.
  Multi-GPU distributes windows across ranks.
  1. Batched sliding window eval (stride=64, batch=256) with proper
     per-token scoring via forward_logits method
  2. Reverted FP16 embedding passthrough to fit 16MB cap
  3. Encoder recurrence behind ENCODER_RECURRENCE=1 env var
     for A/B testing recurrence vs no-recurrence"
@Idan3011 Idan3011 changed the title Record: Pre-Enrichment + Encoder Recurrence (val_bpb=1.1855) Record: Pre-Enrichment + Encoder Recurrence (val_bpb=1.1709) Mar 20, 2026
@Idan3011 Idan3011 changed the title Record: Pre-Enrichment + Encoder Recurrence (val_bpb=1.1709) Record: Pre-Enrichment + Encoder Recurrence + SmearGate + BigramHash (val_bpb=1.1668) Mar 21, 2026
  Process 16K tokens per batch with numpy, not 64 per window.
  Only transfer target token log probs (2MB) not full vocab (2GB per batch).
  Precompute all hashes upfront (6 numpy passes). Clamp ng_prob to [0,1]
  to prevent hash collision artifacts. Progress logging.
  All n-gram operations on GPU — hash precomputation, lookups,
  scoring, cache updates via scatter_add_. No numpy bottleneck.
  Single 5-gram order, fixed alpha=0.20
  No backoff loop, no entropy, no log_softmax for n-gram.
  Three torch ops per batch: lookup, blend, scatter_add.
@Idan3011 Idan3011 changed the title Record: Pre-Enrichment + Encoder Recurrence + XSA + SmearGate + BigramHash (val_bpb=1.1629) Record: EMA-GPU + 5-gram eval cache (val_bpb=1.0689) Mar 25, 2026
@Idan3011 Idan3011 changed the title Record: EMA-GPU + 5-gram eval cache (val_bpb=1.0689) Record: EMA-GPU + Multi-Order N-gram Backoff (val_bpb=0.9784) Mar 26, 2026
@Idan3011 Idan3011 changed the title Record: EMA-GPU + Multi-Order N-gram Backoff (val_bpb=0.9784) Record: EMA-GPU + Multi-Order N-gram Backoff (val_bpb=0.9408) Mar 26, 2026
@Idan3011 Idan3011 changed the title Record: EMA-GPU + Multi-Order N-gram Backoff (val_bpb=0.9408) Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393) Mar 26, 2026
@Idan3011 Idan3011 closed this Mar 26, 2026
@Idan3011 Idan3011 deleted the submission branch March 26, 2026 04:37
@Idan3011 Idan3011 restored the submission branch March 26, 2026 04:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant