Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393)#187
Closed
Idan3011 wants to merge 73 commits intoopenai:mainfrom
Closed
Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393)#187Idan3011 wants to merge 73 commits intoopenai:mainfrom
Idan3011 wants to merge 73 commits intoopenai:mainfrom
Conversation
Two CastedLinear(512,512) layers applied to token embeddings before entering the residual stream. No activation between them. Weights optimized via Muon alongside block matrix params. Also updates .gitignore for venv and build artifacts.
Tests whether true nonlinearity improves over the linear-only factorization that scored val_bpb 1.4188.
Encoder blocks 0-3 run twice with RMS norm between passes. Decoder runs once using skip connections from the refined second encoder pass. 13 effective layers from 9 physical blocks, zero extra parameters.
All 9 blocks run twice with RMS norm between passes. 18 effective layers from 9 physical blocks, zero extra params. Replaces encoder-only recurrence from previous commit.
17 effective layers from 9 physical blocks. RMS norm between each encoder pass. Testing if 3x beats 2x encoder recurrence.
…imit) 3x encoder recurrence exceeds A100 SM shared memory (168096 > 166912). 2x encoder recurrence remains our best: val_bpb 1.4235.
Allows overriding the default 50/50 split to put more blocks in the encoder for deeper recurrence. Default behavior unchanged.
Best config: 4+5 split with 2x encoder recurrence. 6+3 split tested and was worse (1.4267 vs 1.4235).
signal After encoder passes, compute prediction loss from encoder output weighted at 0.1x and add to final loss. Gives encoder blocks direct learning signal instead of only through decoder backprop.
Auxiliary loss was inflating val_bpb metric during evaluation. Now uses weight=0.1 during training, 0.0 during eval.
Second encoder pass runs blocks in reverse order (3→2→1→0) for bidirectional refinement. Auxiliary encoder loss reverted — it hurt performance (1.4135 vs 1.4077 without it).
Novel architecture (ours): - GELU pre-enrichment before transformer blocks - 2x encoder recurrence with RMS norm between passes Proven techniques adopted: - Overtone init (power-law SVD embedding initialization) - FP16 embedding passthrough (avoids int8 compound error) - Muon decoupled weight decay (0.02) - Sliding window eval (stride=64, ~960 tokens context per token) Run with: NUM_LAYERS=10 TIED_EMBED_LR=0.1 WARMDOWN_ITERS=2500 MATRIX_LR=0.06 torchrun --standalone --nproc_per_node=8 train_gpt.py
Sliding window with stride=64 is too slow unbatched on single GPU (~30 min). Falls back to regular eval on single GPU for testing. Multi-GPU distributes windows across ranks.
1. Batched sliding window eval (stride=64, batch=256) with proper
per-token scoring via forward_logits method
2. Reverted FP16 embedding passthrough to fit 16MB cap
3. Encoder recurrence behind ENCODER_RECURRENCE=1 env var
for A/B testing recurrence vs no-recurrence"
…2L config Phase-transition sigmoid init for resid_mix (from rank 1). Late-K: last 2 layers c_k.weight kept fp16 during quantization. GRAD_CLIP_NORM=1.0 default. RUN_CONFIG=C: 12L MLP 2x (18 effective layers with recurrence).
Process 16K tokens per batch with numpy, not 64 per window.
Only transfer target token log probs (2MB) not full vocab (2GB per batch).
Precompute all hashes upfront (6 numpy passes). Clamp ng_prob to [0,1] to prevent hash collision artifacts. Progress logging.
All n-gram operations on GPU — hash precomputation, lookups, scoring, cache updates via scatter_add_. No numpy bottleneck.
Single 5-gram order, fixed alpha=0.20 No backoff loop, no entropy, no log_softmax for n-gram. Three torch ops per batch: lookup, blend, scatter_add.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
EMA-GPU + Multi-Order N-gram Backoff + Pre-Enrichment Confidence + XSA
val_bpb: 0.9393 (multi-order n-gram backoff 2-11, entropy-adaptive alpha + pre-enrichment confidence) |
1.1478 (sliding window) | 14.94 MB | 8xH100 SXM, 600s
Progress
Key Contributions
EMA on GPU (37% faster training)
EMA state kept on GPU during training instead of synchronous GPU→CPU copy every step. Only moved to CPU at the
end for serialization.
Step time: 64.7ms (vs 101ms before). Enables 9,268 steps in 600s vs ~5,900 — 57% more gradient updates.
Multi-Order N-gram Backoff (score-first, backward-looking)
Multi-order n-gram backoff with entropy-adaptive alpha during sliding window eval.
Improvement: 1.1478 → 0.9408 = -0.207 BPB
Pre-Enrichment Confidence Modulation
Uses the pre-enrichment layer's transformation magnitude as a confidence signal. High delta = model uncertain
about this context = trust n-gram more. Low delta = model confident = trust model more. Modulates
entropy-adaptive alpha by
(0.5 + 1.0 * pe_conf).Additional improvement: 0.9408 → 0.9393 = -0.0015 BPB
GELU Pre-Enrichment (512→768→512)
Wider nonlinear transformation before the residual stream: embedding → BigramHash add → SmearGate →
Linear(512→768) → GELU → Linear(768→512) → RMS Norm → transformer blocks
XSA (Exclusive Self Attention) on Last 4 Layers
Removes self-value bias via orthogonal projection (arXiv:2603.09078, GQA-aware PR #265 @unnir). Zero
parameters.
Additional Techniques
Architecture: 10L, 512d, 8H/4KV GQA, MLP 3x, tied embeddings, U-Net skip connections. Training: Muon+AdamW,
WD=0.04, matrix_lr=0.025, warmdown=3500, batch=524K, seq=2048.
What Didn't Work
Reproduction
All defaults baked in. No env vars needed.
8xH100 SXM, 600s training + ~188s eval.
Key Metrics
Credits
entropy-adaptive PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 (@Asukabot0)
Update Log