Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#887
Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#887anthony-maio wants to merge 28 commits intoopenai:mainfrom
Conversation
Takes the proven SOTA script exactly (seq2048, MLP 3x, SmearGate, BigramHash, int6+zstd, SWA, Muon WD 0.02, OrthoInit) and adds TTT LoRA evaluation. TTT passes base_model directly (compiled). If TTT works on this architecture: expected ~1.11-1.12 bpb (new record). If TTT fails (SmearGate/BigramHash incompatibility): 1.1483 baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key changes from PR openai#162 base: - 11 layers (from 9) — enabled by int6 compression headroom - Full-weight SGD TTT (not LoRA): lr=0.002, momentum=0.9, 3 epochs over val data, freeze first 2 blocks for stability - NTK-RoPE base=50000 (from 10000) for long-context extrapolation - matrix_lr=0.025, scalar_lr=0.025, tied_embed_lr=0.035 - weight_decay=0.04 (from 0.01) - BigramHash 2048 buckets (from 4096) - TTT_ENABLED=1 env var toggle Target: match FarnsworthEngine's 1.1303 bpb or beat it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
At ~5700 steps on our pods, warmdown=3000 means 53% of training is in the LR decay phase. Reducing to 1500 doubles full-LR training time. Council identified this as a free 0.005+ bpb improvement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NameError crashed after TTT epoch 3 completed successfully. eval_stride/eval_sl were local variables from the pre-TTT eval section, not visible in the TTT section. Use args.eval_stride and args.train_seq_len directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9 layers (valid artifact under 16MB), full SOTA stack: MLP 3x, SmearGate, BigramHash 2048, int6+zstd-22, SWA, Muon WD=0.04, NTK-RoPE 50k, OrthoInit, sliding window stride=64. Trained 4,782 steps at ~125ms/step on 8xH100 SXM. Custom kernel integration in progress for next submission. TTT disabled (does not improve results on this architecture). Set NUM_LAYERS=11 for 11L variant (requires tighter compression). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Makora-generated persistent-CTA kernel fuses relu² + second matmul into a single Triton launch during eval. First matmul stays on cuBLAS. Active only during eval (not self.training) to preserve autograd. Called 9x per forward pass during sliding window eval. Expected ~10% eval time reduction (190s → ~170s), freeing eval budget. Falls back to PyTorch when Triton unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FA3 (flash_attn_func) uses Hopper-native TMA/WGMMA for 75-85% GPU utilization vs FA2's ~60%. Expected to cut step time from ~108ms to ~85ms, yielding ~7,000 steps in 10 min. Falls back to F.scaled_dot_product_attention when FA3 unavailable. Also includes fused ReLU² MLP Triton kernel (1.26x during eval). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
int5 for MLP weights (largest tensors, ~60% of params), int6 for attention weights. Expected compression: 19.1MB × (5/6 for MLP portion) ≈ 15.9MB. Also restores NUM_LAYERS=11 as default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous int5-MLP + int6-attention produced 16.56MB (556KB over). Switching all large matrices to int5 should save ~700KB more. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
int5-all was 16.27MB (340KB over). MLP is ~60% of params. int4 MLP + int5 attention should save ~500KB more. Expected: ~15.8MB artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Int4 MLP was too aggressive (0.028 bpb penalty). Int5-all on 11L was 340KB over. 10L at int5 should be ~14.8MB — safe margin. 10L is faster (~100ms vs 115ms) = more steps = compensates for one fewer layer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Int5 was penalizing bpb by ~0.015-0.026. 9L with int6 fits at 15.9MB. QUANT_BITS env var allows int5 for 11L when needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Research finding: setting warmdown higher than total steps makes LR decay from step 1, compacting weight magnitudes continuously. This reduces int6 quant penalty from ~0.014 to ~0.005 bpb. Our 1.1401 result used warmdown=3000 on ~4800 steps (63% warmdown) while our 1.1518 used warmdown=1500 on ~7400 steps (20% warmdown) — the higher warmdown fraction gave better post-quant quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
From arXiv:2603.09078. Projects out the self-value component from attention output, forcing the network to use contextual information. Applied via GQA-aware zero-alloc view reshape on last 4 of 11 layers. Both top unmerged submissions (PR openai#374 at 1.1246 and PR openai#379 at 1.1260) use XSA as a key technique. Full next-gen stack now includes: 11L, XSA, Partial RoPE 16/64, Late QAT STE, Tight SWA, GPTQ-lite, LN Scale, FA3, SmearGate, BigramHash, int6+zstd, Muon WD, OrthoInit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous int6 produced 19.0MB on 11L. Int5 should give ~15.8MB. Late QAT STE clusters weights near the int5 grid during training, so the quality penalty should be much smaller than without QAT. val_bpb=1.1309 achieved with int6 (artifact too big). Int5+QAT should preserve most of that while fitting under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full stack: 11 layers, XSA on last 4, Partial RoPE 16/64, Late QAT STE, Tight SWA (scale<0.2), GPTQ-lite clip search, LN Scale 1/sqrt(i+1), FA3, MLP3x, SmearGate, BigramHash 2048, int5+zstd, Muon WD=0.04, NTK-RoPE 50k, OrthoInit, sliding window stride=64. 4,832 steps at 117ms/step on slow pod. On 80ms pod: 1.1309 (invalid artifact). With fast pod + int5: expected ~1.13 valid. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5,660 steps at 101ms/step. Full stack: 11L, XSA, Partial RoPE, Late QAT STE, Tight SWA (7 checkpoints), GPTQ-lite, LN Scale, FA3, MLP3x, SmearGate, BigramHash, int5+zstd, Muon WD, OrthoInit. openai#1 on merged leaderboard. Beats thwu1 (1.1428) by 0.003. On faster pods (80ms): 1.1309 achieved (invalid artifact with int6). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- README: updated with actual 1.1399 results, removed TTT/PENDING claims - submission.json: aligned to repo schema (name, blurb, bytes_total) - train_gpt.py: fixed docstring line count claim, renamed artifact file, fixed misleading int8+zlib log string to reflect actual int5+compressor - Addresses all 5 Copilot review comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Fused RMSNorm (fwd+bwd): replaces F.rms_norm in Block.forward for both attn_norm and mlp_norm. Saves rstd for backward. Called 22x per step (2 per block × 11 blocks). 2. Fused ReLU² MLP backward: fuses (grad_out @ proj_weight) * relu_deriv into single Triton kernel, eliminating [M, 1536] HBM intermediate. Called 11x per step backward pass. Both fall back to PyTorch when Triton unavailable. Expected: 13-15ms/step savings on 100ms baseline = 13-15% speedup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Detached .to(x.dtype) copies broke gradient chain to fp32 params. Fix: pass raw fp32 params to Function, cast inside forward, return .float() gradients in backward. 2. Grid capped at num_sms*4 but kernel isn't persistent — tiles beyond cap were never computed, leaving grad_h uninitialized. Fix: launch all tiles (remove min cap). Both kernels re-enabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Custom Triton kernels add 38ms/step overhead vs torch.compile baseline. The Inductor compiler already fuses RMSNorm and MLP operations effectively on H100. Custom kernels remain in codebase for future optimization but are disabled for the competition submission. Kernel code is correct (no NaN after bug fixes) but slower than compiled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full reproducibility log showing end-to-end training + eval pipeline. 5,205 steps at 108ms/step. Note: this particular run's artifact was 16.46MB (462KB over limit) due to pod variance in SWA averaging. Our submitted score of 1.1399 comes from a run with valid 15.79MB artifact on a faster pod (101ms/step, 5,660 steps). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Custom binary packing stores 4 int6 values in 3 bytes (6 bits each) instead of wasting 2 bits per value with int8 storage. This reduces raw artifact size by 25%, which combined with zstd-22 compression should fit 11L models under 16MB with int6 precision. Int6 has ~0.015 bpb less quantization penalty than int5, so this change should improve our score from ~1.14 to ~1.125 while keeping artifacts under the 16MB limit. Also switches QUANT_BITS default from 5 back to 6 since packed format eliminates the size constraint that forced int5. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Packed int6 + zstd-22 produces 20.2MB artifacts — still over 16MB. The extra entropy per int6 value (64 states vs 32 for int5) doesn't compress away. The competition's int6 fits via aggressive QAT that clusters weights near grid points, reducing entropy. Our QAT isn't aggressive enough yet. Keep int5 as default (15.79MB, valid). Packed int6 code is preserved for future use when QAT improves. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Removed unused eval_val_ttt_lora function and all TTT helper functions (_reset_ttt_optimizer, _build_ttt_optimizer, _find_docs, _compute_chunk_window, _accumulate_bpb) — none were called in the scored config - Removed broken full-weight SGD TTT block that used undefined variables (use_compile, val_tokens_eval) — Copilot flagged this as a runtime crash - TTT work continues on the separate submission/reproduce-414 branch - Scored config unchanged: 11L, int5+zstd, 1.1399 bpb, 15.79MB artifact Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sub-1.0 bpb! Multi-order n-gram backoff (2-7gram) with entropy-adaptive alpha mixing on top of our 1.1229 neural base. 3-seed mean 0.9642, std 0.0002. All artifacts under 16MB. Seed 1337: 0.9640 | Seed 42: 0.9641 | Seed 2025: 0.9644 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a set of “triton-kernels” skill documents (duplicated across multiple agent/tooling directories) plus an MCP server configuration, but it does not include any of the model/architecture changes described in the PR title/description (N-gram backoff cache, VRL, LeakyReLU², etc.).
Changes:
- Add Triton kernel optimization “skill” documentation (core SKILL + specialized guides for fused norms/epilogues, memory efficiency, and quantized block-scaled GEMM) across multiple agent directories.
- Add MCP server configuration for
colab-proxy-mcp.
Reviewed changes
Copilot reviewed 148 out of 338 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| .mcpjam/skills/triton-kernels/SKILL.md | Adds Triton kernel “skill” overview and references to specialized guides. |
| .mcp.json | Adds MCP server config to launch colab-proxy-mcp via uvx from a Git URL. |
| .kode/skills/triton-kernels/SKILL.md | Duplicates the Triton kernel “skill” overview. |
| .kode/skills/triton-kernels/triton-quantized-block-scaled-gemm.md | Adds quantized block-scaled GEMM guide. |
| .kode/skills/triton-kernels/triton-memory-efficient-patterns.md | Adds memory-efficient Triton patterns guide. |
| .kode/skills/triton-kernels/triton-fused-normalizations.md | Adds fused normalization kernels guide. |
| .kode/skills/triton-kernels/triton-fused-epilogue-kernels.md | Adds fused epilogue (attention/GEMM) patterns guide. |
| .kiro/skills/triton-kernels/SKILL.md | Duplicates the Triton kernel “skill” overview. |
| .kiro/skills/triton-kernels/triton-quantized-block-scaled-gemm.md | Duplicates quantized block-scaled GEMM guide. |
| .kiro/skills/triton-kernels/triton-memory-efficient-patterns.md | Duplicates memory-efficient Triton patterns guide. |
| .kiro/skills/triton-kernels/triton-fused-normalizations.md | Duplicates fused normalization kernels guide. |
| .kiro/skills/triton-kernels/triton-fused-epilogue-kernels.md | Duplicates fused epilogue patterns guide. |
| .kilocode/skills/triton-kernels/SKILL.md | Duplicates the Triton kernel “skill” overview. |
| .kilocode/skills/triton-kernels/triton-quantized-block-scaled-gemm.md | Duplicates quantized block-scaled GEMM guide. |
| .kilocode/skills/triton-kernels/triton-memory-efficient-patterns.md | Duplicates memory-efficient Triton patterns guide. |
| .kilocode/skills/triton-kernels/triton-fused-normalizations.md | Duplicates fused normalization kernels guide. |
| .kilocode/skills/triton-kernels/triton-fused-epilogue-kernels.md | Duplicates fused epilogue patterns guide. |
| .junie/skills/triton-kernels/SKILL.md | Duplicates the Triton kernel “skill” overview. |
| .junie/skills/triton-kernels/triton-quantized-block-scaled-gemm.md | Duplicates quantized block-scaled GEMM guide. |
| .junie/skills/triton-kernels/triton-memory-efficient-patterns.md | Duplicates memory-efficient Triton patterns guide. |
| .junie/skills/triton-kernels/triton-fused-normalizations.md | Duplicates fused normalization kernels guide. |
| .junie/skills/triton-kernels/triton-fused-epilogue-kernels.md | Duplicates fused epilogue patterns guide. |
| .goose/skills/triton-kernels/SKILL.md | Duplicates the Triton kernel “skill” overview. |
| .goose/skills/triton-kernels/triton-quantized-block-scaled-gemm.md | Duplicates quantized block-scaled GEMM guide. |
| .goose/skills/triton-kernels/triton-memory-efficient-patterns.md | Duplicates memory-efficient Triton patterns guide. |
| .goose/skills/triton-kernels/triton-fused-epilogue-kernels.md | Duplicates fused epilogue patterns guide. |
| .factory/skills/triton-kernels/SKILL.md | Duplicates the Triton kernel “skill” overview. |
| .factory/skills/triton-kernels/triton-quantized-block-scaled-gemm.md | Duplicates quantized block-scaled GEMM guide. |
| .factory/skills/triton-kernels/triton-memory-efficient-patterns.md | Duplicates memory-efficient Triton patterns guide. |
| .factory/skills/triton-kernels/triton-fused-normalizations.md | Duplicates fused normalization kernels guide. |
| .factory/skills/triton-kernels/triton-fused-epilogue-kernels.md | Duplicates fused epilogue patterns guide. |
| .crush/skills/triton-kernels/SKILL.md | Duplicates the Triton kernel “skill” overview. |
| .crush/skills/triton-kernels/triton-quantized-block-scaled-gemm.md | Duplicates quantized block-scaled GEMM guide. |
| .crush/skills/triton-kernels/triton-memory-efficient-patterns.md | Duplicates memory-efficient Triton patterns guide. |
| .crush/skills/triton-kernels/triton-fused-epilogue-kernels.md | Duplicates fused epilogue patterns guide. |
| .continue/skills/triton-kernels/SKILL.md | Duplicates the Triton kernel “skill” overview. |
| .continue/skills/triton-kernels/triton-quantized-block-scaled-gemm.md | Duplicates quantized block-scaled GEMM guide. |
| .continue/skills/triton-kernels/triton-memory-efficient-patterns.md | Duplicates memory-efficient Triton patterns guide. |
| .continue/skills/triton-kernels/triton-fused-normalizations.md | Duplicates fused normalization kernels guide. |
| .continue/skills/triton-kernels/triton-fused-epilogue-kernels.md | Duplicates fused epilogue patterns guide. |
| .commandcode/skills/triton-kernels/SKILL.md | Duplicates the Triton kernel “skill” overview. |
| .commandcode/skills/triton-kernels/triton-quantized-block-scaled-gemm.md | Duplicates quantized block-scaled GEMM guide. |
| .commandcode/skills/triton-kernels/triton-memory-efficient-patterns.md | Duplicates memory-efficient Triton patterns guide. |
| .commandcode/skills/triton-kernels/triton-fused-epilogue-kernels.md | Duplicates fused epilogue patterns guide. |
| .codebuddy/skills/triton-kernels/SKILL.md | Duplicates the Triton kernel “skill” overview. |
| .codebuddy/skills/triton-kernels/triton-quantized-block-scaled-gemm.md | Duplicates quantized block-scaled GEMM guide. |
| .codebuddy/skills/triton-kernels/triton-memory-efficient-patterns.md | Duplicates memory-efficient Triton patterns guide. |
| .codebuddy/skills/triton-kernels/triton-fused-epilogue-kernels.md | Duplicates fused epilogue patterns guide. |
| .claude/skills/triton-kernels/SKILL.md | Duplicates the Triton kernel “skill” overview. |
| .claude/skills/triton-kernels/triton-quantized-block-scaled-gemm.md | Duplicates quantized block-scaled GEMM guide. |
| .claude/skills/triton-kernels/triton-memory-efficient-patterns.md | Duplicates memory-efficient Triton patterns guide. |
| .claude/skills/triton-kernels/triton-fused-epilogue-kernels.md | Duplicates fused epilogue patterns guide. |
| .agents/skills/triton-kernels/SKILL.md | Duplicates the Triton kernel “skill” overview. |
| .agents/skills/triton-kernels/triton-quantized-block-scaled-gemm.md | Duplicates quantized block-scaled GEMM guide. |
| .agents/skills/triton-kernels/triton-memory-efficient-patterns.md | Duplicates memory-efficient Triton patterns guide. |
| .agents/skills/triton-kernels/triton-fused-normalizations.md | Duplicates fused normalization kernels guide. |
| .agents/skills/triton-kernels/triton-fused-epilogue-kernels.md | Duplicates fused epilogue patterns guide. |
| .agent/skills/triton-kernels/SKILL.md | Duplicates the Triton kernel “skill” overview. |
| .agent/skills/triton-kernels/triton-quantized-block-scaled-gemm.md | Duplicates quantized block-scaled GEMM guide. |
| .agent/skills/triton-kernels/triton-memory-efficient-patterns.md | Duplicates memory-efficient Triton patterns guide. |
| .agent/skills/triton-kernels/triton-fused-normalizations.md | Duplicates fused normalization kernels guide. |
| .agent/skills/triton-kernels/triton-fused-epilogue-kernels.md | Duplicates fused epilogue patterns guide. |
Comments suppressed due to low confidence (1)
.kode/skills/triton-kernels/SKILL.md:1
- This PR introduces the same 'triton-kernels' skill content replicated across many directories (.kode/.kiro/.kilocode/.junie/.goose/.factory/.crush/.continue/.commandcode/.codebuddy/.claude/.agent/.agents/.mcpjam). Keeping many identical copies will be brittle (fixes like typos/API updates must be applied everywhere). If these are meant to stay in sync, consider centralizing the source (single canonical doc + symlinks or a generation step) so updates are made once.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| { | ||
| "mcpServers": { | ||
| "colab-proxy-mcp": { | ||
| "command": "uvx", | ||
| "args": ["git+https://github.com/googlecolab/colab-mcp"], | ||
| "timeout": 30000 | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The MCP server is installed/executed directly from a moving Git URL, which is a supply-chain risk and can make builds non-reproducible. Pin the dependency to a specific tag/commit SHA (e.g., git+...@<sha>) and consider documenting/validating the allowed source to avoid unexpected remote code changes.
|
Cleaning up — too many files leaked into diff. |
Summary
val_bpb = 0.9642 (3-seed mean, std 0.0002) | ~15.95 MB | 8×H100 SXM
3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
All artifacts under 16,000,000 bytes. All 3 train logs attached.
Key Innovation: Multi-Order N-gram Backoff Cache
Backward-looking n-gram cache built causally from already-scored tokens. Zero artifact cost.
Entropy-Adaptive Alpha:
alpha = 0.05 + 0.55 * sigmoid(2*(H-4)). Low alpha when neural model is confident, high alpha when uncertain.Multi-Order Backoff (2-7gram): Highest matching order wins. 4M hash buckets per order. min_count=2 gate. Raw count ratios, no smoothing.
Compliance: Score-first — every token scored under
torch.inference_mode()before any table update. N-gram tables built from already-scored tokens only. No training data access during eval. No oracle selection.Training Architecture
PR #414 base + LeakyReLU² + VRL + lzma:
11L, 512d, 8H/4KV GQA, LeakyReLU(0.5)² MLP 3×, VRL, VE128, BigramHash(2048), XSA4, Partial RoPE 16/64, LN Scale, SmearGate, U-Net skips, EMA(0.997) + Tight SWA, Late QAT, GPTQ-lite int6 + lzma, FA3 Hopper, Muon WD=0.04
Credits
Test plan