Skip to content

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#887

Closed
anthony-maio wants to merge 28 commits intoopenai:mainfrom
anthony-maio:submission/match-sota-plus-ttt
Closed

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#887
anthony-maio wants to merge 28 commits intoopenai:mainfrom
anthony-maio:submission/match-sota-plus-ttt

Conversation

@anthony-maio
Copy link
Copy Markdown

Summary

val_bpb = 0.9642 (3-seed mean, std 0.0002) | ~15.95 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed step_avg steps Pre-ngram bpb Post-ngram bpb ng_helped Artifact
1337 88.7ms 6,765 1.1225 0.9640 38.5% 15,981,848
42 88.6ms 6,772 1.1224 0.9641 38.6% 15,904,632
2025 88.6ms 6,776 1.1231 0.9644 38.6% 15,974,308
Mean 88.6ms 6,771 1.1227 0.9642 (std 0.0002) 38.6%

All artifacts under 16,000,000 bytes. All 3 train logs attached.

Key Innovation: Multi-Order N-gram Backoff Cache

Backward-looking n-gram cache built causally from already-scored tokens. Zero artifact cost.

Entropy-Adaptive Alpha: alpha = 0.05 + 0.55 * sigmoid(2*(H-4)). Low alpha when neural model is confident, high alpha when uncertain.

Multi-Order Backoff (2-7gram): Highest matching order wins. 4M hash buckets per order. min_count=2 gate. Raw count ratios, no smoothing.

Compliance: Score-first — every token scored under torch.inference_mode() before any table update. N-gram tables built from already-scored tokens only. No training data access during eval. No oracle selection.

Training Architecture

PR #414 base + LeakyReLU² + VRL + lzma:
11L, 512d, 8H/4KV GQA, LeakyReLU(0.5)² MLP 3×, VRL, VE128, BigramHash(2048), XSA4, Partial RoPE 16/64, LN Scale, SmearGate, U-Net skips, EMA(0.997) + Tight SWA, Late QAT, GPTQ-lite int6 + lzma, FA3 Hopper, Muon WD=0.04

Credits

Test plan

  • Seed 1337: 0.9640 bpb, 15.98MB valid
  • Seed 42: 0.9641 bpb, 15.90MB valid
  • Seed 2025: 0.9644 bpb, 15.97MB valid
  • 3-seed mean: 0.9642, std 0.0002
  • All train logs attached
  • All artifacts under 16,000,000 bytes
  • Score-first compliance verified

anthony-maio and others added 28 commits March 21, 2026 02:21
Takes the proven SOTA script exactly (seq2048, MLP 3x, SmearGate,
BigramHash, int6+zstd, SWA, Muon WD 0.02, OrthoInit) and adds
TTT LoRA evaluation. TTT passes base_model directly (compiled).

If TTT works on this architecture: expected ~1.11-1.12 bpb (new record).
If TTT fails (SmearGate/BigramHash incompatibility): 1.1483 baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key changes from PR openai#162 base:
- 11 layers (from 9) — enabled by int6 compression headroom
- Full-weight SGD TTT (not LoRA): lr=0.002, momentum=0.9, 3 epochs
  over val data, freeze first 2 blocks for stability
- NTK-RoPE base=50000 (from 10000) for long-context extrapolation
- matrix_lr=0.025, scalar_lr=0.025, tied_embed_lr=0.035
- weight_decay=0.04 (from 0.01)
- BigramHash 2048 buckets (from 4096)
- TTT_ENABLED=1 env var toggle

Target: match FarnsworthEngine's 1.1303 bpb or beat it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
At ~5700 steps on our pods, warmdown=3000 means 53% of training is in
the LR decay phase. Reducing to 1500 doubles full-LR training time.
Council identified this as a free 0.005+ bpb improvement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NameError crashed after TTT epoch 3 completed successfully.
eval_stride/eval_sl were local variables from the pre-TTT eval
section, not visible in the TTT section. Use args.eval_stride
and args.train_seq_len directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9 layers (valid artifact under 16MB), full SOTA stack:
MLP 3x, SmearGate, BigramHash 2048, int6+zstd-22, SWA,
Muon WD=0.04, NTK-RoPE 50k, OrthoInit, sliding window stride=64.

Trained 4,782 steps at ~125ms/step on 8xH100 SXM.
Custom kernel integration in progress for next submission.

TTT disabled (does not improve results on this architecture).
Set NUM_LAYERS=11 for 11L variant (requires tighter compression).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Makora-generated persistent-CTA kernel fuses relu² + second matmul
into a single Triton launch during eval. First matmul stays on cuBLAS.
Active only during eval (not self.training) to preserve autograd.

Called 9x per forward pass during sliding window eval. Expected
~10% eval time reduction (190s → ~170s), freeing eval budget.

Falls back to PyTorch when Triton unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FA3 (flash_attn_func) uses Hopper-native TMA/WGMMA for 75-85%
GPU utilization vs FA2's ~60%. Expected to cut step time from
~108ms to ~85ms, yielding ~7,000 steps in 10 min.

Falls back to F.scaled_dot_product_attention when FA3 unavailable.

Also includes fused ReLU² MLP Triton kernel (1.26x during eval).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
int5 for MLP weights (largest tensors, ~60% of params),
int6 for attention weights. Expected compression:
19.1MB × (5/6 for MLP portion) ≈ 15.9MB.

Also restores NUM_LAYERS=11 as default.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous int5-MLP + int6-attention produced 16.56MB (556KB over).
Switching all large matrices to int5 should save ~700KB more.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
int5-all was 16.27MB (340KB over). MLP is ~60% of params.
int4 MLP + int5 attention should save ~500KB more.
Expected: ~15.8MB artifact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Int4 MLP was too aggressive (0.028 bpb penalty). Int5-all on 11L
was 340KB over. 10L at int5 should be ~14.8MB — safe margin.
10L is faster (~100ms vs 115ms) = more steps = compensates for
one fewer layer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Int5 was penalizing bpb by ~0.015-0.026. 9L with int6 fits at 15.9MB.
QUANT_BITS env var allows int5 for 11L when needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Research finding: setting warmdown higher than total steps makes LR
decay from step 1, compacting weight magnitudes continuously. This
reduces int6 quant penalty from ~0.014 to ~0.005 bpb. Our 1.1401
result used warmdown=3000 on ~4800 steps (63% warmdown) while our
1.1518 used warmdown=1500 on ~7400 steps (20% warmdown) — the
higher warmdown fraction gave better post-quant quality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
From arXiv:2603.09078. Projects out the self-value component from
attention output, forcing the network to use contextual information.
Applied via GQA-aware zero-alloc view reshape on last 4 of 11 layers.

Both top unmerged submissions (PR openai#374 at 1.1246 and PR openai#379 at 1.1260)
use XSA as a key technique.

Full next-gen stack now includes: 11L, XSA, Partial RoPE 16/64,
Late QAT STE, Tight SWA, GPTQ-lite, LN Scale, FA3, SmearGate,
BigramHash, int6+zstd, Muon WD, OrthoInit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous int6 produced 19.0MB on 11L. Int5 should give ~15.8MB.
Late QAT STE clusters weights near the int5 grid during training,
so the quality penalty should be much smaller than without QAT.

val_bpb=1.1309 achieved with int6 (artifact too big). Int5+QAT
should preserve most of that while fitting under 16MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full stack: 11 layers, XSA on last 4, Partial RoPE 16/64, Late QAT STE,
Tight SWA (scale<0.2), GPTQ-lite clip search, LN Scale 1/sqrt(i+1),
FA3, MLP3x, SmearGate, BigramHash 2048, int5+zstd, Muon WD=0.04,
NTK-RoPE 50k, OrthoInit, sliding window stride=64.

4,832 steps at 117ms/step on slow pod. On 80ms pod: 1.1309 (invalid artifact).
With fast pod + int5: expected ~1.13 valid.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5,660 steps at 101ms/step. Full stack: 11L, XSA, Partial RoPE,
Late QAT STE, Tight SWA (7 checkpoints), GPTQ-lite, LN Scale,
FA3, MLP3x, SmearGate, BigramHash, int5+zstd, Muon WD, OrthoInit.

openai#1 on merged leaderboard. Beats thwu1 (1.1428) by 0.003.
On faster pods (80ms): 1.1309 achieved (invalid artifact with int6).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- README: updated with actual 1.1399 results, removed TTT/PENDING claims
- submission.json: aligned to repo schema (name, blurb, bytes_total)
- train_gpt.py: fixed docstring line count claim, renamed artifact file,
  fixed misleading int8+zlib log string to reflect actual int5+compressor
- Addresses all 5 Copilot review comments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Fused RMSNorm (fwd+bwd): replaces F.rms_norm in Block.forward
   for both attn_norm and mlp_norm. Saves rstd for backward.
   Called 22x per step (2 per block × 11 blocks).

2. Fused ReLU² MLP backward: fuses (grad_out @ proj_weight) * relu_deriv
   into single Triton kernel, eliminating [M, 1536] HBM intermediate.
   Called 11x per step backward pass.

Both fall back to PyTorch when Triton unavailable.
Expected: 13-15ms/step savings on 100ms baseline = 13-15% speedup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Detached .to(x.dtype) copies broke gradient chain to fp32 params.
   Fix: pass raw fp32 params to Function, cast inside forward,
   return .float() gradients in backward.

2. Grid capped at num_sms*4 but kernel isn't persistent — tiles
   beyond cap were never computed, leaving grad_h uninitialized.
   Fix: launch all tiles (remove min cap).

Both kernels re-enabled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Custom Triton kernels add 38ms/step overhead vs torch.compile baseline.
The Inductor compiler already fuses RMSNorm and MLP operations effectively
on H100. Custom kernels remain in codebase for future optimization but
are disabled for the competition submission.

Kernel code is correct (no NaN after bug fixes) but slower than compiled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full reproducibility log showing end-to-end training + eval pipeline.
5,205 steps at 108ms/step. Note: this particular run's artifact was
16.46MB (462KB over limit) due to pod variance in SWA averaging.
Our submitted score of 1.1399 comes from a run with valid 15.79MB
artifact on a faster pod (101ms/step, 5,660 steps).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Custom binary packing stores 4 int6 values in 3 bytes (6 bits each)
instead of wasting 2 bits per value with int8 storage. This reduces
raw artifact size by 25%, which combined with zstd-22 compression
should fit 11L models under 16MB with int6 precision.

Int6 has ~0.015 bpb less quantization penalty than int5, so this
change should improve our score from ~1.14 to ~1.125 while keeping
artifacts under the 16MB limit.

Also switches QUANT_BITS default from 5 back to 6 since packed
format eliminates the size constraint that forced int5.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Packed int6 + zstd-22 produces 20.2MB artifacts — still over 16MB.
The extra entropy per int6 value (64 states vs 32 for int5) doesn't
compress away. The competition's int6 fits via aggressive QAT that
clusters weights near grid points, reducing entropy. Our QAT isn't
aggressive enough yet. Keep int5 as default (15.79MB, valid).

Packed int6 code is preserved for future use when QAT improves.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Removed unused eval_val_ttt_lora function and all TTT helper functions
  (_reset_ttt_optimizer, _build_ttt_optimizer, _find_docs,
  _compute_chunk_window, _accumulate_bpb) — none were called in the
  scored config
- Removed broken full-weight SGD TTT block that used undefined variables
  (use_compile, val_tokens_eval) — Copilot flagged this as a runtime crash
- TTT work continues on the separate submission/reproduce-414 branch
- Scored config unchanged: 11L, int5+zstd, 1.1399 bpb, 15.79MB artifact

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sub-1.0 bpb! Multi-order n-gram backoff (2-7gram) with entropy-adaptive
alpha mixing on top of our 1.1229 neural base. 3-seed mean 0.9642,
std 0.0002. All artifacts under 16MB.

Seed 1337: 0.9640 | Seed 42: 0.9641 | Seed 2025: 0.9644

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 26, 2026 19:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a set of “triton-kernels” skill documents (duplicated across multiple agent/tooling directories) plus an MCP server configuration, but it does not include any of the model/architecture changes described in the PR title/description (N-gram backoff cache, VRL, LeakyReLU², etc.).

Changes:

  • Add Triton kernel optimization “skill” documentation (core SKILL + specialized guides for fused norms/epilogues, memory efficiency, and quantized block-scaled GEMM) across multiple agent directories.
  • Add MCP server configuration for colab-proxy-mcp.

Reviewed changes

Copilot reviewed 148 out of 338 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
.mcpjam/skills/triton-kernels/SKILL.md Adds Triton kernel “skill” overview and references to specialized guides.
.mcp.json Adds MCP server config to launch colab-proxy-mcp via uvx from a Git URL.
.kode/skills/triton-kernels/SKILL.md Duplicates the Triton kernel “skill” overview.
.kode/skills/triton-kernels/triton-quantized-block-scaled-gemm.md Adds quantized block-scaled GEMM guide.
.kode/skills/triton-kernels/triton-memory-efficient-patterns.md Adds memory-efficient Triton patterns guide.
.kode/skills/triton-kernels/triton-fused-normalizations.md Adds fused normalization kernels guide.
.kode/skills/triton-kernels/triton-fused-epilogue-kernels.md Adds fused epilogue (attention/GEMM) patterns guide.
.kiro/skills/triton-kernels/SKILL.md Duplicates the Triton kernel “skill” overview.
.kiro/skills/triton-kernels/triton-quantized-block-scaled-gemm.md Duplicates quantized block-scaled GEMM guide.
.kiro/skills/triton-kernels/triton-memory-efficient-patterns.md Duplicates memory-efficient Triton patterns guide.
.kiro/skills/triton-kernels/triton-fused-normalizations.md Duplicates fused normalization kernels guide.
.kiro/skills/triton-kernels/triton-fused-epilogue-kernels.md Duplicates fused epilogue patterns guide.
.kilocode/skills/triton-kernels/SKILL.md Duplicates the Triton kernel “skill” overview.
.kilocode/skills/triton-kernels/triton-quantized-block-scaled-gemm.md Duplicates quantized block-scaled GEMM guide.
.kilocode/skills/triton-kernels/triton-memory-efficient-patterns.md Duplicates memory-efficient Triton patterns guide.
.kilocode/skills/triton-kernels/triton-fused-normalizations.md Duplicates fused normalization kernels guide.
.kilocode/skills/triton-kernels/triton-fused-epilogue-kernels.md Duplicates fused epilogue patterns guide.
.junie/skills/triton-kernels/SKILL.md Duplicates the Triton kernel “skill” overview.
.junie/skills/triton-kernels/triton-quantized-block-scaled-gemm.md Duplicates quantized block-scaled GEMM guide.
.junie/skills/triton-kernels/triton-memory-efficient-patterns.md Duplicates memory-efficient Triton patterns guide.
.junie/skills/triton-kernels/triton-fused-normalizations.md Duplicates fused normalization kernels guide.
.junie/skills/triton-kernels/triton-fused-epilogue-kernels.md Duplicates fused epilogue patterns guide.
.goose/skills/triton-kernels/SKILL.md Duplicates the Triton kernel “skill” overview.
.goose/skills/triton-kernels/triton-quantized-block-scaled-gemm.md Duplicates quantized block-scaled GEMM guide.
.goose/skills/triton-kernels/triton-memory-efficient-patterns.md Duplicates memory-efficient Triton patterns guide.
.goose/skills/triton-kernels/triton-fused-epilogue-kernels.md Duplicates fused epilogue patterns guide.
.factory/skills/triton-kernels/SKILL.md Duplicates the Triton kernel “skill” overview.
.factory/skills/triton-kernels/triton-quantized-block-scaled-gemm.md Duplicates quantized block-scaled GEMM guide.
.factory/skills/triton-kernels/triton-memory-efficient-patterns.md Duplicates memory-efficient Triton patterns guide.
.factory/skills/triton-kernels/triton-fused-normalizations.md Duplicates fused normalization kernels guide.
.factory/skills/triton-kernels/triton-fused-epilogue-kernels.md Duplicates fused epilogue patterns guide.
.crush/skills/triton-kernels/SKILL.md Duplicates the Triton kernel “skill” overview.
.crush/skills/triton-kernels/triton-quantized-block-scaled-gemm.md Duplicates quantized block-scaled GEMM guide.
.crush/skills/triton-kernels/triton-memory-efficient-patterns.md Duplicates memory-efficient Triton patterns guide.
.crush/skills/triton-kernels/triton-fused-epilogue-kernels.md Duplicates fused epilogue patterns guide.
.continue/skills/triton-kernels/SKILL.md Duplicates the Triton kernel “skill” overview.
.continue/skills/triton-kernels/triton-quantized-block-scaled-gemm.md Duplicates quantized block-scaled GEMM guide.
.continue/skills/triton-kernels/triton-memory-efficient-patterns.md Duplicates memory-efficient Triton patterns guide.
.continue/skills/triton-kernels/triton-fused-normalizations.md Duplicates fused normalization kernels guide.
.continue/skills/triton-kernels/triton-fused-epilogue-kernels.md Duplicates fused epilogue patterns guide.
.commandcode/skills/triton-kernels/SKILL.md Duplicates the Triton kernel “skill” overview.
.commandcode/skills/triton-kernels/triton-quantized-block-scaled-gemm.md Duplicates quantized block-scaled GEMM guide.
.commandcode/skills/triton-kernels/triton-memory-efficient-patterns.md Duplicates memory-efficient Triton patterns guide.
.commandcode/skills/triton-kernels/triton-fused-epilogue-kernels.md Duplicates fused epilogue patterns guide.
.codebuddy/skills/triton-kernels/SKILL.md Duplicates the Triton kernel “skill” overview.
.codebuddy/skills/triton-kernels/triton-quantized-block-scaled-gemm.md Duplicates quantized block-scaled GEMM guide.
.codebuddy/skills/triton-kernels/triton-memory-efficient-patterns.md Duplicates memory-efficient Triton patterns guide.
.codebuddy/skills/triton-kernels/triton-fused-epilogue-kernels.md Duplicates fused epilogue patterns guide.
.claude/skills/triton-kernels/SKILL.md Duplicates the Triton kernel “skill” overview.
.claude/skills/triton-kernels/triton-quantized-block-scaled-gemm.md Duplicates quantized block-scaled GEMM guide.
.claude/skills/triton-kernels/triton-memory-efficient-patterns.md Duplicates memory-efficient Triton patterns guide.
.claude/skills/triton-kernels/triton-fused-epilogue-kernels.md Duplicates fused epilogue patterns guide.
.agents/skills/triton-kernels/SKILL.md Duplicates the Triton kernel “skill” overview.
.agents/skills/triton-kernels/triton-quantized-block-scaled-gemm.md Duplicates quantized block-scaled GEMM guide.
.agents/skills/triton-kernels/triton-memory-efficient-patterns.md Duplicates memory-efficient Triton patterns guide.
.agents/skills/triton-kernels/triton-fused-normalizations.md Duplicates fused normalization kernels guide.
.agents/skills/triton-kernels/triton-fused-epilogue-kernels.md Duplicates fused epilogue patterns guide.
.agent/skills/triton-kernels/SKILL.md Duplicates the Triton kernel “skill” overview.
.agent/skills/triton-kernels/triton-quantized-block-scaled-gemm.md Duplicates quantized block-scaled GEMM guide.
.agent/skills/triton-kernels/triton-memory-efficient-patterns.md Duplicates memory-efficient Triton patterns guide.
.agent/skills/triton-kernels/triton-fused-normalizations.md Duplicates fused normalization kernels guide.
.agent/skills/triton-kernels/triton-fused-epilogue-kernels.md Duplicates fused epilogue patterns guide.
Comments suppressed due to low confidence (1)

.kode/skills/triton-kernels/SKILL.md:1

  • This PR introduces the same 'triton-kernels' skill content replicated across many directories (.kode/.kiro/.kilocode/.junie/.goose/.factory/.crush/.continue/.commandcode/.codebuddy/.claude/.agent/.agents/.mcpjam). Keeping many identical copies will be brittle (fixes like typos/API updates must be applied everywhere). If these are meant to stay in sync, consider centralizing the source (single canonical doc + symlinks or a generation step) so updates are made once.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +9
{
"mcpServers": {
"colab-proxy-mcp": {
"command": "uvx",
"args": ["git+https://github.com/googlecolab/colab-mcp"],
"timeout": 30000
}
}
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MCP server is installed/executed directly from a moving Git URL, which is a supply-chain risk and can make builds non-reproducible. Pin the dependency to a specific tag/commit SHA (e.g., git+...@<sha>) and consider documenting/validating the allowed source to avoid unexpected remote code changes.

Copilot uses AI. Check for mistakes.
@anthony-maio
Copy link
Copy Markdown
Author

Cleaning up — too many files leaked into diff.

@anthony-maio anthony-maio deleted the submission/match-sota-plus-ttt branch March 26, 2026 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants