Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) by anthony-maio · Pull Request #887 · openai/parameter-golf

anthony-maio · 2026-03-26T19:08:01Z

Summary

val_bpb = 0.9642 (3-seed mean, std 0.0002) | ~15.95 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	step_avg	steps	Pre-ngram bpb	Post-ngram bpb	ng_helped	Artifact
1337	88.7ms	6,765	1.1225	0.9640	38.5%	15,981,848
42	88.6ms	6,772	1.1224	0.9641	38.6%	15,904,632
2025	88.6ms	6,776	1.1231	0.9644	38.6%	15,974,308
Mean	88.6ms	6,771	1.1227	0.9642 (std 0.0002)	38.6%

All artifacts under 16,000,000 bytes. All 3 train logs attached.

Key Innovation: Multi-Order N-gram Backoff Cache

Backward-looking n-gram cache built causally from already-scored tokens. Zero artifact cost.

Entropy-Adaptive Alpha: alpha = 0.05 + 0.55 * sigmoid(2*(H-4)). Low alpha when neural model is confident, high alpha when uncertain.

Multi-Order Backoff (2-7gram): Highest matching order wins. 4M hash buckets per order. min_count=2 gate. Raw count ratios, no smoothing.

Compliance: Score-first — every token scored under torch.inference_mode() before any table update. N-gram tables built from already-scored tokens only. No training data access during eval. No oracle selection.

Training Architecture

PR #414 base + LeakyReLU² + VRL + lzma:
11L, 512d, 8H/4KV GQA, LeakyReLU(0.5)² MLP 3×, VRL, VE128, BigramHash(2048), XSA4, Partial RoPE 16/64, LN Scale, SmearGate, U-Net skips, EMA(0.997) + Tight SWA, Late QAT, GPTQ-lite int6 + lzma, FA3 Hopper, Muon WD=0.04

Credits

N-gram backoff: PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 by @Asukabot0
Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush
LeakyReLU²: PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493 by @parinzee, PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 by @sofiabod
VRL: ResFormer (arXiv:2410.17897), PR Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175) #569 by @gowtham0992

Test plan

Seed 1337: 0.9640 bpb, 15.98MB valid
Seed 42: 0.9641 bpb, 15.90MB valid
Seed 2025: 0.9644 bpb, 15.97MB valid
3-seed mean: 0.9642, std 0.0002
All train logs attached
All artifacts under 16,000,000 bytes
Score-first compliance verified

Takes the proven SOTA script exactly (seq2048, MLP 3x, SmearGate, BigramHash, int6+zstd, SWA, Muon WD 0.02, OrthoInit) and adds TTT LoRA evaluation. TTT passes base_model directly (compiled). If TTT works on this architecture: expected ~1.11-1.12 bpb (new record). If TTT fails (SmearGate/BigramHash incompatibility): 1.1483 baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Key changes from PR openai#162 base: - 11 layers (from 9) — enabled by int6 compression headroom - Full-weight SGD TTT (not LoRA): lr=0.002, momentum=0.9, 3 epochs over val data, freeze first 2 blocks for stability - NTK-RoPE base=50000 (from 10000) for long-context extrapolation - matrix_lr=0.025, scalar_lr=0.025, tied_embed_lr=0.035 - weight_decay=0.04 (from 0.01) - BigramHash 2048 buckets (from 4096) - TTT_ENABLED=1 env var toggle Target: match FarnsworthEngine's 1.1303 bpb or beat it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

At ~5700 steps on our pods, warmdown=3000 means 53% of training is in the LR decay phase. Reducing to 1500 doubles full-LR training time. Council identified this as a free 0.005+ bpb improvement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NameError crashed after TTT epoch 3 completed successfully. eval_stride/eval_sl were local variables from the pre-TTT eval section, not visible in the TTT section. Use args.eval_stride and args.train_seq_len directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

9 layers (valid artifact under 16MB), full SOTA stack: MLP 3x, SmearGate, BigramHash 2048, int6+zstd-22, SWA, Muon WD=0.04, NTK-RoPE 50k, OrthoInit, sliding window stride=64. Trained 4,782 steps at ~125ms/step on 8xH100 SXM. Custom kernel integration in progress for next submission. TTT disabled (does not improve results on this architecture). Set NUM_LAYERS=11 for 11L variant (requires tighter compression). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Makora-generated persistent-CTA kernel fuses relu² + second matmul into a single Triton launch during eval. First matmul stays on cuBLAS. Active only during eval (not self.training) to preserve autograd. Called 9x per forward pass during sliding window eval. Expected ~10% eval time reduction (190s → ~170s), freeing eval budget. Falls back to PyTorch when Triton unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

FA3 (flash_attn_func) uses Hopper-native TMA/WGMMA for 75-85% GPU utilization vs FA2's ~60%. Expected to cut step time from ~108ms to ~85ms, yielding ~7,000 steps in 10 min. Falls back to F.scaled_dot_product_attention when FA3 unavailable. Also includes fused ReLU² MLP Triton kernel (1.26x during eval). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

int5 for MLP weights (largest tensors, ~60% of params), int6 for attention weights. Expected compression: 19.1MB × (5/6 for MLP portion) ≈ 15.9MB. Also restores NUM_LAYERS=11 as default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous int5-MLP + int6-attention produced 16.56MB (556KB over). Switching all large matrices to int5 should save ~700KB more. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

int5-all was 16.27MB (340KB over). MLP is ~60% of params. int4 MLP + int5 attention should save ~500KB more. Expected: ~15.8MB artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Int4 MLP was too aggressive (0.028 bpb penalty). Int5-all on 11L was 340KB over. 10L at int5 should be ~14.8MB — safe margin. 10L is faster (~100ms vs 115ms) = more steps = compensates for one fewer layer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Int5 was penalizing bpb by ~0.015-0.026. 9L with int6 fits at 15.9MB. QUANT_BITS env var allows int5 for 11L when needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Research finding: setting warmdown higher than total steps makes LR decay from step 1, compacting weight magnitudes continuously. This reduces int6 quant penalty from ~0.014 to ~0.005 bpb. Our 1.1401 result used warmdown=3000 on ~4800 steps (63% warmdown) while our 1.1518 used warmdown=1500 on ~7400 steps (20% warmdown) — the higher warmdown fraction gave better post-quant quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

From arXiv:2603.09078. Projects out the self-value component from attention output, forcing the network to use contextual information. Applied via GQA-aware zero-alloc view reshape on last 4 of 11 layers. Both top unmerged submissions (PR openai#374 at 1.1246 and PR openai#379 at 1.1260) use XSA as a key technique. Full next-gen stack now includes: 11L, XSA, Partial RoPE 16/64, Late QAT STE, Tight SWA, GPTQ-lite, LN Scale, FA3, SmearGate, BigramHash, int6+zstd, Muon WD, OrthoInit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous int6 produced 19.0MB on 11L. Int5 should give ~15.8MB. Late QAT STE clusters weights near the int5 grid during training, so the quality penalty should be much smaller than without QAT. val_bpb=1.1309 achieved with int6 (artifact too big). Int5+QAT should preserve most of that while fitting under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full stack: 11 layers, XSA on last 4, Partial RoPE 16/64, Late QAT STE, Tight SWA (scale<0.2), GPTQ-lite clip search, LN Scale 1/sqrt(i+1), FA3, MLP3x, SmearGate, BigramHash 2048, int5+zstd, Muon WD=0.04, NTK-RoPE 50k, OrthoInit, sliding window stride=64. 4,832 steps at 117ms/step on slow pod. On 80ms pod: 1.1309 (invalid artifact). With fast pod + int5: expected ~1.13 valid. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

5,660 steps at 101ms/step. Full stack: 11L, XSA, Partial RoPE, Late QAT STE, Tight SWA (7 checkpoints), GPTQ-lite, LN Scale, FA3, MLP3x, SmearGate, BigramHash, int5+zstd, Muon WD, OrthoInit. openai#1 on merged leaderboard. Beats thwu1 (1.1428) by 0.003. On faster pods (80ms): 1.1309 achieved (invalid artifact with int6). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- README: updated with actual 1.1399 results, removed TTT/PENDING claims - submission.json: aligned to repo schema (name, blurb, bytes_total) - train_gpt.py: fixed docstring line count claim, renamed artifact file, fixed misleading int8+zlib log string to reflect actual int5+compressor - Addresses all 5 Copilot review comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Fused RMSNorm (fwd+bwd): replaces F.rms_norm in Block.forward for both attn_norm and mlp_norm. Saves rstd for backward. Called 22x per step (2 per block × 11 blocks). 2. Fused ReLU² MLP backward: fuses (grad_out @ proj_weight) * relu_deriv into single Triton kernel, eliminating [M, 1536] HBM intermediate. Called 11x per step backward pass. Both fall back to PyTorch when Triton unavailable. Expected: 13-15ms/step savings on 100ms baseline = 13-15% speedup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Detached .to(x.dtype) copies broke gradient chain to fp32 params. Fix: pass raw fp32 params to Function, cast inside forward, return .float() gradients in backward. 2. Grid capped at num_sms*4 but kernel isn't persistent — tiles beyond cap were never computed, leaving grad_h uninitialized. Fix: launch all tiles (remove min cap). Both kernels re-enabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Custom Triton kernels add 38ms/step overhead vs torch.compile baseline. The Inductor compiler already fuses RMSNorm and MLP operations effectively on H100. Custom kernels remain in codebase for future optimization but are disabled for the competition submission. Kernel code is correct (no NaN after bug fixes) but slower than compiled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full reproducibility log showing end-to-end training + eval pipeline. 5,205 steps at 108ms/step. Note: this particular run's artifact was 16.46MB (462KB over limit) due to pod variance in SWA averaging. Our submitted score of 1.1399 comes from a run with valid 15.79MB artifact on a faster pod (101ms/step, 5,660 steps). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Custom binary packing stores 4 int6 values in 3 bytes (6 bits each) instead of wasting 2 bits per value with int8 storage. This reduces raw artifact size by 25%, which combined with zstd-22 compression should fit 11L models under 16MB with int6 precision. Int6 has ~0.015 bpb less quantization penalty than int5, so this change should improve our score from ~1.14 to ~1.125 while keeping artifacts under the 16MB limit. Also switches QUANT_BITS default from 5 back to 6 since packed format eliminates the size constraint that forced int5. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Packed int6 + zstd-22 produces 20.2MB artifacts — still over 16MB. The extra entropy per int6 value (64 states vs 32 for int5) doesn't compress away. The competition's int6 fits via aggressive QAT that clusters weights near grid points, reducing entropy. Our QAT isn't aggressive enough yet. Keep int5 as default (15.79MB, valid). Packed int6 code is preserved for future use when QAT improves. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Removed unused eval_val_ttt_lora function and all TTT helper functions (_reset_ttt_optimizer, _build_ttt_optimizer, _find_docs, _compute_chunk_window, _accumulate_bpb) — none were called in the scored config - Removed broken full-weight SGD TTT block that used undefined variables (use_compile, val_tokens_eval) — Copilot flagged this as a runtime crash - TTT work continues on the separate submission/reproduce-414 branch - Scored config unchanged: 11L, int5+zstd, 1.1399 bpb, 15.79MB artifact Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Sub-1.0 bpb! Multi-order n-gram backoff (2-7gram) with entropy-adaptive alpha mixing on top of our 1.1229 neural base. 3-seed mean 0.9642, std 0.0002. All artifacts under 16MB. Seed 1337: 0.9640 | Seed 42: 0.9641 | Seed 2025: 0.9644 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds a set of “triton-kernels” skill documents (duplicated across multiple agent/tooling directories) plus an MCP server configuration, but it does not include any of the model/architecture changes described in the PR title/description (N-gram backoff cache, VRL, LeakyReLU², etc.).

Changes:

Add Triton kernel optimization “skill” documentation (core SKILL + specialized guides for fused norms/epilogues, memory efficiency, and quantized block-scaled GEMM) across multiple agent directories.
Add MCP server configuration for colab-proxy-mcp.

Reviewed changes

Copilot reviewed 148 out of 338 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
.mcpjam/skills/triton-kernels/SKILL.md	Adds Triton kernel “skill” overview and references to specialized guides.
.mcp.json	Adds MCP server config to launch `colab-proxy-mcp` via `uvx` from a Git URL.
.kode/skills/triton-kernels/SKILL.md	Duplicates the Triton kernel “skill” overview.
.kode/skills/triton-kernels/triton-quantized-block-scaled-gemm.md	Adds quantized block-scaled GEMM guide.
.kode/skills/triton-kernels/triton-memory-efficient-patterns.md	Adds memory-efficient Triton patterns guide.
.kode/skills/triton-kernels/triton-fused-normalizations.md	Adds fused normalization kernels guide.
.kode/skills/triton-kernels/triton-fused-epilogue-kernels.md	Adds fused epilogue (attention/GEMM) patterns guide.
.kiro/skills/triton-kernels/SKILL.md	Duplicates the Triton kernel “skill” overview.
.kiro/skills/triton-kernels/triton-quantized-block-scaled-gemm.md	Duplicates quantized block-scaled GEMM guide.
.kiro/skills/triton-kernels/triton-memory-efficient-patterns.md	Duplicates memory-efficient Triton patterns guide.
.kiro/skills/triton-kernels/triton-fused-normalizations.md	Duplicates fused normalization kernels guide.
.kiro/skills/triton-kernels/triton-fused-epilogue-kernels.md	Duplicates fused epilogue patterns guide.
.kilocode/skills/triton-kernels/SKILL.md	Duplicates the Triton kernel “skill” overview.
.kilocode/skills/triton-kernels/triton-quantized-block-scaled-gemm.md	Duplicates quantized block-scaled GEMM guide.
.kilocode/skills/triton-kernels/triton-memory-efficient-patterns.md	Duplicates memory-efficient Triton patterns guide.
.kilocode/skills/triton-kernels/triton-fused-normalizations.md	Duplicates fused normalization kernels guide.
.kilocode/skills/triton-kernels/triton-fused-epilogue-kernels.md	Duplicates fused epilogue patterns guide.
.junie/skills/triton-kernels/SKILL.md	Duplicates the Triton kernel “skill” overview.
.junie/skills/triton-kernels/triton-quantized-block-scaled-gemm.md	Duplicates quantized block-scaled GEMM guide.
.junie/skills/triton-kernels/triton-memory-efficient-patterns.md	Duplicates memory-efficient Triton patterns guide.
.junie/skills/triton-kernels/triton-fused-normalizations.md	Duplicates fused normalization kernels guide.
.junie/skills/triton-kernels/triton-fused-epilogue-kernels.md	Duplicates fused epilogue patterns guide.
.goose/skills/triton-kernels/SKILL.md	Duplicates the Triton kernel “skill” overview.
.goose/skills/triton-kernels/triton-quantized-block-scaled-gemm.md	Duplicates quantized block-scaled GEMM guide.
.goose/skills/triton-kernels/triton-memory-efficient-patterns.md	Duplicates memory-efficient Triton patterns guide.
.goose/skills/triton-kernels/triton-fused-epilogue-kernels.md	Duplicates fused epilogue patterns guide.
.factory/skills/triton-kernels/SKILL.md	Duplicates the Triton kernel “skill” overview.
.factory/skills/triton-kernels/triton-quantized-block-scaled-gemm.md	Duplicates quantized block-scaled GEMM guide.
.factory/skills/triton-kernels/triton-memory-efficient-patterns.md	Duplicates memory-efficient Triton patterns guide.
.factory/skills/triton-kernels/triton-fused-normalizations.md	Duplicates fused normalization kernels guide.
.factory/skills/triton-kernels/triton-fused-epilogue-kernels.md	Duplicates fused epilogue patterns guide.
.crush/skills/triton-kernels/SKILL.md	Duplicates the Triton kernel “skill” overview.
.crush/skills/triton-kernels/triton-quantized-block-scaled-gemm.md	Duplicates quantized block-scaled GEMM guide.
.crush/skills/triton-kernels/triton-memory-efficient-patterns.md	Duplicates memory-efficient Triton patterns guide.
.crush/skills/triton-kernels/triton-fused-epilogue-kernels.md	Duplicates fused epilogue patterns guide.
.continue/skills/triton-kernels/SKILL.md	Duplicates the Triton kernel “skill” overview.
.continue/skills/triton-kernels/triton-quantized-block-scaled-gemm.md	Duplicates quantized block-scaled GEMM guide.
.continue/skills/triton-kernels/triton-memory-efficient-patterns.md	Duplicates memory-efficient Triton patterns guide.
.continue/skills/triton-kernels/triton-fused-normalizations.md	Duplicates fused normalization kernels guide.
.continue/skills/triton-kernels/triton-fused-epilogue-kernels.md	Duplicates fused epilogue patterns guide.
.commandcode/skills/triton-kernels/SKILL.md	Duplicates the Triton kernel “skill” overview.
.commandcode/skills/triton-kernels/triton-quantized-block-scaled-gemm.md	Duplicates quantized block-scaled GEMM guide.
.commandcode/skills/triton-kernels/triton-memory-efficient-patterns.md	Duplicates memory-efficient Triton patterns guide.
.commandcode/skills/triton-kernels/triton-fused-epilogue-kernels.md	Duplicates fused epilogue patterns guide.
.codebuddy/skills/triton-kernels/SKILL.md	Duplicates the Triton kernel “skill” overview.
.codebuddy/skills/triton-kernels/triton-quantized-block-scaled-gemm.md	Duplicates quantized block-scaled GEMM guide.
.codebuddy/skills/triton-kernels/triton-memory-efficient-patterns.md	Duplicates memory-efficient Triton patterns guide.
.codebuddy/skills/triton-kernels/triton-fused-epilogue-kernels.md	Duplicates fused epilogue patterns guide.
.claude/skills/triton-kernels/SKILL.md	Duplicates the Triton kernel “skill” overview.
.claude/skills/triton-kernels/triton-quantized-block-scaled-gemm.md	Duplicates quantized block-scaled GEMM guide.
.claude/skills/triton-kernels/triton-memory-efficient-patterns.md	Duplicates memory-efficient Triton patterns guide.
.claude/skills/triton-kernels/triton-fused-epilogue-kernels.md	Duplicates fused epilogue patterns guide.
.agents/skills/triton-kernels/SKILL.md	Duplicates the Triton kernel “skill” overview.
.agents/skills/triton-kernels/triton-quantized-block-scaled-gemm.md	Duplicates quantized block-scaled GEMM guide.
.agents/skills/triton-kernels/triton-memory-efficient-patterns.md	Duplicates memory-efficient Triton patterns guide.
.agents/skills/triton-kernels/triton-fused-normalizations.md	Duplicates fused normalization kernels guide.
.agents/skills/triton-kernels/triton-fused-epilogue-kernels.md	Duplicates fused epilogue patterns guide.
.agent/skills/triton-kernels/SKILL.md	Duplicates the Triton kernel “skill” overview.
.agent/skills/triton-kernels/triton-quantized-block-scaled-gemm.md	Duplicates quantized block-scaled GEMM guide.
.agent/skills/triton-kernels/triton-memory-efficient-patterns.md	Duplicates memory-efficient Triton patterns guide.
.agent/skills/triton-kernels/triton-fused-normalizations.md	Duplicates fused normalization kernels guide.
.agent/skills/triton-kernels/triton-fused-epilogue-kernels.md	Duplicates fused epilogue patterns guide.

Comments suppressed due to low confidence (1)

.kode/skills/triton-kernels/SKILL.md:1

This PR introduces the same 'triton-kernels' skill content replicated across many directories (.kode/.kiro/.kilocode/.junie/.goose/.factory/.crush/.continue/.commandcode/.codebuddy/.claude/.agent/.agents/.mcpjam). Keeping many identical copies will be brittle (fixes like typos/API updates must be applied everywhere). If these are meant to stay in sync, consider centralizing the source (single canonical doc + symlinks or a generation step) so updates are made once.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T19:09:24Z

.mcp.json

+{
+  "mcpServers": {
+    "colab-proxy-mcp": {
+      "command": "uvx",
+      "args": ["git+https://github.com/googlecolab/colab-mcp"],
+      "timeout": 30000
+    }
+  }
+}


The MCP server is installed/executed directly from a moving Git URL, which is a supply-chain risk and can make builds non-reproducible. Pin the dependency to a specific tag/commit SHA (e.g., git+...@<sha>) and consider documenting/validating the allowed source to avoid unexpected remote code changes.

.mcpjam/skills/triton-kernels/SKILL.md

.kode/skills/triton-kernels/triton-quantized-block-scaled-gemm.md

.mcp.json

anthony-maio · 2026-03-26T19:10:49Z

Cleaning up — too many files leaked into diff.

anthony-maio and others added 28 commits March 21, 2026 02:21

Int5 for ALL large weights (not just MLP) to fit 11L under 16MB

c262d1f

Previous int5-MLP + int6-attention produced 16.56MB (556KB over). Switching all large matrices to int5 should save ~700KB more. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mixed int4/int5: int4 for MLP, int5 for attention to fit 11L

34d0a92

int5-all was 16.27MB (340KB over). MLP is ~60% of params. int4 MLP + int5 attention should save ~500KB more. Expected: ~15.8MB artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix: default to int6 quant (QUANT_BITS=6) and 9 layers

333843d

Int5 was penalizing bpb by ~0.015-0.026. 9L with int6 fits at 15.9MB. QUANT_BITS env var allows int5 for 11L when needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert warmdown to 3000 (20000 breaks SWA averaging)

ea25505

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Disable both custom kernels: NaN in training - debugging

f788912

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 26, 2026 19:08

Copilot AI reviewed Mar 26, 2026

View reviewed changes

anthony-maio closed this Mar 26, 2026

anthony-maio deleted the submission/match-sota-plus-ttt branch March 26, 2026 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#887

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#887
anthony-maio wants to merge 28 commits intoopenai:mainfrom
anthony-maio:submission/match-sota-plus-ttt

anthony-maio commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anthony-maio commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anthony-maio commented Mar 26, 2026

Summary

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Key Innovation: Multi-Order N-gram Backoff Cache

Training Architecture

Credits

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anthony-maio commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants