From 2f278eb6239af20c896f5a15c59e4fb659fbfd0c Mon Sep 17 00:00:00 2001 From: "Xiaoan (Sean) Liu" Date: Sat, 21 Mar 2026 00:35:11 -0600 Subject: [PATCH 1/2] Add Neural Cache research proposal: cross-window KV caching for extended eval context Non-record research submission. Proposes caching K/V pairs across sliding windows to extend effective context from 2K to 50K+ tokens at eval time. Backward-looking, zero artifact cost, rule-compliant. Implementation provided but untested due to compute constraints. Base: PR #287 reproduction at 1.1284 BPB. --- .../2026-03-21_NeuralCache_Research/README.md | 127 ++++++++++ .../eval_neural_cache.py | 222 ++++++++++++++++++ .../submission.json | 15 ++ 3 files changed, 364 insertions(+) create mode 100644 records/track_10min_16mb/2026-03-21_NeuralCache_Research/README.md create mode 100644 records/track_10min_16mb/2026-03-21_NeuralCache_Research/eval_neural_cache.py create mode 100644 records/track_10min_16mb/2026-03-21_NeuralCache_Research/submission.json diff --git a/records/track_10min_16mb/2026-03-21_NeuralCache_Research/README.md b/records/track_10min_16mb/2026-03-21_NeuralCache_Research/README.md new file mode 100644 index 000000000..00637e389 --- /dev/null +++ b/records/track_10min_16mb/2026-03-21_NeuralCache_Research/README.md @@ -0,0 +1,127 @@ +# Neural Cache: Cross-Window KV Cache for Extended Context at Eval Time + +**Research proposal (no record claim)** | Base model: PR #287 reproduction (1.1284 BPB) | 8xH100 SXM + +## The Idea + +Standard sliding window evaluation processes each window independently. A window at position 10,000 has no memory of what happened at position 5,000 — even though those tokens were already evaluated. Neural Cache fixes this by **caching K/V pairs across windows**, extending effective context from 2,048 tokens to 50K+ tokens at zero artifact cost. + +``` +Standard sliding window (stride=64, seq=2048): + Window 1: [tokens 0-2047] -> score tokens 0-2047 (context: 2048) + Window 2: [tokens 64-2111] -> score tokens 2048-2111 (context: 2048) + Window 3: [tokens 128-2175] -> score tokens 2112-2175 (context: 2048) + ...each window is INDEPENDENT. Token 2048 cannot see token 0. + +Neural Cache (stride=64, seq=2048, cache=8192): + Window 1: [tokens 0-2047] -> score, cache K/V for stride tokens + Window 2: [tokens 64-2111] -> attend to cached K/V + current window + Window 3: [tokens 128-2175] -> attend to growing cache + current window + ...token 8000 can attend to token 0 through the cache. Effective context: 10K+ +``` + +## Why This Should Work + +1. **More context = better prediction.** This is proven: seq2048 > seq1024 > seq512 (PR #136: -0.014 BPB from longer context). Neural Cache extends this principle beyond the training sequence length. + +2. **Flash Attention natively supports it.** When `seqlen_k > seqlen_q`, FA3 treats the extra K/V as "earlier" context — exactly the KV-cache pattern used in LLM inference. No custom kernels needed. + +3. **Backward-looking only.** The cache contains K/V from already-evaluated tokens. No future information leaks. This is the same principle as backward-looking TTT (PR #267, confirmed rule-compliant) but lighter weight — no gradient computation, just cached hidden states. + +4. **Zero artifact cost.** No extra parameters, no model changes. Pure eval-time technique. ~50 lines of code. + +## Implementation + +The core idea: modify the attention forward pass to accept and prepend cached K/V. + +```python +def attn_forward_with_cache(attn_module, x, kv_cache=None, cache_seqlen=0): + # Compute Q, K, V for current window + q, k, v = compute_qkv(attn_module, x) + + # Apply RoPE with position offset (critical for correctness) + cos, sin = attn_module.rotary(cache_seqlen + seqlen, device, dtype) + q = apply_rotary_emb(q, cos[cache_seqlen:], sin[cache_seqlen:]) + k = apply_rotary_emb(k, cos[cache_seqlen:], sin[cache_seqlen:]) + + # Prepend cached K/V from previous windows + if kv_cache is not None: + k = torch.cat([kv_cache[0], k], dim=1) # [B, cache+seq, H, D] + v = torch.cat([kv_cache[1], v], dim=1) + + # Flash Attention handles seqlen_k > seqlen_q natively + y = flash_attn_func(q, k, v, causal=True) + return y, (new_k, new_v) # Return current K/V for future caching +``` + +The eval loop maintains a per-layer cache, only storing the `stride` newest tokens per window to avoid redundancy: + +```python +layer_caches = [None] * num_layers +for window in sliding_windows: + logits, new_caches = forward_with_cache(model, window, layer_caches) + for layer_idx in range(num_layers): + # Only cache the NEW tokens (stride=64), not the full 2048 window + new_k = new_caches[layer_idx][0][:, -stride:] + # Append to existing cache, trim to max_cache_tokens + layer_caches[layer_idx] = concat_and_trim(old_cache, new_k, max_tokens=8192) + score_tokens(logits, window) +``` + +## RoPE Considerations + +The model was trained with `train_seq_len=1024` and uses NTK-aware RoPE scaling (auto-scales base frequency for longer sequences). For cache positions beyond the training length, RoPE quality degrades gradually. This is a known limitation — the same issue affects any long-context evaluation. + +Potential mitigations: +- **Cache only last N layers** (e.g., last 4 with XSA) — earlier layers handle local patterns that don't need extended context +- **Limit cache to 4096 tokens** — stays within 4x of training length where NTK scaling is still effective +- **Use RoPE base 50000** (as in PR #254) — extends the effective RoPE range + +## Rule Compliance + +Per the organizer ruling on TTT (Mar 20): +> "You can't train on the validation tokens before you evaluate on those same tokens." + +Neural Cache does NOT train on anything. It caches intermediate hidden states (K/V pairs) from **already-evaluated** tokens and uses them as additional context for future tokens. This is: +- **No weight modification** (unlike TTT) +- **Backward-looking only** (only uses K/V from scored tokens) +- **Equivalent to a longer context window** — evaluation methods are explicitly unrestricted + +## Status: Untested Due to Compute Constraints + +We implemented the full Neural Cache eval but encountered a bug in the model state after `torch.compile` — the custom forward path produced invalid results when called on the compiled `base_model`. The fix (using a fresh `eval_model` loaded from saved weights) was identified but we ran out of compute budget before re-running. + +**The code is provided below for anyone to test.** Expected cost: one 8xH100 run (~$5) to train + eval with Neural Cache. + +## Estimated Impact + +- **Conservative:** 0.005-0.01 BPB (from context extension alone) +- **Optimistic:** 0.01-0.03 BPB (if the model effectively leverages 10K+ context) +- **Risk:** RoPE degradation beyond training length could limit gains + +For reference, sliding window eval (extending context via overlap) gave -0.034 BPB (PR #77). Neural Cache extends context further via a complementary mechanism. + +## Reproduction + +Base model: PR #287's recipe (XSA + EMA + 11L + SmearGate + BigramHash) + +```bash +NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=4 \ +EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=0 \ +MUON_WD=0.04 ADAM_WD=0.04 \ +MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \ +MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \ +MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \ +ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \ +torchrun --standalone --nproc_per_node=8 train_gpt.py +``` + +Our reproduction: 7,009 steps @ 85.6ms/step, **1.1284 BPB** sliding window (vs PR #287's 1.1271). + +## Hardware + +8x NVIDIA H100 80GB SXM, RunPod. Training: 600s. Standard eval: ~30s. Sliding window: ~85s. Neural Cache eval (estimated): ~300s for 1M token subset. + +## Author + +Xiaoan Liu | NYU | GitHub: @sseanliu diff --git a/records/track_10min_16mb/2026-03-21_NeuralCache_Research/eval_neural_cache.py b/records/track_10min_16mb/2026-03-21_NeuralCache_Research/eval_neural_cache.py new file mode 100644 index 000000000..d8626344d --- /dev/null +++ b/records/track_10min_16mb/2026-03-21_NeuralCache_Research/eval_neural_cache.py @@ -0,0 +1,222 @@ +"""Neural Cache Evaluation: Cross-window KV caching for extended context. + +Usage: Add this to the end of the training script's main() function, +AFTER the int6 sliding window eval creates `eval_model`. + + # --- NEURAL CACHE EVAL --- + if master_process: + for cache_size in [0, 2048, 4096]: + nc_loss, nc_bpb = eval_neural_cache( + eval_model, rank, device, val_tokens, base_bytes_lut, + has_leading_space_lut, is_boundary_token_lut, + seq_len=args.train_seq_len, stride=64, + max_cache_tokens=cache_size, max_eval_tokens=1000000) + print(f"neural_cache cache={cache_size} bpb={nc_bpb:.6f}") + +IMPORTANT: Use `eval_model` (fresh model loaded from saved weights), +NOT `base_model` (which has torch.compile applied and produces invalid results). +""" + +import math +import time +import torch +import torch.nn.functional as F +from flash_attn_interface import flash_attn_func as flash_attn_3_func + + +def attn_forward_with_cache(attn_module, x, kv_cache=None, cache_seqlen=0): + """Attention forward with KV cache prepended for extended context. + + Args: + attn_module: CausalSelfAttention module + x: input [bsz, seqlen, dim] (already through attn_norm) + kv_cache: tuple (cached_k, cached_v) or None + cache_seqlen: number of tokens in cache (for RoPE position offset) + + Returns: + output: [bsz, seqlen, dim] + new_kv: tuple (k, v) for current window + """ + # Import apply_rotary_emb from the training script + from train_gpt import apply_rotary_emb + + bsz, seqlen, dim = x.shape + q = attn_module.c_q(x).reshape(bsz, seqlen, attn_module.num_heads, attn_module.head_dim) + k = attn_module.c_k(x).reshape(bsz, seqlen, attn_module.num_kv_heads, attn_module.head_dim) + v = attn_module.c_v(x).reshape(bsz, seqlen, attn_module.num_kv_heads, attn_module.head_dim) + + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + + # RoPE with position offset for cached context + total_len = cache_seqlen + seqlen + cos, sin = attn_module.rotary(total_len, x.device, q.dtype) + q = apply_rotary_emb(q, cos[cache_seqlen:total_len], sin[cache_seqlen:total_len]) + k = apply_rotary_emb(k, cos[cache_seqlen:total_len], sin[cache_seqlen:total_len]) + + q = q * attn_module.q_gain.to(dtype=q.dtype)[None, None, :, None] + + # Save current K/V before cache concatenation + new_k, new_v = k.clone(), v.clone() + + # Prepend cached K/V from previous windows + if kv_cache is not None: + k = torch.cat([kv_cache[0], k], dim=1) + v = torch.cat([kv_cache[1], v], dim=1) + + # flash_attn handles seqlen_k > seqlen_q with causal=True correctly: + # queries attend to all cached tokens + causal portion of current window + y = flash_attn_3_func(q, k, v, causal=True) + + if attn_module.use_xsa: + y = attn_module._xsa_efficient(y, new_v) + + y = y.reshape(bsz, seqlen, dim) + return attn_module.proj(y), (new_k, new_v) + + +def forward_logits_cached(model, input_ids, layer_caches=None, cache_seqlen=0): + """Full forward pass with per-layer KV caches.""" + x = model.tok_emb(input_ids) + if model.bigram is not None: + x = x + model.bigram(input_ids) + x = F.rms_norm(x, (x.size(-1),)) + x = model.smear(x) + x0 = x + + new_caches = [] + skips = [] + layer_idx = 0 + + for i in range(model.num_encoder_layers): + block = model.blocks[i] + mix = block.resid_mix.to(dtype=x.dtype) + x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + + lc = layer_caches[layer_idx] if layer_caches else None + attn_out, new_kv = attn_forward_with_cache( + block.attn, block.attn_norm(x), kv_cache=lc, cache_seqlen=cache_seqlen) + new_caches.append(new_kv) + layer_idx += 1 + + x = x + block.attn_scale.to(dtype=x.dtype)[None, None, :] * attn_out + x = x + block.mlp_scale.to(dtype=x.dtype)[None, None, :] * block.mlp(block.mlp_norm(x)) + skips.append(x) + + for i in range(model.num_decoder_layers): + if skips: + x = x + model.skip_weights[i].to(dtype=x.dtype)[None, None, :] * skips.pop() + block = model.blocks[model.num_encoder_layers + i] + mix = block.resid_mix.to(dtype=x.dtype) + x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + + lc = layer_caches[layer_idx] if layer_caches else None + attn_out, new_kv = attn_forward_with_cache( + block.attn, block.attn_norm(x), kv_cache=lc, cache_seqlen=cache_seqlen) + new_caches.append(new_kv) + layer_idx += 1 + + x = x + block.attn_scale.to(dtype=x.dtype)[None, None, :] * attn_out + x = x + block.mlp_scale.to(dtype=x.dtype)[None, None, :] * block.mlp(block.mlp_norm(x)) + + x = model.final_norm(x) + if model.tie_embeddings: + logits_proj = F.linear(x, model.tok_emb.weight) + else: + logits_proj = model.lm_head(x) + logits = model.logit_softcap * torch.tanh(logits_proj / model.logit_softcap) + return logits, new_caches + + +def eval_neural_cache( + model, rank, device, val_tokens, + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + seq_len=2048, stride=64, max_cache_tokens=4096, max_eval_tokens=1000000, +): + """Sliding window eval with cross-window KV caching. + + Args: + model: GPT model (use eval_model, NOT base_model after torch.compile) + rank: distributed rank (only rank 0 runs this) + device: CUDA device + val_tokens: validation token tensor + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut: BPB lookup tables + seq_len: window size (default 2048) + stride: scoring stride (default 64) + max_cache_tokens: maximum cached K/V tokens per layer (0 = no caching) + max_eval_tokens: subset size for quick testing + + Returns: + (val_loss, val_bpb) tuple + """ + if rank != 0: + return 0.0, 0.0 + + total_tokens = min(val_tokens.numel() - 1, max_eval_tokens) + num_layers = len(model.blocks) + + loss_sum = 0.0 + token_count = 0 + byte_count = 0.0 + layer_caches = [None] * num_layers + cache_seqlen = 0 + + model.eval() + t0 = time.perf_counter() + + with torch.inference_mode(), torch.autocast(device_type="cuda", dtype=torch.bfloat16): + for ws in range(0, total_tokens, stride): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + if wlen < 1: + break + + chunk = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) + x_in = chunk[:-1].unsqueeze(0) + y_tgt = chunk[1:].unsqueeze(0) + + logits, new_caches = forward_logits_cached( + model, x_in, layer_caches=layer_caches, cache_seqlen=cache_seqlen) + + # Update per-layer caches: only store the stride-worth of NEW tokens + for li in range(num_layers): + if max_cache_tokens == 0: + layer_caches[li] = None + continue + new_k, new_v = new_caches[li] + cache_k = new_k[:, -stride:] + cache_v = new_v[:, -stride:] + if layer_caches[li] is not None: + old_k, old_v = layer_caches[li] + cache_k = torch.cat([old_k, cache_k], dim=1) + cache_v = torch.cat([old_v, cache_v], dim=1) + if cache_k.size(1) > max_cache_tokens: + cache_k = cache_k[:, -max_cache_tokens:] + cache_v = cache_v[:, -max_cache_tokens:] + layer_caches[li] = (cache_k, cache_v) + + cache_seqlen = min(ws + wlen, max_cache_tokens) if max_cache_tokens > 0 else 0 + + # Score only the NEW tokens + nll = F.cross_entropy(logits[0].float(), y_tgt[0], reduction="none") + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_nll = nll[s:wlen].to(torch.float64) + loss_sum += scored_nll.sum().item() + token_count += wlen - s + + tgt = y_tgt[0, s:wlen] + prev = x_in[0, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum().item() + + if ws % (stride * 500) == 0 and ws > 0: + elapsed = time.perf_counter() - t0 + running_bpb = (loss_sum / token_count / math.log(2.0)) * (token_count / byte_count) + print(f" ncache pos={ws}/{total_tokens} bpb={running_bpb:.4f} " + f"cache={cache_seqlen} elapsed={elapsed:.0f}s") + + elapsed = time.perf_counter() - t0 + val_loss = loss_sum / token_count + bpb = (val_loss / math.log(2.0)) * (token_count / byte_count) + return val_loss, bpb diff --git a/records/track_10min_16mb/2026-03-21_NeuralCache_Research/submission.json b/records/track_10min_16mb/2026-03-21_NeuralCache_Research/submission.json new file mode 100644 index 000000000..bd0d70aa6 --- /dev/null +++ b/records/track_10min_16mb/2026-03-21_NeuralCache_Research/submission.json @@ -0,0 +1,15 @@ +{ + "track": "10min_16mb", + "date": "2026-03-21", + "name": "Neural Cache: Cross-Window KV Cache for Extended Eval Context (research proposal)", + "author": "Xiaoan Liu", + "github_id": "sseanliu", + "blurb": "Research proposal for extending effective eval context from 2K to 50K+ tokens by caching K/V pairs across sliding windows. Backward-looking, zero artifact cost, rule-compliant. Implementation provided but untested due to compute constraints. Base: PR #287 reproduction at 1.1284 BPB.", + "seed_results": { + "1337": {"val_loss": 1.90519942, "val_bpb": 1.12836940, "steps": 7009, "ms_per_step": 85.62} + }, + "mean_val_bpb": 1.1284, + "artifact_bytes": 15532039, + "code_bytes": 71412, + "notes": "Non-record research submission. Neural Cache eval not yet validated — torch.compile interaction bug prevented valid results. Base reproduction of PR #287 confirms 1.1284 BPB (vs original 1.1271). FA3 + 8xH100 SXM." +} From b11bcdb509cee46c3256e76e7f5b3e50aa112045 Mon Sep 17 00:00:00 2001 From: "Xiaoan (Sean) Liu" Date: Thu, 26 Mar 2026 02:01:32 -0600 Subject: [PATCH 2/2] =?UTF-8?q?Add=20research:=20Why=20Novel=20Architectur?= =?UTF-8?q?es=20Fail=20at=2016MB=20=E2=80=94=206=20experiments=20on=20thro?= =?UTF-8?q?ughput-quantization=20co-optimization?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../README.md | 122 ++++++++++++++++++ 1 file changed, 122 insertions(+) create mode 100644 records/track_10min_16mb/2026-03-26_WhyNovelArchitecturesFail/README.md diff --git a/records/track_10min_16mb/2026-03-26_WhyNovelArchitecturesFail/README.md b/records/track_10min_16mb/2026-03-26_WhyNovelArchitecturesFail/README.md new file mode 100644 index 000000000..c0cd8ef09 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_WhyNovelArchitecturesFail/README.md @@ -0,0 +1,122 @@ +# Why Novel Architectures Fail at 16MB: Throughput-Quantization Co-optimization in Parameter Golf + +**Non-record research submission** | 6 experiments on 8xH100 SXM | Base: PR #549 (1.1194 BPB) + +## Summary + +We systematically evaluated 6 architectural innovations from recent papers (March 2026) on the PR #549 SOTA stack. All failed. The unified finding: **at 16MB/600s, the binding constraint is not model quality but throughput-quantization co-optimization.** The SOTA stack is a local optimum where every component (Parallel Muon, torch.compile, int6 per-row quantization, parameter banks) is co-designed for H100 tensor core throughput. Any modification — even theoretically superior — breaks this pipeline and loses more from overhead than it gains in quality. + +## The Throughput Tax + +At 83ms/step (PR #549's speed), each millisecond of per-step overhead costs ~7 training steps. Each step at convergence improves BPB by ~0.001. **Therefore, any technique must improve BPB by at least 0.007 per millisecond of overhead it adds.** No technique we tested clears this bar. + +## Experiments + +### 1. MUD Optimizer (arXiv:2603.17970) -- Negative Result + +**Hypothesis:** Replace Newton-Schulz iteration with Cholesky whitening for 10-50% faster optimizer steps. + +**Result:** 5% SLOWER (87.9ms vs 83ms). NaN divergence requiring diagonal regularization (1e-6). Final BPB: 1.1581 (vs 1.1194 SOTA). + +**Why it failed:** `torch.linalg.solve_triangular` doesn't support bf16 on CUDA — requires float32 cast. H100 tensor cores run batched matrix multiply (NS5) at 989 TFLOPS; triangular solve is memory-bandwidth-bound at ~200 GB/s. The paper's FLOP advantage (12x fewer ops) is irrelevant when the bottleneck is memory bandwidth, not compute. + +**Insight:** Optimizer innovations must match the batched-GEMM-on-tensor-cores paradigm. Sequential operations (triangular solve, scan, recurrence) cannot compete on current hardware. + +### 2. Information-Maximizing Architecture -- Mixed Result + +**Hypothesis:** LeakyReLU(0.9)^2 + XSA-all(11) + Partial RoPE 12/64 + Progressive LN Scale improve by preserving information flow. + +**Result:** 1.1261 BPB (89ms/step, 6,737 steps). Better than PR #287 (1.1271) but 0.0067 behind SOTA. + +**Why it didn't beat SOTA:** XSA-all adds ~6ms/step (XSA on 7 additional layers), costing ~400 steps. The -0.002 BPB gain from XSA-all doesn't compensate. Progressive LN Scale and Partial RoPE 12/64 were approximately neutral. + +**Insight:** XSA follows diminishing returns — the deepest 4 layers capture most of the self-value bias. Extending to all layers trades throughput for marginal quality. + +### 3. Hourglass FFN (arXiv:2602.06471) -- Negative Result + +**Hypothesis:** Split MLP into K=2 sub-blocks with residual connections for deeper per-layer computation at same parameter count. + +**Result:** Pre-quant 1.1539, int6 roundtrip **1.4811** (+0.33 BPB quantization gap). Catastrophic. + +**Why it failed:** Splitting the MLP weight bank into two sub-blocks creates weight distributions that int6 per-row quantization cannot handle. Standard MLP weights have heterogeneous row magnitudes — int6's per-row scaling naturally adapts. Split sub-block weights have more uniform, smaller magnitudes — the quantization grid becomes too coarse relative to the weight variance. + +**Insight:** MLP shape affects quantizability. Any architectural change must preserve the weight distribution characteristics that int6+lzma is optimized for. + +### 4. nGPT Hypersphere Normalization (arXiv:2410.01131) -- Negative Result + +**Hypothesis:** Normalize all vectors to unit norm on a hypersphere for 4-20x faster convergence, eliminating LayerNorm. + +**Result:** Pre-quant 1.3632, int6 roundtrip **1.7134** (+0.35 quant gap). 122ms/step (46% slower). Artifact only 8.38MB (weights compress well under lzma but quality destroyed). + +**Why it failed:** +1. **Quantization incompatibility:** Unit-norm weights concentrate all values in a narrow range (+-0.044 for d=512). Int6's 64 levels can't resolve the angular relationships the model relies on. Small quantization errors destroy the precise geometry of the hypersphere. +2. **Throughput:** F.normalize called 44 times per forward pass + post-step weight normalization = 46% overhead. +3. **Convergence:** The 4-20x speedup claim (tested at 0.5-1B scale) doesn't transfer to 27M scale with SmearGate/XSA/BigramHash, all designed for unnormalized residual streams. + +**Insight:** Weight normalization and low-bit quantization are fundamentally incompatible. Normalized weights need angular-aware quantization, not per-row uniform quantization. + +### 5. TrigramHash Embedding -- Marginal Negative Result + +**Hypothesis:** Extend BigramHash to 3-gram context for -0.008 BPB at ~221KB artifact cost. + +**Result:** 1.1298 BPB (98ms/step, 6,098 steps). Quant gap healthy (+0.009). + +**Why it didn't help net:** Hash computation + embedding lookup + projection adds ~15ms/step overhead, costing ~1,100 steps. The -0.008 BPB gain is eaten by the -0.010 BPB loss from fewer steps. + +**Insight:** Even cheap operations (hash + lookup) fail the throughput tax at 83ms/step. The bar for "zero overhead" is extremely high — only changes to constants (activation slopes, initialization values) truly qualify. + +### 6. SSM-Transformer Hybrid (GatedDeltaNet, ICLR 2025) -- Negative Result + +**Hypothesis:** Replace middle 4 transformer layers with GatedDeltaNet (linear recurrence) for long-range context without quadratic attention. + +**Result:** 1.2516 BPB (282ms/step, 2,126 steps). Artifact 17.78MB (over budget). + +**Why it failed:** +1. **No torch.compile:** GatedDeltaNet's Triton kernels break `torch.compile(fullgraph=True)`. Without compile, step time explodes 3.4x. +2. **Over budget:** GatedDeltaNet adds ~6x hidden_size^2 params per layer (1.58M/layer), pushing past 16MB. +3. **Memory-bound:** Recurrent scan operations can't use H100 tensor cores. + +**Positive finding:** Per-step loss quality matches transformers (loss 2.25 at step 1000 for both). GatedDeltaNet learns at equivalent rate per gradient update — it's purely a throughput problem. + +**Insight:** SSM-transformer hybrids need torch.compile support and hardware-native scan kernels to become competitive. The FLA library's Triton kernels are fast but not compile-compatible. + +## The Unified Finding + +The PR #549 SOTA stack is not just a good architecture — it's a **co-optimized system** where: + +1. **Parallel Muon** packs all weights into 4 contiguous 3D banks for batched Newton-Schulz +2. **torch.compile(fullgraph=True)** fuses the entire forward pass into optimized CUDA kernels +3. **Int6 per-row quantization** is calibrated for the specific weight distributions produced by this architecture + optimizer combination +4. **H100 tensor cores** run the batched GEMM operations at peak throughput + +Breaking any one of these four pillars cascades into the others: +- New optimizer → breaks batched bank structure → loses Parallel Muon speedup +- New layer type → breaks torch.compile → loses fusion speedup +- New weight distribution → breaks int6 calibration → catastrophic quantization +- New operation type (scan, solve) → can't use tensor cores → memory-bandwidth-bound + +**To genuinely beat this SOTA, you need to co-optimize ALL FOUR simultaneously.** The ternary submission (PR #640) succeeded exactly because it did this: different quantization (ternary) + different optimizer (NeoMuon) + different architecture (768d, 8192 BPE) + 250 experiments to co-optimize everything. + +## Implications for Small Model Design + +These findings transfer beyond this competition: + +1. **Throughput-aware architecture search:** At constrained compute, evaluate architectures by BPB-per-second, not BPB-per-step +2. **Quantization-aware architecture design:** Novel MLP shapes and normalization schemes must be validated under target quantization BEFORE committing to them +3. **Co-optimization is mandatory:** At small scale, architecture-optimizer-quantization-hardware form a tightly coupled system. Optimizing one in isolation is insufficient. +4. **The "throughput tax" formula:** For any technique adding T ms/step overhead at S ms/step baseline, it must improve BPB by at least T/(S * 600/S) * step_bpb_rate to break even + +## Hardware + +8x NVIDIA H100 80GB SXM, RunPod. PyTorch 2.8.0+cu128. FA3 via windreamer wheels. Total compute: ~$150 across 7 runs. + +## Related Work + +- PR #296: Reptile meta-learned TTT (our earlier submission, cited by Issue #140) +- PR #303: XSA + TTT negative interaction study +- PR #318: Neural Cache concept (cross-window KV caching) +- PR #640: Ternary quantization (the paradigm-shift submission that succeeded by co-optimizing everything) + +## Author + +Xiaoan Liu | NYU | GitHub: @sseanliu