From 2f278eb6239af20c896f5a15c59e4fb659fbfd0c Mon Sep 17 00:00:00 2001
From: "Xiaoan (Sean) Liu" <shawnliu0327@gmail.com>
Date: Sat, 21 Mar 2026 00:35:11 -0600
Subject: [PATCH 1/2] Add Neural Cache research proposal: cross-window KV
 caching for extended eval context

Non-record research submission. Proposes caching K/V pairs across sliding
windows to extend effective context from 2K to 50K+ tokens at eval time.
Backward-looking, zero artifact cost, rule-compliant. Implementation provided
but untested due to compute constraints. Base: PR #287 reproduction at 1.1284 BPB.
---
 .../2026-03-21_NeuralCache_Research/README.md | 127 ++++++++++
 .../eval_neural_cache.py                      | 222 ++++++++++++++++++
 .../submission.json                           |  15 ++
 3 files changed, 364 insertions(+)
 create mode 100644 records/track_10min_16mb/2026-03-21_NeuralCache_Research/README.md
 create mode 100644 records/track_10min_16mb/2026-03-21_NeuralCache_Research/eval_neural_cache.py
 create mode 100644 records/track_10min_16mb/2026-03-21_NeuralCache_Research/submission.json

diff --git a/records/track_10min_16mb/2026-03-21_NeuralCache_Research/README.md b/records/track_10min_16mb/2026-03-21_NeuralCache_Research/README.md
new file mode 100644
index 000000000..00637e389
--- /dev/null
+++ b/records/track_10min_16mb/2026-03-21_NeuralCache_Research/README.md
@@ -0,0 +1,127 @@
+# Neural Cache: Cross-Window KV Cache for Extended Context at Eval Time
+
+**Research proposal (no record claim)** | Base model: PR #287 reproduction (1.1284 BPB) | 8xH100 SXM
+
+## The Idea
+
+Standard sliding window evaluation processes each window independently. A window at position 10,000 has no memory of what happened at position 5,000 — even though those tokens were already evaluated. Neural Cache fixes this by **caching K/V pairs across windows**, extending effective context from 2,048 tokens to 50K+ tokens at zero artifact cost.
+
+```
+Standard sliding window (stride=64, seq=2048):
+  Window 1: [tokens 0-2047]     -> score tokens 0-2047    (context: 2048)
+  Window 2: [tokens 64-2111]    -> score tokens 2048-2111  (context: 2048)
+  Window 3: [tokens 128-2175]   -> score tokens 2112-2175  (context: 2048)
+  ...each window is INDEPENDENT. Token 2048 cannot see token 0.
+
+Neural Cache (stride=64, seq=2048, cache=8192):
+  Window 1: [tokens 0-2047]     -> score, cache K/V for stride tokens
+  Window 2: [tokens 64-2111]    -> attend to cached K/V + current window
+  Window 3: [tokens 128-2175]   -> attend to growing cache + current window
+  ...token 8000 can attend to token 0 through the cache. Effective context: 10K+
+```
+
+## Why This Should Work
+
+1. **More context = better prediction.** This is proven: seq2048 > seq1024 > seq512 (PR #136: -0.014 BPB from longer context). Neural Cache extends this principle beyond the training sequence length.
+
+2. **Flash Attention natively supports it.** When `seqlen_k > seqlen_q`, FA3 treats the extra K/V as "earlier" context — exactly the KV-cache pattern used in LLM inference. No custom kernels needed.
+
+3. **Backward-looking only.** The cache contains K/V from already-evaluated tokens. No future information leaks. This is the same principle as backward-looking TTT (PR #267, confirmed rule-compliant) but lighter weight — no gradient computation, just cached hidden states.
+
+4. **Zero artifact cost.** No extra parameters, no model changes. Pure eval-time technique. ~50 lines of code.
+
+## Implementation
+
+The core idea: modify the attention forward pass to accept and prepend cached K/V.
+
+```python
+def attn_forward_with_cache(attn_module, x, kv_cache=None, cache_seqlen=0):
+    # Compute Q, K, V for current window
+    q, k, v = compute_qkv(attn_module, x)
+
+    # Apply RoPE with position offset (critical for correctness)
+    cos, sin = attn_module.rotary(cache_seqlen + seqlen, device, dtype)
+    q = apply_rotary_emb(q, cos[cache_seqlen:], sin[cache_seqlen:])
+    k = apply_rotary_emb(k, cos[cache_seqlen:], sin[cache_seqlen:])
+
+    # Prepend cached K/V from previous windows
+    if kv_cache is not None:
+        k = torch.cat([kv_cache[0], k], dim=1)  # [B, cache+seq, H, D]
+        v = torch.cat([kv_cache[1], v], dim=1)
+
+    # Flash Attention handles seqlen_k > seqlen_q natively
+    y = flash_attn_func(q, k, v, causal=True)
+    return y, (new_k, new_v)  # Return current K/V for future caching
+```
+
+The eval loop maintains a per-layer cache, only storing the `stride` newest tokens per window to avoid redundancy:
+
+```python
+layer_caches = [None] * num_layers
+for window in sliding_windows:
+    logits, new_caches = forward_with_cache(model, window, layer_caches)
+    for layer_idx in range(num_layers):
+        # Only cache the NEW tokens (stride=64), not the full 2048 window
+        new_k = new_caches[layer_idx][0][:, -stride:]
+        # Append to existing cache, trim to max_cache_tokens
+        layer_caches[layer_idx] = concat_and_trim(old_cache, new_k, max_tokens=8192)
+    score_tokens(logits, window)
+```
+
+## RoPE Considerations
+
+The model was trained with `train_seq_len=1024` and uses NTK-aware RoPE scaling (auto-scales base frequency for longer sequences). For cache positions beyond the training length, RoPE quality degrades gradually. This is a known limitation — the same issue affects any long-context evaluation.
+
+Potential mitigations:
+- **Cache only last N layers** (e.g., last 4 with XSA) — earlier layers handle local patterns that don't need extended context
+- **Limit cache to 4096 tokens** — stays within 4x of training length where NTK scaling is still effective
+- **Use RoPE base 50000** (as in PR #254) — extends the effective RoPE range
+
+## Rule Compliance
+
+Per the organizer ruling on TTT (Mar 20):
+> "You can't train on the validation tokens before you evaluate on those same tokens."
+
+Neural Cache does NOT train on anything. It caches intermediate hidden states (K/V pairs) from **already-evaluated** tokens and uses them as additional context for future tokens. This is:
+- **No weight modification** (unlike TTT)
+- **Backward-looking only** (only uses K/V from scored tokens)
+- **Equivalent to a longer context window** — evaluation methods are explicitly unrestricted
+
+## Status: Untested Due to Compute Constraints
+
+We implemented the full Neural Cache eval but encountered a bug in the model state after `torch.compile` — the custom forward path produced invalid results when called on the compiled `base_model`. The fix (using a fresh `eval_model` loaded from saved weights) was identified but we ran out of compute budget before re-running.
+
+**The code is provided below for anyone to test.** Expected cost: one 8xH100 run (~$5) to train + eval with Neural Cache.
+
+## Estimated Impact
+
+- **Conservative:** 0.005-0.01 BPB (from context extension alone)
+- **Optimistic:** 0.01-0.03 BPB (if the model effectively leverages 10K+ context)
+- **Risk:** RoPE degradation beyond training length could limit gains
+
+For reference, sliding window eval (extending context via overlap) gave -0.034 BPB (PR #77). Neural Cache extends context further via a complementary mechanism.
+
+## Reproduction
+
+Base model: PR #287's recipe (XSA + EMA + 11L + SmearGate + BigramHash)
+
+```bash
+NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=4 \
+EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=0 \
+MUON_WD=0.04 ADAM_WD=0.04 \
+MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
+MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
+MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
+ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+Our reproduction: 7,009 steps @ 85.6ms/step, **1.1284 BPB** sliding window (vs PR #287's 1.1271).
+
+## Hardware
+
+8x NVIDIA H100 80GB SXM, RunPod. Training: 600s. Standard eval: ~30s. Sliding window: ~85s. Neural Cache eval (estimated): ~300s for 1M token subset.
+
+## Author
+
+Xiaoan Liu | NYU | GitHub: @sseanliu
diff --git a/records/track_10min_16mb/2026-03-21_NeuralCache_Research/eval_neural_cache.py b/records/track_10min_16mb/2026-03-21_NeuralCache_Research/eval_neural_cache.py
new file mode 100644
index 000000000..d8626344d
--- /dev/null
+++ b/records/track_10min_16mb/2026-03-21_NeuralCache_Research/eval_neural_cache.py
@@ -0,0 +1,222 @@
+"""Neural Cache Evaluation: Cross-window KV caching for extended context.
+
+Usage: Add this to the end of the training script's main() function,
+AFTER the int6 sliding window eval creates `eval_model`.
+
+    # --- NEURAL CACHE EVAL ---
+    if master_process:
+        for cache_size in [0, 2048, 4096]:
+            nc_loss, nc_bpb = eval_neural_cache(
+                eval_model, rank, device, val_tokens, base_bytes_lut,
+                has_leading_space_lut, is_boundary_token_lut,
+                seq_len=args.train_seq_len, stride=64,
+                max_cache_tokens=cache_size, max_eval_tokens=1000000)
+            print(f"neural_cache cache={cache_size} bpb={nc_bpb:.6f}")
+
+IMPORTANT: Use `eval_model` (fresh model loaded from saved weights),
+NOT `base_model` (which has torch.compile applied and produces invalid results).
+"""
+
+import math
+import time
+import torch
+import torch.nn.functional as F
+from flash_attn_interface import flash_attn_func as flash_attn_3_func
+
+
+def attn_forward_with_cache(attn_module, x, kv_cache=None, cache_seqlen=0):
+    """Attention forward with KV cache prepended for extended context.
+
+    Args:
+        attn_module: CausalSelfAttention module
+        x: input [bsz, seqlen, dim] (already through attn_norm)
+        kv_cache: tuple (cached_k, cached_v) or None
+        cache_seqlen: number of tokens in cache (for RoPE position offset)
+
+    Returns:
+        output: [bsz, seqlen, dim]
+        new_kv: tuple (k, v) for current window
+    """
+    # Import apply_rotary_emb from the training script
+    from train_gpt import apply_rotary_emb
+
+    bsz, seqlen, dim = x.shape
+    q = attn_module.c_q(x).reshape(bsz, seqlen, attn_module.num_heads, attn_module.head_dim)
+    k = attn_module.c_k(x).reshape(bsz, seqlen, attn_module.num_kv_heads, attn_module.head_dim)
+    v = attn_module.c_v(x).reshape(bsz, seqlen, attn_module.num_kv_heads, attn_module.head_dim)
+
+    q = F.rms_norm(q, (q.size(-1),))
+    k = F.rms_norm(k, (k.size(-1),))
+
+    # RoPE with position offset for cached context
+    total_len = cache_seqlen + seqlen
+    cos, sin = attn_module.rotary(total_len, x.device, q.dtype)
+    q = apply_rotary_emb(q, cos[cache_seqlen:total_len], sin[cache_seqlen:total_len])
+    k = apply_rotary_emb(k, cos[cache_seqlen:total_len], sin[cache_seqlen:total_len])
+
+    q = q * attn_module.q_gain.to(dtype=q.dtype)[None, None, :, None]
+
+    # Save current K/V before cache concatenation
+    new_k, new_v = k.clone(), v.clone()
+
+    # Prepend cached K/V from previous windows
+    if kv_cache is not None:
+        k = torch.cat([kv_cache[0], k], dim=1)
+        v = torch.cat([kv_cache[1], v], dim=1)
+
+    # flash_attn handles seqlen_k > seqlen_q with causal=True correctly:
+    # queries attend to all cached tokens + causal portion of current window
+    y = flash_attn_3_func(q, k, v, causal=True)
+
+    if attn_module.use_xsa:
+        y = attn_module._xsa_efficient(y, new_v)
+
+    y = y.reshape(bsz, seqlen, dim)
+    return attn_module.proj(y), (new_k, new_v)
+
+
+def forward_logits_cached(model, input_ids, layer_caches=None, cache_seqlen=0):
+    """Full forward pass with per-layer KV caches."""
+    x = model.tok_emb(input_ids)
+    if model.bigram is not None:
+        x = x + model.bigram(input_ids)
+    x = F.rms_norm(x, (x.size(-1),))
+    x = model.smear(x)
+    x0 = x
+
+    new_caches = []
+    skips = []
+    layer_idx = 0
+
+    for i in range(model.num_encoder_layers):
+        block = model.blocks[i]
+        mix = block.resid_mix.to(dtype=x.dtype)
+        x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0
+
+        lc = layer_caches[layer_idx] if layer_caches else None
+        attn_out, new_kv = attn_forward_with_cache(
+            block.attn, block.attn_norm(x), kv_cache=lc, cache_seqlen=cache_seqlen)
+        new_caches.append(new_kv)
+        layer_idx += 1
+
+        x = x + block.attn_scale.to(dtype=x.dtype)[None, None, :] * attn_out
+        x = x + block.mlp_scale.to(dtype=x.dtype)[None, None, :] * block.mlp(block.mlp_norm(x))
+        skips.append(x)
+
+    for i in range(model.num_decoder_layers):
+        if skips:
+            x = x + model.skip_weights[i].to(dtype=x.dtype)[None, None, :] * skips.pop()
+        block = model.blocks[model.num_encoder_layers + i]
+        mix = block.resid_mix.to(dtype=x.dtype)
+        x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0
+
+        lc = layer_caches[layer_idx] if layer_caches else None
+        attn_out, new_kv = attn_forward_with_cache(
+            block.attn, block.attn_norm(x), kv_cache=lc, cache_seqlen=cache_seqlen)
+        new_caches.append(new_kv)
+        layer_idx += 1
+
+        x = x + block.attn_scale.to(dtype=x.dtype)[None, None, :] * attn_out
+        x = x + block.mlp_scale.to(dtype=x.dtype)[None, None, :] * block.mlp(block.mlp_norm(x))
+
+    x = model.final_norm(x)
+    if model.tie_embeddings:
+        logits_proj = F.linear(x, model.tok_emb.weight)
+    else:
+        logits_proj = model.lm_head(x)
+    logits = model.logit_softcap * torch.tanh(logits_proj / model.logit_softcap)
+    return logits, new_caches
+
+
+def eval_neural_cache(
+    model, rank, device, val_tokens,
+    base_bytes_lut, has_leading_space_lut, is_boundary_token_lut,
+    seq_len=2048, stride=64, max_cache_tokens=4096, max_eval_tokens=1000000,
+):
+    """Sliding window eval with cross-window KV caching.
+
+    Args:
+        model: GPT model (use eval_model, NOT base_model after torch.compile)
+        rank: distributed rank (only rank 0 runs this)
+        device: CUDA device
+        val_tokens: validation token tensor
+        base_bytes_lut, has_leading_space_lut, is_boundary_token_lut: BPB lookup tables
+        seq_len: window size (default 2048)
+        stride: scoring stride (default 64)
+        max_cache_tokens: maximum cached K/V tokens per layer (0 = no caching)
+        max_eval_tokens: subset size for quick testing
+
+    Returns:
+        (val_loss, val_bpb) tuple
+    """
+    if rank != 0:
+        return 0.0, 0.0
+
+    total_tokens = min(val_tokens.numel() - 1, max_eval_tokens)
+    num_layers = len(model.blocks)
+
+    loss_sum = 0.0
+    token_count = 0
+    byte_count = 0.0
+    layer_caches = [None] * num_layers
+    cache_seqlen = 0
+
+    model.eval()
+    t0 = time.perf_counter()
+
+    with torch.inference_mode(), torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+        for ws in range(0, total_tokens, stride):
+            end = min(ws + seq_len, total_tokens)
+            wlen = end - ws
+            if wlen < 1:
+                break
+
+            chunk = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device)
+            x_in = chunk[:-1].unsqueeze(0)
+            y_tgt = chunk[1:].unsqueeze(0)
+
+            logits, new_caches = forward_logits_cached(
+                model, x_in, layer_caches=layer_caches, cache_seqlen=cache_seqlen)
+
+            # Update per-layer caches: only store the stride-worth of NEW tokens
+            for li in range(num_layers):
+                if max_cache_tokens == 0:
+                    layer_caches[li] = None
+                    continue
+                new_k, new_v = new_caches[li]
+                cache_k = new_k[:, -stride:]
+                cache_v = new_v[:, -stride:]
+                if layer_caches[li] is not None:
+                    old_k, old_v = layer_caches[li]
+                    cache_k = torch.cat([old_k, cache_k], dim=1)
+                    cache_v = torch.cat([old_v, cache_v], dim=1)
+                    if cache_k.size(1) > max_cache_tokens:
+                        cache_k = cache_k[:, -max_cache_tokens:]
+                        cache_v = cache_v[:, -max_cache_tokens:]
+                layer_caches[li] = (cache_k, cache_v)
+
+            cache_seqlen = min(ws + wlen, max_cache_tokens) if max_cache_tokens > 0 else 0
+
+            # Score only the NEW tokens
+            nll = F.cross_entropy(logits[0].float(), y_tgt[0], reduction="none")
+            s = 0 if ws == 0 else max(wlen - stride, 0)
+            scored_nll = nll[s:wlen].to(torch.float64)
+            loss_sum += scored_nll.sum().item()
+            token_count += wlen - s
+
+            tgt = y_tgt[0, s:wlen]
+            prev = x_in[0, s:wlen]
+            tb = base_bytes_lut[tgt].to(torch.float64)
+            tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64)
+            byte_count += tb.sum().item()
+
+            if ws % (stride * 500) == 0 and ws > 0:
+                elapsed = time.perf_counter() - t0
+                running_bpb = (loss_sum / token_count / math.log(2.0)) * (token_count / byte_count)
+                print(f"  ncache pos={ws}/{total_tokens} bpb={running_bpb:.4f} "
+                      f"cache={cache_seqlen} elapsed={elapsed:.0f}s")
+
+    elapsed = time.perf_counter() - t0
+    val_loss = loss_sum / token_count
+    bpb = (val_loss / math.log(2.0)) * (token_count / byte_count)
+    return val_loss, bpb
diff --git a/records/track_10min_16mb/2026-03-21_NeuralCache_Research/submission.json b/records/track_10min_16mb/2026-03-21_NeuralCache_Research/submission.json
new file mode 100644
index 000000000..bd0d70aa6
--- /dev/null
+++ b/records/track_10min_16mb/2026-03-21_NeuralCache_Research/submission.json
@@ -0,0 +1,15 @@
+{
+  "track": "10min_16mb",
+  "date": "2026-03-21",
+  "name": "Neural Cache: Cross-Window KV Cache for Extended Eval Context (research proposal)",
+  "author": "Xiaoan Liu",
+  "github_id": "sseanliu",
+  "blurb": "Research proposal for extending effective eval context from 2K to 50K+ tokens by caching K/V pairs across sliding windows. Backward-looking, zero artifact cost, rule-compliant. Implementation provided but untested due to compute constraints. Base: PR #287 reproduction at 1.1284 BPB.",
+  "seed_results": {
+    "1337": {"val_loss": 1.90519942, "val_bpb": 1.12836940, "steps": 7009, "ms_per_step": 85.62}
+  },
+  "mean_val_bpb": 1.1284,
+  "artifact_bytes": 15532039,
+  "code_bytes": 71412,
+  "notes": "Non-record research submission. Neural Cache eval not yet validated — torch.compile interaction bug prevented valid results. Base reproduction of PR #287 confirms 1.1284 BPB (vs original 1.1271). FA3 + 8xH100 SXM."
+}

From b11bcdb509cee46c3256e76e7f5b3e50aa112045 Mon Sep 17 00:00:00 2001
From: "Xiaoan (Sean) Liu" <shawnliu0327@gmail.com>
Date: Thu, 26 Mar 2026 02:01:32 -0600
Subject: [PATCH 2/2] =?UTF-8?q?Add=20research:=20Why=20Novel=20Architectur?=
 =?UTF-8?q?es=20Fail=20at=2016MB=20=E2=80=94=206=20experiments=20on=20thro?=
 =?UTF-8?q?ughput-quantization=20co-optimization?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../README.md                                 | 122 ++++++++++++++++++
 1 file changed, 122 insertions(+)
 create mode 100644 records/track_10min_16mb/2026-03-26_WhyNovelArchitecturesFail/README.md

diff --git a/records/track_10min_16mb/2026-03-26_WhyNovelArchitecturesFail/README.md b/records/track_10min_16mb/2026-03-26_WhyNovelArchitecturesFail/README.md
new file mode 100644
index 000000000..c0cd8ef09
--- /dev/null
+++ b/records/track_10min_16mb/2026-03-26_WhyNovelArchitecturesFail/README.md
@@ -0,0 +1,122 @@
+# Why Novel Architectures Fail at 16MB: Throughput-Quantization Co-optimization in Parameter Golf
+
+**Non-record research submission** | 6 experiments on 8xH100 SXM | Base: PR #549 (1.1194 BPB)
+
+## Summary
+
+We systematically evaluated 6 architectural innovations from recent papers (March 2026) on the PR #549 SOTA stack. All failed. The unified finding: **at 16MB/600s, the binding constraint is not model quality but throughput-quantization co-optimization.** The SOTA stack is a local optimum where every component (Parallel Muon, torch.compile, int6 per-row quantization, parameter banks) is co-designed for H100 tensor core throughput. Any modification — even theoretically superior — breaks this pipeline and loses more from overhead than it gains in quality.
+
+## The Throughput Tax
+
+At 83ms/step (PR #549's speed), each millisecond of per-step overhead costs ~7 training steps. Each step at convergence improves BPB by ~0.001. **Therefore, any technique must improve BPB by at least 0.007 per millisecond of overhead it adds.** No technique we tested clears this bar.
+
+## Experiments
+
+### 1. MUD Optimizer (arXiv:2603.17970) -- Negative Result
+
+**Hypothesis:** Replace Newton-Schulz iteration with Cholesky whitening for 10-50% faster optimizer steps.
+
+**Result:** 5% SLOWER (87.9ms vs 83ms). NaN divergence requiring diagonal regularization (1e-6). Final BPB: 1.1581 (vs 1.1194 SOTA).
+
+**Why it failed:** `torch.linalg.solve_triangular` doesn't support bf16 on CUDA — requires float32 cast. H100 tensor cores run batched matrix multiply (NS5) at 989 TFLOPS; triangular solve is memory-bandwidth-bound at ~200 GB/s. The paper's FLOP advantage (12x fewer ops) is irrelevant when the bottleneck is memory bandwidth, not compute.
+
+**Insight:** Optimizer innovations must match the batched-GEMM-on-tensor-cores paradigm. Sequential operations (triangular solve, scan, recurrence) cannot compete on current hardware.
+
+### 2. Information-Maximizing Architecture -- Mixed Result
+
+**Hypothesis:** LeakyReLU(0.9)^2 + XSA-all(11) + Partial RoPE 12/64 + Progressive LN Scale improve by preserving information flow.
+
+**Result:** 1.1261 BPB (89ms/step, 6,737 steps). Better than PR #287 (1.1271) but 0.0067 behind SOTA.
+
+**Why it didn't beat SOTA:** XSA-all adds ~6ms/step (XSA on 7 additional layers), costing ~400 steps. The -0.002 BPB gain from XSA-all doesn't compensate. Progressive LN Scale and Partial RoPE 12/64 were approximately neutral.
+
+**Insight:** XSA follows diminishing returns — the deepest 4 layers capture most of the self-value bias. Extending to all layers trades throughput for marginal quality.
+
+### 3. Hourglass FFN (arXiv:2602.06471) -- Negative Result
+
+**Hypothesis:** Split MLP into K=2 sub-blocks with residual connections for deeper per-layer computation at same parameter count.
+
+**Result:** Pre-quant 1.1539, int6 roundtrip **1.4811** (+0.33 BPB quantization gap). Catastrophic.
+
+**Why it failed:** Splitting the MLP weight bank into two sub-blocks creates weight distributions that int6 per-row quantization cannot handle. Standard MLP weights have heterogeneous row magnitudes — int6's per-row scaling naturally adapts. Split sub-block weights have more uniform, smaller magnitudes — the quantization grid becomes too coarse relative to the weight variance.
+
+**Insight:** MLP shape affects quantizability. Any architectural change must preserve the weight distribution characteristics that int6+lzma is optimized for.
+
+### 4. nGPT Hypersphere Normalization (arXiv:2410.01131) -- Negative Result
+
+**Hypothesis:** Normalize all vectors to unit norm on a hypersphere for 4-20x faster convergence, eliminating LayerNorm.
+
+**Result:** Pre-quant 1.3632, int6 roundtrip **1.7134** (+0.35 quant gap). 122ms/step (46% slower). Artifact only 8.38MB (weights compress well under lzma but quality destroyed).
+
+**Why it failed:**
+1. **Quantization incompatibility:** Unit-norm weights concentrate all values in a narrow range (+-0.044 for d=512). Int6's 64 levels can't resolve the angular relationships the model relies on. Small quantization errors destroy the precise geometry of the hypersphere.
+2. **Throughput:** F.normalize called 44 times per forward pass + post-step weight normalization = 46% overhead.
+3. **Convergence:** The 4-20x speedup claim (tested at 0.5-1B scale) doesn't transfer to 27M scale with SmearGate/XSA/BigramHash, all designed for unnormalized residual streams.
+
+**Insight:** Weight normalization and low-bit quantization are fundamentally incompatible. Normalized weights need angular-aware quantization, not per-row uniform quantization.
+
+### 5. TrigramHash Embedding -- Marginal Negative Result
+
+**Hypothesis:** Extend BigramHash to 3-gram context for -0.008 BPB at ~221KB artifact cost.
+
+**Result:** 1.1298 BPB (98ms/step, 6,098 steps). Quant gap healthy (+0.009).
+
+**Why it didn't help net:** Hash computation + embedding lookup + projection adds ~15ms/step overhead, costing ~1,100 steps. The -0.008 BPB gain is eaten by the -0.010 BPB loss from fewer steps.
+
+**Insight:** Even cheap operations (hash + lookup) fail the throughput tax at 83ms/step. The bar for "zero overhead" is extremely high — only changes to constants (activation slopes, initialization values) truly qualify.
+
+### 6. SSM-Transformer Hybrid (GatedDeltaNet, ICLR 2025) -- Negative Result
+
+**Hypothesis:** Replace middle 4 transformer layers with GatedDeltaNet (linear recurrence) for long-range context without quadratic attention.
+
+**Result:** 1.2516 BPB (282ms/step, 2,126 steps). Artifact 17.78MB (over budget).
+
+**Why it failed:**
+1. **No torch.compile:** GatedDeltaNet's Triton kernels break `torch.compile(fullgraph=True)`. Without compile, step time explodes 3.4x.
+2. **Over budget:** GatedDeltaNet adds ~6x hidden_size^2 params per layer (1.58M/layer), pushing past 16MB.
+3. **Memory-bound:** Recurrent scan operations can't use H100 tensor cores.
+
+**Positive finding:** Per-step loss quality matches transformers (loss 2.25 at step 1000 for both). GatedDeltaNet learns at equivalent rate per gradient update — it's purely a throughput problem.
+
+**Insight:** SSM-transformer hybrids need torch.compile support and hardware-native scan kernels to become competitive. The FLA library's Triton kernels are fast but not compile-compatible.
+
+## The Unified Finding
+
+The PR #549 SOTA stack is not just a good architecture — it's a **co-optimized system** where:
+
+1. **Parallel Muon** packs all weights into 4 contiguous 3D banks for batched Newton-Schulz
+2. **torch.compile(fullgraph=True)** fuses the entire forward pass into optimized CUDA kernels
+3. **Int6 per-row quantization** is calibrated for the specific weight distributions produced by this architecture + optimizer combination
+4. **H100 tensor cores** run the batched GEMM operations at peak throughput
+
+Breaking any one of these four pillars cascades into the others:
+- New optimizer → breaks batched bank structure → loses Parallel Muon speedup
+- New layer type → breaks torch.compile → loses fusion speedup
+- New weight distribution → breaks int6 calibration → catastrophic quantization
+- New operation type (scan, solve) → can't use tensor cores → memory-bandwidth-bound
+
+**To genuinely beat this SOTA, you need to co-optimize ALL FOUR simultaneously.** The ternary submission (PR #640) succeeded exactly because it did this: different quantization (ternary) + different optimizer (NeoMuon) + different architecture (768d, 8192 BPE) + 250 experiments to co-optimize everything.
+
+## Implications for Small Model Design
+
+These findings transfer beyond this competition:
+
+1. **Throughput-aware architecture search:** At constrained compute, evaluate architectures by BPB-per-second, not BPB-per-step
+2. **Quantization-aware architecture design:** Novel MLP shapes and normalization schemes must be validated under target quantization BEFORE committing to them
+3. **Co-optimization is mandatory:** At small scale, architecture-optimizer-quantization-hardware form a tightly coupled system. Optimizing one in isolation is insufficient.
+4. **The "throughput tax" formula:** For any technique adding T ms/step overhead at S ms/step baseline, it must improve BPB by at least T/(S * 600/S) * step_bpb_rate to break even
+
+## Hardware
+
+8x NVIDIA H100 80GB SXM, RunPod. PyTorch 2.8.0+cu128. FA3 via windreamer wheels. Total compute: ~$150 across 7 runs.
+
+## Related Work
+
+- PR #296: Reptile meta-learned TTT (our earlier submission, cited by Issue #140)
+- PR #303: XSA + TTT negative interaction study
+- PR #318: Neural Cache concept (cross-window KV caching)
+- PR #640: Ternary quantization (the paradigm-shift submission that succeeded by co-optimizing everything)
+
+## Author
+
+Xiaoan Liu | NYU | GitHub: @sseanliu