From ea95da00b5b1dda73c6ca12a6fba4777e99ad403 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Mon, 30 Mar 2026 02:55:00 +0000
Subject: [PATCH 1/3] Initial plan


From 299dab7340feba63830f17a54a3b0bdc95e7fc18 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Mon, 30 Mar 2026 03:06:32 +0000
Subject: [PATCH 2/3] feat: Add comprehensive novel approaches analysis for
 sub-1.10 BPB

Analyzes 8 angles for pushing Parameter Golf below 1.10 bpb:
- Byte-weighted CE loss, non-uniform quantization, entropy-coded weights,
  hypernetwork generation, context mixing, depth recurrence, multi-model
  ensemble, and information-theoretic bounds.

Includes ranked idea list, 3 proof-of-concept code stubs (EngramLite,
Complementary Training + BackoffNgramMixer, Skip-gram Hash + Shallow
Recurrence), implementation difficulty estimates, and a Moonshot section.

Key finding: The integrated Complementary Training + EngramLite +
BackoffNgramMixer stack is the highest-EV unexplored combination.

Agent-Logs-Url: https://github.com/kailean/parameter-golf/sessions/caad42c8-8417-45c0-bdf7-90803cf87976

Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>
---
 pg_novel_ideas.md | 984 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 984 insertions(+)
 create mode 100644 pg_novel_ideas.md

diff --git a/pg_novel_ideas.md b/pg_novel_ideas.md
new file mode 100644
index 000000000..b2bf67bf4
--- /dev/null
+++ b/pg_novel_ideas.md
@@ -0,0 +1,984 @@
+# Parameter Golf: Novel Approaches to Sub-1.10 BPB
+
+*Analysis date: March 30, 2026*
+*Current official SOTA: 1.1194 BPB (#549, @sanjeevmadhav)*
+*Best pending pure neural: 1.1086 BPB (#1089, @mikeapedia — Turbo-Muon + EngramLite)*
+*Best pending n-gram cache: 0.4027 BPB (#1094 — causal BackoffNgramMixer, compliance TBD)*
+
+---
+
+## Competition Context Summary
+
+**Constraints:** 16MB artifact (code + compressed model), 10 min training on 8×H100, 10 min eval.
+**Metric:** val_bpb (bits per byte) on FineWeb validation set. Lower = better.
+**Baseline:** 1.2244 BPB (9L, 512d, int8+zlib).
+**Current stack:** 11L, 512d, 3×MLP, int6+zstd-22, XSA, EMA, GPTQ-lite, BigramHash, SmearGate, Partial RoPE, sliding-window eval.
+
+**Key quantitative constraints from Issue #140 ablation data:**
+- 1ms step overhead ≈ 0.006-0.007 BPB cost (at ~83ms/step baseline)
+- Int6 quant gap: ~0.0036 BPB (GPTQ is near-optimal at int6)
+- Int5 quant gap: ~0.007 BPB per matrix group
+- Int4 quant gap: ~0.065 BPB (catastrophic — dead end)
+- 3-seed std: ~0.0005-0.0015 BPB
+- EMA > SWA by 0.003 BPB (3-seed verified)
+- Sliding window (s64, w2048): ~0.034 BPB improvement
+- N-gram cache with correct normalization: 1.51 BPB alone (WORSE than neural — #978)
+
+**What's been tried and failed (selected):**
+- MoE (optimal sparsity=0 below 500M params)
+- Depth recurrence >2 loops (quant error amplifies 900×)
+- Knowledge distillation (11ms/step I/O overhead fatal in 600s)
+- MTP (no improvement)
+- INT4 quantization (catastrophic +0.065 BPB)
+- TrigramHash without gating (+0.0049 BPB, hurts compression)
+- MC Dropout ensembling (sub-networks lack diversity at 17M params)
+- kNN-LM at eval (XSA already captures inter-position patterns)
+- Advanced quant algorithms at int6 (Qronos, CDQuant: GPTQ already near-optimal)
+- Procrustes rotation (91% MSE reduction but 380% larger artifact — MSE ≠ artifact size)
+- Pruning 3% of weights (+728KB artifact — zeroes hurt zstd-22)
+
+---
+
+## Analysis of All 8 Angles
+
+### ANGLE 1: COMPRESSION IS THE OBJECTIVE (BPB-Aware Loss)
+
+**Core insight:** Cross-entropy loss treats all tokens equally, but BPB weights by bytes-per-token. Tokens decoding to more UTF-8 bytes matter more for BPB.
+
+**Will it work?** Partially — but the gain is smaller than it appears.
+
+**Analysis:**
+The gap between CE-optimized BPB and byte-weighted BPB is modest for SP1024 tokenization. With a 1024-token vocabulary, most tokens decode to 1-4 bytes, and the distribution of bytes-per-token is relatively concentrated (mean ~1.18 bytes/token for English-dominant FineWeb). The correction factor (token_count/byte_count) is already baked into the BPB formula, so optimizing CE already approximates BPB optimization.
+
+However, there IS a real effect: tokens that decode to many bytes (rare long tokens, Unicode sequences) receive proportionally less gradient signal under CE. Byte-weighting would reallocate gradient toward these tokens.
+
+**Estimated impact:** 0.001-0.003 BPB improvement. The correction is small because:
+1. SP1024 has a tiny vocabulary — the variance of bytes-per-token is low
+2. High-byte tokens are often rare/noisy (non-English text, HTML entities) and may not be learnable
+3. The model already sees these tokens during training — they just get equal weight
+
+**Implementation difficulty:** Very low — multiply CE loss by a pre-computed bytes-per-token lookup.
+
+**Risk of failure:** Moderate. The effect may be below noise floor (0.0005 BPB 3-seed std).
+
+**Compatibility:** Full — one-line change to loss computation. Zero overhead.
+
+**Verdict: LOW PRIORITY.** The theoretical gap is real but the practical gain at SP1024 is likely below significance threshold. Worth a zero-cost experiment but don't build a strategy around it.
+
+```python
+# PROOF-OF-CONCEPT: Byte-weighted cross-entropy loss
+# Add to train_gpt_mlx_kl.py
+
+def compute_bytes_per_token_lut(tokenizer_path):
+    """Pre-compute UTF-8 byte count for each token in vocabulary."""
+    import sentencepiece as spm
+    sp = spm.SentencePieceProcessor(model_file=tokenizer_path)
+    vocab_size = sp.get_piece_size()
+    bytes_lut = []
+    for i in range(vocab_size):
+        piece = sp.id_to_piece(i)
+        # Decode to bytes, count UTF-8 length
+        try:
+            byte_count = len(piece.encode('utf-8'))
+        except:
+            byte_count = 1
+        bytes_lut.append(max(1, byte_count))
+    return bytes_lut
+
+def byte_weighted_ce_loss(logits, targets, bytes_lut_tensor):
+    """
+    Cross-entropy loss weighted by bytes-per-token.
+    
+    logits: (B*T, V) float
+    targets: (B*T,) int
+    bytes_lut_tensor: (V,) float — bytes per token ID
+    
+    Instead of: loss = -log(p[target]) averaged over tokens
+    We compute: loss = -log(p[target]) * bytes[target] / mean(bytes[target])
+    
+    This makes the loss proportional to BPB contribution.
+    """
+    import mlx.core as mx
+    import mlx.nn as nn
+    
+    # Standard CE per token
+    ce_per_token = nn.losses.cross_entropy(logits, targets, reduction='none')  # (B*T,)
+    
+    # Byte weights for each target token
+    byte_weights = bytes_lut_tensor[targets]  # (B*T,)
+    
+    # Normalize so total weight equals number of tokens (preserves LR scale)
+    byte_weights = byte_weights / byte_weights.mean()
+    
+    # Weighted mean
+    loss = (ce_per_token * byte_weights).mean()
+    return loss
+```
+
+---
+
+### ANGLE 2: NON-UNIFORM QUANTIZATION
+
+**Core insight:** Neural network weights are approximately Gaussian — uniform int6 wastes precision on the tails.
+
+**Will it work?** No, for two devastating reasons specific to this competition.
+
+**Analysis:**
+
+**Reason 1: MSE ≠ Artifact Size.** This is the single most important lesson from the competition (#1048, #316). Non-uniform quantization (k-means, log-scale, NF6) reduces reconstruction MSE. But the artifact is compressed with zstd-22, and non-uniform codebook indices have HIGHER entropy than uniform indices. Uniform int6 produces values that cluster around certain bit patterns, which zstd compresses efficiently. K-means indices are essentially random 6-bit values with near-uniform distribution — maximum entropy, minimum compressibility.
+
+Concrete example from #1048: Procrustes rotation reduced MSE by 91% but increased artifact size by 380% because the rotated weights had higher entropy. The same principle applies to non-uniform quant: better MSE, worse compression, net negative for the 16MB budget.
+
+**Reason 2: GPTQ is already near-optimal at int6.** #756 (@abaybektursun, SOTA holder) tested Qronos iterative Hessian (+0.0007 worse) and CDQuant coordinate descent (+0.0005 worse) — both more sophisticated than uniform GPTQ. At int6 with 64 levels, the quant gap is only 0.0036 BPB. There's simply not much room to improve.
+
+**Reason 3: Codebook storage.** A 64-entry codebook per row (or per tensor) costs bytes that offset any quality gain. For 17M params in ~11K rows, even 64×2 bytes per row = ~1.4MB of codebooks.
+
+**Estimated impact:** Net negative (larger artifact for marginal MSE improvement).
+
+**Implementation difficulty:** Moderate (k-means on weights, custom pack/unpack).
+
+**Risk of failure:** Very high — #1048 and #316 both demonstrate the MSE≠artifact principle.
+
+**Compatibility:** Poor — requires custom serialization, breaks zstd compression efficiency.
+
+**Verdict: DO NOT PURSUE.** The fundamental insight "MSE ≠ compressed artifact size" kills this entire angle. Every non-uniform scheme increases index entropy, which defeats zstd. The competition has empirically confirmed this multiple times.
+
+---
+
+### ANGLE 3: ENTROPY-CODED WEIGHTS
+
+**Core insight:** Zstd treats all bytes equally. What if we designed weight distributions for maximum compressibility?
+
+**Will it work?** One sub-idea works marginally, the rest don't.
+
+**Analysis:**
+
+**Weight entropy regularization:** Promising in principle — penalize high-entropy weight distributions during training. But #609's ablation found lzma (which is closer to a custom entropy coder than zstd) achieves 99.7% of Shannon limit on the weight data. **Zstd-22 is already near-optimal.** The bottleneck isn't the entropy coder — it's the weight entropy itself. And the weight entropy is determined by the model's capacity needs, not by the compression algorithm.
+
+**Sparse + dense hybrid:** #1048 proved that 3% pruning INCREASES artifact by 728KB. Zeroing weights doesn't help zstd-22 because the zero values disrupt the statistical patterns zstd exploits. Structured pruning at 50% would be catastrophic for model quality AND artifact size.
+
+**ANS instead of zstd:** #1089 used Brotli+byte-shuffle instead of zstd on mixed int6/int7 — this IS the closest thing to a custom entropy coder that's been tried. The gain was real but small (enough to squeeze in slightly higher precision). ANS tuned to exact weight distribution could save 0.05-0.2MB over zstd-22, which translates to ~100-400K more parameters — worth ~0.001-0.003 BPB at the margin.
+
+**The one promising sub-idea: NuMuon (arXiv:2603.03597).** Nuclear-norm constraint on Muon updates → lower stable rank → better zstd compression. This pushes compressibility into the *optimizer itself*, which is fundamentally different from post-hoc compression. The weights naturally develop lower entropy during training, rather than being forced into compressible patterns afterward.
+
+**Estimated impact:** 0.001-0.003 BPB (ANS/Brotli tuning) or 0.002-0.006 BPB (NuMuon optimizer).
+
+**Implementation difficulty:** ANS is moderate. NuMuon is low (optimizer change).
+
+**Risk of failure:** Moderate for ANS (marginal gains). Low-moderate for NuMuon (backed by theory).
+
+**Compatibility:** Full — orthogonal to everything else.
+
+**Verdict: NuMuon is worth testing (Tier 2 idea from Issue #140). Custom entropy coding is marginal but free.**
+
+---
+
+### ANGLE 4: HYPERNETWORK WEIGHT GENERATION
+
+**Core insight:** Store a tiny network that generates the weight matrices at load time.
+
+**Will it work?** Almost certainly not at competitive quality.
+
+**Analysis:**
+
+This is implicit neural representation (INR) applied to weight matrices. The problem: weight matrices in a trained LLM are NOT smooth or low-frequency. They contain high-frequency, semantically meaningful structure that resists compact representation. A 200K-param hypernetwork cannot generate 37M coherent weights — the information-theoretic compression ratio of ~185× would require the weights to have ~185× redundancy, which they don't.
+
+**The low-rank basis variant** is more realistic: hypernetwork generates a rank-K basis for each weight matrix, and coefficients are stored directly. But this is just low-rank factorization with extra steps, and it's been explored:
+- #609 found Hadamard rotation saves -0.0002 BPB but costs +0.5MB (net negative)
+- CPSVD (Column-Preserving SVD) is the principled version of this — untried but estimated at 0.003-0.008 BPB
+- The fundamental issue: low-rank approximation of weight matrices loses too much information at the precision levels needed
+
+**The real version of this idea that works:** Weight tying + per-layer LoRA deltas (Relaxed Recursive Transformers, ICLR 2025). Share base weights across layers, add tiny per-layer LoRA adaptations. This gives you ~24 virtual layers from ~11 layers of parameters. #686 demonstrated shallow recurrence (2 layers repeated once, +2 virtual depth) at 1.1182 BPB — it works when limited to 2 loops. But >2 loops causes GPTQ error amplification (#579, #363).
+
+**Estimated impact:** Hypernetwork: net negative. Relaxed Recursive: 0.01-0.03 BPB (from deeper effective model).
+
+**Implementation difficulty:** Hypernetwork: high. Relaxed Recursive: moderate.
+
+**Risk of failure:** Hypernetwork: very high. Relaxed Recursive: moderate (2-loop limit).
+
+**Compatibility:** Relaxed Recursive is compatible with the existing stack.
+
+**Verdict: Hypernetwork is a dead end. Relaxed Recursive Transformers with LoRA deltas is the viable realization of this concept, and it's already on the Tier 2 list.**
+
+---
+
+### ANGLE 5: CONTEXT MIXING (N-gram Ensemble)
+
+**Core insight:** Combine multiple simple predictors (bigrams, trigrams, byte-level) with learned mixing weights.
+
+**Will it work?** This is the MOST promising angle — with critical caveats.
+
+**Analysis:**
+
+The competition has extensively explored this direction, and the results are dramatic but complicated:
+
+**What happened with n-gram caches (Mar 25-27):** A wave of submissions used eval-time n-gram caches to achieve sub-1.0 BPB. Then #978 proved that with correct full-vocabulary normalization, standalone n-gram caches degrade to 1.51 BPB — worse than neural baseline. The previous sub-0.1 scores were normalization artifacts. 33+ PRs were closed.
+
+**But causal n-gram mixing IS viable:** Post-enforcement, the correctly-implemented BackoffNgramMixer (#803, #1094) still achieves 0.40-0.44 BPB. The key difference: these produce full normalized probability distributions over the entire vocabulary at each step, blended with the neural model's distribution using learned or entropy-adaptive alpha mixing.
+
+**TrigramHash as a training-time component:** #609 found TrigramHash (without gating) HURTS by +0.0049 BPB. But EngramLite (#1089) with gating + multi-head hashing + trigrams works — part of the new best 1.1086 BPB. **The gating is essential** — it suppresses noisy hash collisions that raw TrigramHash amplifies.
+
+**Byte-level predictor:** H-Net (#1044) attempted learned byte-level tokenization — 1.90 BPB, far behind. Byte-level processing is too slow for the 600s training budget at current architectures.
+
+**Skip-gram hash:** Untried in the competition. Issue #140 lists it as a Tier 1 idea with 0.005-0.015 BPB estimated gain. Uses non-contiguous positions (e.g., tokens[-1, -3, -5]) as context — captures patterns with intervening content. Zero additional memory per context, just hash different positions. Especially effective on FineWeb's structured web text.
+
+**The realistic path:**
+1. Use EngramLite-style gated multi-head hashing (bigram + trigram) during training: ~0.003-0.008 BPB
+2. At eval time, add a correctly-normalized BackoffNgramMixer with entropy-adaptive alpha: ~0.05-0.15 BPB
+3. The combined system achieves the "complementary training" effect (#803): the neural model specializes on what n-grams can't predict
+
+**Estimated impact:** 
+- Training-time context mixing (EngramLite): 0.003-0.008 BPB (proven by #1089)
+- Eval-time BackoffNgramMixer: 0.05-0.15 BPB additional (proven by #803, #1094)
+- Skip-gram hash: 0.005-0.015 BPB (untried, moderate confidence)
+
+**Implementation difficulty:** EngramLite: moderate. BackoffNgramMixer: moderate-high (must get normalization right). Skip-gram: low.
+
+**Risk of failure:** Low for EngramLite (proven). Moderate for BackoffNgramMixer (normalization is tricky). Low for skip-gram (simple extension of proven concept).
+
+**Compatibility:** Excellent — EngramLite replaces BigramHash. BackoffNgramMixer is eval-only.
+
+**Verdict: HIGH PRIORITY. The combination of EngramLite training + BackoffNgramMixer eval + Complementary Training is the most promising path to sub-1.10 BPB.**
+
+```python
+# PROOF-OF-CONCEPT: Gated Multi-Head Hash Embedding (EngramLite-inspired)
+# Replaces BigramHash in train_gpt_mlx_kl.py
+
+import mlx.core as mx
+import mlx.nn as nn
+
+class EngramLiteEmbedding(nn.Module):
+    """
+    Multi-head hashed n-gram embeddings with learned gating.
+    
+    Key improvements over BigramHash:
+    1. Multiple hash heads (K=4) per n-gram order — reduces collision rate
+    2. Trigram support — captures 3-token patterns
+    3. Learned gate — sigmoid suppresses noisy lookups
+    
+    Architecture:
+      - For each n-gram order (2, 3):
+        - K hash functions map context to table indices
+        - Table lookup produces K embeddings
+        - Mean pool across heads
+        - Sigmoid gate (from context) scales the output
+      - Sum across orders → output
+    
+    Parameter budget:
+      bigram_table: hash_size × dim × K_heads
+      trigram_table: hash_size × dim × K_heads  
+      gate_proj: (order_count × dim) × dim
+    """
+    def __init__(self, hash_size: int = 8192, dim: int = 1024, 
+                 n_heads: int = 4, orders: tuple = (2, 3)):
+        super().__init__()
+        self.hash_size = hash_size
+        self.dim = dim
+        self.n_heads = n_heads
+        self.orders = orders
+        
+        # Hash primes for multi-head hashing (different prime per head)
+        self.primes = [31337, 59999, 73721, 97531][:n_heads]
+        
+        # Separate embedding table per order
+        self.tables = {}
+        for order in orders:
+            table = nn.Embedding(hash_size, dim)
+            # Small init — these are additive corrections
+            table.weight = table.weight * 0.01
+            self.tables[f'order_{order}'] = table
+        
+        # Learned gate: context → sigmoid scalar per position
+        # Input: concatenated n-gram context embeddings
+        self.gate_proj = nn.Linear(dim, len(orders), bias=True)
+        # Initialize gate bias negative → starts mostly suppressed
+        # Model learns to trust hash lookups as training progresses
+    
+    def _hash_ngram(self, tokens, order, head_idx):
+        """Hash n-gram context to table index."""
+        B, T = tokens.shape
+        prime = self.primes[head_idx]
+        
+        if order == 2:
+            # Bigram: hash(t-1, t)
+            t_prev = tokens[:, :-1]  # (B, T-1)
+            t_curr = tokens[:, 1:]   # (B, T-1)
+            idx = mx.remainder(t_prev * prime + t_curr, self.hash_size)
+            valid_start = 1
+        elif order == 3:
+            # Trigram: hash(t-2, t-1, t)
+            t_prev2 = tokens[:, :-2]  # (B, T-2)
+            t_prev1 = tokens[:, 1:-1] # (B, T-2)
+            t_curr = tokens[:, 2:]    # (B, T-2)
+            idx = mx.remainder(
+                t_prev2 * (prime * prime) + t_prev1 * prime + t_curr,
+                self.hash_size
+            )
+            valid_start = 2
+        else:
+            raise ValueError(f"Order {order} not supported")
+        
+        return idx, valid_start
+    
+    def __call__(self, tokens):
+        """
+        tokens: (B, T) int32
+        Returns: (B, T, dim) — additive logit bias
+        """
+        B, T = tokens.shape
+        output = mx.zeros((B, T, self.dim))
+        
+        for oi, order in enumerate(self.orders):
+            table = self.tables[f'order_{order}']
+            
+            # Multi-head: average K hash lookups
+            head_embeds = []
+            for hi in range(self.n_heads):
+                idx, valid_start = self._hash_ngram(tokens, order, hi)
+                emb = table(idx)  # (B, T-order+1, dim)
+                head_embeds.append(emb)
+            
+            # Mean pool across heads — reduces collision noise
+            ngram_emb = sum(head_embeds) / self.n_heads  # (B, T-valid_start, dim)
+            
+            # Pad to full sequence length
+            pad = mx.zeros((B, valid_start, self.dim))
+            ngram_emb = mx.concatenate([pad, ngram_emb], axis=1)  # (B, T, dim)
+            
+            output = output + ngram_emb
+        
+        return output
+
+
+# PROOF-OF-CONCEPT: Skip-Gram Hash Embedding  
+class SkipGramHashEmbedding(nn.Module):
+    """
+    Hash embedding using non-contiguous token positions.
+    
+    Captures patterns like:
+    - token[-1] × token[-3] (skip one)
+    - token[-1] × token[-5] (skip three)
+    
+    Effective for structured text (HTML tags, code indentation,
+    sentence templates) where intervening content varies.
+    """
+    def __init__(self, hash_size: int = 4096, dim: int = 1024,
+                 skip_patterns: list = None):
+        super().__init__()
+        self.hash_size = hash_size
+        self.dim = dim
+        # Each pattern is a tuple of negative offsets, e.g., (-1, -3)
+        self.skip_patterns = skip_patterns or [(-1, -3), (-1, -5), (-2, -4)]
+        
+        self.tables = {}
+        for i, pattern in enumerate(self.skip_patterns):
+            table = nn.Embedding(hash_size, dim)
+            table.weight = table.weight * 0.01
+            self.tables[f'skip_{i}'] = table
+    
+    def __call__(self, tokens):
+        B, T = tokens.shape
+        output = mx.zeros((B, T, self.dim))
+        
+        for i, pattern in enumerate(self.skip_patterns):
+            table = self.tables[f'skip_{i}']
+            min_offset = min(pattern)  # Most negative
+            valid_start = abs(min_offset)
+            
+            # Gather tokens at skip positions
+            # pattern = (-1, -3) means: token at t-1 and token at t-3
+            hash_val = mx.zeros((B, T - valid_start), dtype=mx.int32)
+            prime = 31337
+            for offset in pattern:
+                start = valid_start + offset
+                end = T + offset
+                tok_slice = tokens[:, start:end]
+                hash_val = hash_val * prime + tok_slice
+            
+            idx = mx.remainder(mx.abs(hash_val), self.hash_size)
+            emb = table(idx)
+            
+            pad = mx.zeros((B, valid_start, self.dim))
+            emb = mx.concatenate([pad, emb], axis=1)
+            output = output + emb
+        
+        return output
+```
+
+---
+
+### ANGLE 6: DEPTH RECURRENCE WITH PROGRESSIVE ADAPTATION
+
+**Core insight:** 4 unique blocks × 3 loops = 12 effective layers from 4 layers of parameters.
+
+**Will it work?** Only with ≤2 loops, and the gain is modest.
+
+**Analysis:**
+
+The competition has extensively tested this:
+- #344: 2× slower, hurts BPB
+- #363: **Quantization error amplifies ~900× over 3 cycles** — this is the killer
+- #579: 6×2 loops gives 1.1478 (1-seed), but GPTQ compounds multiplicatively
+- #686: **Shallow recurrence works** — layers 4+5 repeated once (11→13 virtual), 1.1182 BPB (3-seed)
+
+**The critical insight from #363:** When you reuse the same quantized weights K times, the error in each weight gets applied K times. At int6, each weight has ~0.016 expected quantization error (range/64). Over 3 cycles, this compounds: effective error ≈ 3 × 0.016 = 0.048, which is approaching int4-level degradation. At 2 cycles, error ≈ 2 × 0.016 = 0.032 — still viable.
+
+**FiLM conditioning** (scale/shift per loop iteration) helps because it differentiates loop passes, but it can't overcome the quantization amplification problem for >2 loops.
+
+**The viable version:** #686's approach — repeat 2 middle layers once, getting 13 virtual layers from 11 layers of parameters. Combined with per-pass learnable scalars (~2K params). Recovers ~70% of independent 12L quality at minimal step cost.
+
+**Budget math with aggressive recurrence:** 4 unique blocks in int6 ≈ 4MB. But we need >4 blocks for competitive quality — the 4-block config is far too small. And even at 4 blocks × 3 loops, the 900× quant error amplification makes it nonviable.
+
+**Estimated impact:** +2 virtual layers via shallow recurrence: ~0.003-0.008 BPB (proven by #686).
+
+**Implementation difficulty:** Low for shallow recurrence. High for full FiLM + multi-loop.
+
+**Risk of failure:** Low for ≤2 loops. Very high for ≥3 loops.
+
+**Compatibility:** Good — drop-in modification to layer stack.
+
+**Verdict: SHALLOW RECURRENCE (2 loops on 2 layers) IS PROVEN AND VIABLE. Full depth recurrence (3+ loops) is dead due to int6 error amplification.**
+
+---
+
+### ANGLE 7: MULTI-MODEL ENSEMBLE IN 16MB
+
+**Core insight:** Two complementary models (small transformer + massive n-gram hash) might beat one larger transformer.
+
+**Will it work?** YES — this is essentially what the top n-gram cache submissions do.
+
+**Analysis:**
+
+This is not a novel idea — it's the dominant strategy on the (compliance-questioned) pending leaderboard. The top submissions (#803 at 0.4416, #1094 at 0.4027) are exactly this: a neural base model + an n-gram cache ensemble, with learned mixing.
+
+**The specific budget breakdown works:**
+- Neural model (11L, 512d, int6+zstd): ~15MB
+- BackoffNgramMixer (orders 2-10, ~4M hash buckets): 0 MB artifact (built at eval time from already-scored tokens)
+- Total: ~15MB ✓
+
+The n-gram component costs ZERO artifact space because it's built incrementally during evaluation from tokens already scored. This is the key insight — the 16MB budget goes entirely to the neural model, and the n-gram mixer is free.
+
+**Why it works despite #978:** #978 showed standalone normalized n-grams achieve only 1.51 BPB. But that's standalone. When MIXED with a neural model, the n-gram component handles high-confidence local patterns (common bigrams, frequent phrases) while the neural model handles everything else. The mixing is complementary — each handles what the other can't.
+
+**Complementary Training (#803):** The neural model is trained with loss weights that down-weight tokens easily predicted by bigram statistics. This forces the model to specialize on what n-grams can't handle, maximizing complementarity. This is the critical innovation that separates 0.44 from 0.55 BPB.
+
+**Estimated impact:** 0.05-0.20 BPB over pure neural, depending on n-gram order and mixing quality.
+
+**Implementation difficulty:** Moderate-high (correct normalization is crucial — #978's lesson).
+
+**Risk of failure:** Low for concept (proven), moderate for compliance (organizer scrutiny ongoing).
+
+**Compatibility:** Eval-time only — compatible with any training stack.
+
+**Verdict: HIGH PRIORITY. This is proven and the highest-impact single technique available. The risk is compliance, not effectiveness.**
+
+```python
+# PROOF-OF-CONCEPT: Complementary Training + BackoffNgramMixer
+# 
+# Two-part system:
+# Part 1: Modified training loss (in train_gpt_mlx_kl.py)
+# Part 2: Eval-time BackoffNgramMixer
+
+# ============================================================
+# PART 1: Complementary Training Loss
+# ============================================================
+
+import mlx.core as mx
+import mlx.nn as nn
+import numpy as np
+from collections import defaultdict
+
+def build_bigram_stats(train_data_path, vocab_size=1024):
+    """
+    Pre-compute bigram transition probabilities from training data.
+    Used to identify tokens that are 'easy' for n-gram models.
+    
+    Returns: bigram_probs[prev_token, next_token] = P(next|prev)
+    """
+    # Count bigram frequencies
+    counts = np.zeros((vocab_size, vocab_size), dtype=np.float32)
+    # Read training shards
+    import glob, struct
+    for shard_path in sorted(glob.glob(f"{train_data_path}/fineweb_train_*.bin")):
+        with open(shard_path, 'rb') as f:
+            header = struct.unpack('<256i', f.read(1024))
+            n_tokens = header[2]
+            tokens = np.frombuffer(f.read(n_tokens * 2), dtype=np.uint16)
+            for i in range(len(tokens) - 1):
+                counts[tokens[i], tokens[i+1]] += 1
+    
+    # Normalize to probabilities (with Laplace smoothing)
+    row_sums = counts.sum(axis=1, keepdims=True) + vocab_size
+    bigram_probs = (counts + 1) / row_sums
+    return bigram_probs
+
+
+def complementary_loss(logits, targets, prev_tokens, bigram_probs_mx, 
+                       complement_alpha=0.5):
+    """
+    Weighted CE loss that down-weights tokens easily predicted by bigrams.
+    
+    For each token:
+      p_bigram = bigram_probs[prev_token, target_token]
+      weight = 1 - complement_alpha * p_bigram
+    
+    Tokens with high bigram probability get lower training weight,
+    forcing the neural model to specialize on what n-grams can't predict.
+    
+    logits: (B*T, V)
+    targets: (B*T,)
+    prev_tokens: (B*T,) — the token at position t-1
+    bigram_probs_mx: (V, V) — pre-computed bigram transition probs
+    complement_alpha: float — strength of complementary weighting (0=standard CE)
+    """
+    # Standard CE per token
+    ce_per_token = nn.losses.cross_entropy(logits, targets, reduction='none')
+    
+    # Look up bigram probability of each target given its predecessor
+    # p_bigram[i] = bigram_probs[prev_tokens[i], targets[i]]
+    p_bigram = bigram_probs_mx[prev_tokens, targets]  # (B*T,)
+    
+    # Complementary weights: tokens easily predicted by bigrams get low weight
+    weights = 1.0 - complement_alpha * p_bigram
+    weights = mx.clip(weights, 0.1, 1.0)  # Floor at 0.1 to avoid zero gradients
+    
+    # Normalize weights to preserve effective learning rate
+    weights = weights / weights.mean()
+    
+    loss = (ce_per_token * weights).mean()
+    return loss
+
+
+# ============================================================
+# PART 2: Eval-Time BackoffNgramMixer (Causal, Full-Vocab Normalized)
+# ============================================================
+
+class BackoffNgramMixer:
+    """
+    Causal n-gram language model with Kneser-Ney-style backoff.
+    Built incrementally from already-scored tokens (backward-looking).
+    
+    Key properties for compliance:
+    1. Full-vocabulary normalized: produces valid probability distribution
+    2. Causal: only uses tokens at positions < current position
+    3. Single-pass: score-first, then update cache
+    4. No artifact cost: built from scratch during eval
+    
+    Algorithm:
+    - Maintain count tables for orders 1 through max_order
+    - For each position t:
+      1. Query all orders using context tokens[t-order+1:t]
+      2. Backoff from highest to lowest order with interpolation
+      3. Produce P_ngram(token | context) for all vocab tokens
+      4. Mix with neural model: P = (1-alpha)*P_neural + alpha*P_ngram
+      5. Score position t using P
+      6. Update count tables with token[t]
+    """
+    def __init__(self, vocab_size=1024, max_order=7, 
+                 hash_buckets=2_000_000, alpha_mode='entropy_adaptive'):
+        self.vocab_size = vocab_size
+        self.max_order = max_order
+        self.hash_buckets = hash_buckets
+        self.alpha_mode = alpha_mode
+        
+        # Count tables: hash(context) -> array of vocab counts
+        # Using defaultdict for prototype; production uses fixed hash tables
+        self.counts = [defaultdict(lambda: np.zeros(vocab_size, dtype=np.float32))
+                       for _ in range(max_order + 1)]
+        self.total_counts = [defaultdict(float) for _ in range(max_order + 1)]
+    
+    def _hash_context(self, context_tokens):
+        """Hash a sequence of tokens to a bucket index."""
+        h = 0
+        for t in context_tokens:
+            h = (h * 31337 + int(t)) % self.hash_buckets
+        return h
+    
+    def _get_ngram_probs(self, context_tokens):
+        """
+        Compute interpolated n-gram probability distribution.
+        Uses simple linear interpolation backoff.
+        """
+        vocab_size = self.vocab_size
+        
+        # Start with uniform (order 0)
+        probs = np.ones(vocab_size, dtype=np.float64) / vocab_size
+        
+        # Interpolate from low to high order
+        for order in range(1, self.max_order + 1):
+            if len(context_tokens) < order:
+                break
+            
+            ctx = context_tokens[-order:]
+            ctx_hash = self._hash_context(ctx)
+            
+            counts = self.counts[order][ctx_hash]
+            total = self.total_counts[order][ctx_hash]
+            
+            if total > 0:
+                # Interpolation weight increases with total count (confidence)
+                lambda_order = total / (total + 5.0)  # Simple discount
+                order_probs = (counts + 1e-10) / (total + 1e-10 * vocab_size)
+                
+                # Ensure normalization
+                order_probs = order_probs / order_probs.sum()
+                
+                probs = (1 - lambda_order) * probs + lambda_order * order_probs
+        
+        # Final normalization (paranoia)
+        probs = probs / probs.sum()
+        return probs
+    
+    def _compute_alpha(self, neural_logits):
+        """
+        Entropy-adaptive mixing weight.
+        When neural model is uncertain (high entropy), trust n-grams more.
+        When neural model is confident (low entropy), trust it more.
+        """
+        if self.alpha_mode == 'fixed':
+            return 0.3
+        
+        # Compute neural model entropy
+        probs = np.exp(neural_logits - neural_logits.max())
+        probs = probs / probs.sum()
+        entropy = -np.sum(probs * np.log2(probs + 1e-10))
+        max_entropy = np.log2(self.vocab_size)  # ~10 bits for 1024 vocab
+        
+        # Map entropy to alpha: high entropy → high alpha
+        normalized_entropy = entropy / max_entropy
+        alpha = 0.15 + 0.45 * normalized_entropy  # Range: 0.15-0.60
+        
+        return alpha
+    
+    def score_and_update(self, position, context_tokens, token_at_pos, 
+                         neural_log_probs):
+        """
+        Score position and update cache. Must be called sequentially.
+        
+        Returns: log probability of token_at_pos under mixed distribution
+        """
+        # 1. Get n-gram distribution (causal: uses only tokens before position)
+        ngram_probs = self._get_ngram_probs(context_tokens)
+        
+        # 2. Get mixing weight
+        alpha = self._compute_alpha(neural_log_probs)
+        
+        # 3. Mix distributions
+        neural_probs = np.exp(neural_log_probs)
+        neural_probs = neural_probs / neural_probs.sum()  # Re-normalize
+        
+        mixed_probs = (1 - alpha) * neural_probs + alpha * ngram_probs
+        mixed_probs = mixed_probs / mixed_probs.sum()  # Ensure normalization
+        
+        # 4. Score
+        log_prob = np.log(mixed_probs[token_at_pos] + 1e-30)
+        
+        # 5. Update cache (AFTER scoring — backward-looking)
+        for order in range(1, self.max_order + 1):
+            if len(context_tokens) >= order:
+                ctx = context_tokens[-order:]
+                ctx_hash = self._hash_context(ctx)
+                self.counts[order][ctx_hash][token_at_pos] += 1
+                self.total_counts[order][ctx_hash] += 1
+        
+        return log_prob
+```
+
+---
+
+### ANGLE 8: INFORMATION-THEORETIC LOWER BOUND
+
+**Analysis:**
+
+The theoretical analysis is sound:
+- Shannon entropy of clean English: ~1.0-1.3 bits/byte
+- Best neural compressors on clean English: ~0.8-0.9 bits/byte (exploiting long-range structure)
+- FineWeb is web text: noisier, multilingual, more diverse → practical floor ~1.0-1.1 bits/byte
+- Current SOTA: 1.1194 (official), 1.1086 (pending)
+- LoRA TTT: reached 1.0865 (#628, GEPA+legal TTT on 4×A100)
+
+**The 16MB constraint is the binding limit, not the theoretical floor.** A 16MB artifact encodes ~128M bits of information. The model has ~37M parameters in int6 ≈ 222M bits. After zstd compression, ~124M bits. The validation set is ~60M tokens × ~1.18 bytes/token ≈ ~70M bytes. To achieve 1.0 BPB, we need 70M bits of prediction accuracy from 124M bits of model. That's a ~1.8:1 ratio — tight but feasible.
+
+**Sub-1.10 BPB is achievable within 16MB** — the GEPA+TTT result (1.0865) proves this, though it used 4×A100 for 20K steps (more compute). The question is whether 600s on 8×H100 (~7K steps) provides enough training to reach the same quality.
+
+**Estimated gap breakdown (1.1194 → 1.10):**
+- Better quantization/compression (entropy coding, NuMuon): ~0.003
+- Better architecture (shallow recurrence, EngramLite): ~0.005-0.008
+- Better training (Complementary Training, Mousse/Turbo-Muon): ~0.003-0.005
+- Eval-time BackoffNgramMixer: ~0.05-0.10
+- **Total estimated: ~0.06-0.12 BPB improvement → 1.00-1.06 BPB**
+
+Sub-1.10 is clearly achievable. Sub-1.05 is plausible. Sub-1.00 is at the edge of feasibility.
+
+---
+
+## Ranked List: Most Promising Ideas
+
+### Tier 1 — Highest Impact, Proven Feasible
+
+| Rank | Idea | Source Angle | Est. BPB Gain | Risk | Difficulty |
+|------|------|-------------|---------------|------|------------|
+| 1 | **BackoffNgramMixer at eval time** | 5, 7 | 0.05-0.15 | Low (proven) / Moderate (compliance) | Moderate-High |
+| 2 | **Complementary Training** | 5, 7 | 0.01-0.03 (over standard training) | Low (proven by #803) | Low |
+| 3 | **EngramLite (gated multi-head hash)** | 5 | 0.003-0.008 (over BigramHash) | Low (proven by #1089) | Moderate |
+
+### Tier 2 — Moderate Impact, Good Feasibility
+
+| Rank | Idea | Source Angle | Est. BPB Gain | Risk | Difficulty |
+|------|------|-------------|---------------|------|------------|
+| 4 | **Shallow recurrence (+2 virtual layers)** | 6 | 0.003-0.008 | Low (proven by #686) | Low |
+| 5 | **Skip-gram hash embedding** | 5 | 0.005-0.015 | Moderate (untried) | Low |
+| 6 | **NuMuon optimizer** | 3 | 0.002-0.006 | Moderate | Low |
+| 7 | **Mousse optimizer** | (Issue #140) | 0.003-0.008 | Low-Moderate | Low |
+| 8 | **PPMII-style escape estimation** | 5, 7 | 0.01-0.03 (over basic backoff) | Moderate | Medium |
+
+### Tier 3 — Small/Speculative Impact
+
+| Rank | Idea | Source Angle | Est. BPB Gain | Risk | Difficulty |
+|------|------|-------------|---------------|------|------------|
+| 9 | **Byte-weighted CE loss** | 1 | 0.001-0.003 | High (below noise) | Very Low |
+| 10 | **Custom entropy coding (ANS/Brotli)** | 3 | 0.001-0.003 | Moderate | Moderate |
+| 11 | **Logistic-domain mixing** | 5 | 0.002-0.005 | Low | Very Low |
+
+### Dead Ideas (Don't Pursue)
+
+| Idea | Source Angle | Why Dead |
+|------|-------------|----------|
+| Non-uniform quantization (K-means, NF6) | 2 | MSE ≠ artifact size; higher index entropy defeats zstd |
+| Hypernetwork weight generation | 4 | Information-theoretic impossibility at 185× compression |
+| Deep recurrence (3+ loops) | 6 | Int6 error amplifies 900× over 3 cycles |
+| Weight sparsification | 3 | Zeroing weights INCREASES artifact size (#1048) |
+| Byte-level model | 5 | Far too slow for 600s training budget |
+| Standalone n-gram (no neural) | 5 | 1.51 BPB with correct normalization — worse than neural |
+
+---
+
+## Top 3: Proof-of-Concept Code
+
+### POC 1: Complementary Training + BackoffNgramMixer
+
+*(Full code stubs provided in Angle 5 and Angle 7 analysis above)*
+
+**Smoke test plan (M1, 100 steps):**
+1. Pre-compute bigram statistics from first training shard
+2. Modify `train_gpt_mlx_kl.py` loss to use `complementary_loss`
+3. Train 100 steps, compare train_loss vs baseline
+4. Expected: slightly higher train_loss (we're down-weighting easy tokens) but model learns harder patterns
+
+**Integration with existing stack:**
+- Replace BigramHash with EngramLite in model init
+- Add `--complement-alpha 0.5` flag
+- Pre-compute bigram stats during data loading (one-time cost)
+- At eval time, wrap sliding-window eval with BackoffNgramMixer
+
+### POC 2: EngramLite Gated Multi-Head Hash
+
+*(Full code stub provided in Angle 5 analysis above)*
+
+**Smoke test plan (M1, 100 steps):**
+1. Replace `BigramHashEmbedding` with `EngramLiteEmbedding` in model
+2. Config: hash_size=4096, dim=1024, n_heads=2, orders=(2,3)
+3. Train 100 steps, compare train_loss vs BigramHash baseline
+4. Expected: comparable or slightly better loss (gating suppresses noise)
+
+**Key implementation notes:**
+- The gating mechanism is essential — without it, trigrams hurt (#609)
+- Multi-head averaging reduces hash collision noise
+- Parameter budget: 2 tables × 4096 × 1024 × 2 heads ≈ 16M params in int6 ≈ 12MB
+  - This is too large! Need dim=128 with projection, or hash_size=2048
+  - Adjusted: hash_size=2048, dim=128, projected to vocab_size (1024)
+  - Budget: 2 × 2048 × 128 × 2 ≈ 1M params ≈ 0.75MB — fits easily
+
+### POC 3: Skip-Gram Hash + Shallow Recurrence Combo
+
+*(Skip-gram code stub provided in Angle 5. Shallow recurrence below.)*
+
+```python
+# PROOF-OF-CONCEPT: Shallow Recurrence with Per-Pass Scalars
+# Modification to GPT model in train_gpt_mlx_kl.py
+#
+# Key idea: Repeat layers 4 and 5 once each (11 → 13 virtual layers)
+# with per-pass learnable scalar multipliers.
+
+class ShallowRecurrentGPT:
+    """
+    Modification to existing GPT class.
+    
+    Original: layers 0,1,2,3,4,5,6,7,8,9,10 (11 layers)
+    Modified: layers 0,1,2,3,4,5,4',5',6,7,8,9,10 (13 virtual, 11 unique)
+    
+    Layers 4' and 5' reuse weights from layers 4 and 5 but with
+    per-pass learnable scalars that differentiate the passes.
+    
+    Cost: ~2K extra parameters (scale + shift per layer per pass)
+    Benefit: ~70% of independent 12L quality gain
+    
+    CRITICAL: Only 2 loops (1 repeat). 3+ loops cause 900× quant error
+    amplification at int6.
+    """
+    
+    def __init__(self, config):
+        # ... (standard init) ...
+        
+        # Per-pass scalars for recurrent layers
+        # pass_scale[layer_idx][pass_idx] = learnable scalar
+        self.recur_layers = [4, 5]  # Which layers to repeat
+        self.n_passes = 2  # Original + 1 repeat
+        
+        # Learnable scale/shift per pass (FiLM-lite)
+        # 2 recurrent layers × 2 passes × 2 (scale + shift) = 8 params
+        self.pass_scales = {}
+        for layer_idx in self.recur_layers:
+            for pass_idx in range(self.n_passes):
+                key = f'layer{layer_idx}_pass{pass_idx}'
+                # Pass 0 (original): scale=1.0, shift=0.0
+                # Pass 1 (repeat): scale=0.9, shift=0.0 (slightly dampened)
+                if pass_idx == 0:
+                    self.pass_scales[key] = 1.0  # Fixed, not learned
+                else:
+                    # Small init perturbation from pass 0
+                    self.pass_scales[key] = 0.9  # nn.Parameter, learned
+    
+    def forward(self, x):
+        """
+        Forward pass with shallow recurrence.
+        
+        Instead of: 0 → 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8 → 9 → 10
+        We do:      0 → 1 → 2 → 3 → 4 → 5 → 4' → 5' → 6 → 7 → 8 → 9 → 10
+        
+        Where 4' means layer 4 weights with pass_scales['layer4_pass1']
+        """
+        # Encoder layers (0 through num_encoder-1)
+        for i in range(self.num_encoder):
+            x = self.layers[i](x)
+        
+        x0 = x  # Skip connection source
+        
+        # Decoder layers with recurrence
+        virtual_layer_order = []
+        for i in range(self.num_encoder, self.num_layers):
+            virtual_layer_order.append((i, 0))  # (layer_idx, pass_idx)
+            if i in self.recur_layers:
+                virtual_layer_order.append((i, 1))  # Repeat
+        
+        for layer_idx, pass_idx in virtual_layer_order:
+            scale = self.pass_scales.get(
+                f'layer{layer_idx}_pass{pass_idx}', 1.0
+            )
+            
+            # Standard block forward with scaling
+            block_out = self.layers[layer_idx](x, x0)
+            
+            if pass_idx > 0:
+                # For repeated passes, apply dampened residual
+                x = x + scale * (block_out - x)
+            else:
+                x = block_out
+        
+        return x
+```
+
+**Smoke test plan (M1, 100 steps):**
+1. Modify GPT forward to add shallow recurrence on layers 4,5
+2. Add SkipGramHashEmbedding alongside existing BigramHash
+3. Train 100 steps, compare train_loss
+4. Expected: slightly slower per step (~5-10% from extra 2 layer passes) but better loss per step
+
+---
+
+## Implementation Details for Each Idea
+
+| Idea | Difficulty | BPB Est. | Risk | Compatible? | Dependencies |
+|------|-----------|----------|------|-------------|-------------|
+| BackoffNgramMixer | Moderate-High | 0.05-0.15 | Low/Moderate | Yes (eval-only) | numpy |
+| Complementary Training | Low | 0.01-0.03 | Low | Yes | Pre-computed bigram stats |
+| EngramLite | Moderate | 0.003-0.008 | Low | Yes (replaces BigramHash) | None |
+| Shallow Recurrence | Low | 0.003-0.008 | Low | Yes (model arch change) | None |
+| Skip-gram Hash | Low | 0.005-0.015 | Moderate | Yes (additive) | None |
+| NuMuon | Low | 0.002-0.006 | Moderate | Yes (optimizer swap) | None |
+| Byte-weighted CE | Very Low | 0.001-0.003 | High | Yes (loss change) | tokenizer stats |
+| Custom entropy coding | Moderate | 0.001-0.003 | Moderate | Yes (post-training) | ANS library |
+
+---
+
+## THE MOONSHOT
+
+### Complementary Training + EngramLite + BackoffNgramMixer: The Integrated Stack
+
+**Why this is the single best bet nobody has fully combined:**
+
+The top competition results reveal three independent discoveries that, when properly integrated, form a system greater than the sum of its parts:
+
+1. **EngramLite (#1089):** Gated multi-head hashing makes n-gram features trainable end-to-end, fixing the TrigramHash failure (#609). This is the TRAINING-TIME component — the neural model learns to use n-gram context efficiently.
+
+2. **Complementary Training (#803):** Down-weighting tokens predictable by n-grams forces the neural model to specialize. This creates maximum COMPLEMENTARITY between neural and n-gram components. Without this, the neural model wastes capacity re-learning patterns the n-gram cache will handle at eval time.
+
+3. **BackoffNgramMixer (#1094):** A correctly-normalized eval-time n-gram cache with entropy-adaptive mixing. This is the EVAL-TIME component that adds 0.05-0.15 BPB improvement at zero artifact cost.
+
+**Why nobody has combined all three:**
+- EngramLite is new (#1089, March 29)
+- Complementary Training is new (#803, March 25)  
+- BackoffNgramMixer compliance was only clarified March 27
+- The three ideas emerged from different teams in different weeks
+
+**The integrated system:**
+
+```
+TRAINING:
+  1. Pre-compute bigram/trigram stats from training data
+  2. Train with EngramLite (gated bigram+trigram hash, replaces BigramHash)
+  3. Use Complementary Training loss (down-weight easy n-gram tokens)
+  4. Standard stack: 11L, 512d, 3×MLP, XSA, EMA, GPTQ-lite, etc.
+  5. Result: neural model specialized for what n-grams can't predict
+
+EVAL:
+  1. Load quantized model (standard sliding-window)
+  2. For each token:
+     a. Score with neural model → neural_log_probs
+     b. Score with BackoffNgramMixer (orders 2-7) → ngram_probs
+     c. Entropy-adaptive alpha: high neural uncertainty → trust n-grams more
+     d. Mix: P = (1-alpha) * P_neural + alpha * P_ngram  
+     e. Record log(P[true_token])
+     f. Update n-gram cache (backward-looking)
+  3. Result: complementary predictions from specialized components
+```
+
+**Expected BPB:**
+- Baseline pure neural SOTA: 1.1086 (#1089)
+- EngramLite + Complementary Training: ~1.10-1.11 (saving capacity for hard tokens)
+- + BackoffNgramMixer at eval: ~0.95-1.05 
+- + Skip-gram hash + shallow recurrence: ~0.92-1.00
+
+**Why this could leapfrog the field:**
+1. Nobody has done Complementary Training + EngramLite together (complementarity is maximized)
+2. The BackoffNgramMixer is free (zero artifact cost) and additive
+3. The neural model is BETTER at the tokens that matter because it doesn't waste capacity on n-gram-predictable tokens
+4. This is principled compression theory: multiple experts with complementary specializations, mixed with learned weights
+
+**Risk factors:**
+- Compliance: BackoffNgramMixer must produce correctly normalized full-vocabulary distributions
+- The eval-time n-gram cache quality depends on validation set repetitiveness
+- Complementary Training requires careful alpha tuning (too aggressive → model loses basic competence)
+- EngramLite's parameter budget must be managed (hash tables compete with model capacity)
+
+**Estimated floor:** Even without the BackoffNgramMixer (which has compliance risk), EngramLite + Complementary Training + the full existing stack should reach ~1.10 BPB on pure neural — matching the current pending best with a principled path forward.
+
+---
+
+## Honest Assessment: What Won't Work
+
+1. **Non-uniform quantization** — MSE ≠ artifact size. Killed by competition data (#1048, #316).
+2. **Hypernetworks** — Information-theoretically impossible at required compression ratios.
+3. **Deep recurrence (3+ loops)** — Int6 error amplification is a fundamental constraint.
+4. **Byte-level models** — Too slow for 600s training. H-Net proved this at 1.90 BPB.
+5. **Standalone n-gram replacement for neural** — 1.51 BPB with correct normalization (#978).
+6. **Byte-weighted CE as a major lever** — The effect is real but ~0.001-0.003 BPB, below noise.
+7. **Knowledge distillation** — 11ms/step I/O overhead fatal at 600s (#1029).
+8. **Weight sparsification** — Increases artifact size, doesn't decrease it (#1048).
+
+The only genuine insights in this analysis are:
+1. **The Complementary Training + EngramLite + BackoffNgramMixer integrated stack** (Moonshot)
+2. **Skip-gram hashing** as a genuinely untried extension of the proven hash embedding approach
+3. **The "MSE ≠ artifact size" principle** that eliminates entire categories of ideas
+
+Everything else is either already known to the competition community or below the significance threshold.

From 79f8f00b8566e1aa98644979e2e647ecdb74a708 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Mon, 30 Mar 2026 03:08:57 +0000
Subject: [PATCH 3/3] fix: Address code review feedback on POC code stubs

- Add explicit gate bias initialization (-2.0) in EngramLiteEmbedding
- Fix parameter budget: hash_size=2048, embed_dim=128 with projection to 1024
- Clarify shallow recurrence pass_scales as pseudocode with implementation notes
- Ensure POC 2 smoke test config matches actual class parameters

Agent-Logs-Url: https://github.com/kailean/parameter-golf/sessions/caad42c8-8417-45c0-bdf7-90803cf87976

Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>
---
 pg_novel_ideas.md | 73 ++++++++++++++++++++++++++---------------------
 1 file changed, 41 insertions(+), 32 deletions(-)

diff --git a/pg_novel_ideas.md b/pg_novel_ideas.md
index b2bf67bf4..8d2eb55b7 100644
--- a/pg_novel_ideas.md
+++ b/pg_novel_ideas.md
@@ -264,35 +264,43 @@ class EngramLiteEmbedding(nn.Module):
         - Sigmoid gate (from context) scales the output
       - Sum across orders → output
     
-    Parameter budget:
-      bigram_table: hash_size × dim × K_heads
-      trigram_table: hash_size × dim × K_heads  
-      gate_proj: (order_count × dim) × dim
+    Parameter budget (adjusted for 16MB constraint):
+      bigram_table: 2048 × 128 × 2 heads = 524K params
+      trigram_table: 2048 × 128 × 2 heads = 524K params
+      projection: 128 × 1024 = 131K params
+      gate: 128 × 2 + 2 = 258 params
+      Total: ~1.2M params ≈ 0.9MB in int6 — fits easily
     """
-    def __init__(self, hash_size: int = 8192, dim: int = 1024, 
-                 n_heads: int = 4, orders: tuple = (2, 3)):
+    def __init__(self, hash_size: int = 2048, embed_dim: int = 128,
+                 output_dim: int = 1024, n_heads: int = 2, 
+                 orders: tuple = (2, 3)):
         super().__init__()
         self.hash_size = hash_size
-        self.dim = dim
+        self.embed_dim = embed_dim
+        self.output_dim = output_dim
         self.n_heads = n_heads
         self.orders = orders
         
         # Hash primes for multi-head hashing (different prime per head)
         self.primes = [31337, 59999, 73721, 97531][:n_heads]
         
-        # Separate embedding table per order
+        # Separate embedding table per order (small embed_dim, projected later)
         self.tables = {}
         for order in orders:
-            table = nn.Embedding(hash_size, dim)
+            table = nn.Embedding(hash_size, embed_dim)
             # Small init — these are additive corrections
             table.weight = table.weight * 0.01
             self.tables[f'order_{order}'] = table
         
+        # Project from embed_dim to output_dim (vocab_size or model_dim)
+        self.proj = nn.Linear(embed_dim, output_dim, bias=False)
+        
         # Learned gate: context → sigmoid scalar per position
         # Input: concatenated n-gram context embeddings
-        self.gate_proj = nn.Linear(dim, len(orders), bias=True)
-        # Initialize gate bias negative → starts mostly suppressed
+        self.gate_proj = nn.Linear(embed_dim, len(orders), bias=True)
+        # Initialize gate bias to -2.0 → sigmoid(-2) ≈ 0.12 → starts mostly suppressed
         # Model learns to trust hash lookups as training progresses
+        self.gate_proj.bias = mx.full((len(orders),), -2.0)
     
     def _hash_ngram(self, tokens, order, head_idx):
         """Hash n-gram context to table index."""
@@ -323,10 +331,10 @@ class EngramLiteEmbedding(nn.Module):
     def __call__(self, tokens):
         """
         tokens: (B, T) int32
-        Returns: (B, T, dim) — additive logit bias
+        Returns: (B, T, output_dim) — additive logit bias
         """
         B, T = tokens.shape
-        output = mx.zeros((B, T, self.dim))
+        output = mx.zeros((B, T, self.embed_dim))
         
         for oi, order in enumerate(self.orders):
             table = self.tables[f'order_{order}']
@@ -335,19 +343,24 @@ class EngramLiteEmbedding(nn.Module):
             head_embeds = []
             for hi in range(self.n_heads):
                 idx, valid_start = self._hash_ngram(tokens, order, hi)
-                emb = table(idx)  # (B, T-order+1, dim)
+                emb = table(idx)  # (B, T-order+1, embed_dim)
                 head_embeds.append(emb)
             
             # Mean pool across heads — reduces collision noise
-            ngram_emb = sum(head_embeds) / self.n_heads  # (B, T-valid_start, dim)
+            ngram_emb = sum(head_embeds) / self.n_heads  # (B, T-valid_start, embed_dim)
             
             # Pad to full sequence length
-            pad = mx.zeros((B, valid_start, self.dim))
-            ngram_emb = mx.concatenate([pad, ngram_emb], axis=1)  # (B, T, dim)
+            pad = mx.zeros((B, valid_start, self.embed_dim))
+            ngram_emb = mx.concatenate([pad, ngram_emb], axis=1)  # (B, T, embed_dim)
             
             output = output + ngram_emb
         
-        return output
+        # Project to output dimension and apply gate
+        gated = mx.sigmoid(self.gate_proj(output))  # (B, T, n_orders)
+        # Average gate across orders for simplicity
+        gate_scalar = gated.mean(axis=-1, keepdims=True)  # (B, T, 1)
+        
+        return self.proj(output) * gate_scalar  # (B, T, output_dim)
 
 
 # PROOF-OF-CONCEPT: Skip-Gram Hash Embedding  
@@ -777,17 +790,15 @@ Sub-1.10 is clearly achievable. Sub-1.05 is plausible. Sub-1.00 is at the edge o
 
 **Smoke test plan (M1, 100 steps):**
 1. Replace `BigramHashEmbedding` with `EngramLiteEmbedding` in model
-2. Config: hash_size=4096, dim=1024, n_heads=2, orders=(2,3)
+2. Config: hash_size=2048, embed_dim=128, output_dim=1024, n_heads=2, orders=(2,3)
 3. Train 100 steps, compare train_loss vs BigramHash baseline
 4. Expected: comparable or slightly better loss (gating suppresses noise)
 
 **Key implementation notes:**
 - The gating mechanism is essential — without it, trigrams hurt (#609)
 - Multi-head averaging reduces hash collision noise
-- Parameter budget: 2 tables × 4096 × 1024 × 2 heads ≈ 16M params in int6 ≈ 12MB
-  - This is too large! Need dim=128 with projection, or hash_size=2048
-  - Adjusted: hash_size=2048, dim=128, projected to vocab_size (1024)
-  - Budget: 2 × 2048 × 128 × 2 ≈ 1M params ≈ 0.75MB — fits easily
+- Parameter budget: 2 tables × 2048 × 128 × 2 heads + projection (128×1024) ≈ 1.2M params
+  - In int6+zstd: ~0.9MB — fits within 16MB budget easily
 
 ### POC 3: Skip-Gram Hash + Shallow Recurrence Combo
 
@@ -820,24 +831,22 @@ class ShallowRecurrentGPT:
     def __init__(self, config):
         # ... (standard init) ...
         
-        # Per-pass scalars for recurrent layers
-        # pass_scale[layer_idx][pass_idx] = learnable scalar
+        # Per-pass scalars for recurrent layers (pseudocode — actual impl
+        # would use nn.Module parameters for gradient tracking)
         self.recur_layers = [4, 5]  # Which layers to repeat
         self.n_passes = 2  # Original + 1 repeat
         
-        # Learnable scale/shift per pass (FiLM-lite)
-        # 2 recurrent layers × 2 passes × 2 (scale + shift) = 8 params
+        # Learnable scale per pass (FiLM-lite)
+        # 2 recurrent layers × 1 repeat = 2 learnable scalars
+        # In real implementation: self.pass_scales = {key: nn.Parameter(mx.array(0.9))}
         self.pass_scales = {}
         for layer_idx in self.recur_layers:
             for pass_idx in range(self.n_passes):
                 key = f'layer{layer_idx}_pass{pass_idx}'
-                # Pass 0 (original): scale=1.0, shift=0.0
-                # Pass 1 (repeat): scale=0.9, shift=0.0 (slightly dampened)
                 if pass_idx == 0:
-                    self.pass_scales[key] = 1.0  # Fixed, not learned
+                    self.pass_scales[key] = 1.0  # Fixed (original pass)
                 else:
-                    # Small init perturbation from pass 0
-                    self.pass_scales[key] = 0.9  # nn.Parameter, learned
+                    self.pass_scales[key] = 0.9  # Learnable (repeated pass)
     
     def forward(self, x):
         """