openai · AnirudhRahul · Mar 26, 2026
diff --git a/...16mb/2026-03-26_LearnedMultiExpertGate_FrozenOracle_BackoffTTT_0.1663/README.md b/...16mb/2026-03-26_LearnedMultiExpertGate_FrozenOracle_BackoffTTT_0.1663/README.md
@@ -0,0 +1,214 @@
+# Record: Learned Multi-Expert Gate + Frozen Oracle + Backoff TTT (3-seed mean val_bpb=0.1663)
+
+**val_bpb: 0.1663** (3-seed mean, std 0.0003) | **<16 MB** | 8xH100 SXM, 600s
+
+## Results (8xH100 80GB SXM)
+
+| Seed | Pre-TTT bpb | Post-TTT bpb | Eval time | Artifact |
+|------|-------------|--------------|-----------|----------|
+| 1337 | 1.1265 | **0.1661** | 308s | 15.74 MB |
+| 42 | 1.1320 | **0.1663** | 305s | 15.76 MB |
+| 2024 | 1.1352 | **0.1666** | 303s | 15.25 MB |
+| **Mean** | 1.1312 | **0.1663** | 305s | |
+| **Std** | | **0.0003** | | |
+
+## Background
+
+PR #779 (deanbrr) introduced the BackoffNgramMixer with entropy-adaptive alpha and drift-free TTT, achieving 0.6683 BPB. The entropy-adaptive alpha uses a hand-crafted heuristic capped at 0.60, which significantly underweights the n-gram cache when it becomes mature during later eval chunks.
+
+This submission replaces the fixed heuristic with a **learned multi-expert gate** trained end-to-end during the main training loop, and introduces a **frozen n-gram oracle** pre-computed from training data for efficient gradient-based gate training.
+
+## Technique
+
+### 1. Learned Multi-Expert Gate (Transformer Head)
+
+Instead of a fixed entropy-based alpha, we add a small `nn.Linear(model_dim, 7)` head to the GPT model that outputs per-token logits over 7 experts:
+- Expert 0: Neural model prediction
+- Experts 1-6: N-gram orders 2 through 7
+
+The gate is trained end-to-end alongside the main language modeling objective. During the forward pass:
+
+1. Compute standard cross-entropy loss from neural logits
+2. Compute per-expert probabilities: `[p_neural, p_2gram, p_3gram, ..., p_7gram]`
+3. Apply masked softmax over valid experts (masking orders with insufficient context)
+4. Enforce a 5% minimum floor on the neural expert weight for stability
+5. Compute mixed probability: `p_mixed = sum(weights * expert_p)`
+6. Add mixer loss: `L_mixer = -log(p_mixed)` weighted by 0.1
+
+The gate learns from the model's hidden state which expert to trust for each token, enabling per-token routing that a fixed heuristic cannot match.
+
+### 2. Frozen N-gram Oracle (Pre-computed from Training Data)
+
+To provide the n-gram probabilities needed for the mixer loss during training, we pre-fill the `BackoffNgramMixer` hash tables from all 80 training shards (8B tokens) at the start of training. This takes ~19 seconds and is counted within the 10-minute wallclock budget.
+
+After pre-filling, the tables are frozen — no `update()` calls during training. The alpha head sees mature n-gram statistics from step 1, enabling effective gradient-based learning throughout training.
+
+The "future token leakage" from using full-corpus statistics is negligible: any single token contributes ~1/8B = 0.000000000125 to the aggregate counts.
+
+### 3. GPU-Native BackoffNgramMixer
+
+The entire n-gram mixer operates on GPU using PyTorch tensor operations:
+- Count tables: `torch.int32` tensors on device (1M buckets × 2 tables × 6 orders = 48MB)
+- Updates via `torch.scatter_add_` (no CPU-GPU transfers)
+- Hash lookups via direct tensor indexing
+
+This eliminates the CPU bottleneck from the original numpy implementation.
+
+### 4. Pre-compilation of Mixer Loss Path
+
+The mixer forward+backward path is pre-compiled via `torch.compile` using dummy data before the wallclock timer starts. This avoids a ~12s JIT compilation penalty during training. The pre-compilation uses zero tensors and does not touch training data.
+
+## Order of Operations (Legality Proof)
+
+### Training Phase (within 600s wallclock)
+
+```
+1. Model init, warmup steps, torch.compile           [OUTSIDE wallclock]
+   - Standard model warmup (20 steps) + state reset
+   - torch.compile of mixer path with DUMMY ZEROS     ← no training tokens
+
+2. ──── WALLCLOCK STARTS (t0 = time.perf_counter()) ────
+
+3. N-gram pre-fill (~19s)                            [INSIDE wallclock]
+   - Stream all 80 training shards through BackoffNgramMixer.update()
+   - Hash tables populated with full-corpus n-gram counts
+   - Tables FROZEN after this point — no more update() calls during training
+
+4. Training loop (~562s, ~5400 steps)                [INSIDE wallclock]
+   For each step:
+     a. Load mini-batch (x, y) from training data
+     b. Query FROZEN n-gram tables:
+        train_mixer._ngram_backoff_p(x, y) → per-order probabilities
+        (lookup only, no update — tables unchanged since step 3)
+     c. Forward pass through GPT model:
+        - Compute neural logits from transformer
+        - Cross-entropy loss on neural logits
+        - alpha_head(hidden_state) → 7 expert gate logits
+        - Masked softmax over valid experts (neural + n-gram orders 2-7)
+        - 5% floor on neural expert weight
+        - mixed_p = weighted sum of expert probabilities
+        - mixer_loss = -log(mixed_p), added to CE with weight 0.1
+     d. Backward pass + optimizer step (Muon + Adam)
+     e. EMA weight update (decay=0.997)
+
+5. ──── WALLCLOCK ENDS (~581s of 600s budget) ────
+```
+
+### Evaluation Phase (after training, ~305s)
+
+```
+6. Serialize model: EMA weights → int6+zstd (15.7 MB)
+
+7. Load quantized model into fresh eval_model
+
+8. TTT eval (eval_val_sliding_ttt):
+   - Create FRESH BackoffNgramMixer (empty, no training data)
+   - 60 chunks × 1M tokens each, stride=64
+
+   For each chunk ci:
+     ┌─ Phase 1: SCORE (torch.inference_mode, no gradient) ─┐
+     │  For each batch of windows in this chunk:             │
+     │    a. Forward pass → neural logits + gate logits      │
+     │    b. Query eval mixer for n-gram probabilities       │
+     │       (only tokens ALREADY in the cache from          │
+     │        previously scored chunks 0..ci-1)              │
+     │    c. Multi-expert mixing with learned gate           │
+     │    d. Record NLL for scored positions                 │
+     └──────────────────────────────────────────────────────┘
+     │
+     │  dist.barrier() — all ranks finish scoring chunk ci
+     │
+     ├─ Cache update: mixer.update(val_tokens[ci_start:ci_end])
+     │  (tokens from chunk ci added to cache AFTER scoring)
+     │
+     ┌─ Phase 2: TRAIN on chunk ci (already scored = legal) ┐
+     │  Standard cross-entropy TTT on Q projections only     │
+     │  (no mixer loss — just CE on neural logits)           │
+     │  Cosine LR decay across chunks                        │
+     └──────────────────────────────────────────────────────┘
+```
+
+Key invariants:
+- **Training**: N-gram tables frozen after pre-fill. Only lookups during gradient steps — never updated from training batches.
+- **Eval**: Fresh cache. Each chunk scored BEFORE its tokens are added to the cache. No future token information can leak.
+- **TTT training**: Uses standard cross-entropy loss only (not mixer loss). Unfreezes Q projections + norms + alpha_head.
+
+## What the Gate Learned
+
+The expert logit statistics reveal a clear hierarchy (seed 1337):
+
+| Expert | Mean Logit | Interpretation |
+|--------|-----------|----------------|
+| Neural | -5.52 | Rarely trusted |
+| 2-gram | -16.78 | Almost never used |
+| 3-gram | -12.13 | Rarely used |
+| 4-gram | -8.94 | Rarely used |
+| 5-gram | -6.21 | Sometimes used |
+| 6-gram | -3.48 | Moderately used |
+| **7-gram** | **+8.09** | **Dominant expert** |
+
+The 7-gram expert is the only one with a positive mean logit, confirming it as the dominant predictor when the cache is mature. The gate automatically falls back to lower-order n-grams or the neural model when higher orders lack coverage.
+
+## Wallclock Budget Breakdown
+
+| Phase | Time | Inside wallclock? |
+|-------|------|-------------------|
+| Model init + warmup steps | ~25s | No |
+| torch.compile (standard path) | ~8s | No |
+| torch.compile (mixer path, dummy zeros) | ~12s | No |
+| **N-gram pre-fill (8B tokens)** | **~19s** | **Yes** |
+| **Training (~5400 steps)** | **~562s** | **Yes** |
+| Eval (sliding window + TTT) | ~305s | After training |
+
+Total training wallclock: ~581s of 600s budget.
+
+## Compliance
+
+- **Score-first TTT:** Each chunk scored under `torch.inference_mode()` before any training on that chunk
+- **Backward-looking n-gram:** Eval-time cache built from scratch; counts only from already-scored chunks, updated strictly after scoring
+- **N-gram pre-fill counted in wallclock:** The 19s pre-fill from training data is inside the 10-minute budget
+- **Frozen oracle during training:** After pre-fill, n-gram tables are read-only — no `update()` calls during the training loop
+- **torch.compile outside wallclock:** Pre-compilation uses dummy zeros, no training tokens accessed
+- **No oracle selection:** Gate depends on model hidden state, never compares mixed vs original NLL
+- **No training data at eval:** Eval mixer is created fresh, built causally from validation data only
+- **TTT uses CE loss only:** TTT training step uses standard cross-entropy, not the mixer loss
+- **Token count verified:** ratio_scored = 1.000000
+- **Artifact under 16MB:** Max 15.76 MB across seeds
+
+## Reproduction
+
+```bash
+pip install zstandard
+SEED=1337 MAX_WALLCLOCK_SECONDS=600 \
+USE_MIXER=1 MIXER_ETA=0.02 MIXER_HEAD=multi \
+QTTT=1 TTT_EPOCHS=1 TTT_FREEZE_BLOCKS=1 TTT_LR=0.00003 \
+TTT_CHUNK_TOKENS=1048576 EVAL_STRIDE=64 \
+CROWN_Q_LAMBDA=0.01 PRUNE_PCT=0.08 \
+PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## TTT Configuration
+
+| Parameter | Setting |
+|-----------|---------|
+| Unfrozen params | Q projections + norms + alpha_head (QTTT=1) |
+| TTT LR | 0.00003 with cosine decay across chunks |
+| Chunk size | 1M tokens (60 chunks) |
+| Epochs per chunk | 1 |
+| Optimizer | AdamW |
+| Loss | Standard cross-entropy (byte-weighted) |
+| Mixer eta | 0.02 |
+
+## Architecture
+
+11L, 512d, GQA 8H/8KV, MLP 3x, LeakyReLU(0.5)^2, XSA all 11 layers, Value Residual, Gated Attention, SmearGate, BigramHash(4096), Partial RoPE(16/64), LN Scale, EMA(0.997). Tied embeddings. Muon optimizer. Multi-expert gate head (Linear 512→7). ~5400 steps in 581s (19s pre-fill + 562s training).
+
+## Credits
+
+- **PR #779 deanbrr** - BackoffNgramMixer, entropy-adaptive alpha, drift-free TTT, base architecture
+- **PR #700 RoyiRa** - Base architecture, TTT framework, stride=64 eval
+- **PR #606 gowtham0992** - int5 + Soft-Round QAT model
+- **PR #727 Asukabot0** - Multi-order backoff concept, entropy-adaptive alpha formula
+- **PR #461 Christopher-Lee-McClendon** - TTT recipe foundations
+- **PR #518 sofiabod** - LeakyReLU(0.5)^2, cosine TTT scheduling
diff --git a/...6mb/2026-03-26_LearnedMultiExpertGate_FrozenOracle_BackoffTTT_0.1663/final_model.int6.ptz b/...6mb/2026-03-26_LearnedMultiExpertGate_FrozenOracle_BackoffTTT_0.1663/final_model.int6.ptz
diff --git a/...in_16mb/2026-03-26_LearnedMultiExpertGate_FrozenOracle_BackoffTTT_0.1663/log_seed1337.txt b/...in_16mb/2026-03-26_LearnedMultiExpertGate_FrozenOracle_BackoffTTT_0.1663/log_seed1337.txt
@@ -0,0 +1,113 @@
+W0326 07:15:42.068000 834822 site-packages/torch/distributed/run.py:851] 
+W0326 07:15:42.068000 834822 site-packages/torch/distributed/run.py:851] *****************************************
+W0326 07:15:42.068000 834822 site-packages/torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0326 07:15:42.068000 834822 site-packages/torch/distributed/run.py:851] *****************************************
+logs/seed1337.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/root/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=/root/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+mixed_precision: 68 int5 layers, 0 int6 layers (last 0 blocks)
+model_params:33321571
+XSA:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ws:8 gqa:8/8
+lr:embed=0.035 matrix=0.025 scalar=0.025 batch:786432 wall:600s seed:1337
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+pre-compiling mixer loss path (dummy data, no training tokens)...
+pre-compile done
+prefilling n-gram tables from training shards (frozen oracle)...
+prefilled 8,000,040,960 tokens in 18963ms (counted in wallclock)
+step:0/20000 val_loss:6.9312 val_bpb:4.1051 train_time:18963ms step_avg:0.04ms
+step:1/20000 train_loss:7.0814 train_time:21159ms step_avg:2195.41ms
+step:2/20000 train_loss:8.7659 train_time:21256ms step_avg:1146.20ms
+step:3/20000 train_loss:8.6634 train_time:21354ms step_avg:797.04ms
+step:4/20000 train_loss:8.1767 train_time:21453ms step_avg:622.38ms
+step:5/20000 train_loss:7.4828 train_time:21552ms step_avg:517.73ms
+step:6/20000 train_loss:6.8784 train_time:21650ms step_avg:447.80ms
+step:7/20000 train_loss:6.4195 train_time:21749ms step_avg:397.97ms
+step:8/20000 train_loss:6.1459 train_time:21847ms step_avg:360.47ms
+step:9/20000 train_loss:5.9906 train_time:21945ms step_avg:331.33ms
+step:10/20000 train_loss:5.9522 train_time:22044ms step_avg:308.12ms
+step:500/20000 train_loss:2.3848 train_time:71109ms step_avg:104.29ms
+step:1000/20000 train_loss:2.2575 train_time:121290ms step_avg:102.33ms
+step:1500/20000 train_loss:2.2011 train_time:171536ms step_avg:101.71ms
+step:2000/20000 train_loss:2.0488 train_time:221772ms step_avg:101.40ms
+step:2500/20000 train_loss:2.1434 train_time:272012ms step_avg:101.22ms
+step:3000/20000 train_loss:2.1215 train_time:322256ms step_avg:101.10ms
+step:3500/20000 train_loss:2.1276 train_time:372497ms step_avg:101.01ms
+late_qat:enabled step:3826 scale:0.4998
+step:4000/20000 train_loss:1.9106 train_time:423748ms step_avg:101.20ms
+step:4000/20000 val_loss:1.9910 val_bpb:1.1792 train_time:423753ms step_avg:101.20ms
+step:4500/20000 train_loss:2.0553 train_time:475664ms step_avg:101.49ms
+swa:start step:4850
+step:5000/20000 train_loss:2.0299 train_time:527787ms step_avg:101.76ms
+step:5500/20000 train_loss:1.9416 train_time:580121ms step_avg:102.03ms
+step:5516/20000 val_loss:1.9118 val_bpb:1.1323 train_time:581819ms step_avg:102.04ms
+stopping_early: wallclock_cap train_time:581819ms step:5516/20000
+peak memory allocated: 26272 MiB reserved: 26550 MiB
+ema:applying EMA weights (skipping diagnostic evals)
+Serialized model: 130447629 bytes
+Code size: 96235 bytes
+pruning:8.0% magnitude pruning applied
+Serialized model int6+zstd: 15642252 bytes
+Total submission size int6+zstd: 15738487 bytes
+  ttt: pre-compiling forward+backward kernels...
+  ttt: pre-compile done
+final_int6_sliding_window val_loss:1.9258 val_bpb:1.1405 stride:64 eval_time:87345ms
+final_int6_sliding_window_exact val_loss:1.92576762 val_bpb:1.14054806
+TTT: epochs=1 lr=3e-05 freeze_first=1 chunk=1048576 opt=adamw
+TTT temperature: 0.98
+PPM alpha: 0.85, Byte-weighted TTT: True
+  Logistic context mixer enabled: eta=0.02
+ttt:start chunks=60 chunk_tokens=1048576 windows=969057 stride=64 lr=3e-05 epochs=1 opt=adamw freeze_first=1
+ttt:params unfrozen=277003 frozen=33044568
+    ttt_train [1] seqs=512 start_train...
+    ttt_train [1] epoch=1/1 batches=64 ...
+      step done ep=1 bs=0 loss=2.3128
+      step done ep=1 bs=32 loss=2.1571
+  ttt_chunk [1/60] bpb=1.151690 time=4.6s
+    ttt_train [2] seqs=512 start_train...
+    ttt_train [2] epoch=1/1 batches=64 ...
+      step done ep=1 bs=0 loss=2.2360
+      step done ep=1 bs=32 loss=2.2657
+  ttt_chunk [2/60] bpb=1.111905 time=9.3s
+    ttt_train [3] seqs=512 start_train...
+    ttt_train [3] epoch=1/1 batches=64 ...
+      step done ep=1 bs=0 loss=2.1750
+      step done ep=1 bs=32 loss=2.1951
+  ttt_chunk [3/60] bpb=0.950938 time=13.9s
+  ttt_chunk [4/60] bpb=0.820517 time=18.5s
+  ttt_chunk [5/60] bpb=0.710326 time=23.2s
+  ttt_chunk [11/60] bpb=0.421397 time=51.3s
+  ttt_chunk [21/60] bpb=0.280785 time=98.1s
+  ttt_chunk [31/60] bpb=0.227172 time=144.9s
+  ttt_chunk [41/60] bpb=0.196466 time=191.7s
+  ttt_chunk [51/60] bpb=0.177661 time=238.5s
+  ttt_chunk [60/60] bpb=0.166172 time=276.6s
+ttt:done val_loss=0.280495 val_bpb=0.166125 elapsed=276.6s
+expert_logit[neural]: mean=-5.5161 std=4.5017 min=-35.5000 max=23.8750
+expert_logit[ngram_2]: mean=-16.7814 std=2.9300 min=-40.5000 max=1.2734
+expert_logit[ngram_3]: mean=-12.1330 std=3.0311 min=-38.0000 max=12.1875
+expert_logit[ngram_4]: mean=-8.9421 std=3.4461 min=-41.0000 max=24.2500
+expert_logit[ngram_5]: mean=-6.2065 std=3.7653 min=-42.7500 max=33.2500
+expert_logit[ngram_6]: mean=-3.4826 std=4.2406 min=-43.0000 max=41.2500
+expert_logit[ngram_7]: mean=8.0914 std=4.6231 min=-19.2500 max=35.5000
+final_int6_ttt val_loss:0.2805 val_bpb:0.1661 stride:64 eval_time:308104ms
+final_int6_ttt_exact val_loss:0.28049466 val_bpb:0.16612474