openai · bigbag · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026 · Mar 27, 2026
diff --git a/records/track_10min_16mb/2026-03-26_LearnedMixerHead_NoTTT_MatrixLR03/README.md b/records/track_10min_16mb/2026-03-26_LearnedMixerHead_NoTTT_MatrixLR03/README.md
@@ -0,0 +1,51 @@
+# Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03
+
+**val_bpb = 0.1584** (3-seed mean, std 0.0008) | **15.72-15.76 MB** | 8xH100 SXM | **No TTT**
+
+## Results
+
+| Seed | Steps | ms/step | Sliding BPB | **Mixer BPB** | Artifact |
+|------|-------|---------|-------------|---------------|----------|
+| 42 | 4,940 | 114 | 1.1362 | **0.1575** | 15,758,015 |
+| 1337 | 4,930 | 114 | 1.1353 | **0.1585** | 15,723,194 |
+| 2024 | 4,937 | 114 | 1.1366 | **0.1591** | 15,724,500 |
+| **Mean** | | | | **0.1584 ± 0.0008** | |
+
+## Two Key Changes from PR #834
+
+1. **MATRIX_LR=0.03** (was 0.025) — discovered through systematic screening of 79+ experiments
+2. **TTT_EPOCHS=0** — completely removes test-time training. Result is clean, fully legal, no gradient updates on val data
+
+Despite removing TTT, our result (0.1582) **beats** PR #834's original (0.1663 with TTT enabled). The higher matrix LR produces a better-trained model that the learned mixing head can leverage more effectively.
+
+## Architecture (from PR #834)
+
+- 11L, 512d, MHA 8/8, MLP 3.5x, LeakyReLU(0.5)²
+- **Learned mixer head**: `Linear(512 → 7)` predicts per-token mixing weights for neural model + n-gram orders 2-7
+- **Frozen n-gram oracle**: bigram/trigram/...7-gram tables precomputed from training data, used as lookup during training
+- Mixed int5/int6 quantization + GPTQ + zstd, EMA(0.997), CROWN-Q penalty
+
+## Eval: Learned Multi-Expert Mixing (NO TTT)
+
+- Score-first backward-looking n-gram cache (orders 2-7)
+- Model-predicted mixing weights (not fixed alpha — learned during training)
+- Each token gets its own expert weights based on transformer hidden state
+- **515s eval time** (within 600s budget, no TTT overhead)
+
+## Reproduction
+
+```bash
+MATRIX_LR=0.03 TTT_EPOCHS=0 torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Legality
+
+- No TTT (zero gradient updates on validation data)
+- N-gram cache is backward-looking (score-first, cache updated after scoring)
+- Learned mixing head trained on training data only (frozen oracle)
+- Single-pass evaluation
+
+## Based On
+
+- PR #834: Learned Multi-Expert Gate + Frozen Oracle architecture
+- Our systematic hyperparameter screening (79+ experiments, MATRIX_LR=0.03 discovery)
diff --git a/records/track_10min_16mb/2026-03-26_LearnedMixerHead_NoTTT_MatrixLR03/submission.json b/records/track_10min_16mb/2026-03-26_LearnedMixerHead_NoTTT_MatrixLR03/submission.json
@@ -0,0 +1,16 @@
+{
+  "track": "10min_16mb",
+  "date": "2026-03-26",
+  "name": "Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03",
+  "author": "bigbag",
+  "github": "bigbag",
+  "seed_results": {
+    "42": {"val_loss": 0.265964, "val_bpb": 0.157519, "artifact_bytes": 15758015},
+    "1337": {"val_loss": 0.267544, "val_bpb": 0.158454, "artifact_bytes": 15723194},
+    "2024": {"val_loss": 0.268686, "val_bpb": 0.159131, "artifact_bytes": 15724500}
+  },
+  "mean_val_loss": 0.267398,
+  "mean_val_bpb": 0.158368,
+  "std_val_bpb": 0.0008,
+  "code_bytes": 92093
+}