openai · danialht · Mar 24, 2026
diff --git a/...rds/track_10min_16mb/2026-03-23_MixResiduals_Int6_GPTQ_BigramHash8192/README.md b/...rds/track_10min_16mb/2026-03-23_MixResiduals_Int6_GPTQ_BigramHash8192/README.md
@@ -0,0 +1,34 @@
+## Record: Residual Input Mixing + mixed int6 GPTQ + grouped TTT + MLP 3.5x
+
+**val_bpb: 1.1169** (mean over 3 seeds with TTT evaluation, stride=64)
+
+**artifact: 15.6 MB** (mean over 3 seeds)
+
+## TLDR Changes
+
+- Changed TTT from a flat optimizer to grouped AdamW with stronger matrix/head adaptation, while restoring standard clipping and removing the per-chunk warmup.
+
+- Changed Architecture: Making Residual Connections Denser, Changed block input formation so each transformer block now sees a learned mix of the current stream, earlier block outputs, and the original x0, instead of only the simpler local x/x0 residual mix. This gives the model a denser residual path and lets each block reuse longer-range intermediate features directly.
+
+## Results
+
+| Seed | Steps | final val_loss | final val_bpb | Artifact |
+|------|-------|----------|-------------------|----------|
+| 1337 | 6106 | 1.8859  | 1.1169 | 15.88 MB |
+| 42 | 6092 | 1.8855 | 1.1167 | 15.33 MB |
+| 2024 | 6091 | 1.8864 | 1.1172 | 15.73 MB |
+
+**val_bpb mean: 1.1169**
+
+**val_bpb std: 0.0003**
+
+**val_loss mean: 1.8859**
+
+
+## More Details
+
+- Architecture: 11L, 512d, Mixed residuals each layer from 2 previous layers, MHA 8/8, MLP 3.5x (1792), BigramHash 8192, XSA all layers
+
+- Quantization: mixed int6 per-row GPTQ (clip_range=15) + Early QAT (threshold 0.5) + EMA 0.997
+
+- TTT: Legal score-first AdamW, chunk=131072, last 2 blocks plus control params unfrozen
diff --git a/records/track_10min_16mb/2026-03-23_MixResiduals_Int6_GPTQ_BigramHash8192/submission.json b/records/track_10min_16mb/2026-03-23_MixResiduals_Int6_GPTQ_BigramHash8192/submission.json
@@ -0,0 +1,11 @@
+{
+  "author": "Danial Hosseintabar",
+  "github_id": "danialht",
+  "name": "Mix Residuals + int6 GPTQ + TTT score first AdamW",
+  "blurb": "11-layer 512-dim MHA 8/8 model with denser residual input mixing, 3.5x MLP (1792), BigramHash-8192, and XSA on all layers. Mixed int6 per-row GPTQ with early QAT, EMA/SWA, 2% pruning, and legal score-first grouped-AdamW TTT on 131072-token chunks with the last 2 blocks plus control params adapted at eval.",
+  "date": "2026-03-23T01:00:00Z",
+  "val_loss": "1.8855",
+  "val_bpb": "1.1167",
+  "bytes_total": "15330834",
+  "bytes_code": "76849"
+}