From e16f7aac737d8a8e960f740c0828c61b3bdc524b Mon Sep 17 00:00:00 2001
From: Abay Bektursun <abaybektursun@gmail.com>
Date: Wed, 25 Mar 2026 13:17:43 -0500
Subject: [PATCH] =?UTF-8?q?Non-record:=20Negative=20results=20=E2=80=94=20?=
 =?UTF-8?q?quantization=20algorithms=20&=20TTT=20on=20val-GPTQ=20stack?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

6 experiments on the current SOTA stack (1.1142 BPB), all negative:
- Qronos iterative Hessian (3 iters): +0.0007 worse
- CDQuant coordinate descent (3 passes): +0.0005 worse
- Full TTT (all params): +0.0001 worse
- MLP-down-only TTT: +0.0001 neutral
- MLP-all TTT: +0.0001 neutral

Key finding: At int6, GPTQ algorithm is near-optimal. Remaining quant headroom
is in the grid (what values to quantize to), not the algorithm (how to assign).
TTT is dead on this stack — 25 total failed attempts across two stacks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../README.md                                 | 53 +++++++++++++++++++
 1 file changed, 53 insertions(+)
 create mode 100644 records/track_10min_16mb/2026-03-25_Negative_Results_Quant_Algorithm_TTT/README.md

diff --git a/records/track_10min_16mb/2026-03-25_Negative_Results_Quant_Algorithm_TTT/README.md b/records/track_10min_16mb/2026-03-25_Negative_Results_Quant_Algorithm_TTT/README.md
new file mode 100644
index 000000000..699a72fcd
--- /dev/null
+++ b/records/track_10min_16mb/2026-03-25_Negative_Results_Quant_Algorithm_TTT/README.md
@@ -0,0 +1,53 @@
+# Non-Record: Negative Results — Quantization Algorithms & TTT on Val-GPTQ Stack
+
+6 experiments on the current SOTA stack (1.1142 BPB, val-calibrated GPTQ + XSA-all + BigramHash 3072), all negative or neutral.
+
+**Base:** Val-Calibrated GPTQ + XSA-all + BigramHash 3072 — 1.1142 BPB (3-seed mean), 8×H100 SXM, 600s
+
+---
+
+## Quantization Algorithm Experiments (All Negative)
+
+Tested whether the GPTQ algorithm itself could be improved. Both methods use the same trained weights and int6 grid — they only change how rounding decisions are made.
+
+| Approach | Sliding BPB | vs Baseline | Why It Failed |
+|----------|-------------|-------------|---------------|
+| Baseline GPTQ (control) | 1.1139 | — | Standard column-wise GPTQ with val-calibrated Hessians |
+| Qronos iterative Hessian (3 iters) | 1.1146 | +0.0007 (worse) | Re-collects H = X^T X after each layer is quantized, so later layers see quantized activations. At int6 the per-layer error is so small (~0.0003 BPB) that iterating doesn't help — the updated Hessians are nearly identical to the original ones. |
+| CDQuant coordinate descent (3 passes) | 1.1144 | +0.0005 (worse) | After GPTQ, revisits each weight and tries flipping its rounding direction. At int6 with 63 levels, the grid spacing is ~0.06 scale units — most weights are already at their optimal grid point. The coordinate descent finds almost nothing to flip. |
+
+**Conclusion:** At int6, the quantization gap is only +0.0036 BPB. Column-wise GPTQ is already near-optimal at this bit-width. Iterative Hessian correction (Qronos) and post-hoc rounding refinement (CDQuant) are designed for aggressive 2-4 bit quantization where cross-layer error compounds. At 6-bit, the error per layer is too small for these methods to improve on.
+
+---
+
+## Test-Time Training Experiments (All Negative)
+
+Legal score-first TTT: score each chunk under `inference_mode()`, THEN train on it. Every token is graded before any adaptation. This is the same protocol that was legal in [PR #549](https://github.com/openai/parameter-golf/pull/549).
+
+| Approach | Params Unfrozen | TTT BPB | Non-TTT Baseline | Delta | Eval Time |
+|----------|-----------------|---------|-------------------|-------|-----------|
+| Full TTT (all params) | 27.1M (100%) | 1.1146 | 1.1145 | +0.0001 (worse) | 445s |
+| MLP-down-only | 8.7M (32%) | 1.1145 | 1.1144 | +0.0001 (neutral) | 424s |
+| MLP-all (up + down) | 17.3M (64%) | 1.1144 | 1.1143 | +0.0001 (neutral) | 422s |
+
+TTT hyperparameters: lr=0.002, epochs=3, chunk_tokens=32768, stride=64.
+
+**Conclusion:** TTT does not help on the val-calibrated GPTQ + XSA-all + BigramHash 3072 stack. This is now **25 total failed TTT attempts** across two stacks:
+- 22 on PR #593 stack (1.1171 BPB) — documented in [PR #670](https://github.com/openai/parameter-golf/pull/670)
+- 3 on val-GPTQ stack (1.1142 BPB) — this work
+
+**Why TTT keeps failing:**
+1. **Score-first constraint:** The model must score each chunk before adapting. Early tokens in each chunk get zero benefit from adaptation.
+2. **Val-calibrated GPTQ interaction:** GPTQ rounding decisions are optimized for val activation patterns. TTT gradient updates shift the dequantized weights away from those optimized rounding points, potentially undoing the val-calibration advantage.
+3. **Catastrophic forgetting at chunk boundaries:** Each 32K-token chunk resets to the base model. The 3-epoch training per chunk overfits to local patterns that don't generalize to the scoring window.
+4. **The base model is already good:** At 1.1142 BPB, the model's predictions are strong enough that test-time adaptation introduces more noise than signal.
+
+---
+
+## Meta-Lessons
+
+1. **GPTQ algorithm is near-optimal at int6.** The remaining quant headroom is in WHAT you quantize to (the grid), not HOW you assign values to grid points (the algorithm).
+
+2. **TTT is dead on this stack.** 25 experiments, zero positive results. The val-calibrated GPTQ + XSA-all combination leaves no room for eval-time weight adaptation.
+
+3. **Seed variance (~0.0003-0.0007 BPB) dominates.** Most "improvements" from quant algorithm tweaks or TTT are within noise. Only techniques that move BPB by >0.001 are distinguishable from random variation.