From e16f7aac737d8a8e960f740c0828c61b3bdc524b Mon Sep 17 00:00:00 2001 From: Abay Bektursun Date: Wed, 25 Mar 2026 13:17:43 -0500 Subject: [PATCH] =?UTF-8?q?Non-record:=20Negative=20results=20=E2=80=94=20?= =?UTF-8?q?quantization=20algorithms=20&=20TTT=20on=20val-GPTQ=20stack?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 6 experiments on the current SOTA stack (1.1142 BPB), all negative: - Qronos iterative Hessian (3 iters): +0.0007 worse - CDQuant coordinate descent (3 passes): +0.0005 worse - Full TTT (all params): +0.0001 worse - MLP-down-only TTT: +0.0001 neutral - MLP-all TTT: +0.0001 neutral Key finding: At int6, GPTQ algorithm is near-optimal. Remaining quant headroom is in the grid (what values to quantize to), not the algorithm (how to assign). TTT is dead on this stack — 25 total failed attempts across two stacks. Co-Authored-By: Claude Opus 4.6 (1M context) --- .../README.md | 53 +++++++++++++++++++ 1 file changed, 53 insertions(+) create mode 100644 records/track_10min_16mb/2026-03-25_Negative_Results_Quant_Algorithm_TTT/README.md diff --git a/records/track_10min_16mb/2026-03-25_Negative_Results_Quant_Algorithm_TTT/README.md b/records/track_10min_16mb/2026-03-25_Negative_Results_Quant_Algorithm_TTT/README.md new file mode 100644 index 000000000..699a72fcd --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_Negative_Results_Quant_Algorithm_TTT/README.md @@ -0,0 +1,53 @@ +# Non-Record: Negative Results — Quantization Algorithms & TTT on Val-GPTQ Stack + +6 experiments on the current SOTA stack (1.1142 BPB, val-calibrated GPTQ + XSA-all + BigramHash 3072), all negative or neutral. + +**Base:** Val-Calibrated GPTQ + XSA-all + BigramHash 3072 — 1.1142 BPB (3-seed mean), 8×H100 SXM, 600s + +--- + +## Quantization Algorithm Experiments (All Negative) + +Tested whether the GPTQ algorithm itself could be improved. Both methods use the same trained weights and int6 grid — they only change how rounding decisions are made. + +| Approach | Sliding BPB | vs Baseline | Why It Failed | +|----------|-------------|-------------|---------------| +| Baseline GPTQ (control) | 1.1139 | — | Standard column-wise GPTQ with val-calibrated Hessians | +| Qronos iterative Hessian (3 iters) | 1.1146 | +0.0007 (worse) | Re-collects H = X^T X after each layer is quantized, so later layers see quantized activations. At int6 the per-layer error is so small (~0.0003 BPB) that iterating doesn't help — the updated Hessians are nearly identical to the original ones. | +| CDQuant coordinate descent (3 passes) | 1.1144 | +0.0005 (worse) | After GPTQ, revisits each weight and tries flipping its rounding direction. At int6 with 63 levels, the grid spacing is ~0.06 scale units — most weights are already at their optimal grid point. The coordinate descent finds almost nothing to flip. | + +**Conclusion:** At int6, the quantization gap is only +0.0036 BPB. Column-wise GPTQ is already near-optimal at this bit-width. Iterative Hessian correction (Qronos) and post-hoc rounding refinement (CDQuant) are designed for aggressive 2-4 bit quantization where cross-layer error compounds. At 6-bit, the error per layer is too small for these methods to improve on. + +--- + +## Test-Time Training Experiments (All Negative) + +Legal score-first TTT: score each chunk under `inference_mode()`, THEN train on it. Every token is graded before any adaptation. This is the same protocol that was legal in [PR #549](https://github.com/openai/parameter-golf/pull/549). + +| Approach | Params Unfrozen | TTT BPB | Non-TTT Baseline | Delta | Eval Time | +|----------|-----------------|---------|-------------------|-------|-----------| +| Full TTT (all params) | 27.1M (100%) | 1.1146 | 1.1145 | +0.0001 (worse) | 445s | +| MLP-down-only | 8.7M (32%) | 1.1145 | 1.1144 | +0.0001 (neutral) | 424s | +| MLP-all (up + down) | 17.3M (64%) | 1.1144 | 1.1143 | +0.0001 (neutral) | 422s | + +TTT hyperparameters: lr=0.002, epochs=3, chunk_tokens=32768, stride=64. + +**Conclusion:** TTT does not help on the val-calibrated GPTQ + XSA-all + BigramHash 3072 stack. This is now **25 total failed TTT attempts** across two stacks: +- 22 on PR #593 stack (1.1171 BPB) — documented in [PR #670](https://github.com/openai/parameter-golf/pull/670) +- 3 on val-GPTQ stack (1.1142 BPB) — this work + +**Why TTT keeps failing:** +1. **Score-first constraint:** The model must score each chunk before adapting. Early tokens in each chunk get zero benefit from adaptation. +2. **Val-calibrated GPTQ interaction:** GPTQ rounding decisions are optimized for val activation patterns. TTT gradient updates shift the dequantized weights away from those optimized rounding points, potentially undoing the val-calibration advantage. +3. **Catastrophic forgetting at chunk boundaries:** Each 32K-token chunk resets to the base model. The 3-epoch training per chunk overfits to local patterns that don't generalize to the scoring window. +4. **The base model is already good:** At 1.1142 BPB, the model's predictions are strong enough that test-time adaptation introduces more noise than signal. + +--- + +## Meta-Lessons + +1. **GPTQ algorithm is near-optimal at int6.** The remaining quant headroom is in WHAT you quantize to (the grid), not HOW you assign values to grid points (the algorithm). + +2. **TTT is dead on this stack.** 25 experiments, zero positive results. The val-calibrated GPTQ + XSA-all combination leaves no room for eval-time weight adaptation. + +3. **Seed variance (~0.0003-0.0007 BPB) dominates.** Most "improvements" from quant algorithm tweaks or TTT are within noise. Only techniques that move BPB by >0.001 are distinguishable from random variation.