Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean)#1019
Open
abaybektursun wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
…11473 (3-seed mean) AR self-generated calibration (no val/train data during quantization). Recreated from PR openai#728 at @valerio-oai's request for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Nice job, looks clean/valid to my non-expert eyes! |
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Mar 28, 2026
- AdamW TTT adapts full-precision EMA weights before GPTQ - Score-first approach (inference_mode then train) for compliance - Hyperparams: lr=0.0005, epochs=3, chunk=32768, cosine decay - 3-stage timing: TTT / AR self-gen+GPTQ / final eval - Uses _HessianGPT (non-banked) for TTT, rebanks for AR self-gen - Kill criteria: seed=1337 must reach <= 1.1156 BPP
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Mar 28, 2026
- copy the old openai#1019 execution path into the experiment branch - add score-first AdamW TTT on the dequantized int6 eval model - default TTT on with 1 epoch for the narrow smoke path - instrument ar_selfgen_gptq/final_eval/post_quant_ttt timing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112
val_bpb: 1.1147 (3-seed mean) | ~15.91 MB | 8×H100 SXM, 600s | No TTT
This submission uses only AR (autoregressive) self-generated calibration data. After training, the model autoregressively generates its own calibration tokens. No val data and no train data are accessed during quantization. The calibration study below is provided separately to help the community understand GPTQ calibration — it is not part of this submission.
SOTA (from our PR #549, 3-seed mean): 1.89002 nats. This run: 1.88218 nats. Delta: −0.0078 nats. Clears the 0.005-nat threshold.
Results (3-seed)
Changes from Prior SOTA (our PR #549)
PR #549 scores 1.1194 BPB using GPTQ-lite + Legal TTT + Parallel Muon + BigramHash(1536) + XSA on last 4 layers. This submission makes three changes and drops TTT:
1. AR Self-Generated Full Hessian GPTQ
PR #549 used GPTQ-lite (diagonal Hessian approximation). We use Full Hessian GPTQ with Cholesky error compensation and column reordering — a strictly better quantizer.
The calibration problem: prior Full Hessian GPTQ implementations (PRs #535, #569, #593, #609) calibrated on training data, ruled illegal after the 600s window. We solve this by having the model generate its own calibration data. After training completes, the model autoregressively generates 64 sequences of 2048 tokens (temperature=0.8, fixed seed). Hessians H = X^T X are collected from these self-generated sequences. No val data, no train data accessed during quantization.
2. BigramHash 3072 × 112 (up from 1536)
Lineage: our PR #549 (1536) → PR #609 (2048) → this run (3072 × dim=112). Fits under 16MB; going wider increased artifact pressure past the break-even point.
3. XSA on all 11 layers (up from last 4)
PR #549 applied XSA to the last 4 layers. Extending to all 11 layers forces cross-position information mixing from layer 0 at zero parameter cost. Source: PR #478 by @gowtham0992.
Dropped: TTT
PR #549 used Legal Score-First TTT for −0.0025 BPB. On this stack, TTT is neutral or negative (25 failed attempts across two stacks — see our PR #756). XSA-all already captures the inter-document context patterns that TTT was adapting to. The Full Hessian GPTQ improvement more than compensates for dropping TTT.
Quantization Pipeline
Architecture
Run Command
Community Reference: GPTQ Calibration Study
This section is not part of the submission. It documents our investigation into what calibration data GPTQ actually needs — shared here to help the community, since GPTQ calibration legality has been a recurring question in this competition (PRs #535, #569, #593, #609, #639).
The question
GPTQ calibration was the source of a legality dispute in this competition. PRs #593 and #609 used training data for calibration and were rejected or flagged. We initially used val data instead, which raised its own question: is val-data calibration legal? To answer this definitively, we investigated whether the model can calibrate itself with no external data at all — which is what the submission above does.
Single-checkpoint ablation
Same trained weights (seed 314), 5 calibration methods, no retraining. This ablation isolates calibration source on a single checkpoint.
Confirmed on a second checkpoint (BigramHash 2048×128, 8×H100) with consistent relative gaps: val 1.11626, AR 1.11657, random 1.11816.
Val-calibrated 3-seed results (not submitted, reference only)
For comparison, the same stack with val-data GPTQ calibration instead of AR self-gen:
AR self-gen is 0.0006 BPB worse than val-calibrated. Both clear the SOTA threshold.
Full quantization pipeline comparison
Findings
Autoregressive self-generation closes 84% of the val-vs-random gap (0.0017 of 0.0020 BPB). The gap between val-calibrated and random-token calibration is predominantly natural language vs random noise. Coherent text from the model's own distribution produces Hessians nearly identical to val data.
The remaining 0.0003 BPB is P_model vs P_data divergence. The model's output distribution is a 27M-parameter approximation of the FineWeb data distribution. This small residual gap measures how far the model's internal activation patterns have drifted from those of real text. It is negligible.
Gibbs refinement does not help (1.1166 vs 1.1165 for plain random). Gibbs replaces tokens in-place conditioned on still-mostly-random neighbors — it does not produce coherent text. Autoregressive generation builds coherent sequences left-to-right, which is what produces natural-language-like activations.
More random tokens do not help. 131K and 25M tokens give identical BPB (1.1165). The Hessian converges quickly at int6 — it mainly needs to identify dead columns and relative importance, which are properties of the model's weights, not input statistics.
Every calibration method beats SOTA. Even the worst (random tokens, 1.1165) beats the previous SOTA (our PR #549, 1.1194) by 0.003 BPB.
See our PR #756 for additional negative results (Qronos, CDQuant, TTT, Spectral Init, SLOT) on this stack.
🤖 Generated with Claude Code