Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.1142 (3-seed mean) by abaybektursun · Pull Request #728 · openai/parameter-golf

abaybektursun · 2026-03-25T15:03:32Z

Record: GPTQ + XSA-all + BigramHash 3072×112

val_bpb: 1.1148 (3-seed mean) | ~15.88 MB | 8×H100 SXM, 600s | No TTT

SOTA (from our PR #549, 3-seed mean): 1.89002 nats. This run: 1.8822 nats. Delta: −0.0078 nats. Clears the 0.005-nat threshold.

Results (3-seed)

Seed	Steps	ms/step	Sliding BPB	val_loss (nats)	Artifact
314	6,927	86.6	1.1151	1.8828	15,863,278
42	6,922	86.7	1.1144	1.8816	15,984,850
999	6,917	86.8	1.1148	1.8822	15,876,310
Mean			1.1148	1.8822

Changes from Prior SOTA (our PR #549)

PR #549 scores 1.1194 BPB using GPTQ-lite + Legal TTT + Parallel Muon + BigramHash(1536) + XSA on last 4 layers. This submission makes three changes and drops TTT:

1. AR Self-Generated Full Hessian GPTQ

PR #549 used GPTQ-lite (diagonal Hessian approximation). We use Full Hessian GPTQ with Cholesky error compensation and column reordering — a strictly better quantizer.

The calibration problem: prior Full Hessian GPTQ implementations (PRs #535, #569, #593, #609) calibrated on training data, ruled illegal after the 600s window. We solve this by having the model generate its own calibration data. After training completes, the model autoregressively generates 64 sequences of 2048 tokens (temperature=0.8, fixed seed). Hessians H = X^T X are collected from these self-generated sequences. No val data, no train data accessed during quantization.

2. BigramHash 3072 × 112 (up from 1536)

Lineage: our PR #549 (1536) → PR #609 (2048) → this run (3072 × dim=112). Fits under 16MB; going wider increased artifact pressure past the break-even point.

3. XSA on all 11 layers (up from last 4)

PR #549 applied XSA to the last 4 layers. Extending to all 11 layers forces cross-position information mixing from layer 0 at zero parameter cost. Source: PR #478 by @gowtham0992.

Dropped: TTT

PR #549 used Legal Score-First TTT for −0.0025 BPB. On this stack, TTT is neutral or negative (25 failed attempts across two stacks — see our PR #756). XSA-all already captures the inter-document context patterns that TTT was adapting to. The Full Hessian GPTQ improvement more than compensates for dropping TTT.

Quantization Pipeline

Stage	BPB
Pre-quant (post-EMA)	1.1341
Post-GPTQ int6 roundtrip	1.1377 (+0.0036 gap)
Post-GPTQ sliding (AR self-gen)	1.1148

Architecture

Component	Setting	Source
Layers	11 (512d, 8 GQA / 4 KV heads)	Baseline
MLP	3× (1536), LeakyReLU(0.5)²	#493 @parinzee
Attention	XSA on all 11 layers	#478 @gowtham0992
BigramHash	3072 × 112	This work (concept: #162 @raahilshah)
RoPE	Partial (16/64 dims)	#315 @jfprincz
LN Scale	1/√(layer+1)	#315 @jfprincz
VE128	Layers 9-10	#374 @unnir
SmearGate	Position-mixing gate	#65 @aquariouseworkman
U-Net skips	Encoder-decoder connections	#289
Weight avg	EMA(0.997) + SWA(every 50)	#401 @newjordan
Quantization	Full Hessian GPTQ int6	This work (GPTQ: #535 @raahilshah)
Compression	LZMA preset=9	#160 @ChaseWNorton
Warmdown	4000 iterations	#364 @shikhar1729
Optimizer	Parallel Muon	Our #399
Late QAT	STE at LR scale < 0.15	#286 @chris-buckley
Selective pruning	±1 by reconstruction error	#609 @saml212
Flash Attention 3	Hopper kernels	#122 @mtybadger

Run Command

BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112 WARMDOWN_ITERS=4000 \
TARGET_MB=15.9 SEED=314 \
torchrun --standalone --nproc_per_node=8 train_ar_selfgen.py

Calibration Study

The submission above uses AR self-generated calibration. This section documents how we got there — what calibration data GPTQ actually needs, what works, what does not, and why.

The question

GPTQ calibration was the source of a legality dispute in this competition. PRs #593 and #609 used training data for calibration and were rejected or flagged. We initially used val data instead, which raised its own question: is val-data calibration legal? To answer this definitively, we investigated whether the model can calibrate itself with no external data at all.

Single-checkpoint ablation

Same trained weights (seed 314), 5 calibration methods, no retraining. This ablation isolates calibration source on a single checkpoint.

#	Calibration Source	Tokens	Time	Sliding BPB	vs Val-calib
1	Val data	~50M	~5s	1.1145	—
2	Autoregressive self-generation	131K	186s	1.1148	+0.0003
3	Random tokens (64 batches)	131K	3.4s	1.1165	+0.0020
4	Random tokens (256×48 batches)	25M	35s	1.1165	+0.0020
5	Gibbs-refined (3 rounds)	6.3M	24s	1.1166	+0.0021

Confirmed on a second checkpoint (BigramHash 2048×128, 8×H100) with consistent relative gaps: val 1.11626, AR 1.11657, random 1.11816.

Val-calibrated 3-seed results

For comparison, the same stack with val-data GPTQ calibration instead of AR self-gen:

Seed	Steps	ms/step	Pre-quant BPB	Sliding BPB	val_loss (nats)	Artifact
314	6,952	86.3	1.1340	1.1141	1.8813	15,855,088
42	6,952	86.3	1.1341	1.1142	1.8815	15,853,088
999	6,945	86.4	1.1343	1.1143	1.8817	15,866,156
Mean			1.1341	1.1142	1.8815

Val-calibrated mean: 1.88128 nats (delta −0.00875 nats vs SOTA, p ≈ 0.003, Welch's t-test, n=3).

AR self-gen is 0.0006 BPB worse than val-calibrated. Both clear the SOTA threshold.

Full quantization pipeline comparison

Stage	BPB
Pre-quant (post-EMA)	1.1341
Post-GPTQ int6 roundtrip	1.1377 (+0.0036 gap)
Post-GPTQ sliding (val-calib)	1.1142
Post-GPTQ sliding (AR self-gen)	1.1148
Post-GPTQ sliding (random self-gen)	1.1165

Findings

Autoregressive self-generation closes 84% of the val-vs-random gap (0.0017 of 0.0020 BPB). The gap between val-calibrated and random-token calibration is predominantly natural language vs random noise. Coherent text from the model's own distribution produces Hessians nearly identical to val data.
The remaining 0.0003 BPB is P_model vs P_data divergence. The model's output distribution is a 27M-parameter approximation of the FineWeb data distribution. This small residual gap measures how far the model's internal activation patterns have drifted from those of real text. It is negligible.
Gibbs refinement does not help (1.1166 vs 1.1165 for plain random). Gibbs replaces tokens in-place conditioned on still-mostly-random neighbors — it does not produce coherent text. Autoregressive generation builds coherent sequences left-to-right, which is what produces natural-language-like activations.
More random tokens do not help. 131K and 25M tokens give identical BPB (1.1165). The Hessian converges quickly at int6 — it mainly needs to identify dead columns and relative importance, which are properties of the model's weights, not input statistics.
Every calibration method beats SOTA. Even the worst (random tokens, 1.1165) beats the previous SOTA (our PR #549, 1.1194) by 0.003 BPB.

See our PR #756 for additional negative results (Qronos, CDQuant, TTT, Spectral Init, SLOT) on this stack.

Legality discussion

The AR self-gen submission above sidesteps this question entirely, but we document it for completeness since val-calibrated GPTQ produced our best absolute result (1.1142).

@valerio-oai: would val-data GPTQ calibration be accepted?

GPTQ calibration is a read-only operation: forward passes collect H = X^T X per layer, then rounding directions are chosen on the int6 grid. No gradients, no weight updates, model weights bit-for-bit identical afterward.

Val and train are the same distribution. FineWeb val and train are random splits of the same corpus. Our PR #772 confirmed this empirically: all 80 training shards scored within 0.018 bits of each other against val under 8 independent methods. Train data would produce the same Hessians.
Self-generated calibration nearly matches val-calibrated. AR self-gen (zero data access) comes within 0.0006 BPB of val-calibrated performance across 3 seeds. Val data is not providing a meaningful advantage beyond what the model already knows about natural language.
Calibration is a less invasive operation than accepted TTT. Our merged PR #549 performs SGD on val tokens — gradient descent updating weights. GPTQ calibration is read-only: collect activation outer products, choose rounding directions. No learning occurs.
The previous rejection was about training data at eval time. PR Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean) #593 was closed for accessing training data after the 600s window. PR Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) #609 was flagged for the same issue and reclassified as a non-record. Val-data calibration does not access training data.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun changed the title ~~Add val-calibrated GPTQ + XSA-all + BigramHash 3072x112 record~~ Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 (val_bpb=1.1142, 3-seed mean) Mar 25, 2026

abaybektursun changed the title ~~Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 (val_bpb=1.1142, 3-seed mean)~~ Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 Mar 25, 2026

notapplica mentioned this pull request Mar 25, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Add val-calibrated GPTQ + XSA-all + BigramHash 3072x112 record

713bb3f

abaybektursun force-pushed the codex/valcalib-gptq-xsa-bigramhash3072 branch from 6e162da to 713bb3f Compare March 25, 2026 15:14

abaybektursun changed the title ~~Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112~~ WIP: GPTQ + XSA-all + BigramHash 3072×112 Mar 25, 2026

abaybektursun changed the title ~~WIP: GPTQ + XSA-all + BigramHash 3072×112~~ Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.1142 (3-seed mean) Mar 26, 2026

abaybektursun mentioned this pull request Mar 26, 2026

RFC: How to Clean Up All the Parameter Golf Submissions #886

Open

6 tasks

abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 26, 2026

Fix base model reference to PR openai#728

1f91dfe

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.1142 (3-seed mean)#728

Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.1142 (3-seed mean)#728
abaybektursun wants to merge 1 commit intoopenai:mainfrom
abaybektursun:codex/valcalib-gptq-xsa-bigramhash3072

abaybektursun commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abaybektursun commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: GPTQ + XSA-all + BigramHash 3072×112

Results (3-seed)

Changes from Prior SOTA (our PR #549)

1. AR Self-Generated Full Hessian GPTQ

2. BigramHash 3072 × 112 (up from 1536)

3. XSA on all 11 layers (up from last 4)

Dropped: TTT

Quantization Pipeline

Architecture

Run Command

Calibration Study

The question

Single-checkpoint ablation

Val-calibrated 3-seed results

Full quantization pipeline comparison

Findings

Legality discussion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abaybektursun commented Mar 25, 2026 •

edited

Loading