Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed) by Robby955 · Pull Request #639 · openai/parameter-golf

Robby955 · 2026-03-24T18:39:30Z

Results (Compliant)

Config	Seed	Sliding BPB	Artifact	Train Time	Status
Score-first TTT (3 epochs)	1337	1.1175	15.87 MB	560s train + 40s GPTQ = 600s	Compliant
No TTT	1337	1.1182	15.93 MB	560s train + 40s GPTQ = 600s	Compliant

Best compliant score: 1.1175 BPB (single seed 1337, 3-seed validation in progress)

Note: Previous headline numbers (1.1158, 3-seed mean 1.1163) were from runs where GPTQ calibration ran outside the 10-minute training budget. Those numbers have been retracted per the organizer ruling in Issue #677. All figures above use compliant timing where GPTQ calibration is counted within the training budget.

Compliance

Training time budget (600s total on 8×H100 SXM):

560s: Main training loop (Muon + Adam, ~6,230 steps)
40s: GPTQ int6 quantization (calibration on training data)
Total: 600s (within 10-minute budget)

Test-time training (TTT):
Score-first protocol — the model scores each validation chunk before adapting on it. No token is ever re-scored after adaptation. This follows the causal/streaming TTT pattern confirmed legal by organizers (Issue #402, #677). Full-epoch TTT (train on all val data before scoring) is NOT used.

Artifact: 15.87 MB (code: ~94KB, compressed weights: ~15.78MB). Under 16,000,000 byte limit.

Key Contributions

1. Full GPTQ halves the quantization gap (0.008 → 0.004 BPB)

Cholesky-based GPTQ with act-order column permutation and block-wise error compensation (block_size=128). Calibrated on 256 training batches with 1% diagonal damping.

Method	Quant Gap	Artifact Size
Naive int6 (grid search)	+0.0083 BPB	15.81 MB
Full GPTQ (Cholesky)	+0.0039 BPB	15.92 MB

2. XSA on all 11 layers

Cross-Sequence Attention on all transformer layers (vs last 4 in baseline). Provides extended context beyond training sequence length at eval time. Worth ~-0.0013 BPB.

3. SWA/EMA weight blending

Stochastic Weight Averaging over the final warmdown phase (16 snapshots every 50 steps), blended 50/50 with EMA (decay=0.997). Smooths weight landscape before quantization.

4. Score-first TTT (legal)

Sequential online adaptation: score chunk, then train on it with AdamW (lr=1e-4, 3 epochs, freeze first 9/11 blocks, 128K token chunks). Improves sliding BPB by -0.0007.

Architecture & Training

Base: PR Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178 #589 (11L GEPA, 8 GQA heads / 4 KV, VE128, Late QAT)
Changed: XSA all 11 layers, Full GPTQ, SWA+EMA 50/50 blend, warmdown 4000
Training: ~6,230 steps at 89.9ms/step on 8×H100 SXM, 27.1M params
Artifact: 15.87 MB (code + lzma-compressed int6 weights)

Credits

@RoyiRa for PR Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178 #589 base architecture
@thwu1, @unnir, @JoeProAI for components adopted in Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178 #589
@saml212, @kshitizz36 for GPTQ reference implementations (Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) #609, Record: Full GPTQ + LeakyReLU² + Parallel Muon (3-seed mean 1.1180) #626)
@CiprianFlorin-Ifrim for int5+Soft-Round QAT reference (Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162) #606)
@raahilshah for PR Record: 11L XSA-all + Full GPTQ + Parallel Muon + Selective Pruning (val_bpb: 1.1171) #634 (XSA-all, selective pruning insights)

🤖 Generated with Claude Code

GPTQ/TTT interaction study with three key findings: 1. Full GPTQ halves quantization gap (0.008 → 0.004 BPB) 2. AdamW TTT catastrophically destroys GPTQ-calibrated weights (+0.076 BPB) 3. SGD TTT preserves GPTQ quality; Born-rule SNR² provides conservative scaling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Robby955 · 2026-03-25T03:51:26Z

Compliance Update

After reviewing the organizer ruling on PR #606 (GPTQ calibration must fit within the training time budget), I've re-run with compliant timing:

Training ends at 560s, GPTQ calibration runs at 560-600s, within the 600s training budget.

Compliant Results (560s training + 40s GPTQ calibration = 600s total)

Config	Seed	Pre-quant	Sliding BPB	Artifact
Compliant + TTT	1337	1.1374	1.1175	15.87 MB
Compliant, no TTT	1337	1.1375	1.1182	15.93 MB

TTT config (matching PR #606's recipe): AdamW lr=1e-4, 3 epochs, zero weight decay, freeze first 9/11 blocks, 128K token chunks.

The original 1.1158 score was from a 600s training run where GPTQ calibration ran outside the training budget — the same issue that affected PR #606. The compliant score with the same architecture is 1.1175.

3-seed validation on the compliant config is in progress. Will update the PR title and body once complete.

notapplica mentioned this pull request Mar 24, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Robby955 changed the title ~~Non-record: Full GPTQ + EB-TTT + SWA/EMA (val_bpb=1.1173)~~ Non-record: Full GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158) Mar 24, 2026

Robby955 changed the title ~~Non-record: Full GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158)~~ Full GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158, 3-seed mean=1.1163) Mar 24, 2026

Robby955 changed the title ~~Full GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158, 3-seed mean=1.1163)~~ Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed) Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed)#639

Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed)#639
Robby955 wants to merge 1 commit intoopenai:mainfrom
Robby955:gptq-ebttt

Robby955 commented Mar 24, 2026 •

edited

Loading

Uh oh!

Robby955 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Robby955 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results (Compliant)

Compliance

Key Contributions

1. Full GPTQ halves the quantization gap (0.008 → 0.004 BPB)

2. XSA on all 11 layers

3. SWA/EMA weight blending

4. Score-first TTT (legal)

Architecture & Training

Credits

Uh oh!

Robby955 commented Mar 25, 2026

Compliance Update

Compliant Results (560s training + 40s GPTQ calibration = 600s total)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Robby955 commented Mar 24, 2026 •

edited

Loading