Skip to content

Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed)#639

Open
Robby955 wants to merge 1 commit intoopenai:mainfrom
Robby955:gptq-ebttt
Open

Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed)#639
Robby955 wants to merge 1 commit intoopenai:mainfrom
Robby955:gptq-ebttt

Conversation

@Robby955
Copy link

@Robby955 Robby955 commented Mar 24, 2026

Results (Compliant)

Config Seed Sliding BPB Artifact Train Time Status
Score-first TTT (3 epochs) 1337 1.1175 15.87 MB 560s train + 40s GPTQ = 600s Compliant
No TTT 1337 1.1182 15.93 MB 560s train + 40s GPTQ = 600s Compliant

Best compliant score: 1.1175 BPB (single seed 1337, 3-seed validation in progress)

Note: Previous headline numbers (1.1158, 3-seed mean 1.1163) were from runs where GPTQ calibration ran outside the 10-minute training budget. Those numbers have been retracted per the organizer ruling in Issue #677. All figures above use compliant timing where GPTQ calibration is counted within the training budget.


Compliance

Training time budget (600s total on 8×H100 SXM):

  • 560s: Main training loop (Muon + Adam, ~6,230 steps)
  • 40s: GPTQ int6 quantization (calibration on training data)
  • Total: 600s (within 10-minute budget)

Test-time training (TTT):
Score-first protocol — the model scores each validation chunk before adapting on it. No token is ever re-scored after adaptation. This follows the causal/streaming TTT pattern confirmed legal by organizers (Issue #402, #677). Full-epoch TTT (train on all val data before scoring) is NOT used.

Artifact: 15.87 MB (code: ~94KB, compressed weights: ~15.78MB). Under 16,000,000 byte limit.


Key Contributions

1. Full GPTQ halves the quantization gap (0.008 → 0.004 BPB)

Cholesky-based GPTQ with act-order column permutation and block-wise error compensation (block_size=128). Calibrated on 256 training batches with 1% diagonal damping.

Method Quant Gap Artifact Size
Naive int6 (grid search) +0.0083 BPB 15.81 MB
Full GPTQ (Cholesky) +0.0039 BPB 15.92 MB

2. XSA on all 11 layers

Cross-Sequence Attention on all transformer layers (vs last 4 in baseline). Provides extended context beyond training sequence length at eval time. Worth ~-0.0013 BPB.

3. SWA/EMA weight blending

Stochastic Weight Averaging over the final warmdown phase (16 snapshots every 50 steps), blended 50/50 with EMA (decay=0.997). Smooths weight landscape before quantization.

4. Score-first TTT (legal)

Sequential online adaptation: score chunk, then train on it with AdamW (lr=1e-4, 3 epochs, freeze first 9/11 blocks, 128K token chunks). Improves sliding BPB by -0.0007.

Architecture & Training

Credits

🤖 Generated with Claude Code

GPTQ/TTT interaction study with three key findings:
1. Full GPTQ halves quantization gap (0.008 → 0.004 BPB)
2. AdamW TTT catastrophically destroys GPTQ-calibrated weights (+0.076 BPB)
3. SGD TTT preserves GPTQ quality; Born-rule SNR² provides conservative scaling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Robby955 Robby955 changed the title Non-record: Full GPTQ + EB-TTT + SWA/EMA (val_bpb=1.1173) Non-record: Full GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158) Mar 24, 2026
@Robby955 Robby955 changed the title Non-record: Full GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158) Full GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158, 3-seed mean=1.1163) Mar 24, 2026
@Robby955
Copy link
Author

Compliance Update

After reviewing the organizer ruling on PR #606 (GPTQ calibration must fit within the training time budget), I've re-run with compliant timing:

Training ends at 560s, GPTQ calibration runs at 560-600s, within the 600s training budget.

Compliant Results (560s training + 40s GPTQ calibration = 600s total)

Config Seed Pre-quant Sliding BPB Artifact
Compliant + TTT 1337 1.1374 1.1175 15.87 MB
Compliant, no TTT 1337 1.1375 1.1182 15.93 MB

TTT config (matching PR #606's recipe): AdamW lr=1e-4, 3 epochs, zero weight decay, freeze first 9/11 blocks, 128K token chunks.

The original 1.1158 score was from a 600s training run where GPTQ calibration ran outside the training budget — the same issue that affected PR #606. The compliant score with the same architecture is 1.1175.

3-seed validation on the compliant config is in progress. Will update the PR title and body once complete.

@Robby955 Robby955 changed the title Full GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158, 3-seed mean=1.1163) Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed) Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant