Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed)#639
Open
Robby955 wants to merge 1 commit intoopenai:mainfrom
Open
Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed)#639Robby955 wants to merge 1 commit intoopenai:mainfrom
Robby955 wants to merge 1 commit intoopenai:mainfrom
Conversation
GPTQ/TTT interaction study with three key findings: 1. Full GPTQ halves quantization gap (0.008 → 0.004 BPB) 2. AdamW TTT catastrophically destroys GPTQ-calibrated weights (+0.076 BPB) 3. SGD TTT preserves GPTQ quality; Born-rule SNR² provides conservative scaling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
Compliance UpdateAfter reviewing the organizer ruling on PR #606 (GPTQ calibration must fit within the training time budget), I've re-run with compliant timing: Training ends at 560s, GPTQ calibration runs at 560-600s, within the 600s training budget. Compliant Results (560s training + 40s GPTQ calibration = 600s total)
TTT config (matching PR #606's recipe): AdamW lr=1e-4, 3 epochs, zero weight decay, freeze first 9/11 blocks, 128K token chunks. The original 1.1158 score was from a 600s training run where GPTQ calibration ran outside the training budget — the same issue that affected PR #606. The compliant score with the same architecture is 1.1175. 3-seed validation on the compliant config is in progress. Will update the PR title and body once complete. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Results (Compliant)
Best compliant score: 1.1175 BPB (single seed 1337, 3-seed validation in progress)
Compliance
Training time budget (600s total on 8×H100 SXM):
Test-time training (TTT):
Score-first protocol — the model scores each validation chunk before adapting on it. No token is ever re-scored after adaptation. This follows the causal/streaming TTT pattern confirmed legal by organizers (Issue #402, #677). Full-epoch TTT (train on all val data before scoring) is NOT used.
Artifact: 15.87 MB (code: ~94KB, compressed weights: ~15.78MB). Under 16,000,000 byte limit.
Key Contributions
1. Full GPTQ halves the quantization gap (0.008 → 0.004 BPB)
Cholesky-based GPTQ with act-order column permutation and block-wise error compensation (block_size=128). Calibrated on 256 training batches with 1% diagonal damping.
2. XSA on all 11 layers
Cross-Sequence Attention on all transformer layers (vs last 4 in baseline). Provides extended context beyond training sequence length at eval time. Worth ~-0.0013 BPB.
3. SWA/EMA weight blending
Stochastic Weight Averaging over the final warmdown phase (16 snapshots every 50 steps), blended 50/50 with EMA (decay=0.997). Smooths weight landscape before quantization.
4. Score-first TTT (legal)
Sequential online adaptation: score chunk, then train on it with AdamW (lr=1e-4, 3 epochs, freeze first 9/11 blocks, 128K token chunks). Improves sliding BPB by -0.0007.
Architecture & Training
Credits
🤖 Generated with Claude Code