Record: 11L + XSA4 + EMA + Late QAT + GPTQ-lite (1.1325 BPB)#531
Open
pragnyanramtha wants to merge 3 commits intoopenai:mainfrom
Open
Record: 11L + XSA4 + EMA + Late QAT + GPTQ-lite (1.1325 BPB)#531pragnyanramtha wants to merge 3 commits intoopenai:mainfrom
pragnyanramtha wants to merge 3 commits intoopenai:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a new 10min/16MB “record” submission under records/track_10min_16mb/2026-03-22_XSA4_EMA_GPQ, including the training script, run logs for two seeds, and the accompanying submission metadata/README.
Changes:
- Add a record
train_gpt.pyimplementing an 11-layer GPT variant with XSA applied to the last 4 layers and mixed post-training quantization (int6/int5/int8) with optional zstd compression. - Add training logs for seeds 1337 and 42 capturing the run outputs and final roundtrip sliding-window eval.
- Add
submission.jsonandREADME.mddescribing the approach and reporting results/artifact sizes.
Reviewed changes
Copilot reviewed 3 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-22_XSA4_EMA_GPQ/train_gpt.py | New record training script with XSA attention variant, Muon optimizer, and mixed quantization export + roundtrip eval. |
| records/track_10min_16mb/2026-03-22_XSA4_EMA_GPQ/train_seed1337.log | Added training/eval log for seed 1337. |
| records/track_10min_16mb/2026-03-22_XSA4_EMA_GPQ/train_seed42.log | Added training/eval log for seed 42. |
| records/track_10min_16mb/2026-03-22_XSA4_EMA_GPQ/submission.json | Added submission metadata (author, blurb, metrics, sizes). |
| records/track_10min_16mb/2026-03-22_XSA4_EMA_GPQ/README.md | Added narrative description, results table, and usage instructions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SOTA-class optimizations targeting the 10-minute, 16MB budget on FineWeb fineweb10B_sp1024.
Key Innovations
1. XSA4 (Cross-layer Shared Attention) — Zero New Parameters
CausalSelfAttentionXSA) for compile-friendly behaviorforward()→ keepsfullgraph=Truestable2. 11 Layers (vs. 9)
NUM_LAYERSfrom 9 to 113. EMA (Exponential Moving Average) — Full Training Duration
EMA_DECAYenv var)4. Late QAT (Quantization-Aware Training) — LR < 15% Only
CastedLinear._qat_enabledw + (w_q - w).detach()5. GPTQ-lite — 5-Percentile MSE Search
6. Compile with fullgraph=True
Architecture
Training Hyperparameters
Results
Two runs with different seeds (fineweb10B_sp1024 validation set, 50k documents):
Final roundtrip validation: Decompressed int6/zstd weights re-evaluated on sliding-window eval (stride 64).
BPB Progression (Seed 1337)
Optimization Confirmations
From training log (seed 1337):
✅ All optimizations confirmed active during training:
Iteration Process
Environment variables:
Files Included
train_gpt.py— Main training script with all optimizationstrain_seed1337.log— Full training log (seed 1337)train_seed42.log— Full training log (seed 42)submission.json— Metadata for leaderboardREADME.md— This fileReferences
Inspired by:
Notes