Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean)#634
Conversation
…ng (val_bpb=1.1171) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4315a016dd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for opt, state in zip(optimizers, initial_optimizer_states, strict=True): | ||
| opt.load_state_dict(state) |
There was a problem hiding this comment.
Clear Muon shard momentum when restoring from warmup
These lines restore each optimizer from state_dict(), but in distributed mode Muon keeps its active momentum in _bank_meta[*]['shard_mom'] (used in Muon.step) rather than in state_dict. Because warmup calls opt.step(), those buffers are updated and then silently carried into real training, so warmup_steps>0 changes optimization even after the supposed rollback. Rebuild or zero Muon’s internal bank buffers after warmup to make the reset correct and reproducible.
Useful? React with 👍 / 👎.
| for name in hessians: | ||
| hessians[name] /= num_batches |
There was a problem hiding this comment.
All-reduce GPTQ Hessians across ranks before averaging
In multi-GPU runs, calibration data is split per rank by DistributedTokenLoader, but Hessians are only divided by num_batches locally with no cross-rank reduction. Since the final artifact is written by rank 0, GPTQ is effectively calibrated from rank 0’s subset instead of the full global calibration set, making quantization quality dependent on rank/world size and underusing most calibration data. Add a distributed sum/average of each Hessian before damping and quantization.
Useful? React with 👍 / 👎.
Three new experiment scripts forked from PR openai#634 (1.1171 BPB frontier): - 2026-03-25_11L_LateEMA_XSA-all_GPTQ: Late EMA fix only Switch EMA decay 0.997→0.95 when QAT activates (~step 17K). Prevents EMA from averaging over the full QAT annealing curve, ensuring shadow weights track the final quantized α=16 state. - 2026-03-25_11L_BOS-Reset_XSA-all_GPTQ: BOS-reset attention only Use flash_attn_varlen_func with cu_seqlens from BOS token positions to block cross-document attention at zero parameter cost. Also zeros SmearGate blend at BOS positions (cross-doc bleed fix). 62.9% of val tokens have cross-doc context pollution — confirmed novel, no prior PR has exploited this. - 2026-03-25_11L_BOS-Reset_LateEMA_XSA-all_GPTQ: both changes combined run.sh: replace FOMAML modes (confirmed negative) with baseline/frontier/ late-ema/bos-reset/combined; log tee to logs/<run_id>.log setup.sh: smoke test now uses frontier record; add quick-start guidance Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BIGRAM_VOCAB_SIZE 10240→2048 (saves ~1M params, matches community SOTA) - Magnitude pruning at 3% (not 10%) matching PR openai#634's validated approach - Remove unused collect_hessians/gptq_quantize_weight (dead code, ~155 lines) - Clean up quantize_state_dict_int8 signature Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Even 3% pruning causes 0.18 BPB degradation because QAT optimizes weights for round-to-nearest. PR openai#634 uses pruning without QAT — different regime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fork PR openai#634. Swap Full GPTQ for GPTQ-lite. Remove dead GPTQ code. int5-MLP + BigramHash(8192) + MATRIX_LR=0.03.
Previous version trained for the full 600s then ran GPTQ calibration for ~46s on top, exceeding the 600s artifact-production budget. Fix: reserve 14s from training budget for GPTQ calibration (gptq_reserve_ms = 14000.0). Training stops at ~586s, GPTQ takes ~10s, total ~596s — within budget. Fresh 3-seed results on 8xH100 SXM: - Seed 1337: val_bpb=1.1177, artifact=15,929,433 bytes, total=595,915ms - Seed 42: val_bpb=1.1179, artifact=15,949,353 bytes, total=595,842ms - Seed 7: val_bpb=1.1179, artifact=15,946,145 bytes, total=595,889ms - Mean: 1.1178, Std: 0.0001
Updated: GPTQ Budget Fix + Fresh 3-Seed ReproThis PR has been updated to fix the GPTQ budget violation identified in issue #677 (see this comment flagging PRs #535, #544, #545, #569, #585, #593 for "GPTQ/Hessian calibration uses fineweb_train_* during evaluation"). What was wrongThe previous version trained for the full 600s wallclock, then ran GPTQ Hessian calibration on What was fixedAdded Fresh resultsAll code, logs, and results have been replaced with fresh 3-seed runs using the fixed code:
Note: the slight BPB increase vs the previous submission (1.1171 → 1.1178) is expected — we're training for ~400 fewer steps due to the GPTQ time reservation. The submission is now fully compliant with the 600s budget. |
Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning
val_bpb: 1.1178 (3-seed mean, std 0.0001) | 15.95 MB max artifact | 8xH100 SXM, ~596s total compute
Update (2026-03-26)
This PR was updated to fix a GPTQ budget violation identified in issue #677. The previous version trained for the full 600s, then ran GPTQ calibration for ~46s on top. The fix reserves 14s from the training budget (
gptq_reserve_ms = 14000.0), ensuring total compute stays within 600s. See update comment for details.Results (3 seeds, 8xH100 SXM)
Mean: 1.1178 | Std: 0.0001
Key Techniques
Architecture
11L, 512d, 8H/4KV (GQA), MLP 3x LeakyReLU(0.5)², XSA-all-11, Partial RoPE 16/64, LN Scale, VE128, SmearGate, BigramHash(2048), U-Net skips, EMA(0.997), Tight SWA, Late QAT@0.15, Full GPTQ int6 + LZMA, Parallel Muon + Parameter Banking, FA3 Hopper.
Compliance
gptq:budget_checkin logs)