Skip to content

Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean)#634

Open
raahilshah wants to merge 2 commits intoopenai:mainfrom
raahilshah:submission/2026-03-24_11L_XSA-all_GPTQ_ParallelMuon_1.1171
Open

Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean)#634
raahilshah wants to merge 2 commits intoopenai:mainfrom
raahilshah:submission/2026-03-24_11L_XSA-all_GPTQ_ParallelMuon_1.1171

Conversation

@raahilshah
Copy link
Copy Markdown

@raahilshah raahilshah commented Mar 24, 2026

Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning

val_bpb: 1.1178 (3-seed mean, std 0.0001) | 15.95 MB max artifact | 8xH100 SXM, ~596s total compute

Update (2026-03-26)

This PR was updated to fix a GPTQ budget violation identified in issue #677. The previous version trained for the full 600s, then ran GPTQ calibration for ~46s on top. The fix reserves 14s from the training budget (gptq_reserve_ms = 14000.0), ensuring total compute stays within 600s. See update comment for details.

Results (3 seeds, 8xH100 SXM)

Seed Sliding BPB (s64) val_loss Artifact Train Time GPTQ Time Total
1337 1.1177 1.8871 15,929,433 B 586,128ms 9,786ms 595,915ms
42 1.1179 1.8875 15,949,353 B 586,050ms 9,792ms 595,842ms
7 1.1179 1.8875 15,946,145 B 586,066ms 9,823ms 595,889ms

Mean: 1.1178 | Std: 0.0001

Key Techniques

  1. XSA on all 11 layers — cross-position mixing from layer 0 (-0.0016 BPB vs XSA-4)
  2. Full Hessian GPTQ — 64-batch GPU Hessian, Cholesky error compensation, budget-legal (14s reserved)
  3. amax-aligned QAT — STE matches export quantizer, [-32, 31] range
  4. Parallel Muon with parameter banking — 3-phase overlapped optimizer, ~87ms/step
  5. Selective magnitude pruning — post-GPTQ, zero least-impactful ±1 values
  6. LZMA compression — preset 6, better than zstd on int6 weights
  7. LeakyReLU(0.5)² — prevents dead neurons

Architecture

11L, 512d, 8H/4KV (GQA), MLP 3x LeakyReLU(0.5)², XSA-all-11, Partial RoPE 16/64, LN Scale, VE128, SmearGate, BigramHash(2048), U-Net skips, EMA(0.997), Tight SWA, Late QAT@0.15, Full GPTQ int6 + LZMA, Parallel Muon + Parameter Banking, FA3 Hopper.

Compliance

  • 3 seeds, all total compute ≤600s (max: 595,915ms, verified via gptq:budget_check in logs)
  • GPTQ calibration WITHIN training budget (14s reserved from 600s)
  • All artifacts ≤16,000,000 bytes (max: 15,949,353)
  • No TTT on validation data
  • No training data accessed during evaluation
  • No network calls during evaluation
  • Sliding window eval stride=64 (std=0.0001)

…ng (val_bpb=1.1171)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4315a016dd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1476 to +1477
for opt, state in zip(optimizers, initial_optimizer_states, strict=True):
opt.load_state_dict(state)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clear Muon shard momentum when restoring from warmup

These lines restore each optimizer from state_dict(), but in distributed mode Muon keeps its active momentum in _bank_meta[*]['shard_mom'] (used in Muon.step) rather than in state_dict. Because warmup calls opt.step(), those buffers are updated and then silently carried into real training, so warmup_steps>0 changes optimization even after the supposed rollback. Rebuild or zero Muon’s internal bank buffers after warmup to make the reset correct and reproducible.

Useful? React with 👍 / 👎.

Comment on lines +1063 to +1064
for name in hessians:
hessians[name] /= num_batches
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge All-reduce GPTQ Hessians across ranks before averaging

In multi-GPU runs, calibration data is split per rank by DistributedTokenLoader, but Hessians are only divided by num_batches locally with no cross-rank reduction. Since the final artifact is written by rank 0, GPTQ is effectively calibrated from rank 0’s subset instead of the full global calibration set, making quantization quality dependent on rank/world size and underusing most calibration data. Add a distributed sum/average of each Hessian before damping and quantization.

Useful? React with 👍 / 👎.

akrausscs pushed a commit to akrausscs/parameter-golf that referenced this pull request Mar 24, 2026
Three new experiment scripts forked from PR openai#634 (1.1171 BPB frontier):

- 2026-03-25_11L_LateEMA_XSA-all_GPTQ: Late EMA fix only
  Switch EMA decay 0.997→0.95 when QAT activates (~step 17K).
  Prevents EMA from averaging over the full QAT annealing curve,
  ensuring shadow weights track the final quantized α=16 state.

- 2026-03-25_11L_BOS-Reset_XSA-all_GPTQ: BOS-reset attention only
  Use flash_attn_varlen_func with cu_seqlens from BOS token positions
  to block cross-document attention at zero parameter cost.
  Also zeros SmearGate blend at BOS positions (cross-doc bleed fix).
  62.9% of val tokens have cross-doc context pollution — confirmed novel,
  no prior PR has exploited this.

- 2026-03-25_11L_BOS-Reset_LateEMA_XSA-all_GPTQ: both changes combined

run.sh: replace FOMAML modes (confirmed negative) with baseline/frontier/
        late-ema/bos-reset/combined; log tee to logs/<run_id>.log
setup.sh: smoke test now uses frontier record; add quick-start guidance

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
arbyte77 added a commit to aryanbhosale/parameter-golf that referenced this pull request Mar 25, 2026
- BIGRAM_VOCAB_SIZE 10240→2048 (saves ~1M params, matches community SOTA)
- Magnitude pruning at 3% (not 10%) matching PR openai#634's validated approach
- Remove unused collect_hessians/gptq_quantize_weight (dead code, ~155 lines)
- Clean up quantize_state_dict_int8 signature

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
arbyte77 added a commit to aryanbhosale/parameter-golf that referenced this pull request Mar 25, 2026
Even 3% pruning causes 0.18 BPB degradation because QAT optimizes weights
for round-to-nearest. PR openai#634 uses pruning without QAT — different regime.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kasimte pushed a commit to kasimte/parameter-golf that referenced this pull request Mar 26, 2026
Fork PR openai#634. Swap Full GPTQ for GPTQ-lite. Remove dead GPTQ code.
int5-MLP + BigramHash(8192) + MATRIX_LR=0.03.
Previous version trained for the full 600s then ran GPTQ calibration
for ~46s on top, exceeding the 600s artifact-production budget.

Fix: reserve 14s from training budget for GPTQ calibration
(gptq_reserve_ms = 14000.0). Training stops at ~586s, GPTQ takes ~10s,
total ~596s — within budget.

Fresh 3-seed results on 8xH100 SXM:
- Seed 1337: val_bpb=1.1177, artifact=15,929,433 bytes, total=595,915ms
- Seed 42:   val_bpb=1.1179, artifact=15,949,353 bytes, total=595,842ms
- Seed 7:    val_bpb=1.1179, artifact=15,946,145 bytes, total=595,889ms
- Mean: 1.1178, Std: 0.0001
@raahilshah raahilshah changed the title Record: 11L XSA-all + Full GPTQ + Parallel Muon + Selective Pruning (val_bpb: 1.1171) Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) Mar 26, 2026
@raahilshah
Copy link
Copy Markdown
Author

Updated: GPTQ Budget Fix + Fresh 3-Seed Repro

This PR has been updated to fix the GPTQ budget violation identified in issue #677 (see this comment flagging PRs #535, #544, #545, #569, #585, #593 for "GPTQ/Hessian calibration uses fineweb_train_* during evaluation").

What was wrong

The previous version trained for the full 600s wallclock, then ran GPTQ Hessian calibration on fineweb_train_* for ~46s on top — total artifact-producing compute was ~646s, exceeding the 600s budget.

What was fixed

Added gptq_reserve_ms = 14000.0 — the training loop now reserves 14s from the 600s budget for GPTQ calibration. Training stops at ~586s, GPTQ calibration takes ~10s (64 batches instead of 256), total ~596s. The log explicitly verifies: gptq:budget_check train:586128ms + gptq:9786ms = 595915ms (budget:600000ms).

Fresh results

All code, logs, and results have been replaced with fresh 3-seed runs using the fixed code:

Seed val_bpb Artifact Total Compute
1337 1.1177 15,929,433 B 595,915ms
42 1.1179 15,949,353 B 595,842ms
7 1.1179 15,946,145 B 595,889ms
Mean 1.1178

Note: the slight BPB increase vs the previous submission (1.1171 → 1.1178) is expected — we're training for ~400 fewer steps due to the GPTQ time reservation. The submission is now fully compliant with the 600s budget.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant