Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) by raahilshah · Pull Request #634 · openai/parameter-golf

raahilshah · 2026-03-24T17:40:34Z

Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning

val_bpb: 1.1178 (3-seed mean, std 0.0001) | 15.95 MB max artifact | 8xH100 SXM, ~596s total compute

Update (2026-03-26)

This PR was updated to fix a GPTQ budget violation identified in issue #677. The previous version trained for the full 600s, then ran GPTQ calibration for ~46s on top. The fix reserves 14s from the training budget (gptq_reserve_ms = 14000.0), ensuring total compute stays within 600s. See update comment for details.

Results (3 seeds, 8xH100 SXM)

Seed	Sliding BPB (s64)	val_loss	Artifact	Train Time	GPTQ Time	Total
1337	1.1177	1.8871	15,929,433 B	586,128ms	9,786ms	595,915ms
42	1.1179	1.8875	15,949,353 B	586,050ms	9,792ms	595,842ms
7	1.1179	1.8875	15,946,145 B	586,066ms	9,823ms	595,889ms

Mean: 1.1178 | Std: 0.0001

Key Techniques

XSA on all 11 layers — cross-position mixing from layer 0 (-0.0016 BPB vs XSA-4)
Full Hessian GPTQ — 64-batch GPU Hessian, Cholesky error compensation, budget-legal (14s reserved)
amax-aligned QAT — STE matches export quantizer, [-32, 31] range
Parallel Muon with parameter banking — 3-phase overlapped optimizer, ~87ms/step
Selective magnitude pruning — post-GPTQ, zero least-impactful ±1 values
LZMA compression — preset 6, better than zstd on int6 weights
LeakyReLU(0.5)² — prevents dead neurons

Architecture

11L, 512d, 8H/4KV (GQA), MLP 3x LeakyReLU(0.5)², XSA-all-11, Partial RoPE 16/64, LN Scale, VE128, SmearGate, BigramHash(2048), U-Net skips, EMA(0.997), Tight SWA, Late QAT@0.15, Full GPTQ int6 + LZMA, Parallel Muon + Parameter Banking, FA3 Hopper.

Compliance

3 seeds, all total compute ≤600s (max: 595,915ms, verified via gptq:budget_check in logs)
GPTQ calibration WITHIN training budget (14s reserved from 600s)
All artifacts ≤16,000,000 bytes (max: 15,949,353)
No TTT on validation data
No training data accessed during evaluation
No network calls during evaluation
Sliding window eval stride=64 (std=0.0001)

…ng (val_bpb=1.1171) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4315a016dd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-24T17:45:37Z

records/track_10min_16mb/2026-03-24_11L_XSA-all_GPTQ_ParallelMuon_1.1171/train_gpt.py

+        for opt, state in zip(optimizers, initial_optimizer_states, strict=True):
+            opt.load_state_dict(state)


Clear Muon shard momentum when restoring from warmup

These lines restore each optimizer from state_dict(), but in distributed mode Muon keeps its active momentum in _bank_meta[*]['shard_mom'] (used in Muon.step) rather than in state_dict. Because warmup calls opt.step(), those buffers are updated and then silently carried into real training, so warmup_steps>0 changes optimization even after the supposed rollback. Rebuild or zero Muon’s internal bank buffers after warmup to make the reset correct and reproducible.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-24T17:45:38Z

records/track_10min_16mb/2026-03-24_11L_XSA-all_GPTQ_ParallelMuon_1.1171/train_gpt.py

+    for name in hessians:
+        hessians[name] /= num_batches


All-reduce GPTQ Hessians across ranks before averaging

In multi-GPU runs, calibration data is split per rank by DistributedTokenLoader, but Hessians are only divided by num_batches locally with no cross-rank reduction. Since the final artifact is written by rank 0, GPTQ is effectively calibrated from rank 0’s subset instead of the full global calibration set, making quantization quality dependent on rank/world size and underusing most calibration data. Add a distributed sum/average of each Hessian before damping and quantization.

Useful? React with 👍 / 👎.

Three new experiment scripts forked from PR openai#634 (1.1171 BPB frontier): - 2026-03-25_11L_LateEMA_XSA-all_GPTQ: Late EMA fix only Switch EMA decay 0.997→0.95 when QAT activates (~step 17K). Prevents EMA from averaging over the full QAT annealing curve, ensuring shadow weights track the final quantized α=16 state. - 2026-03-25_11L_BOS-Reset_XSA-all_GPTQ: BOS-reset attention only Use flash_attn_varlen_func with cu_seqlens from BOS token positions to block cross-document attention at zero parameter cost. Also zeros SmearGate blend at BOS positions (cross-doc bleed fix). 62.9% of val tokens have cross-doc context pollution — confirmed novel, no prior PR has exploited this. - 2026-03-25_11L_BOS-Reset_LateEMA_XSA-all_GPTQ: both changes combined run.sh: replace FOMAML modes (confirmed negative) with baseline/frontier/ late-ema/bos-reset/combined; log tee to logs/<run_id>.log setup.sh: smoke test now uses frontier record; add quick-start guidance Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- BIGRAM_VOCAB_SIZE 10240→2048 (saves ~1M params, matches community SOTA) - Magnitude pruning at 3% (not 10%) matching PR openai#634's validated approach - Remove unused collect_hessians/gptq_quantize_weight (dead code, ~155 lines) - Clean up quantize_state_dict_int8 signature Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Even 3% pruning causes 0.18 BPB degradation because QAT optimizes weights for round-to-nearest. PR openai#634 uses pruning without QAT — different regime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fork PR openai#634. Swap Full GPTQ for GPTQ-lite. Remove dead GPTQ code. int5-MLP + BigramHash(8192) + MATRIX_LR=0.03.

Previous version trained for the full 600s then ran GPTQ calibration for ~46s on top, exceeding the 600s artifact-production budget. Fix: reserve 14s from training budget for GPTQ calibration (gptq_reserve_ms = 14000.0). Training stops at ~586s, GPTQ takes ~10s, total ~596s — within budget. Fresh 3-seed results on 8xH100 SXM: - Seed 1337: val_bpb=1.1177, artifact=15,929,433 bytes, total=595,915ms - Seed 42: val_bpb=1.1179, artifact=15,949,353 bytes, total=595,842ms - Seed 7: val_bpb=1.1179, artifact=15,946,145 bytes, total=595,889ms - Mean: 1.1178, Std: 0.0001

raahilshah · 2026-03-26T17:46:55Z

Updated: GPTQ Budget Fix + Fresh 3-Seed Repro

This PR has been updated to fix the GPTQ budget violation identified in issue #677 (see this comment flagging PRs #535, #544, #545, #569, #585, #593 for "GPTQ/Hessian calibration uses fineweb_train_* during evaluation").

What was wrong

The previous version trained for the full 600s wallclock, then ran GPTQ Hessian calibration on fineweb_train_* for ~46s on top — total artifact-producing compute was ~646s, exceeding the 600s budget.

What was fixed

Added gptq_reserve_ms = 14000.0 — the training loop now reserves 14s from the 600s budget for GPTQ calibration. Training stops at ~586s, GPTQ calibration takes ~10s (64 batches instead of 256), total ~596s. The log explicitly verifies: gptq:budget_check train:586128ms + gptq:9786ms = 595915ms (budget:600000ms).

Fresh results

All code, logs, and results have been replaced with fresh 3-seed runs using the fixed code:

Seed	val_bpb	Artifact	Total Compute
1337	1.1177	15,929,433 B	595,915ms
42	1.1179	15,949,353 B	595,842ms
7	1.1179	15,946,145 B	595,889ms
Mean	1.1178	—	—

Note: the slight BPB increase vs the previous submission (1.1171 → 1.1178) is expected — we're training for ~400 fewer steps due to the GPTQ time reservation. The submission is now fully compliant with the 600s budget.

Add record: 11L XSA-all + Full GPTQ + Parallel Muon + Selective Pruni…

4315a01

…ng (val_bpb=1.1171) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 24, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

chatgpt-codex-connector bot reviewed Mar 24, 2026

View reviewed changes

Robby955 mentioned this pull request Mar 25, 2026

Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed) #639

Closed

This was referenced Mar 25, 2026

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean) #733

Closed

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean) #745

Closed

This was referenced Mar 25, 2026

Record: 0.9623 BPB — 7-Gram Entropy Cache + XSA-all + EBLS #777

Closed

Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS #796

Closed

quietsmile mentioned this pull request Mar 26, 2026

Record: 0.2873 BPB — Fine-Grained N-gram Cache (65K chunks) #840

Open

6 tasks

kasimte pushed a commit to kasimte/parameter-golf that referenced this pull request Mar 26, 2026

Pivot to PR openai#634 base with GPTQ-lite (timing compliant)

2fdd77a

Fork PR openai#634. Swap Full GPTQ for GPTQ-lite. Remove dead GPTQ code. int5-MLP + BigramHash(8192) + MATRIX_LR=0.03.

raahilshah changed the title ~~Record: 11L XSA-all + Full GPTQ + Parallel Muon + Selective Pruning (val_bpb: 1.1171)~~ Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean)#634

Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean)#634
raahilshah wants to merge 2 commits intoopenai:mainfrom
raahilshah:submission/2026-03-24_11L_XSA-all_GPTQ_ParallelMuon_1.1171

raahilshah commented Mar 24, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 24, 2026

Uh oh!

chatgpt-codex-connector bot Mar 24, 2026

Uh oh!

raahilshah commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		for opt, state in zip(optimizers, initial_optimizer_states, strict=True):
		opt.load_state_dict(state)

Conversation

raahilshah commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning

Update (2026-03-26)

Results (3 seeds, 8xH100 SXM)

Key Techniques

Architecture

Compliance

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

raahilshah commented Mar 26, 2026

Updated: GPTQ Budget Fix + Fresh 3-Seed Repro

What was wrong

What was fixed

Fresh results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

raahilshah commented Mar 24, 2026 •

edited

Loading