Skip to content

Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean)#1060

Open
dexhunter wants to merge 5 commits intoopenai:mainfrom
dexhunter:submission/2026-03-29-loader-fullgptq-xsa11
Open

Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean)#1060
dexhunter wants to merge 5 commits intoopenai:mainfrom
dexhunter:submission/2026-03-29-loader-fullgptq-xsa11

Conversation

@dexhunter
Copy link
Copy Markdown

@dexhunter dexhunter commented Mar 29, 2026

Summary

What's New

  1. Coprime-stride multi-shard data pipeline (PR Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon #726 style) — diverse batches from coprime-stride block sampling across shards
  2. Full Hessian GPTQ (PR Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) #634/Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 style) — Cholesky error compensation replaces GPTQ-lite
  3. XSA on all 11 layers — extended from last 4
  4. No TTT — sliding-only outperforms TTT on this stack (confirmed independently by PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019)

3-Seed Results

Seed Sliding BPB Artifact
1337 1.1118 15,973,962
42 1.1127 15,980,438
2025 1.1121 15,983,626
Mean 1.1122

Compliance

  • 3-seed verification, all under budget
  • Standard F.cross_entropy scoring (no mixer, no cache)
  • Artifact < 16,000,000 bytes (all seeds)
  • Training < 600s, eval < 600s
  • No TTT — pure sliding window evaluation

See README.md for full details.

3-seed mean val_bpb: 1.1123 (std 0.0005)
All artifacts under 16MB, all eval under 600s.

Key changes from PR openai#549:
- Coprime-stride multi-shard data pipeline (PR openai#726 style)
- Full Hessian GPTQ with Cholesky error compensation
- XSA on all 11 layers
- BigramHash(2816×112)
- No TTT (sliding-only outperforms on this stack)

Built on PR openai#549 by @abaybektursun.
Seed logs now generated with the same 96,398-byte train_gpt.py that ships
in this record. Previous logs were from the pre-strip 111,130-byte version.

Updated results:
  Seed 1337: 1.1118 BPP, 15,973,962 bytes
  Seed 42:   1.1127 BPP, 15,980,438 bytes
  Seed 2025: 1.1121 BPP, 15,983,626 bytes
  Mean: 1.1122 ± 0.0004
@dexhunter dexhunter changed the title Record: 1.1123 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) Mar 29, 2026
@dexhunter
Copy link
Copy Markdown
Author

Updated: re-verified all 3 seeds with the stripped train_gpt.py (96,398 bytes) that ships in this record. Previous logs were generated with a pre-strip version (111,130 bytes) that included unused code paths. Scores are unchanged — 3-seed mean 1.1122 ± 0.0004, all artifacts under 16MB. Code size and logs are now fully consistent.

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Mar 29, 2026
@dexhunter
Copy link
Copy Markdown
Author

Follow-up cleanup for the stripped submission artifacts only.

What changed:

  • replaced the bundled train_seed1337.log short extract with the clean extract from the actual stripped-code run log
  • clarified in the record README that all 3 bundled seed results and the included train_gpt.py are from the stripped submission script (Code size: 96,398 bytes)
  • clarified reproduction from within the records folder and tightened the eval/rule-compliance wording

Why:

  • the previous train_seed1337.log extract accidentally included a launcher traceback / truncated preamble from an earlier invocation, which made the record bundle look inconsistent even though the underlying stripped run was valid
  • there is no model/code/score change here; all 3 seeds already match the stripped script, and the recorded metrics are unchanged

I re-ran the local rule checker on all 3 bundled logs after the cleanup and they pass cleanly.

icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 29, 2026
Competition moved while we were experimenting locally:
  PR openai#634: 1.1178 BPB (Full GPTQ + XSA-all + selective pruning)
  PR openai#1060: 1.1122 BPB (+ coprime loader + BigramHash 2816)

Our contribution: TTT periodic reset on the PR openai#1060 base.
PR openai#1060 found TTT unnecessary with Full GPTQ, but they
didn't test TTT with anti-drift reset. If TTT drift was the
reason it stopped helping, reset could unlock further gains.

Files:
  train_gpt_ours.py  — PR openai#1060 + TTT reset mechanism
  train_gpt_pr634.py — Full GPTQ reference (for study)
  train_gpt_pr1060.py — Original PR openai#1060 (for comparison)
  run_h100.sh — Train once, sweep 4 TTT configs

TTT configs tested:
  A: SOTA (lr=0.002, 3ep) — baseline TTT
  B: PR openai#1039 (lr=0.0025, 4ep) — tuned TTT
  C: B + reset/100 — anti-drift, moderate
  D: B + reset/50 — anti-drift, aggressive

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 29, 2026
…-gram invalidation

- PR openai#771 closed (rule violation: multi-epoch TTT re-scored same eval tokens)
- N-gram eval cache banned: 33+ PRs closed by @valerio-oai on 2026-03-27 due to
  normalization bug; correct n-gram achieves ~1.51 BPB (worse than baseline)
- Update merged SOTA to 1.1194 (PR openai#549, was 1.1228)
- New target: PR openai#1060 (1.1122) — Full Hessian GPTQ + XSA-all + Coprime-stride
- Add Lessons 17-20 and v8.0 strategy to CLAUDE.md
- Add 2026-03-29 daily research report to logs/daily_research.md

https://claude.ai/code/session_01GabEptdqRohHFtkKNZNL17
Bortlesboat added a commit to Bortlesboat/parameter-golf that referenced this pull request Mar 29, 2026
…(3-seed mean)

3-seed results: 1.1136/1.1133/1.1139 (mean 1.1136, std 0.0003)
Built on PR openai#549 + PR openai#1060 with optimized GPTQ reserve (10s vs 14s).
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 29, 2026
… reset

Combines the best of three approaches:
  PR openai#1060 (1.1122): coprime loader + Full GPTQ + XSA-all
  PR openai#1072 (1.117):  fused Triton MLP (matmul+activation, 70ms/step)
  Ours:              TTT periodic reset (anti-drift)

Expected: ~7900 steps (vs 6700) with PR openai#1060 quality innovations
= best training throughput + best quantization + best eval.

Fused MLP kernel from PR openai#1072 uses TMA TensorDescriptors (H100 only).
Falls back to standard path on non-Hopper GPUs.

TTT sweep tests 4 configs on the same trained checkpoint:
  sota_ttt, pr1039, reset/100, reset/50

Total H100 time: ~10min train + 4×7min TTT ≈ 40 min

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gusanidas added a commit to Gusanidas/parameter-golf that referenced this pull request Mar 30, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AnirudhRahul pushed a commit to AnirudhRahul/parameter-golf that referenced this pull request Mar 30, 2026
…Agreement

Package the validated three-seed rerun of the PR openai#1060-derived Loader FullGPTQ XSA11 stack with the online causal ngram agreement evaluator. Include the runnable record folder, benchmark log, and submission metadata for the under-10-minute eval path.

Made-with: Cursor
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 31, 2026
Critical realization: our ported innovations (EngramLite, gated skips,
LeakyReLU(0.3)², Turbo-Muon) HURT by 0.003 BPB vs the PR openai#1060 baseline.
PR openai#1060 gets 1.1122, our merged version gets 1.1151. The partial port
of PR openai#1089 innovations doesn't capture their interactions.

Clean path to record: run PR openai#1060's exact code with FA3 + GPTQ_RESERVE=9s.
Expected: 1.1115-1.1122 BPB (well below 1.1144 threshold).
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 31, 2026
Complete pipeline to beat openai#1 (1.0806 BPB):
- train_gpt_scylla_stack.py: PR openai#1060 + metadata-based tokenizer loading
- retokenize.py: TokenMonster retokenization of FineWeb
- deploy_scylla.sh: two-phase deploy (retokenize once, train many)

Strategy: PR openai#1143 used old stack. We use PR openai#1060's modern stack
(GPTQ, XSA-all, coprime loader) on the same Scylla tokenizer.
Expected: ~1.070-1.080 BPB (beating both openai#1143 and openai#1089).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant