Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178 by RoyiRa · Pull Request #589 · openai/parameter-golf

RoyiRa · 2026-03-23T23:44:39Z

Summary

val_bpb: 1.1178 (3-seed mean, std 0.0005)
Artifact: ~15.75 MB (under 16 MB limit)
8×H100 SXM, train ≤ 10 min, eval ≤ 10 min

Novel Contribution: Late Soft-Round QAT

Standard STE quantization-aware training uses hard rounding in the forward pass and an identity-style surrogate in the backward pass, which provides no bin-aware gradient signal near quantization boundaries. We replace that late in training with a temperature-controlled soft-round surrogate, giving the optimizer a non-zero, bin-aware gradient signal that encourages weights to settle onto nearby int6 grid points just before EMA/SWA finalization.

Score-First Backward-Looking TTT

Backward-looking adaptation following PR #461: validation tokens split into ~1,893 non-overlapping 32K-token chunks. Each chunk is scored under torch.inference_mode(), then trained with SGD (cosine-decayed lr=0.002, momentum=0.9, 3 epochs). Chunk N is scored by a model adapted only on chunks 0..N-1.

Results

Seed	Pre-TTT bpb	Post-TTT bpb	TTT gain	Artifact
1337	1.1201	1.1176	-0.0025	15,700,318
42	1.1209	1.1183	-0.0026	15,850,153
7	1.1200	1.1174	-0.0026	15,706,617
Mean	1.1203	1.1178	-0.0026

Credits

Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush
TTT recipe: PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 by @Christopher-Lee-McClendon
LeakyReLU² activation: PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493 by @parinzee

…bpb 1.1178 3-seed mean: 1.1178 BPB (std 0.0005), ~15.75 MB artifact, 8×H100 SXM. Novel contribution: Late Soft-Round QAT — replaces STE identity surrogate with sigmoid soft-round in the backward pass during the final 2% of training, giving bin-aware gradients that settle weights onto int6 grid points. Built on PR openai#414 (base model), PR openai#461 (TTT recipe), PR openai#493 (LeakyReLU²).

- Base: PR openai#589 architecture (11L GEPA, VE128, XSA, SWA, Late QAT) - New: Empirical Bayes Adaptive TTT (per-layer gradient SNR scaling) - New: Embedding freeze during TTT - Result: 1.1185 BPB on 8xH100 SXM (6909 steps, 15.81 MB artifact)

valerio-oai · 2026-03-24T13:58:40Z

This looks valid, but at the time of merging the chronological SOTA is PR #549, which this beats by less than the required 0.005 nats threshold.

notapplica mentioned this pull request Mar 23, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Robby955 mentioned this pull request Mar 24, 2026

Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185) #484

Closed

valerio-oai closed this Mar 24, 2026

Robby955 mentioned this pull request Mar 24, 2026

Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed) #639

Closed

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

Robby955 mentioned this pull request Mar 25, 2026

Non-record: Full GPTQ + XSA-4 + Score-First TTT (3-seed mean 1.1198) #734

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178#589

Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178#589
RoyiRa wants to merge 1 commit intoopenai:mainfrom
RoyiRa:submission-late-soft-round-qat

RoyiRa commented Mar 23, 2026

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RoyiRa commented Mar 23, 2026

Summary

Novel Contribution: Late Soft-Round QAT

Score-First Backward-Looking TTT

Results

Credits

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants