Skip to content

Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178#589

Closed
RoyiRa wants to merge 1 commit intoopenai:mainfrom
RoyiRa:submission-late-soft-round-qat
Closed

Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178#589
RoyiRa wants to merge 1 commit intoopenai:mainfrom
RoyiRa:submission-late-soft-round-qat

Conversation

@RoyiRa
Copy link

@RoyiRa RoyiRa commented Mar 23, 2026

Summary

  • val_bpb: 1.1178 (3-seed mean, std 0.0005)
  • Artifact: ~15.75 MB (under 16 MB limit)
  • 8×H100 SXM, train ≤ 10 min, eval ≤ 10 min

Novel Contribution: Late Soft-Round QAT

Standard STE quantization-aware training uses hard rounding in the forward pass and an identity-style surrogate in the backward pass, which provides no bin-aware gradient signal near quantization boundaries. We replace that late in training with a temperature-controlled soft-round surrogate, giving the optimizer a non-zero, bin-aware gradient signal that encourages weights to settle onto nearby int6 grid points just before EMA/SWA finalization.

Score-First Backward-Looking TTT

Backward-looking adaptation following PR #461: validation tokens split into ~1,893 non-overlapping 32K-token chunks. Each chunk is scored under torch.inference_mode(), then trained with SGD (cosine-decayed lr=0.002, momentum=0.9, 3 epochs). Chunk N is scored by a model adapted only on chunks 0..N-1.

Results

Seed Pre-TTT bpb Post-TTT bpb TTT gain Artifact
1337 1.1201 1.1176 -0.0025 15,700,318
42 1.1209 1.1183 -0.0026 15,850,153
7 1.1200 1.1174 -0.0026 15,706,617
Mean 1.1203 1.1178 -0.0026

Credits

…bpb 1.1178

3-seed mean: 1.1178 BPB (std 0.0005), ~15.75 MB artifact, 8×H100 SXM.

Novel contribution: Late Soft-Round QAT — replaces STE identity surrogate
with sigmoid soft-round in the backward pass during the final 2% of training,
giving bin-aware gradients that settle weights onto int6 grid points.

Built on PR openai#414 (base model), PR openai#461 (TTT recipe), PR openai#493 (LeakyReLU²).
Robby955 added a commit to Robby955/parameter-golf that referenced this pull request Mar 24, 2026
- Base: PR openai#589 architecture (11L GEPA, VE128, XSA, SWA, Late QAT)
- New: Empirical Bayes Adaptive TTT (per-layer gradient SNR scaling)
- New: Embedding freeze during TTT
- Result: 1.1185 BPB on 8xH100 SXM (6909 steps, 15.81 MB artifact)
@valerio-oai
Copy link
Contributor

This looks valid, but at the time of merging the chronological SOTA is PR #549, which this beats by less than the required 0.005 nats threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants