Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185) by Robby955 · Pull Request #484 · openai/parameter-golf

Robby955 · 2026-03-23T00:40:58Z

Summary

Verified val_bpb = 1.1185 (8xH100 SXM, 6909 steps at 86.8ms/step, artifact 15.81 MB).

Built on PR #589's frontier architecture (11L GEPA, VE128, XSA, SWA, Late Soft-Round QAT, Bigram, SmearGate). Our contribution is Empirical Bayes Adaptive Test-Time Training (EB-TTT): a per-layer gradient SNR scaling method that improves TTT by 0.0015 BPB over baseline SGD.

Results

Metric	PR #589 Baseline (our repro)	+ EB-TTT (this PR)	Delta
Training steps	6801	6909	+108
Post-EMA BPB	1.1375	1.1356	-0.0019
Post-quant BPB	1.1458	1.1443	-0.0015
Post-TTT BPB	1.1200	1.1185	-0.0015
Artifact size	15.83 MB	15.81 MB	-0.02 MB

Research Contribution: EB-Adaptive TTT

Standard TTT applies uniform SGD learning rate to all layers. We observe that different layers have different gradient signal-to-noise ratios (SNR) when adapting to local validation chunks. EB-TTT scales each layer's gradient by its SNR:

lambda_i = clip(|E[grad_i]| / std(grad_i), 0.3, 3.0)

Interpretation: High SNR (consistent gradient direction) means the layer has a clear adaptation signal — amplify it. Low SNR (noisy gradients) means the layer should stay closer to the trained prior — attenuate it.

Learned Layer Scales (stable across runs)

blocks.0=0.54  blocks.1=0.61  blocks.2=0.61  blocks.3=0.61  blocks.4=0.63
blocks.5=0.65  blocks.6=0.66  blocks.7=0.68  blocks.8=0.65  blocks.9=0.58  blocks.10=0.51
ve_layer_scales: 3.00 (saturated at cap)

Middle layers (blocks.5-8) adapt most aggressively, boundary layers (blocks.0, 10) stay closer to prior. This matches intuition: early/late layers handle general tokenization/prediction while middle layers specialize to local data distribution.

Additional TTT Improvements

Embedding freeze during TTT (TTT_FREEZE_EMBEDDINGS=1): Prevents vocabulary embedding distortion when adapting to local 32K-token chunks. Token embeddings, bigram hash, and lm_head are frozen.
TTT burst with EMA: 2-epoch burst replay at 0.1x LR before sliding-window TTT.

Ablation Results (all on 8xH100 SXM)

Run	Config	Final BPB	Notes
1	PR #589 base + VRL	1.1200	Previous best
2	+ MTP + James-Stein shrinkage	1.1312	MTP + J-S hurt
7	EB-TTT (H100 NVL, 3540 steps)	1.1526	TTT delta -0.040 (54% better than SGD)
8	EB-TTT, no burst (SXM)	1.1192	Best absolute, artifact over 16MB
9	EB-TTT + burst (SXM)	1.1185	Best valid submission

Negative Results

VRL (Value Residual Learning) on GEPA: +0.002 worse, collides with VE128
MTP (Multi-Token Prediction): Dilutes gradient in compute-limited regime
James-Stein TTT shrinkage: Over-attenuates (TTT is underfitting, not overfitting)
AdamW for TTT: Diverges catastrophically
EVAL_SEQ_LEN=4096: RoPE NTK scaling doesn't generalize (1.53 BPB)
Output-aware quantization: Same BPB, bigger artifact

Architecture

11 layers, 512-dim, 8 heads, 4 KV heads. GEPA attention, VE128, XSA (last 4 layers), SWA, Late Soft-Round QAT, BigramHash, SmearGate. int6+zstd quantization. Single-pass score-first TTT with EB-adaptive per-layer scaling.

Credits

@RoyiRa PR #589 (base architecture), @thwu1 PR #180, @unnir PR #162, @JoeProAI PR #462 (sequential TTT, SwiGLU), ResFormer (VRL concept), @ndokutovich PR #486 (per-layer LR). EB-TTT concept inspired by Empirical Bayes shrinkage estimation applied to gradient SNR.

Non-record entry exploring learned layer-sharing patterns via James-Stein shrinkage estimators. Three shared blocks x 3 virtual layers with per-layer LoRA deviations gated by learned shrinkage gammas. Key findings: - MLP gammas converge to 0.0000 (fully shared) across all virtual layers - Attention retains trace specialization (gamma ~0.004) in early layers only - Quantization error amplifies multiplicatively in depth-recurrent architectures (0.19 BPB compiled-vs-eager gap from 15 passes through shared blocks) - LoRA rank 8 forces full sharing; rank 16 permits mild deviation (0.01-0.05) Pre-quant BPB (1.2105) beats baseline (1.2244) despite fewer steps (4572 vs 13780). Post-quant BPB (1.3441) limited by quantization amplification in recurrent architecture. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace EBLS-only submission with combined approach: - SwiGLU MLP (mult=2.0) replacing ReLU-squared - EMA (decay=0.9985) replacing SWA - Eval-time AdamW TTT on MLP weights - Mixed int5/int6 quantization with 5% pruning Post-quant BPB: 1.1746 (H100 NVL, 3547 steps) Artifact: 15.9MB (under 16MB limit) Retains EBLS findings: gamma convergence, MLP sharing asymmetry, quantization error amplification in depth-recurrent architectures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Improved from 1.1746 to 1.1679 BPB (post-quant sliding eval) - Int5 quantization for all weight categories (was mixed int5/int6) - 5116 steps on 8xH100 SXM at 110ms/step (was 3521 on NVL) - Artifact: 15.1MB (down from 15.9MB) - Document TTT negative result: per-window adaptation degrades quality (batch leak bug found and fixed, but honest TTT doesn't help) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Sequential score-then-train TTT (3 epochs, batched 8 chunks) - Report sliding-window BPB on adapted weights (1.0476) not TTT-loop BPB (1.1032) - Full memorization analysis: 3 epochs = domain adaptation, 10 epochs = memorization - Freeze embeddings during TTT, adapt attention + MLP only - Artifact: 15.18 MB, eval: 91s TTT + 233s diagnostic Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Key improvements: - 5-epoch global cosine LR decay (single curve across all epochs) - Per-layer TTT LR multipliers (later layers adapt faster) - Peak LR 7e-4 (up from 5e-4) Results reproduced across two independent pods: - Run 9: sliding 1.0022, TTT-loop 1.0101 (gap 0.008) - Run 10: sliding 1.0028, TTT-loop 1.0106 (gap 0.008) Memorization diagnostic confirms genuine adaptation: sliding BPB < TTT-loop BPB at 5 epochs with cosine decay. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…research Multi-epoch TTT is invalid per Issue openai#402 ruling. Our verified score is 1.1679 BPB from standard sliding-window eval without TTT. The multi-epoch TTT experiments (reaching 1.00 BPB) are retained as a research contribution showing how to diagnose memorization in TTT submissions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Base: PR openai#589 architecture (11L GEPA, VE128, XSA, SWA, Late QAT) - New: Empirical Bayes Adaptive TTT (per-layer gradient SNR scaling) - New: Embedding freeze during TTT - Result: 1.1185 BPB on 8xH100 SXM (6909 steps, 15.81 MB artifact)

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Robby955 changed the title ~~Non-record: EBLS (Empirical Bayes Layer Sharing) — learned sharing patterns~~ Non-record: SwiGLU + EMA + AdamW TTT + EBLS Findings (val_bpb=1.1746 on H100 NVL) Mar 23, 2026

Robby955 and others added 2 commits March 23, 2026 01:01

Robby955 changed the title ~~Non-record: SwiGLU + EMA + AdamW TTT + EBLS Findings (val_bpb=1.1746 on H100 NVL)~~ Non-record: Sequential TTT + Memorization Analysis (val_bpb=1.0476, 8xH100 SXM) Mar 23, 2026

Robby955 changed the title ~~Non-record: Sequential TTT + Memorization Analysis (val_bpb=1.0476, 8xH100 SXM)~~ Sequential TTT + Global Cosine Schedule + Memorization Analysis (val_bpb=1.0028, 8xH100 SXM) Mar 23, 2026

Robby955 and others added 2 commits March 23, 2026 14:16

Robby955 changed the title ~~Sequential TTT + Global Cosine Schedule + Memorization Analysis (val_bpb=1.0028, 8xH100 SXM)~~ Non-record: SwiGLU + EMA + TTT Memorization Analysis (val_bpb=1.1679) Mar 23, 2026

Robby955 changed the title ~~Non-record: SwiGLU + EMA + TTT Memorization Analysis (val_bpb=1.1679)~~ Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185) Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185)#484

Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185)#484
Robby955 wants to merge 7 commits intoopenai:mainfrom
Robby955:ebls-learned-sharing

Robby955 commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Robby955 commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Research Contribution: EB-Adaptive TTT

Learned Layer Scales (stable across runs)

Additional TTT Improvements

Ablation Results (all on 8xH100 SXM)

Negative Results

Architecture

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Robby955 commented Mar 23, 2026 •

edited

Loading