Skip to content

Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185)#484

Open
Robby955 wants to merge 7 commits intoopenai:mainfrom
Robby955:ebls-learned-sharing
Open

Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185)#484
Robby955 wants to merge 7 commits intoopenai:mainfrom
Robby955:ebls-learned-sharing

Conversation

@Robby955
Copy link

@Robby955 Robby955 commented Mar 23, 2026

Summary

Verified val_bpb = 1.1185 (8xH100 SXM, 6909 steps at 86.8ms/step, artifact 15.81 MB).

Built on PR #589's frontier architecture (11L GEPA, VE128, XSA, SWA, Late Soft-Round QAT, Bigram, SmearGate). Our contribution is Empirical Bayes Adaptive Test-Time Training (EB-TTT): a per-layer gradient SNR scaling method that improves TTT by 0.0015 BPB over baseline SGD.

Results

Metric PR #589 Baseline (our repro) + EB-TTT (this PR) Delta
Training steps 6801 6909 +108
Post-EMA BPB 1.1375 1.1356 -0.0019
Post-quant BPB 1.1458 1.1443 -0.0015
Post-TTT BPB 1.1200 1.1185 -0.0015
Artifact size 15.83 MB 15.81 MB -0.02 MB

Research Contribution: EB-Adaptive TTT

Standard TTT applies uniform SGD learning rate to all layers. We observe that different layers have different gradient signal-to-noise ratios (SNR) when adapting to local validation chunks. EB-TTT scales each layer's gradient by its SNR:

lambda_i = clip(|E[grad_i]| / std(grad_i), 0.3, 3.0)

Interpretation: High SNR (consistent gradient direction) means the layer has a clear adaptation signal — amplify it. Low SNR (noisy gradients) means the layer should stay closer to the trained prior — attenuate it.

Learned Layer Scales (stable across runs)

blocks.0=0.54  blocks.1=0.61  blocks.2=0.61  blocks.3=0.61  blocks.4=0.63
blocks.5=0.65  blocks.6=0.66  blocks.7=0.68  blocks.8=0.65  blocks.9=0.58  blocks.10=0.51
ve_layer_scales: 3.00 (saturated at cap)

Middle layers (blocks.5-8) adapt most aggressively, boundary layers (blocks.0, 10) stay closer to prior. This matches intuition: early/late layers handle general tokenization/prediction while middle layers specialize to local data distribution.

Additional TTT Improvements

  • Embedding freeze during TTT (TTT_FREEZE_EMBEDDINGS=1): Prevents vocabulary embedding distortion when adapting to local 32K-token chunks. Token embeddings, bigram hash, and lm_head are frozen.
  • TTT burst with EMA: 2-epoch burst replay at 0.1x LR before sliding-window TTT.

Ablation Results (all on 8xH100 SXM)

Run Config Final BPB Notes
1 PR #589 base + VRL 1.1200 Previous best
2 + MTP + James-Stein shrinkage 1.1312 MTP + J-S hurt
7 EB-TTT (H100 NVL, 3540 steps) 1.1526 TTT delta -0.040 (54% better than SGD)
8 EB-TTT, no burst (SXM) 1.1192 Best absolute, artifact over 16MB
9 EB-TTT + burst (SXM) 1.1185 Best valid submission

Negative Results

  • VRL (Value Residual Learning) on GEPA: +0.002 worse, collides with VE128
  • MTP (Multi-Token Prediction): Dilutes gradient in compute-limited regime
  • James-Stein TTT shrinkage: Over-attenuates (TTT is underfitting, not overfitting)
  • AdamW for TTT: Diverges catastrophically
  • EVAL_SEQ_LEN=4096: RoPE NTK scaling doesn't generalize (1.53 BPB)
  • Output-aware quantization: Same BPB, bigger artifact

Architecture

11 layers, 512-dim, 8 heads, 4 KV heads. GEPA attention, VE128, XSA (last 4 layers), SWA, Late Soft-Round QAT, BigramHash, SmearGate. int6+zstd quantization. Single-pass score-first TTT with EB-adaptive per-layer scaling.

Credits

@RoyiRa PR #589 (base architecture), @thwu1 PR #180, @unnir PR #162, @JoeProAI PR #462 (sequential TTT, SwiGLU), ResFormer (VRL concept), @ndokutovich PR #486 (per-layer LR). EB-TTT concept inspired by Empirical Bayes shrinkage estimation applied to gradient SNR.

Non-record entry exploring learned layer-sharing patterns via James-Stein
shrinkage estimators. Three shared blocks x 3 virtual layers with per-layer
LoRA deviations gated by learned shrinkage gammas.

Key findings:
- MLP gammas converge to 0.0000 (fully shared) across all virtual layers
- Attention retains trace specialization (gamma ~0.004) in early layers only
- Quantization error amplifies multiplicatively in depth-recurrent architectures
  (0.19 BPB compiled-vs-eager gap from 15 passes through shared blocks)
- LoRA rank 8 forces full sharing; rank 16 permits mild deviation (0.01-0.05)

Pre-quant BPB (1.2105) beats baseline (1.2244) despite fewer steps (4572 vs 13780).
Post-quant BPB (1.3441) limited by quantization amplification in recurrent architecture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace EBLS-only submission with combined approach:
- SwiGLU MLP (mult=2.0) replacing ReLU-squared
- EMA (decay=0.9985) replacing SWA
- Eval-time AdamW TTT on MLP weights
- Mixed int5/int6 quantization with 5% pruning

Post-quant BPB: 1.1746 (H100 NVL, 3547 steps)
Artifact: 15.9MB (under 16MB limit)

Retains EBLS findings: gamma convergence, MLP sharing
asymmetry, quantization error amplification in depth-recurrent
architectures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Robby955 Robby955 changed the title Non-record: EBLS (Empirical Bayes Layer Sharing) — learned sharing patterns Non-record: SwiGLU + EMA + AdamW TTT + EBLS Findings (val_bpb=1.1746 on H100 NVL) Mar 23, 2026
Robby955 and others added 2 commits March 23, 2026 01:01
- Improved from 1.1746 to 1.1679 BPB (post-quant sliding eval)
- Int5 quantization for all weight categories (was mixed int5/int6)
- 5116 steps on 8xH100 SXM at 110ms/step (was 3521 on NVL)
- Artifact: 15.1MB (down from 15.9MB)
- Document TTT negative result: per-window adaptation degrades quality
  (batch leak bug found and fixed, but honest TTT doesn't help)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Sequential score-then-train TTT (3 epochs, batched 8 chunks)
- Report sliding-window BPB on adapted weights (1.0476) not TTT-loop BPB (1.1032)
- Full memorization analysis: 3 epochs = domain adaptation, 10 epochs = memorization
- Freeze embeddings during TTT, adapt attention + MLP only
- Artifact: 15.18 MB, eval: 91s TTT + 233s diagnostic

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Robby955 Robby955 changed the title Non-record: SwiGLU + EMA + AdamW TTT + EBLS Findings (val_bpb=1.1746 on H100 NVL) Non-record: Sequential TTT + Memorization Analysis (val_bpb=1.0476, 8xH100 SXM) Mar 23, 2026
@Robby955 Robby955 changed the title Non-record: Sequential TTT + Memorization Analysis (val_bpb=1.0476, 8xH100 SXM) Sequential TTT + Global Cosine Schedule + Memorization Analysis (val_bpb=1.0028, 8xH100 SXM) Mar 23, 2026
Robby955 and others added 2 commits March 23, 2026 14:16
Key improvements:
- 5-epoch global cosine LR decay (single curve across all epochs)
- Per-layer TTT LR multipliers (later layers adapt faster)
- Peak LR 7e-4 (up from 5e-4)

Results reproduced across two independent pods:
- Run 9: sliding 1.0022, TTT-loop 1.0101 (gap 0.008)
- Run 10: sliding 1.0028, TTT-loop 1.0106 (gap 0.008)

Memorization diagnostic confirms genuine adaptation:
sliding BPB < TTT-loop BPB at 5 epochs with cosine decay.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…research

Multi-epoch TTT is invalid per Issue openai#402 ruling. Our verified score is
1.1679 BPB from standard sliding-window eval without TTT. The multi-epoch
TTT experiments (reaching 1.00 BPB) are retained as a research contribution
showing how to diagnose memorization in TTT submissions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Robby955 Robby955 changed the title Sequential TTT + Global Cosine Schedule + Memorization Analysis (val_bpb=1.0028, 8xH100 SXM) Non-record: SwiGLU + EMA + TTT Memorization Analysis (val_bpb=1.1679) Mar 23, 2026
@Robby955 Robby955 changed the title Non-record: SwiGLU + EMA + TTT Memorization Analysis (val_bpb=1.1679) Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185) Mar 24, 2026
- Base: PR openai#589 architecture (11L GEPA, VE128, XSA, SWA, Late QAT)
- New: Empirical Bayes Adaptive TTT (per-layer gradient SNR scaling)
- New: Embedding freeze during TTT
- Result: 1.1185 BPB on 8xH100 SXM (6909 steps, 15.81 MB artifact)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant