Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185)#484
Open
Robby955 wants to merge 7 commits intoopenai:mainfrom
Open
Non-record: Empirical Bayes Adaptive TTT (val_bpb=1.1185)#484Robby955 wants to merge 7 commits intoopenai:mainfrom
Robby955 wants to merge 7 commits intoopenai:mainfrom
Conversation
Non-record entry exploring learned layer-sharing patterns via James-Stein shrinkage estimators. Three shared blocks x 3 virtual layers with per-layer LoRA deviations gated by learned shrinkage gammas. Key findings: - MLP gammas converge to 0.0000 (fully shared) across all virtual layers - Attention retains trace specialization (gamma ~0.004) in early layers only - Quantization error amplifies multiplicatively in depth-recurrent architectures (0.19 BPB compiled-vs-eager gap from 15 passes through shared blocks) - LoRA rank 8 forces full sharing; rank 16 permits mild deviation (0.01-0.05) Pre-quant BPB (1.2105) beats baseline (1.2244) despite fewer steps (4572 vs 13780). Post-quant BPB (1.3441) limited by quantization amplification in recurrent architecture. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace EBLS-only submission with combined approach: - SwiGLU MLP (mult=2.0) replacing ReLU-squared - EMA (decay=0.9985) replacing SWA - Eval-time AdamW TTT on MLP weights - Mixed int5/int6 quantization with 5% pruning Post-quant BPB: 1.1746 (H100 NVL, 3547 steps) Artifact: 15.9MB (under 16MB limit) Retains EBLS findings: gamma convergence, MLP sharing asymmetry, quantization error amplification in depth-recurrent architectures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Improved from 1.1746 to 1.1679 BPB (post-quant sliding eval) - Int5 quantization for all weight categories (was mixed int5/int6) - 5116 steps on 8xH100 SXM at 110ms/step (was 3521 on NVL) - Artifact: 15.1MB (down from 15.9MB) - Document TTT negative result: per-window adaptation degrades quality (batch leak bug found and fixed, but honest TTT doesn't help) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Sequential score-then-train TTT (3 epochs, batched 8 chunks) - Report sliding-window BPB on adapted weights (1.0476) not TTT-loop BPB (1.1032) - Full memorization analysis: 3 epochs = domain adaptation, 10 epochs = memorization - Freeze embeddings during TTT, adapt attention + MLP only - Artifact: 15.18 MB, eval: 91s TTT + 233s diagnostic Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Key improvements: - 5-epoch global cosine LR decay (single curve across all epochs) - Per-layer TTT LR multipliers (later layers adapt faster) - Peak LR 7e-4 (up from 5e-4) Results reproduced across two independent pods: - Run 9: sliding 1.0022, TTT-loop 1.0101 (gap 0.008) - Run 10: sliding 1.0028, TTT-loop 1.0106 (gap 0.008) Memorization diagnostic confirms genuine adaptation: sliding BPB < TTT-loop BPB at 5 epochs with cosine decay. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…research Multi-epoch TTT is invalid per Issue openai#402 ruling. Our verified score is 1.1679 BPB from standard sliding-window eval without TTT. The multi-epoch TTT experiments (reaching 1.00 BPB) are retained as a research contribution showing how to diagnose memorization in TTT submissions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Base: PR openai#589 architecture (11L GEPA, VE128, XSA, SWA, Late QAT) - New: Empirical Bayes Adaptive TTT (per-layer gradient SNR scaling) - New: Embedding freeze during TTT - Result: 1.1185 BPB on 8xH100 SXM (6909 steps, 15.81 MB artifact)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Verified val_bpb = 1.1185 (8xH100 SXM, 6909 steps at 86.8ms/step, artifact 15.81 MB).
Built on PR #589's frontier architecture (11L GEPA, VE128, XSA, SWA, Late Soft-Round QAT, Bigram, SmearGate). Our contribution is Empirical Bayes Adaptive Test-Time Training (EB-TTT): a per-layer gradient SNR scaling method that improves TTT by 0.0015 BPB over baseline SGD.
Results
Research Contribution: EB-Adaptive TTT
Standard TTT applies uniform SGD learning rate to all layers. We observe that different layers have different gradient signal-to-noise ratios (SNR) when adapting to local validation chunks. EB-TTT scales each layer's gradient by its SNR:
Interpretation: High SNR (consistent gradient direction) means the layer has a clear adaptation signal — amplify it. Low SNR (noisy gradients) means the layer should stay closer to the trained prior — attenuate it.
Learned Layer Scales (stable across runs)
Middle layers (blocks.5-8) adapt most aggressively, boundary layers (blocks.0, 10) stay closer to prior. This matches intuition: early/late layers handle general tokenization/prediction while middle layers specialize to local data distribution.
Additional TTT Improvements
TTT_FREEZE_EMBEDDINGS=1): Prevents vocabulary embedding distortion when adapting to local 32K-token chunks. Token embeddings, bigram hash, and lm_head are frozen.Ablation Results (all on 8xH100 SXM)
Negative Results
Architecture
11 layers, 512-dim, 8 heads, 4 KV heads. GEPA attention, VE128, XSA (last 4 layers), SWA, Late Soft-Round QAT, BigramHash, SmearGate. int6+zstd quantization. Single-pass score-first TTT with EB-adaptive per-layer scaling.
Credits
@RoyiRa PR #589 (base architecture), @thwu1 PR #180, @unnir PR #162, @JoeProAI PR #462 (sequential TTT, SwiGLU), ResFormer (VRL concept), @ndokutovich PR #486 (per-layer LR). EB-TTT concept inspired by Empirical Bayes shrinkage estimation applied to gradient SNR.