Skip to content

Non-record: 1.1354 BPB — 10L TTT 22ep AdamW Cosine + LeakyReLU(0.5)² + TrigramHash#562

Open
bigbag wants to merge 1 commit intoopenai:mainfrom
bigbag:submission/10L-ttt22-leaky-relu-trigram-1.1354
Open

Non-record: 1.1354 BPB — 10L TTT 22ep AdamW Cosine + LeakyReLU(0.5)² + TrigramHash#562
bigbag wants to merge 1 commit intoopenai:mainfrom
bigbag:submission/10L-ttt22-leaky-relu-trigram-1.1354

Conversation

@bigbag
Copy link

@bigbag bigbag commented Mar 23, 2026

Summary

  • val_bpb: 1.1354 (single seed, 8xH100 SXM)
  • Artifact: 15.35 MB (under 16MB limit)
  • Training: 600s | Eval (TTT + sliding window): 603s

Techniques

Architecture (10L, 512d, GQA 8/4)

  • Value Residual (ResFormer-style layer-0 V mixing, -0.015 BPB)
  • Gated Attention (per-head sigmoid gates, -0.003 BPB)
  • XSA on last 4 layers (-0.005 BPB)
  • LeakyReLU(0.5)² activation — preserves negative gradient flow (-0.003 BPB vs ReLU²)
  • TrigramHash — extends BigramHash to 3-token context via XOR hashing into shared embedding table
  • SmearGate, LN Scale (depth-scaled residuals), U-Net skip connections

Training

Quantization

  • Mixed int5 (MLP) / int6 (attention) + zstd-22
  • 3% magnitude pruning, FP16 passthrough for embeddings

Test-Time Training (22 epochs, within eval budget)

  • AdamW (lr=0.0005, wd=0.0) with per-step cosine LR decay to 0
  • Per-layer LR groups: 3x for output projections, 0.5x for input projections
  • Batched 32 sequences per GPU, distributed gradient sync via all_reduce
  • Gradient clipping at 1.0
  • TTT time: 406s, eval time: 197s

Key Findings

  1. Batched TTT (32 seqs/GPU) is ~500x faster than chunk-based (1 seq × 256 tokens)
  2. Per-step cosine decay (not per-epoch) prevents overfitting at high epoch counts
  3. Gradient sync per step (all_reduce on grads) is critical for multi-GPU TTT — syncing parameters at the end causes divergence
  4. Per-layer LR groups compensate for uneven quantization damage (output projections are most affected)
  5. LeakyReLU(0.5)² gives consistent -0.003 BPB improvement
  6. TrigramHash reuses BigramHash's embedding table for zero extra parameters

@bigbag bigbag changed the title Record: 1.1354 BPB — 10L TTT 22ep AdamW Cosine + LeakyReLU(0.5)² + TrigramHash Non-record: 1.1354 BPB — 10L TTT 22ep AdamW Cosine + LeakyReLU(0.5)² + TrigramHash Mar 23, 2026
@bigbag bigbag force-pushed the submission/10L-ttt22-leaky-relu-trigram-1.1354 branch 2 times, most recently from 76c6512 to 5dca5af Compare March 23, 2026 18:12
…igramHash

Architecture: 10L 512d GQA 8/4, Value Residual, Gated Attention, XSA4,
LN Scale, LeakyReLU(0.5)², TrigramHash + BigramHash (shared table),
SmearGate, SWA 27ckpts, Late QAT 0.5, int5-MLP/int6-attn + zstd-22.

TTT: 22-epoch AdamW with per-step cosine LR decay, per-layer LR groups
(3x proj, 0.5x fc), batched 32 seqs/GPU, grad sync + clip 1.0.

Result: val_bpb=1.1354, artifact=15.35MB, train=600s, eval=603s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bigbag bigbag force-pushed the submission/10L-ttt22-leaky-relu-trigram-1.1354 branch from 5dca5af to c7a96b3 Compare March 23, 2026 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant