Skip to content

Non-record: 10L + Batched LoRA TTT (val_bpb=1.1160)#525

Open
hypery11 wants to merge 3 commits intoopenai:mainfrom
hypery11:submission/2026-03-23_10L_Optimized
Open

Non-record: 10L + Batched LoRA TTT (val_bpb=1.1160)#525
hypery11 wants to merge 3 commits intoopenai:mainfrom
hypery11:submission/2026-03-23_10L_Optimized

Conversation

@hypery11
Copy link

@hypery11 hypery11 commented Mar 23, 2026

Summary

10-layer transformer with batched per-document LoRA test-time training.

  • Base val_bpb: 1.1476 (10 min train)
  • TTT val_bpb: 1.1160 (8.3 min eval)
  • Artifact size: 15.75 MB

Architecture

  • 10L, 512 dim, 8/4 GQA, 3x MLP
  • Mixed int5/int6 quantization + zstd-22
  • Muon + AdamW, EMA averaging

LoRA TTT

  • Rank-8 on Q/V/LM-head, all layers
  • 64 docs batched in parallel
  • Per-document reset, Adam lr=0.01
  • 256-token chunks, 3 epochs, score on final epoch

Single seed result. 3-seed validation coming.

johndoe and others added 2 commits March 23, 2026 20:25
10-layer transformer with mixed int5/int6 quantization,
improved activations, enhanced embeddings, and EMA averaging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per-document rank-8 LoRA adaptation on Q/V/LM-head.
Base 1.148 -> TTT 1.104. Eval time ~29min (needs speedup for 10min cap).
@hypery11 hypery11 changed the title Non-record: 10L Optimized (val_bpb=1.1477) Non-record: 10L + LoRA TTT (val_bpb=1.1039, base=1.1485) Mar 23, 2026
Batch-64 per-document LoRA adaptation. Rank-8 on Q/V/LM-head.
Base 1.148 -> TTT 1.116 in 8.3 min eval. 256-token chunks, 3 epochs.
@hypery11 hypery11 changed the title Non-record: 10L + LoRA TTT (val_bpb=1.1039, base=1.1485) Non-record: 10L + Batched LoRA TTT (val_bpb=1.1160) Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant