Skip to content

Non-record: 11L XSA + SwiGLU + LoRA TTT (val_bpb=1.1573, 1xH100)#2

Open
swapp1990 wants to merge 1 commit intomainfrom
submission/nonrecord-11l-xsa-lora-ttt
Open

Non-record: 11L XSA + SwiGLU + LoRA TTT (val_bpb=1.1573, 1xH100)#2
swapp1990 wants to merge 1 commit intomainfrom
submission/nonrecord-11l-xsa-lora-ttt

Conversation

@swapp1990
Copy link
Owner

Summary

  • val_bpb: 1.1573 (LoRA TTT) | 15.02 MB artifact | 1xH100 PCIe, ~80 min
  • 11-layer transformer: XSA (last 4 layers), SwiGLU 3x MLP, SmearGate, U-Net skips, OrthoInit, Muon WD=0.04, SWA
  • Mixed quantization: int5-MLP + int6-attn + int8-embed + zstd-22
  • Score-then-train LoRA TTT (rank-8, 256-token chunks) brings val_bpb from 1.191 → 1.157
  • 18 experiments over 5 days, from val_bpb=3.10 to 1.1573 (~$50 total compute)

Why Non-Record

Trained on 1xH100 PCIe with grad accumulation (~80 min), not 8xH100 in 10 min. Architecture is identical to what would run on 8xH100.

Test plan

🤖 Generated with Claude Code

…3, 1xH100)

Non-record submission for the parameter-golf challenge.
11-layer transformer with XSA, SwiGLU, SmearGate, U-Net skips,
mixed quantization (15 MB), and score-then-train LoRA TTT.
Trained on 1xH100 PCIe with grad accumulation (~80 min).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant