Skip to content

Record: 11L Muon TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean)#999

Open
aamodbhatt wants to merge 1 commit intoopenai:mainfrom
aamodbhatt:record-2026-03-28-muon-ttt-entropy-adaptive
Open

Record: 11L Muon TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean)#999
aamodbhatt wants to merge 1 commit intoopenai:mainfrom
aamodbhatt:record-2026-03-28-muon-ttt-entropy-adaptive

Conversation

@aamodbhatt
Copy link
Copy Markdown

@aamodbhatt aamodbhatt commented Mar 28, 2026

Summary

Two novel TTT innovations on the SOTA base stack (PR #399 + PR #414 + PR #461): Muon-style Newton-Schulz orthogonalized updates replace SGD in the TTT loop, and entropy-adaptive epoch selection concentrates adaptation budget on harder content. Beats current SOTA (1.1194) with a 3-seed mean of 1.1179.

Run Results (3 seeds)

Seed legal_ttt_exact val_bpb legal_ttt_exact val_loss pre-quant val_bpb train time eval time (TTT) artifact size
1337 1.11765030 1.88710072 1.1366 599.1s 477.9s 15,944,410 bytes
42 1.11812929 1.88790947 1.1371 599.1s 485.3s 15,873,826 bytes
2025 1.11789934 1.88752121 1.1367 599.1s 479.2s 15,879,042 bytes
mean 1.11789

Method Notes

  • NUM_LAYERS=11, BIGRAM_VOCAB_SIZE=1536, XSA_LAST_N=4
  • TTT_ENABLED=1, score-first path
  • TTT_MUON=1 — Newton-Schulz orthogonalized updates in TTT loop (NS steps=3)
  • TTT_ENTROPY_ADAPT=1 — entropy-adaptive 2/3/4 epochs per chunk (H_HIGH=2.1, H_LOW=1.75)
  • TTT_LR=0.002, TTT_EPOCHS=3, TTT_CHUNK_TOKENS=32768
  • NGRAM_EVAL_ENABLED=0
  • NGRAM_TWO_PASS_ENABLED=0
  • NGRAM_FULL_RESCORE=0
  • EMA_ENABLED=1, SWA_ENABLED=1, LATE_QAT=1, VE_ENABLED=1
  • WARMDOWN_ITERS=3500, MAX_WALLCLOCK_SECONDS=599

Submission Checklist

  • One folder under records/track_10min_16mb/
  • Included README.md, submission.json, train_gpt.py, and train logs (3 seeds)
  • Training <= 600s
  • Eval <= 600s
  • Artifact <= 16,000,000 bytes
  • No tokenizer/dataset modifications
  • Score-first TTT (SCORE under inference_mode before TRAIN on same chunk)
  • No n-gram, no two-pass, no external data lookup

…179 (3-seed mean)

Two novel TTT innovations: (1) Muon-style Newton-Schulz orthogonalized updates
replace SGD in the TTT loop; (2) entropy-adaptive 2/3/4 epochs per chunk based
on globally-synced chunk NLL. 3-seed mean 1.1179, std 0.0002. All under 16MB/600s.
@aamodbhatt aamodbhatt force-pushed the record-2026-03-28-muon-ttt-entropy-adaptive branch from bffd426 to f219235 Compare March 28, 2026 03:33
aryanbhosale added a commit to aryanbhosale/parameter-golf that referenced this pull request Mar 28, 2026
slope 0.75 + LR 0.027 + warmdown 3700 (PR openai#977)
No SWA with QAT (PR openai#989)
QAT from 50% + range fix [-31,31]
mHC 22-param residual mixing (PR openai#928)
VE128 + no gated_attn + no value_residual (PR openai#549)
LZMA preset 7 compression (PR openai#999)
Muon TTT with NS3 (PR openai#999)
Entropy-adaptive TTT epochs 2/3/4 (PR openai#999)
Per-layer TTT LR (PR openai#995)
TTT momentum 0.95 (PR openai#995)
eamon831 added a commit to eamon831/parameter-golf that referenced this pull request Mar 28, 2026
CLAUDE.md: Complete project state for cross-session continuity
- Leaderboard intel (verified SOTA + unverified PRs openai#1006, openai#999, openai#831)
- 8192 vocab analysis (doesn't fit — only 9,994 bytes headroom)
- Three planned improvements with code status
- Environment setup instructions (Mac MLX + RunPod H100)
- Codebase layout and git remotes

experiments.md: 4 planned experiments with commands + success criteria

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant