Skip to content

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean)#745

Closed
stukenov wants to merge 1 commit intoopenai:mainfrom
stukenov:submission/v4-final-1epoch
Closed

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean)#745
stukenov wants to merge 1 commit intoopenai:mainfrom
stukenov:submission/v4-final-1epoch

Conversation

@stukenov
Copy link
Copy Markdown

Record: XSA-all + VRL + CROWN-Q + Depth Recurrence + Hedge Mixer TTT

val_bpb = 1.0222 (3-seed mean, std 0.0067) | <16 MB | 8xH100 SXM | 600s train, 507s eval

3-Seed Results

Seed Pre-TTT bpb Post-TTT bpb TTT time Artifact
1337 1.1336 1.0201 507s 15,857,972
42 1.1339 1.0165 508s 15,846,228
2025 1.1369 1.0299 507s 15,669,888
Mean 1.1348 1.0222 (std 0.0067) 507s

Compliance

  • Training: 600s on 8xH100 SXM
  • Eval (TTT + sliding): 507s on 8xH100 SXM (under 600s limit)
  • All artifacts under 16,000,000 bytes
  • Score-first TTT: every token scored under torch.inference_mode() before any weight update
  • N-gram tables built from already-scored tokens only
  • No training data access during evaluation
  • GPTQ-lite: no calibration data needed

6 Additions Over PR #549

  1. XSA on all layers (PR Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) #634) — -0.006 BPB
  2. Value Residual Learning (PR Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #657) — layer 0 V blended via sigmoid gates
  3. Gated Attention (PR Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed) #638) — per-head sigmoid gates
  4. CROWN-Q (PR Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean) #693) — curvature-weighted quant penalty during warmdown
  5. Depth Recurrence (PR Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182 #686) — layers 4,5 repeated, 13 virtual from 11 physical
  6. 5-Expert Hedge Mixer (PR Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #688) — online mixing of neural + unigram + bigram + trigram + entropy experts via Hedge algorithm

Reproduction

SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py

All defaults in the script match the submitted results. No env vars needed.

Credits

PR #549 (@abaybektursun), #634 (@raahilshah), #657 (@anthony-maio), #638 (@Asukabot0), #693 (@EthanYangTW), #686 (@msisovic), #688 (@RoyiRa), #493 (@parinzee), #414 (@signalrush)

… 3-seed mean)

Training: 600s, Eval: 507s — both within limits.
3 seeds: 1.0201, 1.0165, 1.0299 (mean 1.0222, std 0.0067)
@valerio-oai
Copy link
Copy Markdown
Contributor

Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches (your "Hedge Mixer"), which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants