Skip to content

Record: 0.4311 BPB - Complementary Training + Backoff N-gram Mixer + TTT#1033

Open
Naazimsnh02 wants to merge 2 commits intoopenai:mainfrom
Naazimsnh02:submission/complementary-backoff-ngram-mixer
Open

Record: 0.4311 BPB - Complementary Training + Backoff N-gram Mixer + TTT#1033
Naazimsnh02 wants to merge 2 commits intoopenai:mainfrom
Naazimsnh02:submission/complementary-backoff-ngram-mixer

Conversation

@Naazimsnh02
Copy link
Copy Markdown

Summary

val_bpb: 0.4311 (3-seed mean, std < 0.0001) | ~15.9 MB | 8xH100 SXM | 600s train + ~562s eval

Key Innovation: Complementary Training

Standard approach: train model on uniform cross-entropy, bolt on n-gram cache at eval.

Our approach: during training, downweight tokens that a bigram predictor would get right (COMPLEMENT_ALPHA=0.5). The model learns to focus its 27M parameters on tokens that statistical caches can't predict — novel word choices, long-range dependencies, semantic surprises. This creates a natural division of labor between the neural model and the n-gram cache at eval time.

Config BPB
Base model only ~1.139
+ Standard backoff (alpha=0.05) ~0.700
+ Complementary training + adaptive alpha 0.4311

3-Seed Results

Seed BPB Artifact bytes
1337 0.431107 15,916,181
42 0.431062 15,962,841
2024 0.431112 15,958,961
Mean 0.431094 (std < 0.0001)

Training stopped at 600s (~6976 steps). Full eval (diag + q_rt + q_sw + TTT + ngram) completes in ~562s ≈ 9.37 min.

Architecture

11L 512d GQA 8/4, MLP 3.0x, XSA-4, LeakyReLU(0.5)², BigramHash(2048), Int6 + LZMA.
VRL (Value Residual Learning), SmearGate, Partial RoPE (16 dims), U-Net skip connections, EMA + SWA.

Eval Stack

  • BackoffNgramMixer: orders 2–10, 4M flat hash buckets, greedy cascade (highest order wins)
  • Entropy-adaptive alpha: 0.20 + 0.55 · sigmoid(2 · (H − 3.0)) — n-gram gets 20–75% weight based on model uncertainty
  • AdamW TTT: lr=5e-4, 3 epochs/chunk, Polyak EMA 0.998, freeze first 9/11 blocks
  • Sliding window: stride=64
  • Score-first: n-gram cache updated only after scoring each chunk

Compliance

  • Training: 600s on 8xH100 SXM (within 600s)
  • Eval: ~562s on 8xH100 SXM (within 600s)
  • Artifact: max 15,962,841 bytes (under 16,000,000 byte limit)
  • Complementary training uses training-data bigram statistics only — no validation data accessed during training
  • N-gram cache is strictly backward-looking — updated only after scoring each chunk
  • TTT is score-first legal — trains only on already-evaluated tokens
  • Committed distribution: (1−α)·P_neural + α·P_ngram — all tokens have nonzero probability

Credits

This builds on community work:

Our contribution: Validated Complementary Training as a first-class technique that meaningfully improves n-gram mixer performance by specializing the neural model on statistically-hard tokens.

@NoesisGenesis
Copy link
Copy Markdown

This submission violates Condition 2 as defined in #1017. The n-gram scoring method receives the realized target token as an argument and evaluates the n-gram probability exclusively at that token: neither the hash-based higher-order lookup nor the unigram fallback ever construct a distribution over the full vocabulary.

The mixture of neural and n-gram components therefore operates on scalars rather than on committed probability vectors. No full distribution is defined before the realized token is observed, which means no codebook exists from which a message could actually be decompressed, and the quantity being reported is not a prequential code length.

The entropy-adaptive mixing weight compounds this. It is a scalar functional of the neural distribution used to modulate the contribution of a component that was itself never normalized over the vocabulary, which is the exact pattern listed in Section VI of #1017. You can also check out #995 for more information on this.

The remaining machinery (causal cache updates, score-first TTT) appears sound, but Condition 2 is load-bearing and it is not satisfied here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants