Skip to content

N-gram logit boost + HedgeMixer + score-first TTT#1014

Open
haimianbaobao007 wants to merge 1 commit intoopenai:mainfrom
haimianbaobao007:ngram-logit-boost-hedgemixer
Open

N-gram logit boost + HedgeMixer + score-first TTT#1014
haimianbaobao007 wants to merge 1 commit intoopenai:mainfrom
haimianbaobao007:ngram-logit-boost-hedgemixer

Conversation

@haimianbaobao007
Copy link
Copy Markdown

Summary

  • N-gram logit boost: Properly normalized via softmax. Fixes hash collision normalization bug. Uses log-count boost instead of raw probability ratio.
  • HedgeMixer: Online multiplicative weights mixing (neural vs neural+ngram experts).
  • SGD momentum=0.95 TTT with per-layer LR and Polyak averaging.
  • Online bias correction: Per-document logit bias (Nacrith 2026).
  • Numba JIT acceleration, EMA skip, FA fallback chain.

Results (RTX PRO 6000, 1 GPU, 535 steps)

  • FP32 base: 1.62 BPB
  • Int6 + sliding window + n-gram + HedgeMixer: 2.87 BPB (-8% vs int6 alone)
  • 535 steps insufficient for QAT convergence; 8xH100 expected to be much better.

Compliance

  • Score-first: every token scored before any update that uses it.
  • N-gram tables updated after scoring. HedgeMixer weights updated after scoring.

Based on PR #549 by @abaybektursun.

🤖 Generated with Claude Code

Based on PR#549 by @abaybektursun. Key additions:

- N-gram logit boost: properly normalized via softmax (fixes hash collision
  normalization bug affecting most hash-based n-gram PRs). Uses log-count
  boost instead of raw probability ratio.
- HedgeMixer: online multiplicative weights mixing between neural and
  neural+ngram experts (inspired by PR#700).
- SGD momentum=0.95 TTT with per-layer LR (output proj 3x, FC 0.5x)
  and Polyak averaging (inspired by PR#995).
- Online bias correction: per-document logit bias vector (Nacrith 2026).
- Numba JIT acceleration for n-gram eval (~20x speedup).
- EMA skip for short training runs (<1000 steps).
- FA3→FA2→SDPA fallback chain for non-Hopper GPUs.

Verified on RTX PRO 6000 (single GPU, 535 steps):
- FP32 base: 1.62 BPB
- Int6 + sliding window + n-gram + HedgeMixer: 2.87 BPB (-8% vs int6 alone)
- Note: 535 steps insufficient for QAT convergence; expect much better on 8xH100

All eval techniques are score-first compliant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@immartian
Copy link
Copy Markdown

The HedgeMixer (multiplicative weights for neural vs neural+ngram experts) is a clean approach to the interpolation problem. We're working on a similar idea in PR #541 where binding energy acts as the per-token confidence signal for mixing, but your online multiplicative weights adaptation is more principled for the non-stationary eval setting.

The log-count boost for n-gram normalization is a good fix — raw probability ratios from hash tables with collisions are noisy. Have you tried scaling the boost by an IDF-like term (inverse document frequency of the n-gram context)? In our experiments, rare contexts carry much more predictive signal than common ones, and weighting by context specificity improved pattern selection significantly.

535 steps on a single GPU is tough — the n-gram + HedgeMixer results should improve dramatically with more training. Good luck on the 8xH100 run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants