Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609) by sunnypatneedi · Pull Request #909 · openai/parameter-golf

sunnypatneedi · 2026-03-26T23:16:02Z

11-gram Eval Cache + Hedge Mixer on PR #549 Base

val_bpb: 0.8609 (3-seed mean, std 0.0008, sliding window stride=64) | ~15.9 MB | 8×H100 SXM

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	step_avg	steps	Roundtrip bpb	Sliding+N-gram bpb	N-gram gain	Eval time	Artifact
42	92ms	~6,500	1.1452	0.8600	-0.2852	~188s	15,341,541
1337	92ms	~6,500	1.1452	0.8611	-0.2841	~188s	15,918,565
2025	92ms	6,526	1.1452	0.8616	-0.2836	188s	15,790,804
Mean	92ms	~6,500	1.1452	0.8609 (std 0.0008)	-0.284	~188s

Key Innovation: 11-gram Eval Cache with Entropy-Adaptive Mixing

The n-gram eval cache provides -0.284 bpb — the single largest improvement over the base model. It replaces TTT entirely, freeing the full eval time budget.

Multi-order n-gram cache (orders 2-11): 10 hash tables with 4M buckets each, uint32 count tables
Score-first, update-after protocol: n-gram counts are scored before being updated (legal per @valerio-oai, Issue ⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140)
Entropy-adaptive alpha: mixing weight between neural and n-gram predictions is a function of model entropy — high-entropy (uncertain) tokens get more n-gram contribution
Order-adaptive gating: higher-order matches get tighter entropy thresholds via order_centers = 3.0 - 0.25 * (matched_order - min_order)
Hedge Mixer: online multiplicative-weights ensemble (beta=2.0) that learns optimal neural vs n-gram weighting across the eval run

N-gram Protocol

Initialize 10 hash tables (orders 2-11), each with 4M buckets of uint32 counts
For each evaluation position:
- Score: look up n-gram match for each order (highest order first), compute n-gram probability
- Compute model entropy from neural logits
- Compute entropy-adaptive alpha (sigmoid of entropy vs order-specific threshold)
- Hedge Mixer blends neural and n-gram-enhanced predictions using learned weights
- Update: increment n-gram counts for all observed n-grams at this position
Sliding window eval (stride=64) processes validation tokens with the n-gram cache active

Run Config

cd /workspace/parameter-golf
SEED=42 BIGRAM_VOCAB_SIZE=0 VE_DIM=64 GRADQUANT_ENABLED=0 \
  torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-03-26_sunnypatneedi_moonshot/train_gpt.py

All hyperparameters are baked into the script as defaults. Key environment variables:

# N-gram config
NGRAM_CACHE=1 NGRAM_ORDER=11 NGRAM_MIN_ORDER=2 NGRAM_BUCKETS=4194304
NGRAM_ENTROPY=1 NGRAM_ALPHA=0.40 NGRAM_ENT_BASE=0.05 NGRAM_ENT_RANGE=0.55

# Hedge Mixer
HEDGE_ENABLED=1 HEDGE_BETA=2.0

# Model (no BigramHash, VE_DIM=64 to fit 16MB across all seeds)
BIGRAM_VOCAB_SIZE=0 VE_DIM=64 GRADQUANT_ENABLED=0

# TTT disabled (n-gram replaces it)
TTT_ENABLED=0

Timing Budget

Phase	Time
Training	600s (≤10 min)
Int6 roundtrip eval (diagnostic)	~49s
Sliding window + n-gram + Hedge eval (stride=64)	~188s
Total eval	~237s (< 10 min)

Training Architecture (from PR #549 SOTA)

Component	Setting
Layers	11 (512d, 8H, 4KV GQA)
MLP	3× expansion, LeakyReLU(0.5)²
XSA	All 11 layers
Gated Attention	Enabled
RoPE	Partial (16/64 dims)
LN Scale	1/√(layer+1)
VE64	Layers 7-10
Weight avg	EMA(0.997) + SWA(every 50)
Quantization	Uniform Int6 + zstd-22

Ablation

Config	val_bpb	Delta
Roundtrip (no n-gram, no sliding window)	1.1452	— (baseline)
+ Sliding window (stride=64) + 11-gram + Hedge	0.8609	-0.284

Credits

Base model (LeakyReLU² + Legal TTT + Parallel Muon): PR #549 by @abaybektursun
N-gram cache reference: PR #727, PR #758
Hedge Mixer concept: PR #731
N-gram cache legality: @valerio-oai (Issue #140)
Architecture stack: PR #414 by @signalrush
XSA: PR #198 / PR #503 by @jfprincz

3-seed mean 0.8609 bpb (42→0.8600, 1337→0.8611, 2025→0.8616). All artifacts under 16MB. 11-gram n-gram cache with entropy-adaptive alpha and Hedge Mixer on PR openai#549 base architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Eppie mentioned this pull request Mar 27, 2026

Illegal submissions megathread #677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)#909

Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)#909
sunnypatneedi wants to merge 1 commit intoopenai:mainfrom
sunnypatneedi:submission/v10-moonshot-0.8609

sunnypatneedi commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sunnypatneedi commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

11-gram Eval Cache + Hedge Mixer on PR #549 Base

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Key Innovation: 11-gram Eval Cache with Entropy-Adaptive Mixing

N-gram Protocol

Run Config

Timing Budget

Training Architecture (from PR #549 SOTA)

Ablation

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sunnypatneedi commented Mar 26, 2026 •

edited

Loading