Skip to content

Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)#909

Open
sunnypatneedi wants to merge 1 commit intoopenai:mainfrom
sunnypatneedi:submission/v10-moonshot-0.8609
Open

Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)#909
sunnypatneedi wants to merge 1 commit intoopenai:mainfrom
sunnypatneedi:submission/v10-moonshot-0.8609

Conversation

@sunnypatneedi
Copy link

@sunnypatneedi sunnypatneedi commented Mar 26, 2026

11-gram Eval Cache + Hedge Mixer on PR #549 Base

val_bpb: 0.8609 (3-seed mean, std 0.0008, sliding window stride=64) | ~15.9 MB | 8×H100 SXM

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed step_avg steps Roundtrip bpb Sliding+N-gram bpb N-gram gain Eval time Artifact
42 92ms ~6,500 1.1452 0.8600 -0.2852 ~188s 15,341,541
1337 92ms ~6,500 1.1452 0.8611 -0.2841 ~188s 15,918,565
2025 92ms 6,526 1.1452 0.8616 -0.2836 188s 15,790,804
Mean 92ms ~6,500 1.1452 0.8609 (std 0.0008) -0.284 ~188s

Key Innovation: 11-gram Eval Cache with Entropy-Adaptive Mixing

The n-gram eval cache provides -0.284 bpb — the single largest improvement over the base model. It replaces TTT entirely, freeing the full eval time budget.

  1. Multi-order n-gram cache (orders 2-11): 10 hash tables with 4M buckets each, uint32 count tables
  2. Score-first, update-after protocol: n-gram counts are scored before being updated (legal per @valerio-oai, Issue ⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140)
  3. Entropy-adaptive alpha: mixing weight between neural and n-gram predictions is a function of model entropy — high-entropy (uncertain) tokens get more n-gram contribution
  4. Order-adaptive gating: higher-order matches get tighter entropy thresholds via order_centers = 3.0 - 0.25 * (matched_order - min_order)
  5. Hedge Mixer: online multiplicative-weights ensemble (beta=2.0) that learns optimal neural vs n-gram weighting across the eval run

N-gram Protocol

  1. Initialize 10 hash tables (orders 2-11), each with 4M buckets of uint32 counts
  2. For each evaluation position:
    • Score: look up n-gram match for each order (highest order first), compute n-gram probability
    • Compute model entropy from neural logits
    • Compute entropy-adaptive alpha (sigmoid of entropy vs order-specific threshold)
    • Hedge Mixer blends neural and n-gram-enhanced predictions using learned weights
    • Update: increment n-gram counts for all observed n-grams at this position
  3. Sliding window eval (stride=64) processes validation tokens with the n-gram cache active

Run Config

cd /workspace/parameter-golf
SEED=42 BIGRAM_VOCAB_SIZE=0 VE_DIM=64 GRADQUANT_ENABLED=0 \
  torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-03-26_sunnypatneedi_moonshot/train_gpt.py

All hyperparameters are baked into the script as defaults. Key environment variables:

# N-gram config
NGRAM_CACHE=1 NGRAM_ORDER=11 NGRAM_MIN_ORDER=2 NGRAM_BUCKETS=4194304
NGRAM_ENTROPY=1 NGRAM_ALPHA=0.40 NGRAM_ENT_BASE=0.05 NGRAM_ENT_RANGE=0.55

# Hedge Mixer
HEDGE_ENABLED=1 HEDGE_BETA=2.0

# Model (no BigramHash, VE_DIM=64 to fit 16MB across all seeds)
BIGRAM_VOCAB_SIZE=0 VE_DIM=64 GRADQUANT_ENABLED=0

# TTT disabled (n-gram replaces it)
TTT_ENABLED=0

Timing Budget

Phase Time
Training 600s (≤10 min)
Int6 roundtrip eval (diagnostic) ~49s
Sliding window + n-gram + Hedge eval (stride=64) ~188s
Total eval ~237s (< 10 min)

Training Architecture (from PR #549 SOTA)

Component Setting
Layers 11 (512d, 8H, 4KV GQA)
MLP 3× expansion, LeakyReLU(0.5)²
XSA All 11 layers
Gated Attention Enabled
RoPE Partial (16/64 dims)
LN Scale 1/√(layer+1)
VE64 Layers 7-10
Weight avg EMA(0.997) + SWA(every 50)
Quantization Uniform Int6 + zstd-22

Ablation

Config val_bpb Delta
Roundtrip (no n-gram, no sliding window) 1.1452 — (baseline)
+ Sliding window (stride=64) + 11-gram + Hedge 0.8609 -0.284

Credits

3-seed mean 0.8609 bpb (42→0.8600, 1337→0.8611, 2025→0.8616).
All artifacts under 16MB. 11-gram n-gram cache with entropy-adaptive
alpha and Hedge Mixer on PR openai#549 base architecture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant