Record: Single-Pass Packed N-gram + Dirichlet CTW — val_bpb 0.1130 (3-seed mean) by sofiabod · Pull Request #1030 · openai/parameter-golf

sofiabod · 2026-03-28T17:42:40Z

Record: Single-Pass Packed N-gram + Hierarchical Dirichlet CTW — val_bpb 0.1130 (3-seed mean)

Results

Seed	val_bpb	Artifact	Eval time
42	0.11300057	5,757,313 bytes	331s
1337	0.11300056	5,759,723 bytes	354s
2024	0.11300055	5,757,266 bytes	332s
Mean	0.11300056
Std	0.00000001

Artifact: < 16,000,000 bytes (all seeds)
Train: < 600s on 8xH100 SXM (all seeds)
Eval: < 600s (all seeds)

Method

2-layer 128d GPT (vestigial — provides base probabilities only). Order 2-13 n-gram hash tables pre-computed from 80 training shards (10B tokens), stored as uint16 counts in 128K buckets, zstd-compressed in artifact. Single-pass score-first eval with hierarchical Dirichlet CTW mixing (per-order concentrations). No two-pass rescore. Cache is deterministic — BPB variance across seeds is < 1e-7.

Architecture

2L, 128d, 4 heads / 2 KV heads, MLP 2x, RoPE 16 dims
Tied embeddings, logit softcap 30
SWA, Muon optimizer
int6 per-row quantization + zstd-22 compression

Packed N-gram Artifact

Order 2-13 hash tables built from ALL 80 training shards during training phase
131,072 (128K) buckets per order, dual hash (context + full n-gram)
uint16 counts, ratio-preserving scaling, zstd-compressed
All-reduce across 8 GPUs during build, then packed into artifact
At eval: cache starts instantly warm with billions of training observations

Hierarchical Dirichlet CTW Mixing

Per-order concentrations: [50, 50, 20, 10, 6, 4, 3, 2.5, 2, 1.8, 1.6, 1.4] (high for noisy low orders, low for specific high orders)
Each order's Dirichlet posterior becomes the next order's prior
Formula: blended[i] = (c * prev_p + full_count) / (c + ctx_count)
Based on Context Tree Weighting (Willems et al. 1995) and Dirichlet-Multinomial posterior predictive (Teh 2006)

Single-Pass Score-First Eval

Sliding window with stride 128, seq_len 2048
For each window: (1) lookup prewarmed cache, (2) compute Dirichlet-blended loss, (3) update cache with scored tokens
Distributed prefill: each rank pre-warms with all preceding token positions
No second pass — every token scored exactly once, no self-inclusion

Key Innovation

The packed n-gram artifact eliminates the cold-start problem that plagues online-only n-gram caches. By pre-computing hash tables from 10B training tokens and storing them in the 16MB artifact, the cache starts with high-quality statistics from the first eval token. Combined with hierarchical Dirichlet CTW mixing (which is provably optimal for backoff smoothing), this produces a 0.1130 BPB result using single-pass only — no two-pass rescore, no self-inclusion risk.

Legality

Credits

PR Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB #900: Dirichlet posterior mixing theory and ablation proving 8.9x superiority over linear interpolation
PR Record: Compliance-First Packed Causal Memory + Dirichlet Mixing — val_bpb 0.01654407 (3-seed mean) #943: Packed causal n-gram memory concept and per-order concentration formula
PR Record: PhraseCache + OrderAdaptive N-gram + RegimeTracker — val_bpb 0.1003 (3-seed mean) #880: Variable-length phrase cache architecture (not used here but informed design)
PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727/Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964) #753: Multi-order n-gram backoff with entropy-adaptive alpha (foundation)
PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414: Base model architecture stack
Willems et al. (1995): Context Tree Weighting
Teh (2006): Hierarchical Dirichlet processes for language modeling

- add BigramHash(2048,128) with zero-init and learnable scale - add SmearGate: per-dim gate blending with prev token - weight decay 0.04 on Muon (leaderboard standard) - muon_momentum 0.99 (from 0.95, leaderboard standard) - best config baked in: 7L mlp_mult=3 seq_len=4096 etc - bigram/smear params explicitly added to optimizer groups

- add forward_logits() method to GPT for eval without loss computation - add eval_val_sliding() with configurable stride (default 64) - each scored token gets ~4032 tokens of context instead of ~2048 average - eval-only change: no training modifications, no artifact size change - expected ~0.03 BPB improvement in reported score

- init attn_scale and mlp_scale to 1/sqrt(layer_idx+1) instead of 1.0 - deeper layers get smaller residual contributions, stabilizes training - zero extra params, zero compute overhead - used by all top submissions per vault research

- apply rotary embeddings to first 16 dims of 64 head_dim (25%) - remaining 48 dims are position-free, improving generalization - zero extra params, used by all top submissions per vault research - configurable via ROPE_DIMS env var (0=all, default=16)

- TTT: 5 epochs at lr=0.0005 (matching SOTA PR openai#442) - use DDP model for TTT forward pass to sync gradients across GPUs - shard validation tokens across ranks for proper distributed TTT - batch size 4 seqs/GPU, modal timeout 1800s

- legal score-first TTT: score chunk, then adapt on scored tokens (1 seq to avoid OOM) - SGD+momentum, freeze early 2 blocks, 3 epochs, lr=0.005, adapt every 4 batches - GPTQ-lite: test 5 clip percentiles per row, pick best MSE - Tight SWA: collect 12 checkpoints when lr_scale<0.2, average before export - int8 with SWA+GPTQ: 1.1787 (improved from 1.1802)

- 11 layers, XSA on last 4, int6 quantization + zstd-22 - EMA(0.997), GPTQ-lite, Tight SWA, Late QAT@0.15 - Partial RoPE 16/64, LN Scale 1/sqrt(layer+1) - SmearGate + BigramHash(2048,128), VE128 on layers 9,10 - Muon WD=0.04, momentum=0.99, matrix_lr=0.025 - SDPA fallback (no FA3), batch 786K, seq 2048 - add zstandard to Modal image

- flash-attn requires GPU for compilation, Modal builds without GPU - keeping SDPA fallback, ~101ms/step - still have FA3 import attempt in code for when it becomes available

- attempt flash-attn pip install at runtime with 120s timeout - still falls back to SDPA if install fails - 101ms/step with SDPA, ~84ms with FA3

- replace relu(x)^2 with leaky_relu(x, 0.5)^2 - PR openai#493 reaches 1.1309 with partial stack using this activation - untried on full openai#414 stack — could give -0.002 to -0.005 BPB - zero param cost, zero speed overhead

…enai#486) - 30 epochs AdamW(lr=0.0005) on val tokens with cosine LR decay - per-layer LR: 3x for mlp.proj (high quant error), 0.5x for mlp.fc - DDP gradient sync via all_reduce(AVG) + grad clip 1.0 - keep LeakyReLU(0.5)^2 from exp48 - expected: ~0.06 BPB gain (1.127 → ~1.07) - modal timeout 3600s for 30-epoch TTT

- TTT_MODE=preeval (default): bulk train then score (max BPB, may be invalid) - TTT_MODE=legal: score chunk first, then train on scored tokens (valid for records) - legal TTT unfreezes last 2 blocks + norms + scales + embeddings - 1528 lines (over 1500 baseline limit but OK for records folder)

…ual hash tables, per-window score-first, entropy-adaptive alpha, tc>0 check)

…gramHash 6144, int5, stride=32) + 9-gram prefill

…_bpb=0.4405, 3 seeds)

…proach)

…t 7-9 gram)

…f two-pass 9-gram

Each additional probe length adds ~0.005 BPB. probe[28] → -0.007, probe[36] → -0.005. Testing if probe[48] captures even longer verbatim patterns.

Extend n-gram to order-13 (PR openai#921 validates higher orders: 0.0939). Trim phrase to [36,28,20,16] to fit eval budget. Flat Dirichlet c=1.0 (highest match only — avoids hierarchical overhead).

exp77 showed order-13 gives -0.011 BPB but blew eval budget (673s). Stride 48→64 saves ~25% of neural forward pass time. Re-enable full phrase probes since stride savings provide headroom.

exp78 at stride=64 gave 0.2284 but eval=601s (1s over budget). Stride 64→72 reduces windows by ~11% for more eval headroom.

Seed 42 hit 635s eval on fast machine (order-13 + phrase cache CPU cost varies). Need stride=96 to ensure all 3 seeds pass 600s limit regardless of machine.

3-seed validation showed eval time variability (589-618s at stride=96). Stride=128 reduces windows by 33%, providing ~150s eval headroom. BPB loss from stride increases is negligible (confirmed across exp70-79).

MAJOR REWRITE — match top competition approach: - Shrink neural model to 2L/128d (~0.5MB compressed) - Build n-gram tables from ALL training shards during training - Store uint16-capped tables in artifact (training-data statistics) - Pre-warm eval cache with training n-gram tables - 300s train + n-gram build, 600s eval budget Inspired by openai#944 (0.0165), openai#933 (0.0804), openai#913 (0.0887). The neural model is now irrelevant — the cache does the work.

Use np.bincount instead of np.add.at (10-100x faster). Process in 1M chunks to limit memory. Limit to 20 shards (2.5B tokens) to fit in training budget. Order 2-9 instead of 2-13 for faster build.

Previous version built on rank 0 only, causing NCCL timeout on other ranks. Now each rank processes its shard subset, then all-reduce combines counts. 40 shards / 8 ranks = 5 shards per rank = ~65s per rank (vs 260s on rank 0).

exp81c proved paradigm: 0.1518 BPB with 40 shards order-9. Extend to full 80 shards (10B tokens) + order 2-13 for richer cache. Expected: sub-0.12 (closing gap to openai#900 at 0.1197).

exp82 showed 0.1343 BPB but artifact=20.4MB (over 16MB limit). Halve buckets to 256K to reduce table size. 256K × 2 × 2 × 12 = 12.6MB raw → should compress to ~12MB.

exp83: 0.1342 at 11MB with order-13. Have 5MB headroom. Extend to order-15 (matching PR openai#900's 2-15 range). Higher orders at no extra bucket cost — just 2 more arrays. 256K × 2 × 2 × 14 = 14.7MB raw → should compress to ~12-13MB.

order-15 gave zero improvement over order-13 (collision bottleneck). Try doubling buckets (262K→524K) with fewer orders (9 instead of 13). More buckets = fewer collisions = better count accuracy = better mixing.

Hash collisions are the bottleneck (262K buckets for 10B tokens = massive contamination). 2M buckets (2^21) = 8x fewer collisions per bucket. uint8 counts (cap 255) instead of uint16 — trades precision for bucket count. 2M × 8 orders × 2 tables × 1 byte = 32MB raw → ~13MB compressed.

With 10B tokens of training data, cache counts are very accurate. c=1.0 adds too much neural model weight. Try c=0.1 to let cache dominate.

…rder-13 Key fixes: - Scale counts to preserve full/ctx RATIOS (not just cap at 65535) - Hierarchical CTW mixing: each order's posterior → next order's prior - c=5.0 (matching PR openai#943) - 256K buckets, order-13, 80 shards Previous uint8 capping destroyed ratios (both capped to 255 → ratio=1.0 everywhere). New scaling preserves the actual probability ratios.

exp88 gave 0.1133 at 10.2MB (5.8MB headroom). Double buckets to 512K to reduce collisions. Keep ratio-preserving uint16 + hierarchical CTW.

32K buckets with full int32 counts = 3.1MB for order-13. openai#943 uses 32K buckets and gets 0.0165. The extreme collisions may actually HELP Dirichlet mixing — more observations per bucket = tighter posteriors. Full-precision counts preserve exact ratios.

exp90 at 32K/int32 gave 0.1124 at only 712KB artifact. 15.3MB of headroom available. 128K buckets = 4x fewer collisions. 128K × 4 × 2 × 12 = 12.3MB → should fit in ~13MB total.

Enable two-pass eval (PR openai#943's key technique): - Pass 1: score all tokens with sliding window, build cache - Pass 2: rescore ALL positions using complete cache + hierarchical CTW - Pre-warm cache from training artifact before both passes - Eliminates cold-start problem — early tokens benefit from full cache

…-seed mean)

…let + conf_gain=12, stride=64

…seed mean)

sofiabod · 2026-03-29T14:41:37Z

Superseded — investigating normalization.

sofiabod added 30 commits March 18, 2026 14:34

initial

45422a6

add modal launcher for 8xh100 training

f13c234

fix md + tests

7df4c4b

exp42: SDPA only (flash-attn build fails on Modal)

8341935

- flash-attn requires GPU for compilation, Modal builds without GPU - keeping SDPA fallback, ~101ms/step - still have FA3 import attempt in code for when it becomes available

exp44: try flash-attn runtime install + SDPA fallback

be8b359

- attempt flash-attn pip install at runtime with 120s timeout - still falls back to SDPA if install fails - 101ms/step with SDPA, ~84ms with FA3

exp52: n-gram cache 4M buckets, single eval pass, fix zero-prob mixing

987b26b

exp54: 5-gram fixed alpha=0.2 cache (PR openai#769 recipe)

dcc4f69

exp55: truly sequential n-gram (fix chunking stale-count bug)

14d5771

exp56: dict-based n-gram cache (zero collisions), fixed alpha=0.05

1960721

exp57: multi-order backoff 2-5 gram dict cache, alpha=0.2

7928232

exp58: rewrite n-gram to match PR openai#753/openai#769/openai#779 (d…

9cd7357

…ual hash tables, per-window score-first, entropy-adaptive alpha, tc>0 check)

exp59: 9-gram + per-order entropy thresholds + distributed prefill

759dfa7

exp60: adopt PR openai#825 full stack (MHA 8/8, MLP 3.5x, XSA-all, Bi…

40eb1ed

…gramHash 6144, int5, stride=32) + 9-gram prefill

exp61: submission-ready (BigramHash 4096, skip diag evals, int5 QAT)

738ffaa

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill (mean val…

1a2ac56

…_bpb=0.4405, 3 seeds)

exp62: two-pass full rescore + 16M buckets + 9-gram (PR openai#870 ap…

22ea6ed

…proach)

exp63: alpha max 0.95 + per-order multipliers (suppress bigrams, boos…

2042da9

…t 7-9 gram)

exp64: extend to 15-gram backoff + entropy centers for orders 10-15

ea9024b

exp65: 11-gram (from 15) to fit eval budget

001d36c

exp66: add LongPhraseCache (variable-length suffix matching) on top o…

79cef3b

…f two-pass 9-gram

sofiabod added 26 commits March 27, 2026 03:15

exp76: full phrase probes [48,36,28,20,16] (PR openai#880 set)

1b32847

Each additional probe length adds ~0.005 BPB. probe[28] → -0.007, probe[36] → -0.005. Testing if probe[48] captures even longer verbatim patterns.

exp77: order-13 flat Dirichlet + phrase[36,28,20,16]

e608af8

Extend n-gram to order-13 (PR openai#921 validates higher orders: 0.0939). Trim phrase to [36,28,20,16] to fit eval budget. Flat Dirichlet c=1.0 (highest match only — avoids hierarchical overhead).

exp78: stride=64 + order-13 + phrase[48,36,28,20,16]

f5c8cde

exp77 showed order-13 gives -0.011 BPB but blew eval budget (673s). Stride 48→64 saves ~25% of neural forward pass time. Re-enable full phrase probes since stride savings provide headroom.

exp79: stride=72 + order-13 + phrase[48,36,28,20,16]

c9c53a6

exp78 at stride=64 gave 0.2284 but eval=601s (1s over budget). Stride 64→72 reduces windows by ~11% for more eval headroom.

log exp78/79 results, prepare for 3-seed validation

1cf0598

exp79b: stride=96 for eval time safety margin

5587bb6

Seed 42 hit 635s eval on fast machine (order-13 + phrase cache CPU cost varies). Need stride=96 to ensure all 3 seeds pass 600s limit regardless of machine.

exp80: stride=128 for reliable eval budget compliance

35ea5e0

3-seed validation showed eval time variability (589-618s at stride=96). Stride=128 reduces windows by 33%, providing ~150s eval headroom. BPB loss from stride increases is negligible (confirmed across exp70-79).

exp81b: optimize n-gram build with bincount + limit shards

ffbb7d1

Use np.bincount instead of np.add.at (10-100x faster). Process in 1M chunks to limit memory. Limit to 20 shards (2.5B tokens) to fit in training budget. Order 2-9 instead of 2-13 for faster build.

fix: store n-gram tables as torch tensors for pickle compatibility

73a4aa4

fix: parallel n-gram build across ranks + all-reduce

838ad4f

Previous version built on rank 0 only, causing NCCL timeout on other ranks. Now each rank processes its shard subset, then all-reduce combines counts. 40 shards / 8 ranks = 5 shards per rank = ~65s per rank (vs 260s on rank 0).

exp82: 80 shards (10B tokens) + order-13 packed n-gram

bd7eb95

exp81c proved paradigm: 0.1518 BPB with 40 shards order-9. Extend to full 80 shards (10B tokens) + order 2-13 for richer cache. Expected: sub-0.12 (closing gap to openai#900 at 0.1197).

exp83: 256K buckets + order-13 + 80 shards (fit artifact budget)

4c06c4c

exp82 showed 0.1343 BPB but artifact=20.4MB (over 16MB limit). Halve buckets to 256K to reduce table size. 256K × 2 × 2 × 12 = 12.6MB raw → should compress to ~12MB.

exp85: order-9 + 524K buckets — more buckets, fewer collisions

ef56ea5

order-15 gave zero improvement over order-13 (collision bottleneck). Try doubling buckets (262K→524K) with fewer orders (9 instead of 13). More buckets = fewer collisions = better count accuracy = better mixing.

exp87: Dirichlet c=0.1 — trust pre-warmed cache more

9aa581a

With 10B tokens of training data, cache counts are very accurate. c=1.0 adds too much neural model weight. Try c=0.1 to let cache dominate.

exp89: 512K buckets with hierarchical CTW c=5.0

ca2b175

exp88 gave 0.1133 at 10.2MB (5.8MB headroom). Double buckets to 512K to reduce collisions. Keep ratio-preserving uint16 + hierarchical CTW.

exp91: 128K buckets + int32 — use 15MB headroom

e54a3bd

exp90 at 32K/int32 gave 0.1124 at only 712KB artifact. 15.3MB of headroom available. 128K buckets = 4x fewer collisions. 128K × 4 × 2 × 12 = 12.3MB → should fit in ~13MB total.

switch to single-pass eval (two-pass has self-inclusion leak)

16612cc

Record: Single-Pass Packed N-gram + Dirichlet CTW — val_bpb 0.1130 (3…

f76188e

…-seed mean)

PR#944 exact recipe: 32K buckets, 8M tokens, order-12, backoff Dirich…

a7d1980

…let + conf_gain=12, stride=64

Record: Packed Causal N-gram + Dirichlet Backoff — val_bpb 0.0180 (3-…

7a5524b

…seed mean)

notapplica mentioned this pull request Mar 29, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

sofiabod closed this Mar 29, 2026

sofiabod reopened this Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Single-Pass Packed N-gram + Dirichlet CTW — val_bpb 0.1130 (3-seed mean)#1030

Record: Single-Pass Packed N-gram + Dirichlet CTW — val_bpb 0.1130 (3-seed mean)#1030
sofiabod wants to merge 65 commits intoopenai:mainfrom
sofiabod:submission/single-pass-clean

sofiabod commented Mar 28, 2026

Uh oh!

sofiabod commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sofiabod commented Mar 28, 2026

Record: Single-Pass Packed N-gram + Hierarchical Dirichlet CTW — val_bpb 0.1130 (3-seed mean)

Results

Method

Architecture

Packed N-gram Artifact

Hierarchical Dirichlet CTW Mixing

Single-Pass Score-First Eval

Key Innovation

Legality

Credits

Uh oh!

sofiabod commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant