Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB by Robby955 · Pull Request #900 · openai/parameter-golf

Robby955 · 2026-03-26T22:07:10Z

Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.11559 BPB

val_bpb: 0.11559 (3-seed mean, std 3.8e-6) | ~14.9 MB | 8xH100 SXM

What changed from previous version (0.11807)

Per-order concentration learning via Bayesian Online Concentration Learning (OBCL). Instead of a single c=5.0 for all n-gram orders, each order gets its own concentration learned from a posterior over a 50-point log-spaced grid [0.5, 50.0]:

Orders	Learned c	Interpretation
2-3 (bigram, trigram)	~50.0	Low orders noisy, need heavy neural prior
4	6.95	Transitional
5	2.98	Moderate evidence
6-8	~2.05	More specific matches, trust counts
9-14	~1.86	High-order matches precise, minimal smoothing

This 27x spread in optimal concentration across orders is explained by the exponential decrease in hash collision rate with increasing match length.

3-seed validation

Seed	BPB	Artifact	Eval time
1337	0.11559293	14,926,561	448s
2024	0.11559195	14,869,557	441s
2025	0.11558566	14,814,921	429s
Mean	0.11559 (std 3.8e-6)

Approach

Same Dirichlet-Multinomial formula at every level:

p(token) = (count + c_k * prior) / (total + c_k)

Level 1: 15-gram recursive backoff with per-order concentrations (OBCL-learned)
Level 2: Phrase suffix matching (probes at 20, 16 tokens) with c_phrase=1.0
Base measure: Neural LM (EBLS: 3 shared blocks x 3 loops + 2 unique = 11 layers)

Key ablations

Config	BPB	Delta
Neural only (EBLS + GPTQ)	1.1745	baseline
+ 15-gram Dirichlet backoff (flat c=5.0)	0.2292	-0.945
+ phrase Dirichlet (c=1.0)	0.1181	-0.111
+ per-order OBCL concentrations	0.1156	-0.002
Phrase with linear interp instead of Dirichlet	1.0686	8.9x worse

All ablation deltas exceed 200 sigma (3-seed std 3.8e-6).

Compliance

Training: 560s on 8xH100 (within 600s)
Eval: 448s worst case (within 600s)
Artifact: 14,926,561 bytes worst case (within 16,000,000)
Single-pass, strictly backward-looking, no training data at eval
No oracle/min(NLL) selection

Legality

N-gram caching ruled "directionally legal" by @valerio-oai (Issue #677). Single-pass, score-first, causal. We also maintain a separate neural-only submission (PR #734, 1.1198 BPB).

See README.md for full details, concentration landscapes, compression theory connection, and credits.

3-seed validated (std 0.000003): s1337: 0.11967683 (435s eval, 14.91MB) s2024: 0.11968156 (455s eval, 14.84MB) s2025: 0.11967545 (441s eval, 14.80MB) Dirichlet-Multinomial posterior predictive applied at two levels: - N-gram backoff (orders 2-15, c=5.0) - Phrase suffix matching (probes=[20,16], c=2.0) Ablation: removing Dirichlet from phrase mixing degrades BPB 8.9x.

….5e-6) - Optimized phrase-level concentration from 2.0 to 1.0 via sweep - Added phrase concentration landscape table (convex, min at 1.0) - Expanded compression theory section (CTW connection, match-length scaling, OBCL decomposition) - Updated 3-seed results: s1337=0.11807, s2024=0.11807, s2025=0.11806 - Longer matches need less smoothing: c* decreases from ~50 (bigrams) to 1.0 (phrases)

Log files now match claimed BPB (0.11807). All numbers are exact from verified pod runs, not approximations.

Per-order concentrations learned via Bayesian Online Concentration Learning range from 50.0 (bigrams) to 1.86 (14-grams). Improves from 0.11807 to 0.11559 (-0.00248 BPB). 3-seed mean 0.11559, std 3.8e-6.

Eppie · 2026-03-27T04:13:05Z

Moving discussion here so as to not clutter the other thread. It's late where I live and I'm a bit tired, so I had Claude write my response for me:

Your theoretical argument is correct — the Dirichlet-Multinomial posterior predictive produces a valid distribution when Σ_y count_y = N. The issue is that with hash tables, this identity doesn't hold, and the deviation is not negligible.

I implemented your exact formula (orders 2-15, per-order concentrations [50.0, 50.0, 6.95, 2.98, 2.05, 2.05, 2.05, 1.86, 1.86, 1.86, 1.86, 1.86, 1.86, 1.86], phrase probes at 20 and 16, 4M n-gram buckets, 1M phrase buckets, phrase concentration 1.0) and computed p(t) for all 1024 vocab tokens at sample positions as the cache fills:

chunk	tokens_seen	avg_p_sum	min_p_sum	max_p_sum
0	0	1.0000	1.0000	1.0000
1	131k	16.8850	1.3289	67.3684
2	262k	80.8645	2.2301	140.4559
5	655k	271.2090	55.8939	380.3625
9	1.18M	438.9720	215.0402	546.1090
25	3.28M	767.6675	555.1482	875.6750
49	6.42M	818.1985	583.1743	967.0769

These should all be exactly 1.0. The choice of prior (uniform vs neural softmax) doesn't affect whether the sum equals 1 — that depends entirely on whether Σ_y count_y = N, which is what the hash collisions break.

With 4M buckets, each of the 1024 lookups full_table[hash(ctx, t)] independently picks up counts from unrelated n-grams that collided into the same bucket. Most of these don't exceed ctx_table[hash(ctx)] so the min() clipping doesn't help, and they sum to far more than N.

You can verify this yourself — at any position after warmup, run your full hierarchical Dirichlet update (all orders + phrase) for all 1024 vocab tokens instead of just the correct one, and print sm_p.sum().

Me again: I just saw your arithmetic encoder. It uses exact counts via defaultdict(), which is different from what is implemented in this PR.

mhuen · 2026-03-27T06:11:17Z

Another way to put it: in the discussion in #677 you mention that everything is properly normalized because of this:

sum_y p(y) = (sum_y count_y + c * sum_y prior_y) / (N + c) = (N + c) / (N + c) = 1

The problem is that the requirement is not that it needs to be normalized over the hash buckets y, but over all possible output tokens t_i. We need the assigned probabilities for each possible output token t_i to sum up to 1. And that's why this breaks down and artificially boost the scores. The mixture PDF in #913 is fine and correct if p_ng were properly normalized over the tokens t_i.

Example:

For the extreme case of only 1 cache bin (maximal collisions):

p(token_i | ...) = (count_y + c * prior_y) / (N + c)
                 = (N +c*prior_y) / (N+c)

For easier computation we can assume c << N for now:
                 ≈ 1

And then the problem is clear: the probability for the i-th tokenp(token_i) is close to 1 which artificially boosts the score and results in very small BPB. However, the probability for the k-th token p(token_k) will also be ~1, same as for all other tokens. So:

Σ_i p(token_i | ...) ≈ vocab_size ≠ 1

The issue becomes very clear if you then think about the forward pass and how you would now sample the next token. All p(token_i | ...) ≈ 1. You would have to renormalize this to get a proper probability distribution and what you'd get is simply a uniform prior with p_norm(token_i | ...) = 1 / vocab_size. And this also makes sense because the single hash bucket adds no distinguishing power. The above argument still holds if you increase c and have less collisions.

TLDR: it's a bug in the evaluation

Robby955 · 2026-03-27T20:20:38Z

Closing this PR. The unnormalized point-estimate scoring exploits hash collision artifacts — we confirmed this ourselves when implementing full-distribution normalization (partition functions averaged ~950 instead of 1.0 across the vocab). The Dirichlet-Multinomial framework is sound with exact counts, but hash table collisions invalidate the score in practice.

Some findings from this work that may be useful to others:

Pitman-Yor discount is counterproductive for BPE vocabularies (d=0.0 optimal; d=0.3/0.5/0.7 monotonically worse, consistent with Zipf exponent <1 for BPE tokens)
Optimal concentration parameters track hash collision rate (load factor), serving dual roles as Bayesian prior strength and implicit collision correction
Two-pass LOO scoring removes self-prediction bias but was ruled out of scope

Pivoting to neural track. Will write up the collision-concentration analysis separately.

notapplica mentioned this pull request Mar 26, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Robby955 added 2 commits March 26, 2026 19:29

Replace logs with pc=1.0 runs, fix exact artifact sizes and eval times

3a9cd5c

Log files now match claimed BPB (0.11807). All numbers are exact from verified pod runs, not approximations.

Robby955 changed the title ~~Record: Two-Level Dirichlet Posterior Mixing — 0.11968 BPB~~ Record: Two-Level Dirichlet Posterior Mixing — 0.11807 BPB Mar 27, 2026

Update to per-order OBCL concentrations: 0.11559 BPB (3-seed validated)

1190eeb

Per-order concentrations learned via Bayesian Online Concentration Learning range from 50.0 (bigrams) to 1.86 (14-grams). Improves from 0.11807 to 0.11559 (-0.00248 BPB). 3-seed mean 0.11559, std 3.8e-6.

Robby955 changed the title ~~Record: Two-Level Dirichlet Posterior Mixing — 0.11807 BPB~~ Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB Mar 27, 2026

Eppie mentioned this pull request Mar 27, 2026

Illegal submissions megathread #677

Open

This was referenced Mar 27, 2026

Record: Two-Level Dirichlet Posterior + Phrase Cache — 0.11556 BPB (3-seed) #948

Open

Record: Order-20 Dirichlet Posterior + Phrase Cache — 0.11545 BPB (3-seed) #968

Open

Robby955 closed this Mar 27, 2026

sofiabod mentioned this pull request Mar 27, 2026

Record: Packed N-gram + Two-Pass Dirichlet CTW — val_bpb 0.0830 (3-seed mean) #986

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB#900

Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB#900
Robby955 wants to merge 4 commits intoopenai:mainfrom
Robby955:record/dirichlet-phrase-0.1197

Robby955 commented Mar 26, 2026 •

edited

Loading

Uh oh!

Eppie commented Mar 27, 2026 •

edited

Loading

Uh oh!

mhuen commented Mar 27, 2026 •

edited

Loading

Uh oh!

Robby955 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Robby955 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.11559 BPB

What changed from previous version (0.11807)

3-seed validation

Approach

Key ablations

Compliance

Legality

Uh oh!

Eppie commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhuen commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Robby955 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Robby955 commented Mar 26, 2026 •

edited

Loading

Eppie commented Mar 27, 2026 •

edited

Loading

mhuen commented Mar 27, 2026 •

edited

Loading