Skip to content

Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB#900

Closed
Robby955 wants to merge 4 commits intoopenai:mainfrom
Robby955:record/dirichlet-phrase-0.1197
Closed

Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB#900
Robby955 wants to merge 4 commits intoopenai:mainfrom
Robby955:record/dirichlet-phrase-0.1197

Conversation

@Robby955
Copy link
Copy Markdown

@Robby955 Robby955 commented Mar 26, 2026

Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.11559 BPB

val_bpb: 0.11559 (3-seed mean, std 3.8e-6) | ~14.9 MB | 8xH100 SXM

What changed from previous version (0.11807)

Per-order concentration learning via Bayesian Online Concentration Learning (OBCL). Instead of a single c=5.0 for all n-gram orders, each order gets its own concentration learned from a posterior over a 50-point log-spaced grid [0.5, 50.0]:

Orders Learned c Interpretation
2-3 (bigram, trigram) ~50.0 Low orders noisy, need heavy neural prior
4 6.95 Transitional
5 2.98 Moderate evidence
6-8 ~2.05 More specific matches, trust counts
9-14 ~1.86 High-order matches precise, minimal smoothing

This 27x spread in optimal concentration across orders is explained by the exponential decrease in hash collision rate with increasing match length.

3-seed validation

Seed BPB Artifact Eval time
1337 0.11559293 14,926,561 448s
2024 0.11559195 14,869,557 441s
2025 0.11558566 14,814,921 429s
Mean 0.11559 (std 3.8e-6)

Approach

Same Dirichlet-Multinomial formula at every level:

p(token) = (count + c_k * prior) / (total + c_k)
  • Level 1: 15-gram recursive backoff with per-order concentrations (OBCL-learned)
  • Level 2: Phrase suffix matching (probes at 20, 16 tokens) with c_phrase=1.0
  • Base measure: Neural LM (EBLS: 3 shared blocks x 3 loops + 2 unique = 11 layers)

Key ablations

Config BPB Delta
Neural only (EBLS + GPTQ) 1.1745 baseline
+ 15-gram Dirichlet backoff (flat c=5.0) 0.2292 -0.945
+ phrase Dirichlet (c=1.0) 0.1181 -0.111
+ per-order OBCL concentrations 0.1156 -0.002
Phrase with linear interp instead of Dirichlet 1.0686 8.9x worse

All ablation deltas exceed 200 sigma (3-seed std 3.8e-6).

Compliance

  • Training: 560s on 8xH100 (within 600s)
  • Eval: 448s worst case (within 600s)
  • Artifact: 14,926,561 bytes worst case (within 16,000,000)
  • Single-pass, strictly backward-looking, no training data at eval
  • No oracle/min(NLL) selection

Legality

N-gram caching ruled "directionally legal" by @valerio-oai (Issue #677). Single-pass, score-first, causal. We also maintain a separate neural-only submission (PR #734, 1.1198 BPB).

See README.md for full details, concentration landscapes, compression theory connection, and credits.

3-seed validated (std 0.000003):
  s1337: 0.11967683 (435s eval, 14.91MB)
  s2024: 0.11968156 (455s eval, 14.84MB)
  s2025: 0.11967545 (441s eval, 14.80MB)

Dirichlet-Multinomial posterior predictive applied at two levels:
- N-gram backoff (orders 2-15, c=5.0)
- Phrase suffix matching (probes=[20,16], c=2.0)

Ablation: removing Dirichlet from phrase mixing degrades BPB 8.9x.
….5e-6)

- Optimized phrase-level concentration from 2.0 to 1.0 via sweep
- Added phrase concentration landscape table (convex, min at 1.0)
- Expanded compression theory section (CTW connection, match-length scaling, OBCL decomposition)
- Updated 3-seed results: s1337=0.11807, s2024=0.11807, s2025=0.11806
- Longer matches need less smoothing: c* decreases from ~50 (bigrams) to 1.0 (phrases)
Log files now match claimed BPB (0.11807). All numbers are exact from
verified pod runs, not approximations.
@Robby955 Robby955 changed the title Record: Two-Level Dirichlet Posterior Mixing — 0.11968 BPB Record: Two-Level Dirichlet Posterior Mixing — 0.11807 BPB Mar 27, 2026
Per-order concentrations learned via Bayesian Online Concentration Learning
range from 50.0 (bigrams) to 1.86 (14-grams). Improves from 0.11807 to
0.11559 (-0.00248 BPB). 3-seed mean 0.11559, std 3.8e-6.
@Robby955 Robby955 changed the title Record: Two-Level Dirichlet Posterior Mixing — 0.11807 BPB Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB Mar 27, 2026
@Eppie
Copy link
Copy Markdown

Eppie commented Mar 27, 2026

Moving discussion here so as to not clutter the other thread. It's late where I live and I'm a bit tired, so I had Claude write my response for me:


Your theoretical argument is correct — the Dirichlet-Multinomial posterior predictive produces a valid distribution when Σ_y count_y = N. The issue is that with hash tables, this identity doesn't hold, and the deviation is not negligible.

I implemented your exact formula (orders 2-15, per-order concentrations [50.0, 50.0, 6.95, 2.98, 2.05, 2.05, 2.05, 1.86, 1.86, 1.86, 1.86, 1.86, 1.86, 1.86], phrase probes at 20 and 16, 4M n-gram buckets, 1M phrase buckets, phrase concentration 1.0) and computed p(t) for all 1024 vocab tokens at sample positions as the cache fills:

chunk tokens_seen avg_p_sum min_p_sum max_p_sum
0 0 1.0000 1.0000 1.0000
1 131k 16.8850 1.3289 67.3684
2 262k 80.8645 2.2301 140.4559
5 655k 271.2090 55.8939 380.3625
9 1.18M 438.9720 215.0402 546.1090
25 3.28M 767.6675 555.1482 875.6750
49 6.42M 818.1985 583.1743 967.0769

These should all be exactly 1.0. The choice of prior (uniform vs neural softmax) doesn't affect whether the sum equals 1 — that depends entirely on whether Σ_y count_y = N, which is what the hash collisions break.

With 4M buckets, each of the 1024 lookups full_table[hash(ctx, t)] independently picks up counts from unrelated n-grams that collided into the same bucket. Most of these don't exceed ctx_table[hash(ctx)] so the min() clipping doesn't help, and they sum to far more than N.

You can verify this yourself — at any position after warmup, run your full hierarchical Dirichlet update (all orders + phrase) for all 1024 vocab tokens instead of just the correct one, and print sm_p.sum().


Me again: I just saw your arithmetic encoder. It uses exact counts via defaultdict(), which is different from what is implemented in this PR.

@mhuen
Copy link
Copy Markdown

mhuen commented Mar 27, 2026

Another way to put it: in the discussion in #677 you mention that everything is properly normalized because of this:

sum_y p(y) = (sum_y count_y + c * sum_y prior_y) / (N + c) = (N + c) / (N + c) = 1

The problem is that the requirement is not that it needs to be normalized over the hash buckets y, but over all possible output tokens t_i. We need the assigned probabilities for each possible output token t_i to sum up to 1. And that's why this breaks down and artificially boost the scores. The mixture PDF in #913 is fine and correct if p_ng were properly normalized over the tokens t_i.

Example:

For the extreme case of only 1 cache bin (maximal collisions):

p(token_i | ...) = (count_y + c * prior_y) / (N + c)
                 = (N +c*prior_y) / (N+c)

For easier computation we can assume c << N for now:
                 ≈ 1

And then the problem is clear: the probability for the i-th tokenp(token_i) is close to 1 which artificially boosts the score and results in very small BPB. However, the probability for the k-th token p(token_k) will also be ~1, same as for all other tokens. So:

Σ_i p(token_i | ...) ≈ vocab_size ≠ 1

The issue becomes very clear if you then think about the forward pass and how you would now sample the next token. All p(token_i | ...) ≈ 1. You would have to renormalize this to get a proper probability distribution and what you'd get is simply a uniform prior with p_norm(token_i | ...) = 1 / vocab_size. And this also makes sense because the single hash bucket adds no distinguishing power. The above argument still holds if you increase c and have less collisions.

TLDR: it's a bug in the evaluation

@Robby955
Copy link
Copy Markdown
Author

Closing this PR. The unnormalized point-estimate scoring exploits hash collision artifacts — we confirmed this ourselves when implementing full-distribution normalization (partition functions averaged ~950 instead of 1.0 across the vocab). The Dirichlet-Multinomial framework is sound with exact counts, but hash table collisions invalidate the score in practice.

Some findings from this work that may be useful to others:

  • Pitman-Yor discount is counterproductive for BPE vocabularies (d=0.0 optimal; d=0.3/0.5/0.7 monotonically worse, consistent with Zipf exponent <1 for BPE tokens)
  • Optimal concentration parameters track hash collision rate (load factor), serving dual roles as Bayesian prior strength and implicit collision correction
  • Two-pass LOO scoring removes self-prediction bias but was ruled out of scope

Pivoting to neural track. Will write up the collision-concentration analysis separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants