Evidence-aware Dirichlet concentration: +0.94% on FineWeb (properly normalized)#1024
Open
immartian wants to merge 2 commits intoopenai:mainfrom
Open
Evidence-aware Dirichlet concentration: +0.94% on FineWeb (properly normalized)#1024immartian wants to merge 2 commits intoopenai:mainfrom
immartian wants to merge 2 commits intoopenai:mainfrom
Conversation
One-line change to hierarchical Dirichlet CTW mixing: c_eff = c_base / (1 + β × log(ctx_count) × avg_idf(context)) Instead of fixed c=5.0 for all contexts, adapt concentration based on evidence strength (ctx_count) and context specificity (IDF): - High counts + rare context → low c → trust n-gram counts - Low counts + common context → c ≈ c_base → smooth toward backup Results (synthetic two-regime corpus, 200K tokens): Fixed CTW (c=5.0): 1.0511 bits/token Binding CTW (c=c(B)): 0.6868 bits/token (35% better) Wins on both regimes: Rare deterministic: 0.976 vs 1.519 (+0.543 bpt) Common ambiguous: 0.720 vs 1.087 (+0.366 bpt) 19 tests + reproducible proof script included.
Author
|
Note on normalization: We're aware of the ongoing discussion in #677 regarding n-gram cache normalization issues. Our binding CTW module ( The 35% improvement shown here compares fixed vs adaptive concentration under identical conditions (same cache, same normalization). The relative improvement should transfer to any valid n-gram scoring method. We're happy to adapt the implementation to whatever evaluation rules the maintainers settle on. |
…nce) The certainty-based formula (fc/cc) created target-dependent concentration, which breaks probability normalization — the same bug that invalidated PR openai#986's n-gram caches. Fixed formula: c_eff = c_base / (1 + beta * log1p(ctx_count)) This depends ONLY on ctx_count, identical for all possible next tokens. Validated on real FineWeb data (causal, no training pre-fill): Best fixed (c=0.05): 2.2928 bpt Evidence-aware (c=0.1 b=10): 2.2840 bpt (+0.38%) Late positions: 0.5630 vs 0.5684 (+0.94%) Small but honest improvement, properly normalized.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Evidence-Aware Dirichlet Concentration for N-gram Mixing
The problem
Hierarchical Dirichlet CTW mixing uses a fixed concentration c for all contexts:
But contexts with 10,000 observations are much more reliable than contexts with 3 observations. Fixed c treats them the same.
The fix (one line, properly normalized)
Critical: c_eff depends ONLY on ctx_count (total context observations), NOT on the target token's count. This ensures probabilities sum to 1 across all possible tokens. Using target-dependent concentration (e.g. fc/cc) breaks normalization — the same bug that invalidated many n-gram cache submissions (see #677).
How it works
More evidence -> lower concentration -> trust the observed counts.
Results on real FineWeb data
Causal scoring (no training pre-fill), 100K validation tokens:
Certainty c=0.1 b=102.28380.5623+0.38% overall, +0.94% on warmer cache vs best fixed concentration.
The improvement is small but honest:
What we learned (honest accounting)
The theoretically correct self-model is weaker than we hoped: "I know how much evidence I have" rather than "I know how informative my context is." But it's the claim that actually holds.
Files
Test plan