Skip to content

Partition Function Inflation: Why Hashed N-Gram Caches Produce Invalid BPB Scores (Non-Record, Analytical)#1147

Open
Robby955 wants to merge 2 commits intoopenai:mainfrom
Robby955:partition-function-inflation-analysis
Open

Partition Function Inflation: Why Hashed N-Gram Caches Produce Invalid BPB Scores (Non-Record, Analytical)#1147
Robby955 wants to merge 2 commits intoopenai:mainfrom
Robby955:partition-function-inflation-analysis

Conversation

@Robby955
Copy link
Copy Markdown

@Robby955 Robby955 commented Mar 30, 2026

Summary

This is a non-record analytical submission (following the precedent of PR #363) providing the mathematical explanation for why hashed n-gram caches appear to dramatically improve BPB but actually produce invalid probability distributions.

We derive the partition function Z of the recursive Dirichlet-multinomial scoring function used by cache-augmented submissions (#777, #796, #900, #986, and others). The key result:

Z* = (n_c + V · L) / (n_c + L)

where V = 1024 (vocab size), L = N/B (load factor), and n_c is the true context count. At 1M buckets (L ≈ 59), Z* exceeds 1,000 — meaning the scoring function's outputs sum to ~1,000 instead of 1.

A partial normalization sweep (dividing by Z^λ for λ ∈ [0, 1]) directly measures the penalty: at λ = 1 (full normalization), 1M buckets scores 4.10 BPB — far above the 1.130 neural baseline. Even 25% normalization (λ ≈ 0.25) erases the apparent gain.

Lambda sweep: BPB increases linearly with normalization

Before and after normalization across 4 bucket sizes

Key Findings

  1. Monotonic bucket ranking explained: Smaller buckets → higher L → larger Z* → more inflation → lower (invalid) BPB. Confirmed across 36 configurations (4 bucket sizes × 9 concentration values) and 27 per-order concentration profiles.

  2. Random collisions = real collisions: Rate-matched random collision partners replicate 99.4% of the observed BPB gap (gap = 0.009 BPB on 1.619 range). Across 8 independent remap seeds, std = 0.00004 BPB. The mechanism is count-inflation magnitude, not collision-partner structure.

  3. Synthetic floor: Adding uniform fake counts to clean 64M-bucket tables achieves 0.020 BPB — beating all real cache configurations — despite zero linguistic content.

  4. Normalization eliminates apparent gains: Under post-hoc normalization (λ = 1), all four tested bucket sizes produce BPB ≈ 4.1, all far above the 1.130 neural baseline. Step-wise normalization yields 4.15 BPB at 1M buckets. PR Review: Rerun of #972 with actual full-vocab normalization #978 (AnirudhRahul) independently confirmed: normalizing crashed performance from 0.39 to 1.51 BPB.

  5. Partial normalization sweep: BPB increases linearly with λ (R² > 0.9999 across 7 measured points). Slope ≈ 3.94 BPB/unit λ at 1M. Validated at a second bucket size (64M, λ = 0.5).

  6. Z diagnostic: The recurrence Z_n = (S_n + α · Z_{n-1}) / (c_n + α) provides a check. If Z ≫ 1, unnormalized scores are unreliable. Z* > 1 is unavoidable for any L > 0 and V > 1.

Diagnostic Code

The submission includes diagnostic tools as environment variables:

# Measure Z at sampled positions during evaluation
MEASURE_Z=1 MEASURE_Z_EVERY=10

# Run with partial normalization (0 = unnormalized, 1 = fully normalized)
NORM_LAMBDA=0.5

# Run with step-wise normalization
NORMALIZE_STEPWISE=1

Context

The author contributed the Dirichlet smoothing framework to the competition (PR #796, March 26, 2026) and several cache submissions (#777, #796, #900), all of which were closed along with other cache-based entries in the late March ruling. This analysis grew out of investigating why our own submissions produced scores that were too good to be true. PR #978 (AnirudhRahul) independently identified the same normalization gap; PR #1114 (minh-stakc) credits PR #900 for "Dirichlet posterior mixing theory."

Experimental Data (115+ configurations)

Experiment Configs Result
Bucket × alpha sweep 36 1M < 4M < 16M < 64M at all 9 alpha values
Per-order profiles 27 1M wins at all 6 profiles + 2 diagnostics
Collision control (8 seeds) 8 Random replicates 99.4% of observed BPB gap (std=0.00004 BPB)
Synthetic floor 3 Fake counts beat real caches (0.020 BPB at floor=60)
Z measurement 4 E[log₂ Z] = 5.6 to 9.6, Z_max = 1,024 ≈ V
Partial normalization (λ) 7+1 BPB linear in λ (R²>0.9999), validated at 2 bucket sizes
Multi-seed validation 3 Seeds 1337, 2026, 415 agree within 0.0003 BPB at 1M
Step-wise normalization 4 3.42–4.15 BPB across order ranges (all worse than neural)

Full paper (24 pages), all CSVs, and reproducible figure generation: https://github.com/Robby955/partition-function-inflation

Score

Not applicable — this is an analytical contribution. The neural-only baseline achieves 1.130 BPB. All cache configurations perform worse after normalization.

Shows Z* > 1000 at 1M buckets under Dirichlet-multinomial scoring.
Full normalization (lambda=1) yields 4.10 BPB — far worse than 1.13 neural baseline.
115+ experimental configurations across 8 experiment types.
Links to full paper and data at the public repo.
@Robby955
Copy link
Copy Markdown
Author

A few clarifications up front, since I expect the same questions to come up in review:

  1. This PR is not claiming that all cache-based methods are useless.
    It is making a narrower claim: in this benchmark, the specific hashed recursive n-gram scoring function used by these submissions is not a normalized probability distribution, and the apparent BPB gains are therefore not valid as compression results under the reported metric.

  2. The core issue is the partition function.
    For the recursive Dirichlet-multinomial score,
    Z = (n_c + V·L) / (n_c + L)*,
    so with vocabulary size V=1024 and load factor L≈59 at 1M buckets, Z* is on the order of 10^3.
    That means the system is often scoring the true token under a distribution whose total mass is nowhere near 1.

  3. The strongest empirical check is the lambda sweep.
    If we divide by Z^λ and sweep λ from 0 to 1, BPB rises almost perfectly linearly (R² > 0.9999).
    At λ=1 the 1M-bucket configuration is 4.10 BPB, far worse than the 1.130 neural baseline.
    The apparent gain disappears by about λ≈0.25.

  4. The collision-control result argues against "semantic smoothing" as the main explanation.
    A rate-matched random remap reproduces 99.4% of the observed gap across 8 seeds with tiny variance.
    That suggests the effect is primarily driven by count inflation magnitude, not by meaningful collision partner structure.

  5. The synthetic-floor control pushes this further.
    Adding uniform fake counts to otherwise clean tables can beat real cache configurations.
    That should not happen if the improvement were coming from linguistic information rather than from the scoring artifact.

  6. On measured vs estimated normalized scores:
    λ=1 at 1M is directly measured.
    64M has an explicit validation point at λ=0.5.
    The 4M and 16M normalized values are inferred using the same correction framework described in the paper/repo, and I'm happy to label those more prominently if reviewers want stricter separation between measured and extrapolated quantities.

  7. This also does not conflict with the possibility that collision-free or properly normalized cache methods might yield small real gains.
    In fact, I cite PR Non Record: Add PPM heuristic for test time learning  #511 precisely because a collision-free trie appears to produce a much smaller effect, which is consistent with the claim here:
    the huge reported hashed-cache gains are mostly an artifact, while any genuine gain from structured memorization is likely much smaller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant