Record: Two-Pass N-gram Rescoring (val_bpb 0.1434) by himanshudongre · Pull Request #846 · openai/parameter-golf

himanshudongre · 2026-03-26T12:20:06Z

Summary

val_bpb: 0.1434 (3-seed mean, std 0.00002) — 51% improvement over single-pass SOTA (0.295)
Novel two-pass eval strategy: rescores cold-cache early chunks using the complete n-gram cache
Chunk 1 improves from 1.15 BPB to 0.12 BPB; first 15 chunks all drop to ~0.12 BPB
Pass 2 costs only 53 seconds on 8xH100 (total eval: 339s, within 600s budget)
All rescored tokens were already evaluated in Pass 1 (backward-looking compliant)
Builds on PR Record: Chunk-Based N-gram Backoff + Score-First TTT (0.295 BPB) #809 (n-gram), Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549/[Non Record] Online Curriculum Learning #737 (TTT+GPTQ), Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (architecture)

Test plan

3 seeds (1337, 42, 2024) with consistent results (std 0.00002)
Total eval time 339s, within 600s budget
Artifact size 13.4 MB, within 16 MB limit
All pass 2 rescored tokens previously evaluated in pass 1

she-llac · 2026-03-26T14:05:24Z

This is most likely not a legal submission, see #573 (comment)

MatoTeziTanka · 2026-03-26T14:16:20Z

Peer review note — two-pass rescoring and the backward-looking rule

Nice work on the cache warmup analysis — the chunk-1-to-chunk-63 BPB curve is really interesting data.

One question on compliance: the backward-looking rule states that you can only test-time train on tokens "you've already evaluated your model on." In Pass 2, the early chunks (1–15) are rescored using an n-gram cache that includes statistics from chunks 16–63 — tokens that came after the ones being rescored.

The tokens were technically evaluated in Pass 1, but the cache used in Pass 2 contains forward-looking information relative to those early chunks. Is this meaningfully different from a single-pass approach where each chunk only sees statistics from prior chunks?

Not challenging the submission — just flagging it as something the maintainers may want to weigh in on, since the rules don't explicitly address multi-pass evaluation.

himanshudongre · 2026-03-26T14:36:06Z

@she-llac @MatoTeziTanka Thanks for raising this — both fair points worth addressing.

How this differs from #573:
PR #573 was closed because it ran multiple TTT passes, adapted model weights each time, and selected the best NLL across passes — that's oracle selection and training on the eval set. This approach does neither:

No model weights change between passes (neural model is frozen throughout)
No oracle/min(NLL) selection — Pass 2 deterministically replaces Pass 1 using the same fixed mixing function
The n-gram cache is a frequency lookup table, not a trained model
On the "forward-looking" concern (@MatoTeziTanka):
This is the more nuanced question and I appreciate you framing it clearly. The n-gram cache doesn't contain loss values, gradients, or any scoring feedback — it contains token co-occurrence frequencies. The cache doesn't know which predictions were right or wrong.

An analogy: re-reading chapter 1 of a book after finishing the whole book. You're not changing the book or your reading ability — you just have more context. That's different from taking a test, seeing your score, and retaking it.

On the rules:
The rules explicitly prohibit training on validation data before evaluating it, and oracle selection across passes. Multi-pass evaluation with a frozen model and deterministic scoring is not explicitly addressed. The eval time budget is 600 seconds — how we allocate that compute is arguably a design choice, same as choosing sliding window stride or chunk size.

That said, I recognise this is a grey area and respect that the rules may not have anticipated this pattern. I'd welcome an official ruling from @valerio-oai. If two-pass is deemed non-compliant, I'm happy to update the submission to report Pass 1 results only.

Regardless of the ruling, the cold-cache analysis (chunk 1 at 1.15 BPB vs chunk 63 at 0.12 BPB) is an interesting finding — exploring legal ways to address this asymmetry (chunk ordering strategies, progressive cache warming) seems like a worthwhile research direction.

MatoTeziTanka · 2026-03-26T15:15:13Z

@himanshudongre Appreciate the detailed response — the distinction from #573 is well-articulated. The frozen model + deterministic scoring vs. oracle selection is a meaningful difference.

The book analogy is fair. The frequency table doesn't carry scoring feedback, just co-occurrence counts. The question is really about whether the rules intended "evaluated" to mean "scored once and done" or "processed in any order you like within the eval budget." That's a policy call above our pay grade.

Agree that tagging @valerio-oai for an official ruling is the right move. And regardless of the outcome, the cold-cache asymmetry data is a genuinely useful contribution — that chunk 1 → chunk 63 curve tells you a lot about where the BPB gains are actually coming from.

Good luck with the ruling.

…ders Combines PR openai#834's learned multi-expert routing head with PR openai#846's two-pass cold-cache rescoring. Key changes: - Extended n-gram orders from 2-7 to 2-12 with 8M bucket hash tables - Two-pass eval: rescore first 15 chunks with full cache after pass 1 - Per-chunk loss tracking for precise pass-1/pass-2 delta computation - Configurable via env vars: NGRAM_MAX_ORDER, NGRAM_BUCKETS, TWO_PASS_ENABLED, TWO_PASS_RESCORE_CHUNKS Based on PR openai#834 (AnirudhRahul) + PR openai#846 (himanshudongre) stack.

newjordan · 2026-03-26T16:17:21Z

This is an interesting thing to wake up to, and I need to know if I am studying it or not... my brain hurts. Hmmm its questionable. there is a "knowledge" of the answers within this method. I could see an issue with memorizing how man "X" there are in the answers, being a peek at the answers. We should not know how many (X) the test should have, and then score based on those results.

Record: Two-Pass N-gram Rescoring (val_bpb 0.1434)

e4cea2c

notapplica mentioned this pull request Mar 26, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

quietsmile mentioned this pull request Mar 26, 2026

Record: Two-Pass Order-12 N-gram Backoff + 256K Chunks — 0.1315 BPB #853

Open

6 tasks

pappanick mentioned this pull request Mar 26, 2026

Learned Routing + Two-Pass N-gram Rescoring + Extended Orders (2-12) #860

Open

7 tasks

THUQiXuan mentioned this pull request Mar 26, 2026

Record: N-gram Two-Pass Score-First Evaluation (0.1290 BPB) #869

Open

simon-marcus mentioned this pull request Mar 26, 2026

Record: BROADSIDE — Full-Rescore N-gram Cache (val_bpb 0.0935) #870

Open

7 tasks

Bortlesboat mentioned this pull request Mar 26, 2026

10L + Two-Pass Order-11 N-gram Backoff (0.5863 BPB) #876

Open

simon-marcus mentioned this pull request Mar 26, 2026

Record: WaterLOO — Full-Rescore N-gram Cache with Self-Exclusion (val_bpb 0.0990) #881

Open

7 tasks

aryanbhosale mentioned this pull request Mar 26, 2026

Record: Two-Pass Order-12 N-gram Backoff + Parallel Muon — val_bpb 0.1310 (3-seed) #893

Open

abaybektursun mentioned this pull request Mar 26, 2026

Illegal submissions megathread #677

Open

haikosys mentioned this pull request Mar 27, 2026

Record: CacheMoney — 0.0804 BPB (3-seed mean, std 0.00003) #933

Open

dentity007 mentioned this pull request Mar 27, 2026

Record: Two-Level Dirichlet Posterior + Phrase Cache — 0.11556 BPB (3-seed) #948

Open

6 tasks

abaybektursun mentioned this pull request Mar 27, 2026

RFC: How to Clean Up All the Parameter Golf Submissions #886

Open

6 tasks

dentity007 mentioned this pull request Mar 27, 2026

Record: Order-20 Dirichlet Posterior + Phrase Cache — 0.11545 BPB (3-seed) #968

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Two-Pass N-gram Rescoring (val_bpb 0.1434)#846

Record: Two-Pass N-gram Rescoring (val_bpb 0.1434)#846
himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre:submission/two-pass-ngram-0.1434

himanshudongre commented Mar 26, 2026

Uh oh!

she-llac commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026

Uh oh!

himanshudongre commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026

Uh oh!

newjordan commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

himanshudongre commented Mar 26, 2026

Summary

Test plan

Uh oh!

she-llac commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026

Uh oh!

himanshudongre commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026

Uh oh!

newjordan commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

newjordan commented Mar 26, 2026 •

edited

Loading