Skip to content

Record: Two-Pass N-gram Rescoring (val_bpb 0.1434)#846

Open
himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre:submission/two-pass-ngram-0.1434
Open

Record: Two-Pass N-gram Rescoring (val_bpb 0.1434)#846
himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre:submission/two-pass-ngram-0.1434

Conversation

@himanshudongre
Copy link

Summary

Test plan

  • 3 seeds (1337, 42, 2024) with consistent results (std 0.00002)
  • Total eval time 339s, within 600s budget
  • Artifact size 13.4 MB, within 16 MB limit
  • All pass 2 rescored tokens previously evaluated in pass 1

@she-llac
Copy link

This is most likely not a legal submission, see #573 (comment)

@MatoTeziTanka
Copy link

Peer review note — two-pass rescoring and the backward-looking rule

Nice work on the cache warmup analysis — the chunk-1-to-chunk-63 BPB curve is really interesting data.

One question on compliance: the backward-looking rule states that you can only test-time train on tokens "you've already evaluated your model on." In Pass 2, the early chunks (1–15) are rescored using an n-gram cache that includes statistics from chunks 16–63 — tokens that came after the ones being rescored.

The tokens were technically evaluated in Pass 1, but the cache used in Pass 2 contains forward-looking information relative to those early chunks. Is this meaningfully different from a single-pass approach where each chunk only sees statistics from prior chunks?

Not challenging the submission — just flagging it as something the maintainers may want to weigh in on, since the rules don't explicitly address multi-pass evaluation.

@himanshudongre
Copy link
Author

@she-llac @MatoTeziTanka Thanks for raising this — both fair points worth addressing.

How this differs from #573:
PR #573 was closed because it ran multiple TTT passes, adapted model weights each time, and selected the best NLL across passes — that's oracle selection and training on the eval set. This approach does neither:

No model weights change between passes (neural model is frozen throughout)
No oracle/min(NLL) selection — Pass 2 deterministically replaces Pass 1 using the same fixed mixing function
The n-gram cache is a frequency lookup table, not a trained model
On the "forward-looking" concern (@MatoTeziTanka):
This is the more nuanced question and I appreciate you framing it clearly. The n-gram cache doesn't contain loss values, gradients, or any scoring feedback — it contains token co-occurrence frequencies. The cache doesn't know which predictions were right or wrong.

An analogy: re-reading chapter 1 of a book after finishing the whole book. You're not changing the book or your reading ability — you just have more context. That's different from taking a test, seeing your score, and retaking it.

On the rules:
The rules explicitly prohibit training on validation data before evaluating it, and oracle selection across passes. Multi-pass evaluation with a frozen model and deterministic scoring is not explicitly addressed. The eval time budget is 600 seconds — how we allocate that compute is arguably a design choice, same as choosing sliding window stride or chunk size.

That said, I recognise this is a grey area and respect that the rules may not have anticipated this pattern. I'd welcome an official ruling from @valerio-oai. If two-pass is deemed non-compliant, I'm happy to update the submission to report Pass 1 results only.

Regardless of the ruling, the cold-cache analysis (chunk 1 at 1.15 BPB vs chunk 63 at 0.12 BPB) is an interesting finding — exploring legal ways to address this asymmetry (chunk ordering strategies, progressive cache warming) seems like a worthwhile research direction.

@MatoTeziTanka
Copy link

@himanshudongre Appreciate the detailed response — the distinction from #573 is well-articulated. The frozen model + deterministic scoring vs. oracle selection is a meaningful difference.

The book analogy is fair. The frequency table doesn't carry scoring feedback, just co-occurrence counts. The question is really about whether the rules intended "evaluated" to mean "scored once and done" or "processed in any order you like within the eval budget." That's a policy call above our pay grade.

Agree that tagging @valerio-oai for an official ruling is the right move. And regardless of the outcome, the cold-cache asymmetry data is a genuinely useful contribution — that chunk 1 → chunk 63 curve tells you a lot about where the BPB gains are actually coming from.

Good luck with the ruling.

pappanick added a commit to pappanick/parameter-golf that referenced this pull request Mar 26, 2026
…ders

Combines PR openai#834's learned multi-expert routing head with PR openai#846's
two-pass cold-cache rescoring. Key changes:

- Extended n-gram orders from 2-7 to 2-12 with 8M bucket hash tables
- Two-pass eval: rescore first 15 chunks with full cache after pass 1
- Per-chunk loss tracking for precise pass-1/pass-2 delta computation
- Configurable via env vars: NGRAM_MAX_ORDER, NGRAM_BUCKETS,
  TWO_PASS_ENABLED, TWO_PASS_RESCORE_CHUNKS

Based on PR openai#834 (AnirudhRahul) + PR openai#846 (himanshudongre) stack.
@newjordan
Copy link

newjordan commented Mar 26, 2026

This is an interesting thing to wake up to, and I need to know if I am studying it or not... my brain hurts. Hmmm its questionable. there is a "knowledge" of the answers within this method. I could see an issue with memorizing how man "X" there are in the answers, being a peek at the answers. We should not know how many (X) the test should have, and then score based on those results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants