Record: 0.0881 BPB — 11L Int5 GPTQ + Order-12 N-gram + Phrase Cache + 65K Chunks by callithyia · Pull Request #961 · openai/parameter-golf

callithyia · 2026-03-27T15:08:19Z

Summary

val_bpb: 0.08808 (3-seed mean, std 0.0004)
Order-12 backoff n-gram cache + long phrase cache + entropy-adaptive alpha + temperature sharpening (T=0.85)
Fully-trained 11L 512d base model (pre-quant 1.14 BPB) with Full Hessian GPTQ int5 + LZMA (~13.0 MB)
65K-token chunks: resolves eval timeout on properly trained models where Record: Cache Is All You Need — val_bpb 0.0887, 622KB artifact (3-seed mean) #913's 131K exceeds 600s budget
Single-pass, score-first, backward-looking. No TTT, no learned gate.

Results (8xH100 SXM)

Seed	val_bpb	Pre-quant	Steps	Eval time	Artifact
1337	0.08855	1.1383	7,167	568s	13,018,464
42	0.08793	1.1385	7,171	572s	12,992,096
2024	0.08777	1.1391	7,172	563s	13,032,072
Mean	0.08808	1.1386

Key Contributions

65K chunk optimization: 131K chunks exceed 600s eval on 11L models. 65K completes faster (568s vs 606s) with warmer cache.
Int5 GPTQ tradeoff: Accepts 0.02 BPB quant penalty for 1MB artifact headroom.
Empirical finding: 0.64 BPB pre-quant gap (1.78 vs 1.14) collapses to <0.001 BPB after cache — see Discussion in README.

Compliance

3 seeds, all train ≤600s, all eval ≤600s (max 572s)
Artifact ≤16,000,000 bytes
Single-pass, score-first, no rescore
No validation data during training

Credits

Eval: PR #913 (@RoyiRa). Base model: PR #549 (@abaybektursun). Architecture: PR #414 (@signalrush). Chunk concept: PR #840 (@quietsmile).

… 65K Chunks 3-seed mean 0.0881 (std 0.0004). Order-12 backoff n-gram cache + long phrase cache + entropy-adaptive alpha + temperature sharpening on a fully-trained 11L 512d base model (pre-quant 1.14 BPB). 65K-token chunks resolve eval timeout on properly trained models. Full Hessian GPTQ int5 + LZMA. Single-pass, score-first, backward-looking.

callithyia · 2026-03-27T15:13:03Z

A note on the results: On n-gram cache dominance and the diminishing role of neural models.

This submission applies all 295 lines of PR #913's eval stack ("Cache Is All You Need") onto a fully-trained 11-layer 512d base model (pre-quant 1.14 BPB) — a 0.64 BPB improvement over #913's original 2-layer 128d model (pre-quant 1.78 BPB, 500K parameters, 622KB artifact). The base model represents a 54x parameter increase, Full Hessian GPTQ int5 quantization, aggressive SWA, and Parallel Muon optimization.

One practical adaptation was required: #913's default 131K-token chunk size exceeds the 600s eval budget on any model beyond a few layers, as forward pass cost scales with model depth. Reducing chunk size to 65K resolves this while providing warmer cache through more frequent updates — a necessary change for applying cache eval techniques to models built with any meaningful training investment.

The core finding: a 0.64 BPB gap in pre-quantization model quality (1.78 vs 1.14) — representing substantial differences in architecture depth, parameter count, training compute, quantization strategy, and optimization techniques — collapses to <0.001 BPB after cache application. The n-gram cache handles approximately 97% of token predictions through pure frequency statistics. The neural model contributes meaningfully only on the narrow residual of tokens with no cache match. Beyond order-10 n-gram caching with sufficient training data, marginal returns from neural model innovation approach zero.

This suggests the current leaderboard measures n-gram engineering quality, not language model quality. The competition's meta incentivizes cache engineering over model innovation — a dynamic where a 500K-parameter model performs equivalently to one 54x its size. This entry serves as an empirical demonstration of this limitation.

…ed submission format

- Update merged SOTA to 1.1194 (abaybektursun, was 1.1228 signalrush) - Add competition strategy pivot: n-gram eval cache now dominates (~0.02-0.97 bpb) - Document PR openai#727 (0.9674), openai#741 (0.9850), openai#945 (0.0274), openai#961 (0.0881) findings - Add Lessons Learned entries 17-20 on n-gram dominance + memorization risk - Update Technique Reference table with n-gram entries https://claude.ai/code/session_01Bpr2fKEnkNQmNKno8EnxWF

Merge remote's two-pass n-gram discoveries (PR openai#868 0.1181, PR openai#870 0.0935) with today's extreme n-gram findings (PR openai#945 0.0274, PR openai#961 0.0881). Keep Architecture Decisions and Legal TTT Protocol from remote. Add Lessons Learned 17-20 from 2026-03-27 research. https://claude.ai/code/session_01Bpr2fKEnkNQmNKno8EnxWF

notapplica mentioned this pull request Mar 27, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Fix submission.json: add seed_results, exact precision, match validat…

920692d

…ed submission format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 0.0881 BPB — 11L Int5 GPTQ + Order-12 N-gram + Phrase Cache + 65K Chunks#961

Record: 0.0881 BPB — 11L Int5 GPTQ + Order-12 N-gram + Phrase Cache + 65K Chunks#961
callithyia wants to merge 2 commits intoopenai:mainfrom
callithyia:record/ngram-phrase-65k-0.0881

callithyia commented Mar 27, 2026 •

edited

Loading

Uh oh!

callithyia commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

callithyia commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results (8xH100 SXM)

Key Contributions

Compliance

Credits

Uh oh!

callithyia commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

callithyia commented Mar 27, 2026 •

edited

Loading