Skip to content

Record: 0.0881 BPB — 11L Int5 GPTQ + Order-12 N-gram + Phrase Cache + 65K Chunks#961

Open
callithyia wants to merge 2 commits intoopenai:mainfrom
callithyia:record/ngram-phrase-65k-0.0881
Open

Record: 0.0881 BPB — 11L Int5 GPTQ + Order-12 N-gram + Phrase Cache + 65K Chunks#961
callithyia wants to merge 2 commits intoopenai:mainfrom
callithyia:record/ngram-phrase-65k-0.0881

Conversation

@callithyia
Copy link
Copy Markdown

@callithyia callithyia commented Mar 27, 2026

Summary

  • val_bpb: 0.08808 (3-seed mean, std 0.0004)
  • Order-12 backoff n-gram cache + long phrase cache + entropy-adaptive alpha + temperature sharpening (T=0.85)
  • Fully-trained 11L 512d base model (pre-quant 1.14 BPB) with Full Hessian GPTQ int5 + LZMA (~13.0 MB)
  • 65K-token chunks: resolves eval timeout on properly trained models where Record: Cache Is All You Need — val_bpb 0.0887, 622KB artifact (3-seed mean) #913's 131K exceeds 600s budget
  • Single-pass, score-first, backward-looking. No TTT, no learned gate.

Results (8xH100 SXM)

Seed val_bpb Pre-quant Steps Eval time Artifact
1337 0.08855 1.1383 7,167 568s 13,018,464
42 0.08793 1.1385 7,171 572s 12,992,096
2024 0.08777 1.1391 7,172 563s 13,032,072
Mean 0.08808 1.1386

Key Contributions

  • 65K chunk optimization: 131K chunks exceed 600s eval on 11L models. 65K completes faster (568s vs 606s) with warmer cache.
  • Int5 GPTQ tradeoff: Accepts 0.02 BPB quant penalty for 1MB artifact headroom.
  • Empirical finding: 0.64 BPB pre-quant gap (1.78 vs 1.14) collapses to <0.001 BPB after cache — see Discussion in README.

Compliance

  • 3 seeds, all train ≤600s, all eval ≤600s (max 572s)
  • Artifact ≤16,000,000 bytes
  • Single-pass, score-first, no rescore
  • No validation data during training

Credits

Eval: PR #913 (@RoyiRa). Base model: PR #549 (@abaybektursun). Architecture: PR #414 (@signalrush). Chunk concept: PR #840 (@quietsmile).

… 65K Chunks

3-seed mean 0.0881 (std 0.0004). Order-12 backoff n-gram cache + long
phrase cache + entropy-adaptive alpha + temperature sharpening on a
fully-trained 11L 512d base model (pre-quant 1.14 BPB). 65K-token
chunks resolve eval timeout on properly trained models. Full Hessian
GPTQ int5 + LZMA. Single-pass, score-first, backward-looking.
@callithyia
Copy link
Copy Markdown
Author

A note on the results: On n-gram cache dominance and the diminishing role of neural models.

This submission applies all 295 lines of PR #913's eval stack ("Cache Is All You Need") onto a fully-trained 11-layer 512d base model (pre-quant 1.14 BPB) — a 0.64 BPB improvement over #913's original 2-layer 128d model (pre-quant 1.78 BPB, 500K parameters, 622KB artifact). The base model represents a 54x parameter increase, Full Hessian GPTQ int5 quantization, aggressive SWA, and Parallel Muon optimization.

One practical adaptation was required: #913's default 131K-token chunk size exceeds the 600s eval budget on any model beyond a few layers, as forward pass cost scales with model depth. Reducing chunk size to 65K resolves this while providing warmer cache through more frequent updates — a necessary change for applying cache eval techniques to models built with any meaningful training investment.

The core finding: a 0.64 BPB gap in pre-quantization model quality (1.78 vs 1.14) — representing substantial differences in architecture depth, parameter count, training compute, quantization strategy, and optimization techniques — collapses to <0.001 BPB after cache application. The n-gram cache handles approximately 97% of token predictions through pure frequency statistics. The neural model contributes meaningfully only on the narrow residual of tokens with no cache match. Beyond order-10 n-gram caching with sufficient training data, marginal returns from neural model innovation approach zero.

This suggests the current leaderboard measures n-gram engineering quality, not language model quality. The competition's meta incentivizes cache engineering over model innovation — a dynamic where a 500K-parameter model performs equivalently to one 54x its size. This entry serves as an empirical demonstration of this limitation.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 27, 2026
- Update merged SOTA to 1.1194 (abaybektursun, was 1.1228 signalrush)
- Add competition strategy pivot: n-gram eval cache now dominates (~0.02-0.97 bpb)
- Document PR openai#727 (0.9674), openai#741 (0.9850), openai#945 (0.0274), openai#961 (0.0881) findings
- Add Lessons Learned entries 17-20 on n-gram dominance + memorization risk
- Update Technique Reference table with n-gram entries

https://claude.ai/code/session_01Bpr2fKEnkNQmNKno8EnxWF
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 27, 2026
Merge remote's two-pass n-gram discoveries (PR openai#868 0.1181, PR openai#870 0.0935)
with today's extreme n-gram findings (PR openai#945 0.0274, PR openai#961 0.0881).
Keep Architecture Decisions and Legal TTT Protocol from remote.
Add Lessons Learned 17-20 from 2026-03-27 research.

https://claude.ai/code/session_01Bpr2fKEnkNQmNKno8EnxWF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant