Record: 0.0881 BPB — 11L Int5 GPTQ + Order-12 N-gram + Phrase Cache + 65K Chunks#961
Record: 0.0881 BPB — 11L Int5 GPTQ + Order-12 N-gram + Phrase Cache + 65K Chunks#961callithyia wants to merge 2 commits intoopenai:mainfrom
Conversation
… 65K Chunks 3-seed mean 0.0881 (std 0.0004). Order-12 backoff n-gram cache + long phrase cache + entropy-adaptive alpha + temperature sharpening on a fully-trained 11L 512d base model (pre-quant 1.14 BPB). 65K-token chunks resolve eval timeout on properly trained models. Full Hessian GPTQ int5 + LZMA. Single-pass, score-first, backward-looking.
|
A note on the results: On n-gram cache dominance and the diminishing role of neural models. This submission applies all 295 lines of PR #913's eval stack ("Cache Is All You Need") onto a fully-trained 11-layer 512d base model (pre-quant 1.14 BPB) — a 0.64 BPB improvement over #913's original 2-layer 128d model (pre-quant 1.78 BPB, 500K parameters, 622KB artifact). The base model represents a 54x parameter increase, Full Hessian GPTQ int5 quantization, aggressive SWA, and Parallel Muon optimization. One practical adaptation was required: #913's default 131K-token chunk size exceeds the 600s eval budget on any model beyond a few layers, as forward pass cost scales with model depth. Reducing chunk size to 65K resolves this while providing warmer cache through more frequent updates — a necessary change for applying cache eval techniques to models built with any meaningful training investment. The core finding: a 0.64 BPB gap in pre-quantization model quality (1.78 vs 1.14) — representing substantial differences in architecture depth, parameter count, training compute, quantization strategy, and optimization techniques — collapses to <0.001 BPB after cache application. The n-gram cache handles approximately 97% of token predictions through pure frequency statistics. The neural model contributes meaningfully only on the narrow residual of tokens with no cache match. Beyond order-10 n-gram caching with sufficient training data, marginal returns from neural model innovation approach zero. This suggests the current leaderboard measures n-gram engineering quality, not language model quality. The competition's meta incentivizes cache engineering over model innovation — a dynamic where a 500K-parameter model performs equivalently to one 54x its size. This entry serves as an empirical demonstration of this limitation. |
…ed submission format
- Update merged SOTA to 1.1194 (abaybektursun, was 1.1228 signalrush) - Add competition strategy pivot: n-gram eval cache now dominates (~0.02-0.97 bpb) - Document PR openai#727 (0.9674), openai#741 (0.9850), openai#945 (0.0274), openai#961 (0.0881) findings - Add Lessons Learned entries 17-20 on n-gram dominance + memorization risk - Update Technique Reference table with n-gram entries https://claude.ai/code/session_01Bpr2fKEnkNQmNKno8EnxWF
Merge remote's two-pass n-gram discoveries (PR openai#868 0.1181, PR openai#870 0.0935) with today's extreme n-gram findings (PR openai#945 0.0274, PR openai#961 0.0881). Keep Architecture Decisions and Legal TTT Protocol from remote. Add Lessons Learned 17-20 from 2026-03-27 research. https://claude.ai/code/session_01Bpr2fKEnkNQmNKno8EnxWF
Summary
Results (8xH100 SXM)
Key Contributions
Compliance
Credits
Eval: PR #913 (@RoyiRa). Base model: PR #549 (@abaybektursun). Architecture: PR #414 (@signalrush). Chunk concept: PR #840 (@quietsmile).