Record: LeakyReLU(0.9)² + N-gram Cache + Entropy-Reg QAT — val_bpb 0.9958 (3-seed mean)#885
Open
lolrazh wants to merge 1 commit intoopenai:mainfrom
Open
Record: LeakyReLU(0.9)² + N-gram Cache + Entropy-Reg QAT — val_bpb 0.9958 (3-seed mean)#885lolrazh wants to merge 1 commit intoopenai:mainfrom
lolrazh wants to merge 1 commit intoopenai:mainfrom
Conversation
…9958 (3-seed mean) 3-seed mean: 0.9958 BPB (std 0.0017). Seeds 1337/42/2025: 0.9977/0.9947/0.9949. Built on PR openai#549 stack + three additions: - Backward-looking 7-gram eval cache (alpha=0.2, score-first, ~98% hit rate) - Entropy-regularized QAT (halves quant gap: 0.009 vs 0.017) - Mixed int5/int6 quantization (front3_back1_6_middle5) + per-row GPTQ-lite - LeakyReLU(0.9)² (+0.013 BPB vs 0.5 slope) All artifacts under 16MB (~14.0 MB). All eval under 10 min (~552s TTT+ngram). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: LeakyReLU(0.9)² + N-gram Cache + Entropy-Reg QAT — val_bpb 0.9958
val_bpb = 0.9958 (3-seed mean, std 0.0017) | ~14.0 MB | 8×H100 SXM
3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
What's New
Backward-looking 7-gram eval cache (alpha=0.2, score-first, ~98% hit rate) — exploits FineWeb's repetitive n-gram structure. Cache starts empty, builds from scored val tokens only. No oracle, no training data access during eval.
Entropy-regularized QAT — penalty term pushes weights toward quantization grid during warmdown. Halves quant gap (0.009 vs 0.017 BPB).
Mixed int5/int6 quantization (
front3_back1_6_middle5) — int6 for sensitive layers (first 3 + last 1), int5 for middle. Combined with per-row GPTQ-lite clip search.LeakyReLU(0.9)² — slope 0.9 beats 0.5 by 0.013 BPB (controlled sweep, issue ⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140).
Score-first TTT (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 recipe) — SGD(lr=0.002, mom=0.9), 3 epochs per 32K chunk, all blocks unfrozen.
Timing Note
The logs show a redundant standalone sliding window eval (~75-98s) that ran before TTT. This is redundant because TTT includes its own sliding window scoring — the standalone eval's BPB is not the reported score. Without it, eval time is 576-581s (within 600s budget). Full explanation in the README.
Credits