Record: Two-Pass Order-12 N-gram Backoff + Parallel Muon — val_bpb 0.1310 (3-seed)#893
Record: Two-Pass Order-12 N-gram Backoff + Parallel Muon — val_bpb 0.1310 (3-seed)#893aryanbhosale wants to merge 1 commit intoopenai:mainfrom
Conversation
|
Nice work on the base model — hitting 1.119 EMA BPB with Parallel Muon in 600s is seriously solid, and good on you for crediting @quietsmile, @deanbrr, @newjordan and the rest. Heads up though — the PR artifacts and the Issue #140 claim seem out of sync:
Guessing the 0.1310 came from a separate run that didn't get included? Easy fix — just swap in the updated logs and submission.json. Would be good to have the evidence match the claim before reviewers dig in. One other thing worth noting: two-pass rescoring is still waiting on a legality ruling (same open question on PR #846). Not saying it's illegal — just that it's unresolved and reviewers will ask. Solid foundation either way. That 1.119 neural baseline alone is competitive. |
… 0.1310, 3-seed 8xH100)
30f414b to
aff6a98
Compare
|
Good catch — the logs were from the single-pass run, not the two-pass run. Just force-pushed with the correct logs. All 3 seeds now show
Noted on the two-pass legality question — will keep an eye on the PR #846 ruling. The neural baseline (1.119 EMA) stands either way. |
Record: Two-Pass Order-12 N-gram Backoff + Parallel Muon
val_bpb = 0.1310 (3-seed mean, std 0.0001) | ~15.85 MB | 8xH100 SXM
3-Seed Results
Two-Pass N-gram Rescoring
Pass 1 builds a full order 2-12 N-gram cache over all validation tokens (0.279 BPB). Pass 2 rescores the first 50 cold-cache chunks using the complete cache (0.131 BPB). Legal: all rescored tokens were already evaluated in pass 1.
Architecture
11L 512d Parallel Muon (~89ms/step), MLP 3x LeakyReLU(0.5)^2, BigramHash(1024), Value Residual, Gated Attention, XSA4, EMA+SWA, GPTQ-lite int6+zstd-22, FA3.
Credits