Skip to content

Record: 0.3212 BPB — Complementary N-gram 65K + Int5 GPTQ + LoRA TTT#850

Closed
callithyia wants to merge 1 commit intoopenai:mainfrom
callithyia:record/complementary-ngram-65k-0.3212
Closed

Record: 0.3212 BPB — Complementary N-gram 65K + Int5 GPTQ + LoRA TTT#850
callithyia wants to merge 1 commit intoopenai:mainfrom
callithyia:record/complementary-ngram-65k-0.3212

Conversation

@callithyia
Copy link
Copy Markdown

Summary

  • val_bpb: 0.3212 (3-seed mean, std 0.0003)
  • Complementary training (alpha=0.50) + order-9 n-gram eval cache with 65K-token chunks (15x cache refresh)
  • Full Hessian GPTQ int5 + LZMA compression (~14.9 MB artifact)
  • LoRA TTT (rank 8, Polyak averaging, score-first backward-looking)
  • LeakyReLU(0.9)² + XSA-4 + VRL + Gated Attention + Parallel Muon

Results (8xH100 SXM)

Seed Steps ms/step val_bpb Post-quant BPB Artifact
1337 5,457 101 0.3211 1.1817 14,965,401 bytes
42 5,437 101 0.3210 1.1794 14,926,117 bytes
2024 5,498 101 0.3216 1.1831 14,874,853 bytes
Mean 5,464 101 0.3212 1.1814 14,922,124 bytes

Key Techniques

  • Complementary training: Downweights bigram-predictable tokens, making the model deliberately weaker where n-grams are strong
  • 65K-token chunks: Cache updates 15x more frequently than 1M chunks, reducing cold-cache penalty
  • Per-order entropy centers + multipliers: Orders 5-9 boosted 2x, orders 2-3 suppressed 0.3x
  • Full Hessian GPTQ: Activation-order column permutation + Cholesky error compensation (not naive quantization)
  • LoRA TTT: Rank 8, Q+V on blocks 9-10, Polyak averaging decay=0.998

Compliance

  • 3 seeds on 8xH100 SXM (1337, 42, 2024)
  • All seeds train ≤600s, eval ≤600s (~570s)
  • Artifact ≤16,000,000 bytes (~14.9MB)
  • No validation data during training
  • TTT backward-looking (score-first per chunk)
  • No multi-pass rescoring
  • Reproducible single script

Credits

Built on: PR #809 (n-gram cache), PR #803 (complementary training), PR #798 (entropy centers, Polyak TTT), PR #840 (65K chunks), PR #779 (integrated eval), PR #414 (GPTQ baseline).

3-seed mean 0.3212 (std 0.0003). Complementary training + order-9
n-gram eval cache with 65K-token chunks + Full Hessian GPTQ int5 +
LoRA TTT with Polyak averaging.
@valerio-oai
Copy link
Copy Markdown
Contributor

Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future!

newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 28, 2026
Phrase cache (PR openai#880 / PR openai#900 — proven +0.1 BPB, legal):
- Variable-length suffix matching at 48/36/28/20/16 token probe lengths
- One ctx+full count table pair per probe length (4M buckets each)
- 48-prime XOR hash — unique prime per context position up to length 48
- Dirichlet smoothing: p=(min(fc,cc)+c*neural)/(ctx+c), c=2.0
- Applied inline after n-gram mixing, before NLL conversion
- Score-first: tables updated with chunk tokens AFTER all scoring done

RegimeTracker (PR openai#880):
- Tracks match rate + token diversity over rolling 4096-token window
- Adapts effective phrase concentration: repetitive/boilerplate content
  → lower c (more cache trust); novel prose → higher c (more neural trust)
- Multiplier range [0.7, 1.5], effective_c = base_c / mult

Config improvements:
- WARMDOWN_ITERS=2000 (confirmed best from A/B sweep)
- NGRAM_CHUNK_TOKENS=65536 (PR openai#850, 15x more cache refreshes vs 1M)
- MATRIX_LR=0.03 (PR openai#859)

ARTIFACT_NGRAM=0 remains disabled (legally gray).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants