Skip to content

Record: 0.0498 bpb - Packed Training N-gram Artifact + Learned Weighting Gate (updated) #931

Closed
AnirudhRahul wants to merge 2 commits intoopenai:mainfrom
AnirudhRahul:record-packed-training-ngram-artifact-00498
Closed

Record: 0.0498 bpb - Packed Training N-gram Artifact + Learned Weighting Gate (updated) #931
AnirudhRahul wants to merge 2 commits intoopenai:mainfrom
AnirudhRahul:record-packed-training-ngram-artifact-00498

Conversation

@AnirudhRahul
Copy link
Copy Markdown

@AnirudhRahul AnirudhRahul commented Mar 27, 2026

3-seed mean val_bpb = 0.04979 ± 0.00014 | 15.86 MB max total size

All within budget: training < 600s ✓, eval < 600s ✓, artifact < 16MB ✓

Summary

  • Replace the hand-written n-gram blend with a learned weighting gate over the neural model plus order-2..9 n-gram experts.
  • Serialize a compact order-2..9 training n-gram cache into the artifact itself using 32K buckets with 32-bit count tables, deserialize that bundled payload at eval step 0, and continue updating it causally during validation.
  • Remove the bigram hash path to fit the packed cache under the 16MB artifact limit while keeping phrase cache and online logit calibration.
  • Update the final path so the gate mask is context-only, GPTQ calibration uses cached training batches, and the compliant reruns use TTT_EPOCHS=0.

Results

Current compliant 3-seed results:

Seed Final val_bpb Artifact bytes Total bytes Eval time Notes
1337 0.04975064 15,142,934 15,303,728 540s context-only gate mask, cached-batch GPTQ, TTT_EPOCHS=0
42 0.04994599 15,703,498 15,864,394 543s context-only gate mask, cached-batch GPTQ, TTT_EPOCHS=0
7 0.04968095 15,254,206 15,415,102 537s context-only gate mask, cached-batch GPTQ, TTT_EPOCHS=0

Final 3-seed mean final val_bpb: 0.04979253 with sample std 0.00013740.

Equivalent statistical view: the 99% one-sided upper confidence bound on the mean is 0.05034499 BPB, so the 3-seed result supports mean BPB < 0.0504 without relying on ambiguous p-value shorthand.

The packed training n-gram payload is still a 32K-bucket order-2..9 cache serialized as 32-bit count tables (2,097,152 raw bytes) inside the artifact itself, so the warm-start cache loaded at eval step 0 comes directly from the submitted artifact rather than any external side input, and validation then continues causal online updates from there.

Causal Inference Scheme

  1. Start eval by deserializing the packed order-2..9 n-gram cache from the submitted artifact itself; this warm-start cache was built from training data only and is not loaded from any external file or side channel.
  2. For each validation chunk, run the model once using only left context and the current cache state.
  3. Query n-gram experts from the current cache using left context only; expert availability depends only on context evidence, not on the true next token.
  4. Blend neural + n-gram experts and score the chunk before any mutation of cache or model state.
  5. After scoring, append the chunk tokens to the streaming cache for future chunks.
  6. The reported compliant runs use TTT_EPOCHS=0, so there is no backward adaptation step in the submission path.

Key Changes

  • Learned weighting gate over neural + order-2..9 n-gram experts.
  • Packed training n-gram artifact embedded into the submission itself so eval starts warm.
  • Bigram hash embedding removed to make room for the packed cache, since the packed n-gram artifact already supplies the warm low-order signal the learned gate needs.
  • Earlier cache-maturity decay and hybrid/heuristic switching logic removed from the final path.
  • Phrase cache and online logit calibration kept from the PR #880 stack.
  • Eval-time gate validity is now context-only rather than target-conditioned.
  • GPTQ calibration now uses cached training batches already seen during the run.

Bucket-Size Ablation

In our ablations, smaller n-gram bucket sizes tended to do better for this learned-gate setup than the larger bucket sizes we tried. The 32K setting was the best practical point because it improved BPB while also making it possible to pack the training cache into the artifact and stay under the 16MB limit.

Compliance

  • This is not a 2-pass method.
  • Validation is scored in a single causal pass: each chunk is scored before that chunk is used for any cache update.
  • The warm-start cache used at eval step 0 is part of the artifact itself, not a separate runtime input.
  • The packed n-gram cache in the artifact is derived from training data only and is produced within the 600 second training budget.
  • The learned gate does not use the true next token to decide which experts are available.
  • GPTQ calibration runs inside the reserved pre-export budget using cached training batches from the same timed run; it does not reopen training shards after the official wallclock limit.
  • The current reported numbers use TTT_EPOCHS=0, so there is no backward test-time adaptation in the final submission path.
  • No future validation tokens are visible when scoring the current chunk.

Reproduction

pip install -r requirements.txt

SEED=1337 \
ARTIFACT_NGRAM_EXPORT=1 \
MAX_WALLCLOCK_SECONDS=600 \
USE_MIXER=1 MIXER_ETA=0.1 MIXER_HEAD=multi \
USE_NGRAM_CACHE=1 NGRAM_EVAL_ORDER=9 \
TRAIN_ORACLE_BUCKETS=32768 NGRAM_EVAL_BUCKETS=32768 \
USE_PHRASE_CACHE=1 USE_REGIME_TRACKER=0 USE_LOGIT_CAL=1 \
TTT_EPOCHS=0 TTT_FREEZE_BLOCKS=2 TTT_LR=0.0001 \
TTT_CHUNK_TOKENS=131072 SKIP_SLIDING=1 EVAL_STRIDE=64 TTT_TEMPERATURE=0.85 \
CROWN_Q_LAMBDA=0.01 PRUNE_PCT=0.05 BIGRAM_VOCAB_SIZE=0 \
GPTQ_CALIBRATION_SEQS=128 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

@AnirudhRahul AnirudhRahul changed the title Record: 0.0498 bpb - Packed Training N-gram Artifact + Learned Weighting Gate Record: 0.0498 bpb - Packed Training N-gram Artifact + Learned Weighting Gate (updated) Mar 27, 2026
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 28, 2026
Three improvements over Green v1 (baseline: sliding=1.1129, ngram9=0.4489):

PR openai#931 (packed training oracle): after training, reads 2 train shards
(~200M tokens) and seeds eval n-gram tables before val token openai#1.
Eliminates cold-start penalty where early val chunks score with empty cache.
Legal: oracle is training-data-only, eval remains single-pass causal.

PR openai#900 (Dirichlet smoothing): replaces linear alpha mixing with
  p = (ng_count + c * neural_p) / (ctx_count + c)
Count-sensitive weighting: high-count matches trust n-gram, low-count
matches stay close to neural prior. No hand-tuned alpha per-order needed.
NGRAM_EVAL_MIN_COUNT=1 (formula handles low counts naturally).

PR openai#859 (matrix_lr): MATRIX_LR=0.03 vs 0.025 in Green — higher LR
found across 79-experiment sweep to train stronger base model.

Both new features are independent toggles (ARTIFACT_NGRAM, NGRAM_DIRICHLET)
for A/B isolation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant