Record: 0.0498 bpb - Packed Training N-gram Artifact + Learned Weighting Gate (updated) by AnirudhRahul · Pull Request #931 · openai/parameter-golf

AnirudhRahul · 2026-03-27T04:21:54Z

3-seed mean val_bpb = 0.04979 ± 0.00014 | 15.86 MB max total size

All within budget: training < 600s ✓, eval < 600s ✓, artifact < 16MB ✓

Summary

Replace the hand-written n-gram blend with a learned weighting gate over the neural model plus order-2..9 n-gram experts.
Serialize a compact order-2..9 training n-gram cache into the artifact itself using 32K buckets with 32-bit count tables, deserialize that bundled payload at eval step 0, and continue updating it causally during validation.
Remove the bigram hash path to fit the packed cache under the 16MB artifact limit while keeping phrase cache and online logit calibration.
Update the final path so the gate mask is context-only, GPTQ calibration uses cached training batches, and the compliant reruns use TTT_EPOCHS=0.

Results

Current compliant 3-seed results:

Seed	Final val_bpb	Artifact bytes	Total bytes	Eval time	Notes
1337	0.04975064	15,142,934	15,303,728	540s	context-only gate mask, cached-batch GPTQ, `TTT_EPOCHS=0`
42	0.04994599	15,703,498	15,864,394	543s	context-only gate mask, cached-batch GPTQ, `TTT_EPOCHS=0`
7	0.04968095	15,254,206	15,415,102	537s	context-only gate mask, cached-batch GPTQ, `TTT_EPOCHS=0`

Final 3-seed mean final val_bpb: 0.04979253 with sample std 0.00013740.

Equivalent statistical view: the 99% one-sided upper confidence bound on the mean is 0.05034499 BPB, so the 3-seed result supports mean BPB < 0.0504 without relying on ambiguous p-value shorthand.

The packed training n-gram payload is still a 32K-bucket order-2..9 cache serialized as 32-bit count tables (2,097,152 raw bytes) inside the artifact itself, so the warm-start cache loaded at eval step 0 comes directly from the submitted artifact rather than any external side input, and validation then continues causal online updates from there.

Causal Inference Scheme

Start eval by deserializing the packed order-2..9 n-gram cache from the submitted artifact itself; this warm-start cache was built from training data only and is not loaded from any external file or side channel.
For each validation chunk, run the model once using only left context and the current cache state.
Query n-gram experts from the current cache using left context only; expert availability depends only on context evidence, not on the true next token.
Blend neural + n-gram experts and score the chunk before any mutation of cache or model state.
After scoring, append the chunk tokens to the streaming cache for future chunks.
The reported compliant runs use TTT_EPOCHS=0, so there is no backward adaptation step in the submission path.

Key Changes

Learned weighting gate over neural + order-2..9 n-gram experts.
Packed training n-gram artifact embedded into the submission itself so eval starts warm.
Bigram hash embedding removed to make room for the packed cache, since the packed n-gram artifact already supplies the warm low-order signal the learned gate needs.
Earlier cache-maturity decay and hybrid/heuristic switching logic removed from the final path.
Phrase cache and online logit calibration kept from the PR #880 stack.
Eval-time gate validity is now context-only rather than target-conditioned.
GPTQ calibration now uses cached training batches already seen during the run.

Bucket-Size Ablation

In our ablations, smaller n-gram bucket sizes tended to do better for this learned-gate setup than the larger bucket sizes we tried. The 32K setting was the best practical point because it improved BPB while also making it possible to pack the training cache into the artifact and stay under the 16MB limit.

Compliance

This is not a 2-pass method.
Validation is scored in a single causal pass: each chunk is scored before that chunk is used for any cache update.
The warm-start cache used at eval step 0 is part of the artifact itself, not a separate runtime input.
The packed n-gram cache in the artifact is derived from training data only and is produced within the 600 second training budget.
The learned gate does not use the true next token to decide which experts are available.
GPTQ calibration runs inside the reserved pre-export budget using cached training batches from the same timed run; it does not reopen training shards after the official wallclock limit.
The current reported numbers use TTT_EPOCHS=0, so there is no backward test-time adaptation in the final submission path.
No future validation tokens are visible when scoring the current chunk.

Reproduction

pip install -r requirements.txt

SEED=1337 \
ARTIFACT_NGRAM_EXPORT=1 \
MAX_WALLCLOCK_SECONDS=600 \
USE_MIXER=1 MIXER_ETA=0.1 MIXER_HEAD=multi \
USE_NGRAM_CACHE=1 NGRAM_EVAL_ORDER=9 \
TRAIN_ORACLE_BUCKETS=32768 NGRAM_EVAL_BUCKETS=32768 \
USE_PHRASE_CACHE=1 USE_REGIME_TRACKER=0 USE_LOGIT_CAL=1 \
TTT_EPOCHS=0 TTT_FREEZE_BLOCKS=2 TTT_LR=0.0001 \
TTT_CHUNK_TOKENS=131072 SKIP_SLIDING=1 EVAL_STRIDE=64 TTT_TEMPERATURE=0.85 \
CROWN_Q_LAMBDA=0.01 PRUNE_PCT=0.05 BIGRAM_VOCAB_SIZE=0 \
GPTQ_CALIBRATION_SEQS=128 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

…rned Weighting Gate

Three improvements over Green v1 (baseline: sliding=1.1129, ngram9=0.4489): PR openai#931 (packed training oracle): after training, reads 2 train shards (~200M tokens) and seeds eval n-gram tables before val token openai#1. Eliminates cold-start penalty where early val chunks score with empty cache. Legal: oracle is training-data-only, eval remains single-pass causal. PR openai#900 (Dirichlet smoothing): replaces linear alpha mixing with p = (ng_count + c * neural_p) / (ctx_count + c) Count-sensitive weighting: high-count matches trust n-gram, low-count matches stay close to neural prior. No hand-tuned alpha per-order needed. NGRAM_EVAL_MIN_COUNT=1 (formula handles low counts naturally). PR openai#859 (matrix_lr): MATRIX_LR=0.03 vs 0.025 in Green — higher LR found across 79-experiment sweep to train stronger base model. Both new features are independent toggles (ARTIFACT_NGRAM, NGRAM_DIRICHLET) for A/B isolation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Cursor Agent added 2 commits March 27, 2026 04:21

Record Submission: 0.0498 BPB - Packed Training N-gram Artifact + Lea…

51ea79c

…rned Weighting Gate

Add 3-seed training logs for packed n-gram artifact record.

5e6c5de

notapplica mentioned this pull request Mar 27, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

AnirudhRahul changed the title ~~Record: 0.0498 bpb - Packed Training N-gram Artifact + Learned Weighting Gate~~ Record: 0.0498 bpb - Packed Training N-gram Artifact + Learned Weighting Gate (updated) Mar 27, 2026

AnirudhRahul mentioned this pull request Mar 27, 2026

[Closed] Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache) #962

Closed

9 tasks

AnirudhRahul closed this Mar 27, 2026

haikosys mentioned this pull request Mar 27, 2026

Record: Fort Knox — Legal Packed Training Cache, Zero Val Adaptation (val_bpb 0.0638, 3-seed) #982

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 0.0498 bpb - Packed Training N-gram Artifact + Learned Weighting Gate (updated) #931

Record: 0.0498 bpb - Packed Training N-gram Artifact + Learned Weighting Gate (updated) #931
AnirudhRahul wants to merge 2 commits intoopenai:mainfrom
AnirudhRahul:record-packed-training-ngram-artifact-00498

AnirudhRahul commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AnirudhRahul commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Causal Inference Scheme

Key Changes

Bucket-Size Ablation

Compliance

Reproduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AnirudhRahul commented Mar 27, 2026 •

edited

Loading