Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011)#1094
Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011)#1094michaelwinczuk wants to merge 2 commits intoopenai:mainfrom
Conversation
3-seed mean 0.4027 BPB (std 0.0015): 1337=0.4024, 42=0.4044, 2024=0.4014 All artifacts under 16MB. Beats openai#803 (0.4416) by 0.0389 BPB. Causal sequential chunk eval with BackoffNgramMixer (orders 2-10). Swarm-guided training with KG-conditioned embedding init. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the review @kooshi. Let me clarify the eval mechanism: The eval processes validation tokens in sequential non-overlapping chunks (chunk_size = seq_len = 2048). For each chunk:
The n-gram counts at chunk C only contain tokens from chunks 0 through C-1. The score-first, update-after ordering is the same "backward-looking" pattern used by #803 and #779. However, I want to flag a potential concern: our sequential chunks are non-overlapping, which means the neural model restarts with fresh context each chunk while the n-gram retains full history from all previous chunks. This could give the n-gram disproportionate influence compared to sliding-window approaches where the neural model maintains longer context. If the organizers consider this an issue, I'm happy to adapt the eval to match #803's sliding-window + incremental-update approach. The implementation is transparent in train_gpt.py lines 1077-1101. |
Replace (hash_size, vocab) tables with separate context-count and full-count (context+target) flat vectors per order. Key improvements: - VRAM: O(num_buckets) per order, not O(hash_size × vocab) 4M buckets × 8 orders × 4 bytes × 2 = 256MB (was 460MB at 32K×1024) - Supports 4M buckets (vs 32K) — far fewer collisions - Orders 2-10 (was 2-7) — stronger high-order statistics - Entropy-adaptive alpha: trust n-gram more when model is uncertain - Greedy cascade backoff with min_count threshold - Sequential causal chunk eval (all ranks identical, not sharded) - score() method handles mixing internally Based on PR openai#1094 (BackoffNgramMixer) by michaelwinczuk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s eval Seeds: 7 (0.3948), 1337 (0.3957), 2024 (0.3969) Batched sliding-window eval with incremental n-gram updates. batch_seqs=128 for eval time compliance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kooshi Thanks for the quick look! |
Summary
Key Innovation
Batched sliding-window eval with incremental n-gram updates. All ranks process ALL windows (stride=64) with batch_seqs=128 for throughput. N-gram counts update after each batch — strictly backward-looking, causal. Full 62M-token history builds incrementally as scoring progresses.
Eval Stack
0.20 + 0.55 * sigmoid(2*(H - 3.0))p = (1-alpha)*p_neural + alpha*p_ngramLegality
Credits
Reproduction
Requires
swarm_agents.pyandkg_data.pyin the same directory.Test Plan
🤖 Generated with Claude Code