openai · Asukabot0 · Mar 25, 2026
diff --git a/records/track_10min_16mb/2026-03-26_Backoff_Entropy_Adaptive/README.md b/records/track_10min_16mb/2026-03-26_Backoff_Entropy_Adaptive/README.md
@@ -0,0 +1,100 @@
+# Record: Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674)
+
+**val_bpb = 0.9674** (3-seed mean, std=0.0006) | **~15.99 MB** | **No TTT**
+
+## 3-Seed Results
+
+| Seed | val_bpb | val_loss | Submission Size | Quantization |
+|------|---------|----------|-----------------|--------------|
+| 1337 | 0.96679 | 1.63238 | 15,994,366 B | int6+zstd-16 |
+| 42   | 0.96703 | 1.63278 | 15,996,585 B | int6+zstd-16 |
+| 7    | 0.96825 | 1.63485 | 15,988,201 B | int6+zstd-16 |
+| **Mean** | **0.96736** | **1.63334** | — | — |
+| **Std** | **0.00063** | — | — | — |
+
+## Architecture
+
+- **Model**: 11 layers, 512 dim, GQA 8H/4KV, MLP 3x expansion
+- **Activation**: LeakyReLU(0.5)² (squared LeakyReLU with negative slope 0.5)
+- **Attention**: XSA-all with last_n=11 (cross-sequence attention across all layers)
+- **Residual**: Value Residual + Gated Attention
+- **Embeddings**: SmearGate, BigramHash(4096), Partial RoPE (16/64), tied embeddings
+- **Normalization**: LN Scale (learnable scale only, no bias)
+- **EMA**: decay=0.997
+- **Optimizer**: Muon (momentum=0.99, warmup from 0.92 over 1500 steps)
+- **LR**: matrix=0.025, scalar=0.025, tied_embed=0.035
+
+## Quantization
+
+int6 per-row quantization + zstd-16 compression. Auto-downgrade fallback to int5 for select layers is available but was **not triggered** for any seed in this run.
+
+## N-gram Eval Cache
+
+The n-gram cache is an **eval-time only** technique that interpolates LM logits with n-gram statistics collected during evaluation.
+
+### Multi-order Backoff (orders 2–7)
+
+Instead of a single fixed n-gram order, we maintain counts for orders 2 through 7. At each position, we attempt the highest order first (7-gram). If the context has no match (count < min_count=2), we cascade down to the next lower order until a match is found or we exhaust all orders. This dramatically improves coverage compared to a fixed high-order model.
+
+### Entropy-Adaptive Alpha
+
+Instead of a fixed interpolation weight, alpha adapts based on the model's own entropy:
+
+```
+alpha = ent_base + ent_range * sigmoid(2 * (H - 4.0))
+      = 0.05    + 0.55      * sigmoid(2 * (H - 4.0))
+```
+
+- **Low entropy** (model is confident): alpha → 0.05, trust the LM
+- **High entropy** (model is uncertain): alpha → 0.60, trust the n-gram cache
+
+### Compliance
+
+- **Score-first, backward-looking**: n-gram counts are built from previously scored tokens only
+- **No oracle selection**: alpha depends solely on the model's own output distribution (entropy), never on ground-truth labels
+- **No cross-GPU sync**: each GPU maintains its own independent cache (4M buckets)
+
+## Ablation
+
+| Configuration | val_bpb | Delta |
+|---------------|---------|-------|
+| No n-gram cache (neural only) | 1.1271 | baseline |
+| Fixed alpha=0.40, order=7, no backoff | 1.0336 | -0.0935 |
+| Multi-order backoff (2-7) + fixed alpha=0.40 | 0.9825 | -0.1446 |
+| **Multi-order backoff (2-7) + entropy-adaptive** | **0.9674** | **-0.1597** |
+
+Entropy-adaptive alpha improves over fixed alpha by **0.0151 bpb**.
+
+## Reproduction
+
+```bash
+cd records/track_10min_16mb/2026-03-26_Backoff_Entropy_Adaptive
+
+# Symlink data directory
+ln -sf ../../../data data
+
+# Training (seed 1337)
+SEED=1337 TTT_ENABLED=0 CANON_LAST_N=0 SWA_ENABLED=0 \
+  MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
+  MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 \
+  XSA_LAST_N=11 LEAKY_RELU=1 MAX_WALLCLOCK_SECONDS=600 \
+  torchrun --standalone --nproc_per_node=8 train_gpt.py
+
+# Eval-only with n-gram cache (uses saved model)
+EVAL_ONLY="$(pwd)/final_model.pt" ITERATIONS=0 \
+  TTT_ENABLED=0 CANON_LAST_N=0 SWA_ENABLED=0 \
+  MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
+  XSA_LAST_N=11 LEAKY_RELU=1 MAX_WALLCLOCK_SECONDS=600 \
+  torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Credits
+
+Built on [modded-nanogpt](https://github.com/KellerJordan/modded-nanogpt) and the following PRs:
+- PR #315 (GQA + RoPE)
+- PR #609 (XSA)
+- PR #493 (Value Residual)
+- PR #518 (Gated Attention)
+- PR #413 (LeakyReLU²)
+- PR #674 (SmearGate + BigramHash)
+- PR #702 (Multi-order backoff concept)
diff --git a/records/track_10min_16mb/2026-03-26_Backoff_Entropy_Adaptive/final_model.int6.ptz b/records/track_10min_16mb/2026-03-26_Backoff_Entropy_Adaptive/final_model.int6.ptz
diff --git a/records/track_10min_16mb/2026-03-26_Backoff_Entropy_Adaptive/logs/p21_seed1337.txt b/records/track_10min_16mb/2026-03-26_Backoff_Entropy_Adaptive/logs/p21_seed1337.txt
@@ -0,0 +1,100 @@
+W0325 14:02:27.100000 75934 torch/distributed/run.py:803] 
+W0325 14:02:27.100000 75934 torch/distributed/run.py:803] *****************************************
+W0325 14:02:27.100000 75934 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0325 14:02:27.100000 75934 torch/distributed/run.py:803] *****************************************
+logs/7dfa6e28-583f-42cb-b24d-fff9bf16b3e4.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:27137223
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+world_size:8 grad_accum_steps:1
+sdp_backends:fa3=True cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
+train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:1337
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9286 val_bpb:4.1035 train_time:0ms step_avg:0.03ms
+step:1/20000 train_loss:6.9308 train_time:150ms step_avg:150.06ms
+step:2/20000 train_loss:8.6115 train_time:249ms step_avg:124.59ms
+step:3/20000 train_loss:7.7111 train_time:348ms step_avg:116.06ms
+step:4/20000 train_loss:7.2918 train_time:446ms step_avg:111.53ms
+step:5/20000 train_loss:7.0145 train_time:545ms step_avg:109.01ms
+step:6/20000 train_loss:6.8949 train_time:644ms step_avg:107.40ms
+step:7/20000 train_loss:6.7785 train_time:745ms step_avg:106.41ms
+step:8/20000 train_loss:6.6044 train_time:846ms step_avg:105.75ms
+step:9/20000 train_loss:6.2822 train_time:948ms step_avg:105.28ms
+step:10/20000 train_loss:5.9656 train_time:1050ms step_avg:105.03ms
+step:200/20000 train_loss:2.3412 train_time:21394ms step_avg:106.97ms
+step:400/20000 train_loss:2.3885 train_time:42899ms step_avg:107.25ms
+step:600/20000 train_loss:2.3141 train_time:64342ms step_avg:107.24ms
+step:800/20000 train_loss:2.2158 train_time:85893ms step_avg:107.37ms
+step:1000/20000 train_loss:2.2564 train_time:107395ms step_avg:107.40ms
+step:1000/20000 val_loss:2.2047 val_bpb:1.3057 train_time:107400ms step_avg:107.40ms
+step:1200/20000 train_loss:2.3271 train_time:128865ms step_avg:107.39ms
+step:1400/20000 train_loss:2.1611 train_time:150480ms step_avg:107.49ms
+step:1600/20000 train_loss:2.0540 train_time:171919ms step_avg:107.45ms
+step:1800/20000 train_loss:2.1280 train_time:193452ms step_avg:107.47ms
+step:2000/20000 train_loss:2.0423 train_time:214954ms step_avg:107.48ms
+step:2000/20000 val_loss:2.1080 val_bpb:1.2485 train_time:214959ms step_avg:107.48ms
+step:2200/20000 train_loss:2.1176 train_time:236430ms step_avg:107.47ms
+step:2400/20000 train_loss:2.0447 train_time:257914ms step_avg:107.46ms
+step:2600/20000 train_loss:2.0892 train_time:279472ms step_avg:107.49ms
+step:2800/20000 train_loss:2.1320 train_time:301096ms step_avg:107.53ms
+step:3000/20000 train_loss:2.1300 train_time:322557ms step_avg:107.52ms
+step:3000/20000 val_loss:2.0608 val_bpb:1.2205 train_time:322562ms step_avg:107.52ms
+step:3200/20000 train_loss:2.1391 train_time:344047ms step_avg:107.51ms
+step:3400/20000 train_loss:1.9836 train_time:365561ms step_avg:107.52ms
+step:3600/20000 train_loss:2.0493 train_time:387180ms step_avg:107.55ms
+step:3800/20000 train_loss:2.0265 train_time:408682ms step_avg:107.55ms
+step:4000/20000 train_loss:1.9238 train_time:430237ms step_avg:107.56ms
+step:4000/20000 val_loss:2.0149 val_bpb:1.1933 train_time:430248ms step_avg:107.56ms
+step:4200/20000 train_loss:2.0964 train_time:451717ms step_avg:107.55ms
+step:4400/20000 train_loss:1.9772 train_time:473154ms step_avg:107.54ms
+step:4600/20000 train_loss:1.7850 train_time:494667ms step_avg:107.54ms
+step:4800/20000 train_loss:2.3677 train_time:516137ms step_avg:107.53ms
+step:5000/20000 train_loss:2.0391 train_time:537641ms step_avg:107.53ms
+step:5000/20000 val_loss:1.9605 val_bpb:1.1611 train_time:537651ms step_avg:107.53ms
+step:5200/20000 train_loss:1.9740 train_time:559013ms step_avg:107.50ms
+step:5400/20000 train_loss:1.9794 train_time:580533ms step_avg:107.51ms
+step:5582/20000 val_loss:1.9308 val_bpb:1.1436 train_time:600087ms step_avg:107.50ms
+stopping_early: wallclock_cap train_time:600087ms step:5582/20000
+peak memory allocated: 22472 MiB reserved: 22518 MiB
+ema:applying EMA weights
+Serialized model: 106498817 bytes
+Code size: 88158 bytes
+quant_try int6 zstd-16: 15906208 bytes (limit 15911842)
+Serialized model quant+zstd-16: 15906208 bytes
+Total submission size: 15994366 bytes
+final_int6_roundtrip val_loss:1.9408 val_bpb:1.1495 eval_time:7858ms
+final_int6_roundtrip_exact val_loss:1.94083468 val_bpb:1.14947162
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+final_int6_sliding_window val_loss:1.6324 val_bpb:0.9668 stride:64 eval_time:140015ms
+final_int6_sliding_window_exact val_loss:1.63238069 val_bpb:0.96679035
diff --git a/records/track_10min_16mb/2026-03-26_Backoff_Entropy_Adaptive/logs/p21_seed42.txt b/records/track_10min_16mb/2026-03-26_Backoff_Entropy_Adaptive/logs/p21_seed42.txt
@@ -0,0 +1,100 @@
+W0325 14:16:23.686000 78254 torch/distributed/run.py:803] 
+W0325 14:16:23.686000 78254 torch/distributed/run.py:803] *****************************************
+W0325 14:16:23.686000 78254 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0325 14:16:23.686000 78254 torch/distributed/run.py:803] *****************************************
+logs/28a7acbc-a2aa-4893-8866-0c5109940199.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:27137223
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+world_size:8 grad_accum_steps:1
+sdp_backends:fa3=True cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
+train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:42
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9310 val_bpb:4.1049 train_time:0ms step_avg:0.02ms
+step:1/20000 train_loss:6.9318 train_time:150ms step_avg:150.10ms
+step:2/20000 train_loss:8.7601 train_time:250ms step_avg:125.00ms
+step:3/20000 train_loss:7.7913 train_time:349ms step_avg:116.34ms
+step:4/20000 train_loss:7.2741 train_time:448ms step_avg:112.12ms
+step:5/20000 train_loss:7.0822 train_time:547ms step_avg:109.39ms
+step:6/20000 train_loss:6.9498 train_time:646ms step_avg:107.62ms
+step:7/20000 train_loss:6.8483 train_time:746ms step_avg:106.58ms
+step:8/20000 train_loss:6.6872 train_time:847ms step_avg:105.85ms
+step:9/20000 train_loss:6.3449 train_time:949ms step_avg:105.43ms
+step:10/20000 train_loss:6.0018 train_time:1052ms step_avg:105.17ms
+step:200/20000 train_loss:2.3530 train_time:21445ms step_avg:107.23ms
+step:400/20000 train_loss:2.3867 train_time:42987ms step_avg:107.47ms
+step:600/20000 train_loss:2.3078 train_time:64408ms step_avg:107.35ms
+step:800/20000 train_loss:2.2126 train_time:85962ms step_avg:107.45ms
+step:1000/20000 train_loss:2.2532 train_time:107407ms step_avg:107.41ms
+step:1000/20000 val_loss:2.2021 val_bpb:1.3042 train_time:107411ms step_avg:107.41ms
+step:1200/20000 train_loss:2.3292 train_time:128857ms step_avg:107.38ms
+step:1400/20000 train_loss:2.1627 train_time:150454ms step_avg:107.47ms
+step:1600/20000 train_loss:2.0558 train_time:171992ms step_avg:107.50ms
+step:1800/20000 train_loss:2.1278 train_time:193592ms step_avg:107.55ms
+step:2000/20000 train_loss:2.0468 train_time:215169ms step_avg:107.58ms
+step:2000/20000 val_loss:2.1095 val_bpb:1.2493 train_time:215174ms step_avg:107.59ms
+step:2200/20000 train_loss:2.1208 train_time:236695ms step_avg:107.59ms
+step:2400/20000 train_loss:2.0460 train_time:258250ms step_avg:107.60ms
+step:2600/20000 train_loss:2.0917 train_time:279858ms step_avg:107.64ms
+step:2800/20000 train_loss:2.1327 train_time:301468ms step_avg:107.67ms
+step:3000/20000 train_loss:2.1337 train_time:322996ms step_avg:107.67ms
+step:3000/20000 val_loss:2.0621 val_bpb:1.2213 train_time:323001ms step_avg:107.67ms
+step:3200/20000 train_loss:2.1418 train_time:344507ms step_avg:107.66ms
+step:3400/20000 train_loss:1.9845 train_time:366008ms step_avg:107.65ms
+step:3600/20000 train_loss:2.0518 train_time:387601ms step_avg:107.67ms
+step:3800/20000 train_loss:2.0260 train_time:409111ms step_avg:107.66ms
+step:4000/20000 train_loss:1.9265 train_time:430654ms step_avg:107.66ms
+step:4000/20000 val_loss:2.0153 val_bpb:1.1936 train_time:430658ms step_avg:107.66ms
+step:4200/20000 train_loss:2.0977 train_time:452156ms step_avg:107.66ms
+step:4400/20000 train_loss:1.9785 train_time:473589ms step_avg:107.63ms
+step:4600/20000 train_loss:1.7848 train_time:495173ms step_avg:107.65ms
+step:4800/20000 train_loss:2.3657 train_time:516699ms step_avg:107.65ms
+step:5000/20000 train_loss:2.0383 train_time:538248ms step_avg:107.65ms
+step:5000/20000 val_loss:1.9608 val_bpb:1.1613 train_time:538252ms step_avg:107.65ms
+step:5200/20000 train_loss:1.9767 train_time:559648ms step_avg:107.62ms
+step:5400/20000 train_loss:1.9805 train_time:581161ms step_avg:107.62ms
+step:5576/20000 val_loss:1.9313 val_bpb:1.1438 train_time:600110ms step_avg:107.62ms
+stopping_early: wallclock_cap train_time:600110ms step:5576/20000
+peak memory allocated: 22472 MiB reserved: 22518 MiB
+ema:applying EMA weights
+Serialized model: 106498817 bytes
+Code size: 88158 bytes
+quant_try int6 zstd-16: 15908427 bytes (limit 15911842)
+Serialized model quant+zstd-16: 15908427 bytes
+Total submission size: 15996585 bytes
+final_int6_roundtrip val_loss:1.9416 val_bpb:1.1499 eval_time:7605ms
+final_int6_roundtrip_exact val_loss:1.94160297 val_bpb:1.14992665
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
+final_int6_sliding_window val_loss:1.6328 val_bpb:0.9670 stride:64 eval_time:140935ms
+final_int6_sliding_window_exact val_loss:1.63278454 val_bpb:0.96702954