Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions records/track_10min_16mb/2026-03-26_Backoff_Entropy_Adaptive/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Record: Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674)

**val_bpb = 0.9674** (3-seed mean, std=0.0006) | **~15.99 MB** | **No TTT**

## 3-Seed Results

| Seed | val_bpb | val_loss | Submission Size | Quantization |
|------|---------|----------|-----------------|--------------|
| 1337 | 0.96679 | 1.63238 | 15,994,366 B | int6+zstd-16 |
| 42 | 0.96703 | 1.63278 | 15,996,585 B | int6+zstd-16 |
| 7 | 0.96825 | 1.63485 | 15,988,201 B | int6+zstd-16 |
| **Mean** | **0.96736** | **1.63334** | — | — |
| **Std** | **0.00063** | — | — | — |

## Architecture

- **Model**: 11 layers, 512 dim, GQA 8H/4KV, MLP 3x expansion
- **Activation**: LeakyReLU(0.5)² (squared LeakyReLU with negative slope 0.5)
- **Attention**: XSA-all with last_n=11 (cross-sequence attention across all layers)
- **Residual**: Value Residual + Gated Attention
- **Embeddings**: SmearGate, BigramHash(4096), Partial RoPE (16/64), tied embeddings
- **Normalization**: LN Scale (learnable scale only, no bias)
- **EMA**: decay=0.997
- **Optimizer**: Muon (momentum=0.99, warmup from 0.92 over 1500 steps)
- **LR**: matrix=0.025, scalar=0.025, tied_embed=0.035

## Quantization

int6 per-row quantization + zstd-16 compression. Auto-downgrade fallback to int5 for select layers is available but was **not triggered** for any seed in this run.

## N-gram Eval Cache

The n-gram cache is an **eval-time only** technique that interpolates LM logits with n-gram statistics collected during evaluation.

### Multi-order Backoff (orders 2–7)

Instead of a single fixed n-gram order, we maintain counts for orders 2 through 7. At each position, we attempt the highest order first (7-gram). If the context has no match (count < min_count=2), we cascade down to the next lower order until a match is found or we exhaust all orders. This dramatically improves coverage compared to a fixed high-order model.

### Entropy-Adaptive Alpha

Instead of a fixed interpolation weight, alpha adapts based on the model's own entropy:

```
alpha = ent_base + ent_range * sigmoid(2 * (H - 4.0))
= 0.05 + 0.55 * sigmoid(2 * (H - 4.0))
```

- **Low entropy** (model is confident): alpha → 0.05, trust the LM
- **High entropy** (model is uncertain): alpha → 0.60, trust the n-gram cache

### Compliance

- **Score-first, backward-looking**: n-gram counts are built from previously scored tokens only
- **No oracle selection**: alpha depends solely on the model's own output distribution (entropy), never on ground-truth labels
- **No cross-GPU sync**: each GPU maintains its own independent cache (4M buckets)

## Ablation

| Configuration | val_bpb | Delta |
|---------------|---------|-------|
| No n-gram cache (neural only) | 1.1271 | baseline |
| Fixed alpha=0.40, order=7, no backoff | 1.0336 | -0.0935 |
| Multi-order backoff (2-7) + fixed alpha=0.40 | 0.9825 | -0.1446 |
| **Multi-order backoff (2-7) + entropy-adaptive** | **0.9674** | **-0.1597** |

Entropy-adaptive alpha improves over fixed alpha by **0.0151 bpb**.

## Reproduction

```bash
cd records/track_10min_16mb/2026-03-26_Backoff_Entropy_Adaptive

# Symlink data directory
ln -sf ../../../data data

# Training (seed 1337)
SEED=1337 TTT_ENABLED=0 CANON_LAST_N=0 SWA_ENABLED=0 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 \
XSA_LAST_N=11 LEAKY_RELU=1 MAX_WALLCLOCK_SECONDS=600 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

# Eval-only with n-gram cache (uses saved model)
EVAL_ONLY="$(pwd)/final_model.pt" ITERATIONS=0 \
TTT_ENABLED=0 CANON_LAST_N=0 SWA_ENABLED=0 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
XSA_LAST_N=11 LEAKY_RELU=1 MAX_WALLCLOCK_SECONDS=600 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Credits

Built on [modded-nanogpt](https://github.com/KellerJordan/modded-nanogpt) and the following PRs:
- PR #315 (GQA + RoPE)
- PR #609 (XSA)
- PR #493 (Value Residual)
- PR #518 (Gated Attention)
- PR #413 (LeakyReLU²)
- PR #674 (SmearGate + BigramHash)
- PR #702 (Multi-order backoff concept)
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
W0325 14:02:27.100000 75934 torch/distributed/run.py:803]
W0325 14:02:27.100000 75934 torch/distributed/run.py:803] *****************************************
W0325 14:02:27.100000 75934 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0325 14:02:27.100000 75934 torch/distributed/run.py:803] *****************************************
logs/7dfa6e28-583f-42cb-b24d-fff9bf16b3e4.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:27137223
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
world_size:8 grad_accum_steps:1
sdp_backends:fa3=True cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
seed:1337
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9286 val_bpb:4.1035 train_time:0ms step_avg:0.03ms
step:1/20000 train_loss:6.9308 train_time:150ms step_avg:150.06ms
step:2/20000 train_loss:8.6115 train_time:249ms step_avg:124.59ms
step:3/20000 train_loss:7.7111 train_time:348ms step_avg:116.06ms
step:4/20000 train_loss:7.2918 train_time:446ms step_avg:111.53ms
step:5/20000 train_loss:7.0145 train_time:545ms step_avg:109.01ms
step:6/20000 train_loss:6.8949 train_time:644ms step_avg:107.40ms
step:7/20000 train_loss:6.7785 train_time:745ms step_avg:106.41ms
step:8/20000 train_loss:6.6044 train_time:846ms step_avg:105.75ms
step:9/20000 train_loss:6.2822 train_time:948ms step_avg:105.28ms
step:10/20000 train_loss:5.9656 train_time:1050ms step_avg:105.03ms
step:200/20000 train_loss:2.3412 train_time:21394ms step_avg:106.97ms
step:400/20000 train_loss:2.3885 train_time:42899ms step_avg:107.25ms
step:600/20000 train_loss:2.3141 train_time:64342ms step_avg:107.24ms
step:800/20000 train_loss:2.2158 train_time:85893ms step_avg:107.37ms
step:1000/20000 train_loss:2.2564 train_time:107395ms step_avg:107.40ms
step:1000/20000 val_loss:2.2047 val_bpb:1.3057 train_time:107400ms step_avg:107.40ms
step:1200/20000 train_loss:2.3271 train_time:128865ms step_avg:107.39ms
step:1400/20000 train_loss:2.1611 train_time:150480ms step_avg:107.49ms
step:1600/20000 train_loss:2.0540 train_time:171919ms step_avg:107.45ms
step:1800/20000 train_loss:2.1280 train_time:193452ms step_avg:107.47ms
step:2000/20000 train_loss:2.0423 train_time:214954ms step_avg:107.48ms
step:2000/20000 val_loss:2.1080 val_bpb:1.2485 train_time:214959ms step_avg:107.48ms
step:2200/20000 train_loss:2.1176 train_time:236430ms step_avg:107.47ms
step:2400/20000 train_loss:2.0447 train_time:257914ms step_avg:107.46ms
step:2600/20000 train_loss:2.0892 train_time:279472ms step_avg:107.49ms
step:2800/20000 train_loss:2.1320 train_time:301096ms step_avg:107.53ms
step:3000/20000 train_loss:2.1300 train_time:322557ms step_avg:107.52ms
step:3000/20000 val_loss:2.0608 val_bpb:1.2205 train_time:322562ms step_avg:107.52ms
step:3200/20000 train_loss:2.1391 train_time:344047ms step_avg:107.51ms
step:3400/20000 train_loss:1.9836 train_time:365561ms step_avg:107.52ms
step:3600/20000 train_loss:2.0493 train_time:387180ms step_avg:107.55ms
step:3800/20000 train_loss:2.0265 train_time:408682ms step_avg:107.55ms
step:4000/20000 train_loss:1.9238 train_time:430237ms step_avg:107.56ms
step:4000/20000 val_loss:2.0149 val_bpb:1.1933 train_time:430248ms step_avg:107.56ms
step:4200/20000 train_loss:2.0964 train_time:451717ms step_avg:107.55ms
step:4400/20000 train_loss:1.9772 train_time:473154ms step_avg:107.54ms
step:4600/20000 train_loss:1.7850 train_time:494667ms step_avg:107.54ms
step:4800/20000 train_loss:2.3677 train_time:516137ms step_avg:107.53ms
step:5000/20000 train_loss:2.0391 train_time:537641ms step_avg:107.53ms
step:5000/20000 val_loss:1.9605 val_bpb:1.1611 train_time:537651ms step_avg:107.53ms
step:5200/20000 train_loss:1.9740 train_time:559013ms step_avg:107.50ms
step:5400/20000 train_loss:1.9794 train_time:580533ms step_avg:107.51ms
step:5582/20000 val_loss:1.9308 val_bpb:1.1436 train_time:600087ms step_avg:107.50ms
stopping_early: wallclock_cap train_time:600087ms step:5582/20000
peak memory allocated: 22472 MiB reserved: 22518 MiB
ema:applying EMA weights
Serialized model: 106498817 bytes
Code size: 88158 bytes
quant_try int6 zstd-16: 15906208 bytes (limit 15911842)
Serialized model quant+zstd-16: 15906208 bytes
Total submission size: 15994366 bytes
final_int6_roundtrip val_loss:1.9408 val_bpb:1.1495 eval_time:7858ms
final_int6_roundtrip_exact val_loss:1.94083468 val_bpb:1.14947162
ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304

ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
final_int6_sliding_window val_loss:1.6324 val_bpb:0.9668 stride:64 eval_time:140015ms
final_int6_sliding_window_exact val_loss:1.63238069 val_bpb:0.96679035
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
W0325 14:16:23.686000 78254 torch/distributed/run.py:803]
W0325 14:16:23.686000 78254 torch/distributed/run.py:803] *****************************************
W0325 14:16:23.686000 78254 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0325 14:16:23.686000 78254 torch/distributed/run.py:803] *****************************************
logs/28a7acbc-a2aa-4893-8866-0c5109940199.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:27137223
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
world_size:8 grad_accum_steps:1
sdp_backends:fa3=True cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
seed:42
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9310 val_bpb:4.1049 train_time:0ms step_avg:0.02ms
step:1/20000 train_loss:6.9318 train_time:150ms step_avg:150.10ms
step:2/20000 train_loss:8.7601 train_time:250ms step_avg:125.00ms
step:3/20000 train_loss:7.7913 train_time:349ms step_avg:116.34ms
step:4/20000 train_loss:7.2741 train_time:448ms step_avg:112.12ms
step:5/20000 train_loss:7.0822 train_time:547ms step_avg:109.39ms
step:6/20000 train_loss:6.9498 train_time:646ms step_avg:107.62ms
step:7/20000 train_loss:6.8483 train_time:746ms step_avg:106.58ms
step:8/20000 train_loss:6.6872 train_time:847ms step_avg:105.85ms
step:9/20000 train_loss:6.3449 train_time:949ms step_avg:105.43ms
step:10/20000 train_loss:6.0018 train_time:1052ms step_avg:105.17ms
step:200/20000 train_loss:2.3530 train_time:21445ms step_avg:107.23ms
step:400/20000 train_loss:2.3867 train_time:42987ms step_avg:107.47ms
step:600/20000 train_loss:2.3078 train_time:64408ms step_avg:107.35ms
step:800/20000 train_loss:2.2126 train_time:85962ms step_avg:107.45ms
step:1000/20000 train_loss:2.2532 train_time:107407ms step_avg:107.41ms
step:1000/20000 val_loss:2.2021 val_bpb:1.3042 train_time:107411ms step_avg:107.41ms
step:1200/20000 train_loss:2.3292 train_time:128857ms step_avg:107.38ms
step:1400/20000 train_loss:2.1627 train_time:150454ms step_avg:107.47ms
step:1600/20000 train_loss:2.0558 train_time:171992ms step_avg:107.50ms
step:1800/20000 train_loss:2.1278 train_time:193592ms step_avg:107.55ms
step:2000/20000 train_loss:2.0468 train_time:215169ms step_avg:107.58ms
step:2000/20000 val_loss:2.1095 val_bpb:1.2493 train_time:215174ms step_avg:107.59ms
step:2200/20000 train_loss:2.1208 train_time:236695ms step_avg:107.59ms
step:2400/20000 train_loss:2.0460 train_time:258250ms step_avg:107.60ms
step:2600/20000 train_loss:2.0917 train_time:279858ms step_avg:107.64ms
step:2800/20000 train_loss:2.1327 train_time:301468ms step_avg:107.67ms
step:3000/20000 train_loss:2.1337 train_time:322996ms step_avg:107.67ms
step:3000/20000 val_loss:2.0621 val_bpb:1.2213 train_time:323001ms step_avg:107.67ms
step:3200/20000 train_loss:2.1418 train_time:344507ms step_avg:107.66ms
step:3400/20000 train_loss:1.9845 train_time:366008ms step_avg:107.65ms
step:3600/20000 train_loss:2.0518 train_time:387601ms step_avg:107.67ms
step:3800/20000 train_loss:2.0260 train_time:409111ms step_avg:107.66ms
step:4000/20000 train_loss:1.9265 train_time:430654ms step_avg:107.66ms
step:4000/20000 val_loss:2.0153 val_bpb:1.1936 train_time:430658ms step_avg:107.66ms
step:4200/20000 train_loss:2.0977 train_time:452156ms step_avg:107.66ms
step:4400/20000 train_loss:1.9785 train_time:473589ms step_avg:107.63ms
step:4600/20000 train_loss:1.7848 train_time:495173ms step_avg:107.65ms
step:4800/20000 train_loss:2.3657 train_time:516699ms step_avg:107.65ms
step:5000/20000 train_loss:2.0383 train_time:538248ms step_avg:107.65ms
step:5000/20000 val_loss:1.9608 val_bpb:1.1613 train_time:538252ms step_avg:107.65ms
step:5200/20000 train_loss:1.9767 train_time:559648ms step_avg:107.62ms
step:5400/20000 train_loss:1.9805 train_time:581161ms step_avg:107.62ms
step:5576/20000 val_loss:1.9313 val_bpb:1.1438 train_time:600110ms step_avg:107.62ms
stopping_early: wallclock_cap train_time:600110ms step:5576/20000
peak memory allocated: 22472 MiB reserved: 22518 MiB
ema:applying EMA weights
Serialized model: 106498817 bytes
Code size: 88158 bytes
quant_try int6 zstd-16: 15908427 bytes (limit 15911842)
Serialized model quant+zstd-16: 15908427 bytes
Total submission size: 15996585 bytes
final_int6_roundtrip val_loss:1.9416 val_bpb:1.1499 eval_time:7605ms
final_int6_roundtrip_exact val_loss:1.94160297 val_bpb:1.14992665
ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304

ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
final_int6_sliding_window val_loss:1.6328 val_bpb:0.9670 stride:64 eval_time:140935ms
final_int6_sliding_window_exact val_loss:1.63278454 val_bpb:0.96702954
Loading