Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03

**val_bpb = 0.1584** (3-seed mean, std 0.0008) | **15.72-15.76 MB** | 8xH100 SXM | **No TTT**

## Results

| Seed | Steps | ms/step | Sliding BPB | **Mixer BPB** | Artifact |
|------|-------|---------|-------------|---------------|----------|
| 42 | 4,940 | 114 | 1.1362 | **0.1575** | 15,758,015 |
| 1337 | 4,930 | 114 | 1.1353 | **0.1585** | 15,723,194 |
| 2024 | 4,937 | 114 | 1.1366 | **0.1591** | 15,724,500 |
| **Mean** | | | | **0.1584 ± 0.0008** | |

## Two Key Changes from PR #834

1. **MATRIX_LR=0.03** (was 0.025) — discovered through systematic screening of 79+ experiments
2. **TTT_EPOCHS=0** — completely removes test-time training. Result is clean, fully legal, no gradient updates on val data

Despite removing TTT, our result (0.1582) **beats** PR #834's original (0.1663 with TTT enabled). The higher matrix LR produces a better-trained model that the learned mixing head can leverage more effectively.

## Architecture (from PR #834)

- 11L, 512d, MHA 8/8, MLP 3.5x, LeakyReLU(0.5)²
- **Learned mixer head**: `Linear(512 → 7)` predicts per-token mixing weights for neural model + n-gram orders 2-7
- **Frozen n-gram oracle**: bigram/trigram/...7-gram tables precomputed from training data, used as lookup during training
- Mixed int5/int6 quantization + GPTQ + zstd, EMA(0.997), CROWN-Q penalty

## Eval: Learned Multi-Expert Mixing (NO TTT)

- Score-first backward-looking n-gram cache (orders 2-7)
- Model-predicted mixing weights (not fixed alpha — learned during training)
- Each token gets its own expert weights based on transformer hidden state
- **515s eval time** (within 600s budget, no TTT overhead)

## Reproduction

```bash
MATRIX_LR=0.03 TTT_EPOCHS=0 torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Legality

- No TTT (zero gradient updates on validation data)
- N-gram cache is backward-looking (score-first, cache updated after scoring)
- Learned mixing head trained on training data only (frozen oracle)
- Single-pass evaluation

## Based On

- PR #834: Learned Multi-Expert Gate + Frozen Oracle architecture
- Our systematic hyperparameter screening (79+ experiments, MATRIX_LR=0.03 discovery)
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"track": "10min_16mb",
"date": "2026-03-26",
"name": "Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03",
"author": "bigbag",
"github": "bigbag",
"seed_results": {
"42": {"val_loss": 0.265964, "val_bpb": 0.157519, "artifact_bytes": 15758015},
"1337": {"val_loss": 0.267544, "val_bpb": 0.158454, "artifact_bytes": 15723194},
"2024": {"val_loss": 0.268686, "val_bpb": 0.159131, "artifact_bytes": 15724500}
},
"mean_val_loss": 0.267398,
"mean_val_bpb": 0.158368,
"std_val_bpb": 0.0008,
"code_bytes": 92093
}
Loading