Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
cc6a63e
Add LongContext4096 + QAT Int4 16L experiments
FlashyFlash3011 Mar 21, 2026
538bfa6
Fix warmdown and rope_base in LongContext4096 script
FlashyFlash3011 Mar 22, 2026
54b969a
Add LongContext4096 + Full SOTA stack experiment (2-line diff from PR…
FlashyFlash3011 Mar 24, 2026
e4ee89f
Add 4 experiments: fix QAT mismatch, add int6/int4 bank-QAT scripts
FlashyFlash3011 Mar 25, 2026
15e6d9e
Add run_experiments.sh and reset RESULTS.md with correct iteration co…
FlashyFlash3011 Mar 25, 2026
ee26d04
Fix BASE path in run_experiments.sh
FlashyFlash3011 Mar 25, 2026
b7eb0ed
Fix lzma preset, TTT stride, add QAT exp to run script
FlashyFlash3011 Mar 25, 2026
b688621
results: 2026-03-25_LongContext4096_Int6_QAT seed1337
Mar 26, 2026
b937791
add recompress_l9.py utility
FlashyFlash3011 Mar 26, 2026
12edb34
exp6: Int6_QAT_2048 — same as Exp5 but ctx=2048 for size+speed fix
FlashyFlash3011 Mar 26, 2026
0b6146d
exp6: full bank QAT + submission.json
FlashyFlash3011 Mar 26, 2026
b992b40
exp7: Int6_QAT_2048_LateBank — late bank QAT + MLP_MULT=2.75 + 2048 ctx
FlashyFlash3011 Mar 26, 2026
b371be9
results: 2026-03-26_Int6_QAT_2048_LateBank seed1337
Mar 26, 2026
3c73097
fix: clamp QAT range -32->-31 to match export symmetric range
FlashyFlash3011 Mar 26, 2026
8ccc5d2
reset: remove failed exps, add BankQAT_2048train_4096eval (Option B)
FlashyFlash3011 Mar 26, 2026
4c492e6
exp: LongContext4096 + BankQAT + GatedAttn + ValueResid + zstd-22
FlashyFlash3011 Mar 26, 2026
97d4cda
fix: lzma-9 compression, TTT epochs=1/lr=0.001/freeze=4 to prevent fo…
FlashyFlash3011 Mar 26, 2026
1cc698c
tune: bank_qat_threshold 0.15->0.05 (less warmdown noise), target_mb …
FlashyFlash3011 Mar 26, 2026
74c4ce7
exp: 2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT
FlashyFlash3011 Mar 27, 2026
d1563e1
exp: 2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT — lzma-9, bank_qat=0.05…
FlashyFlash3011 Mar 27, 2026
4355194
fix: add git identity + save_and_push after each seed for auto-commit…
FlashyFlash3011 Mar 27, 2026
7fc776c
cleanup: remove old experiments, slim run_experiments.sh to GPTQLite …
FlashyFlash3011 Mar 27, 2026
6351d76
fix: remove TTT cosine LR decay — use constant ttt_lr across all chunks
FlashyFlash3011 Mar 28, 2026
3dd38f2
revert: restore TTT cosine LR decay
FlashyFlash3011 Mar 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,6 @@ data/manifest.json
data/docs_selected.jsonl
.mypy_cache/
.venv
logs/
logs/
*.pt
*.ptz
68 changes: 68 additions & 0 deletions RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Experiment Results

Leaderboard #1 to beat: **1.1194 BPB** (must achieve < 1.1144 at p<0.01)

Training limit: 10 min | Eval limit: 10 min (separate)

---

## Experiment 1 — LongContext4096_FullSOTA
`records/track_10min_16mb/2026-03-24_LongContext4096_FullSOTA/`
`ITERATIONS=6000 WARMDOWN_ITERS=1440 EVAL_STRIDE=80`

| Seed | Steps | ms/step | Pre-TTT bpb | Post-TTT bpb | Artifact (bytes) |
|------|-------|---------|-------------|--------------|-----------------|
| 1337 | — | — | — | — | — |
| 42 | — | — | — | — | — |
| 2025 | — | — | — | — | — |
| **Mean** | | | | | |

---

## Experiment 2 — LongContext4096_Int4_16L_FullSOTA
`records/track_10min_16mb/2026-03-24_LongContext4096_Int4_16L_FullSOTA/`
`ITERATIONS=3500 WARMDOWN_ITERS=840 EVAL_STRIDE=80`

| Seed | Steps | ms/step | Pre-TTT bpb | Post-TTT bpb | Artifact (bytes) |
|------|-------|---------|-------------|--------------|-----------------|
| 1337 | — | — | — | — | — |
| 42 | — | — | — | — | — |
| 2025 | — | — | — | — | — |
| **Mean** | | | | | |

---

## Experiment 3 — LongContext4096_Int4_BankQAT (Risky)
`records/track_10min_16mb/2026-03-25_LongContext4096_Int4_BankQAT/`
`ITERATIONS=3500 WARMDOWN_ITERS=840 EVAL_STRIDE=80`

| Seed | Steps | ms/step | Pre-TTT bpb | Post-TTT bpb | Artifact (bytes) |
|------|-------|---------|-------------|--------------|-----------------|
| 1337 | — | — | — | — | — |
| 42 | — | — | — | — | — |
| 2025 | — | — | — | — | — |
| **Mean** | | | | | |

---

## Experiment 4 — LongContext4096_Int6_QAT (Safe)
`records/track_10min_16mb/2026-03-25_LongContext4096_Int6_QAT/`
`ITERATIONS=6000 WARMDOWN_ITERS=1440 EVAL_STRIDE=80`

| Seed | Steps | ms/step | Pre-TTT bpb | Post-TTT bpb | Artifact (bytes) |
|------|-------|---------|-------------|--------------|-----------------|
| 1337 | — | — | — | — | — |
| 42 | — | — | — | — | — |
| 2025 | — | — | — | — | — |
| **Mean** | | | | | |

---

## Summary

| Experiment | Mean Post-TTT BPB | Beat #1? | Artifact |
|------------|------------------|----------|---------|
| 1. LongContext4096_FullSOTA | — | — | — |
| 2. LongContext4096_Int4_16L | — | — | — |
| 3. LongContext4096_Int4_BankQAT | — | — | — |
| 4. LongContext4096_Int6_QAT | — | — | — |
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Ultimate: GatedAttention + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)

**Target val_bpb: < 1.1144** (to beat competition leaderboard #1 by 0.005)
**Base: 2026-03-23_LeakyReLU_LegalTTT_ParallelMuon → 1.1194 BPB**

## Results (8×H100 80GB SXM) — TBD after run

| Seed | step_avg | steps | Pre-TTT bpb | **Post-TTT bpb** | TTT gain | TTT time | Artifact |
|------|----------|-------|-------------|-----------------|----------|----------|----------|
| 1337 | — | — | — | **—** | — | — | — |
| 42 | — | — | — | **—** | — | — | — |
| 2025 | — | — | — | **—** | — | — | — |
| **Mean** | — | — | — | **—** | — | — | — |

## Changes vs 2026-03-23 Base (1.1194 BPB)

| Change | Why | Expected Δ BPB |
|--------|-----|----------------|
| **GatedAttention=True** | Per-head sigmoid gate (PR #841, nearly no-op init: weight=0, bias=4.0). Already implemented, just enabling default. | -0.002 to -0.005 |
| **ValueResidual=True** | Mixes layer-0 value v0 into all subsequent layers (PR #841, init: λ=[0.5,0.5]). Already implemented, enabling default. | included above |
| **QAT_ENABLED=True from step 1** | Model trains with int6 fake-quant noise for all ~7000 steps vs only the last 175 (5%). Full quantization adaptation. | -0.001 to -0.003 |
| **LATE_QAT_THRESHOLD=0.05** | CastedLinear QAT (for non-bank params) only activates in final 5% of warmdown — minimal noise during critical fine-tuning phase. | included above |
| **lzma preset=6 → 9** | CLAUDE.md: lzma-9 is 8-15% smaller than lzma-6 on int6 weights. Frees ~1.3MB artifact budget. | ~0 (enables BigramHash upgrade) |
| **BigramHash 1536 → 2048** | 2026-03-23 downgraded from 2048→1536 to fit 16MB with lzma-6. lzma-9 savings enable going back to 2048. Ablation: 2048→3072 was -0.0009; expect similar gain here. | -0.001 to -0.002 |
| **Total expected gain** | | **-0.004 to -0.010** → ~1.109 to 1.115 BPB |

## Preserved from 2026-03-23 (unchanged)

| Feature | Setting |
|---------|---------|
| **Architecture** | 11L, 512d, 8H, 4KV, 3× MLP |
| **Activation** | **LeakyReLU(0.5)²** — hardcoded, always on |
| **Optimizer** | **Parallel Muon + Parameter Banking** — unchanged |
| **EMA** | decay=0.997, applied at end of training |
| **SWA** | every 50 steps when LR scale < 0.2 |
| **Sliding window eval** | stride=64, seq_len=2048 |
| **Legal TTT** | score-first, 3 epochs, freeze=0, lr=0.002, SGD+momentum(0.9) |
| **BigramHash** | vocab=**2048**, dim=128 (restored from 1536) |
| **XSA** | Last 4 layers |
| **Partial RoPE** | 16/64 dims, NTK scaling |
| **LN Scale** | 1/√(layer+1) |
| **VE** | dim=128, layers 9,10 |
| **Quantization** | GPTQ-lite int6, per-row clip search, **lzma-9** |
| **Training** | TRAIN_SEQ_LEN=2048, EVAL_STRIDE=64, WARMDOWN_ITERS=3500 |

## Training Architecture

PR #414 + PR #399 + PR #841 stack:

| Component | Setting |
|-----------|---------|
| Layers | 11 (512d, 8H, 4KV) |
| MLP | 3× with **LeakyReLU(0.5)²** |
| **GatedAttention** | Per-head sigmoid gate (NEW, weight=0, bias=4.0 init) |
| **ValueResidual** | Layer-0 value injection λ=[0.5,0.5] (NEW) |
| BigramHash | **2048** vocab, 128 dim (restored from 1536) |
| XSA | Last 4 layers |
| RoPE | Partial (16/64 dims), NTK dynamic scaling |
| LN Scale | 1/√(layer+1) |
| VE128 | Layers 9-10 |
| Weight avg | EMA(0.997) + Tight SWA(every 50) |
| **QAT** | **Full from step 1** (int6 fake-quant on CastedLinear) |
| Quantization | GPTQ-lite int6 + **lzma-9** |
| Optimizer | Parameter Banking + Parallel Muon |

## Run Command

```bash
# All changed defaults are now baked in; no env var overrides needed for the 4 improvements.
# Explicit env vars shown for transparency / to allow override:
NUM_LAYERS=11 XSA_LAST_N=4 \
EMA_ENABLED=1 SWA_ENABLED=1 SWA_EVERY=50 \
ROPE_DIMS=16 LN_SCALE=1 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
GATED_ATTENTION=1 VALUE_RESIDUAL=1 \
QAT_ENABLED=1 LATE_QAT_THRESHOLD=0.05 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee train_seed1337.log
```

## Timing Budget (estimated)

| Phase | Time |
|-------|------|
| Training | 600s |
| EMA apply + int6 roundtrip eval | ~30s |
| Sliding window eval (2048, stride=64) | ~74s |
| lzma-9 compression (slower than lzma-6) | ~45s |
| Legal TTT (3ep, all blocks, 2048 ctx) | ~410s |
| **Total eval** | **~560s (< 10 min ✓)** |

## Credits

- **GatedAttention + ValueResidual**: PR #841 by @someone114514
- **LeakyReLU² activation**: PR #493 by @parinzee, PR #518 by @sofiabod
- **Optimizer (Parameter Banking + Parallel Muon)**: PR #399 by @abaybektursun
- **TTT recipe**: PR #461 by @Christopher-Lee-McClendon
- **Base model**: PR #414 by @signalrush
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"name": "Ultimate: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)",
"val_bpb": null,
"bytes_total": null,
"blurb": "LeakyReLU(0.5)² + GatedAttention(PR#841) + ValueResidual(PR#841) + Full QAT from step 1 + lzma-9 compression + BigramHash(2048) + Legal score-first TTT (3ep SGD, all blocks unfrozen) + Parameter Banking + Parallel Muon (PR#399). Built on PR#414 + PR#399 + PR#841 stack. Target: <1.1144 BPB.",
"author": "FlashyFlash3011",
"github_id": "FlashyFlash3011",
"date": "2026-03-27"
}
Loading