Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
5bcbc42
Atris Labs: attack infrastructure for Parameter Golf
keshav55 Mar 19, 2026
e58d234
v1 train_gpt.py: 10 layers, lower LR, INT6 mixed precision, eval seq len
keshav55 Mar 19, 2026
8642760
v2: sliding window eval (stride=64) + MLP 3x — ~0.05 BPB combined
keshav55 Mar 19, 2026
659162d
v3: SP-4096 support + Muon 0.99 + run_5seeds.sh for submission valida…
keshav55 Mar 19, 2026
fda5c8b
v4: weight sharing with per-layer scale adapters
keshav55 Mar 19, 2026
d9f0e80
v5: QAT via fake quantization with STE in CastedLinear
keshav55 Mar 19, 2026
7904151
fix: MLP 3x + 10 layers exceeds 16MB, fix layer_scales optimizer bug
keshav55 Mar 20, 2026
82e3dba
v6: SWA, weight decay, gradient clipping — match top 3 techniques
keshav55 Mar 20, 2026
d2f6c16
v7: BigramHash + SmearGate — match techniques from #1 and #2
keshav55 Mar 20, 2026
036cb6b
v8: match winner's recipe — Int5 MLP, zstd-22, 3 critical bug fixes
keshav55 Mar 21, 2026
e12a20d
fix: increase Modal timeout to 3600s, reduce to 3 train shards, add f…
keshav55 Mar 22, 2026
2257733
fix: lean dev config for 1xH100 — disable all heavy features
keshav55 Mar 22, 2026
503de01
8xH100 Modal runner for submission
keshav55 Mar 22, 2026
f72c225
fix: 7200s timeout, 5 warmup steps (torch.compile overhead)
keshav55 Mar 22, 2026
63a168b
fix: stream stdout live instead of capture_output (see logs in real t…
keshav55 Mar 22, 2026
c157ce7
fix: disable periodic val (sliding window eval costs 2min per call, w…
keshav55 Mar 22, 2026
31811bc
fix: enable sliding window eval for final eval (stride=64), keep peri…
keshav55 Mar 23, 2026
11bc618
Record: Atris Labs v8 — val_bpb=1.2015, 10L MLP3x Int5/Int6 + BigramH…
keshav55 Mar 23, 2026
b40f038
update: val_bpb=1.1820 (improved from 1.2015 with EVAL_SEQ_LEN=2048)
keshav55 Mar 23, 2026
a751019
update: val_bpb=1.1810 (best run, improved from 1.1820)
keshav55 Mar 23, 2026
d12793b
update: 2-seed validation — mean val_bpb=1.1807 (seed42: 1.1803, seed…
keshav55 Mar 23, 2026
a8ef146
update: 3-seed validation complete — mean val_bpb=1.1807 (std=0.0004,…
keshav55 Mar 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
194 changes: 194 additions & 0 deletions atris/ATTACK_PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# Parameter Golf — Attack Plan

**Target:** Beat 1.2244 BPB → sub-1.20 BPB → sub-1.18 BPB → absolute minimum
**Constraint:** 16,000,000 bytes (code + int8+zlib model), 10 min on 8xH100 SXM
**Metric:** `final_int8_zlib_roundtrip val_bpb` (lower = better)
**Deadline:** April 30, 2026

---

## Current State

| Entry | BPB | Gap to beat |
|-------|-----|-------------|
| Naive Baseline (10 min) | 1.2244 | — |
| 4-Hour Baseline | 1.2074 | -0.017 |
| **Our target** | **< 1.18** | **-0.044** |

Baseline config: 9 layers, 512 dim, 8 heads, 4 KV heads, 1024 vocab, tied embeddings, Muon optimizer, ~15.86MB artifact.

---

## Attack Vectors (Ordered by Expected Impact)

### A. Architecture — More Capacity Per Parameter

#### A1. Weight Sharing / Depth Recurrence (HIGH IMPACT)
- Share transformer blocks across layers. 3 unique blocks × 3 repeats = 9 effective layers, 1/3 the parameters.
- Universal Transformer style: same block repeated with layer-specific lightweight adapters (scalars/biases only).
- Freed parameters → wider model or more unique blocks.
- **Risk:** Shared weights compress better under zlib (repetitive patterns). Double win.
- **Experiment:** Start with full sharing (1 block × 9), then 3×3, then 2 shared + 1 unique per position.

#### A2. Low-Rank Factorization (MEDIUM IMPACT)
- Factor Q/K/V/O projections: W = UV where U is d×r, V is r×d, r << d.
- Rank 64-128 for a 512-dim model saves significant parameters.
- Can combine with weight sharing for compound savings.
- **Experiment:** Sweep rank from 32 to 256 on attention projections.

#### A3. Sparse MLP / Mixture of Experts (MEDIUM IMPACT)
- Replace single 2x MLP with 4 smaller experts + router.
- More total capacity, same active parameters per token.
- **Risk:** Router overhead, load balancing complexity within 10 min.
- **Experiment:** 2 experts first (simplest), then 4.

#### A4. Sub-Quadratic Attention (LOW IMPACT at 1024 seq len)
- Linear attention, sliding window, etc.
- At seq_len=1024, quadratic attention is fine. Skip unless going longer.

### B. Compression — More Model Per Byte

#### B1. Quantization-Aware Training (HIGH IMPACT)
- Train with fake quantization in the loop. Model learns to be robust to quantization.
- INT8 QAT → INT4 QAT → ternary/binary.
- Current post-hoc INT8 loses ~0.007 BPB (1.2172 → 1.2244). QAT can eliminate this.
- **Experiment:** Add STE (straight-through estimator) for INT8 first, then push to INT4.

#### B2. BitNet / Ternary Weights (HIGH IMPACT)
- 1.58-bit weights {-1, 0, 1}. Massive compression.
- Recent papers show competitive quality at scale.
- Combined with zlib, ternary weights compress extremely well.
- **Experiment:** Replace linear layers with ternary, keep embeddings/norms in higher precision.

#### B3. Structured Pruning + Quantization (MEDIUM IMPACT)
- Train full model, prune channels/heads, then quantize.
- Or train with L1 regularization to encourage sparsity, then prune.

#### B4. Better Compression Algorithm (LOW-MEDIUM IMPACT)
- Replace zlib with zstd (better ratio, same speed) or lzma (best ratio, slower).
- Custom weight encoding: delta coding between layers (especially with weight sharing).
- **Check:** Does the submission format require zlib specifically? → No, just needs to fit in 16MB.

### C. Training Efficiency — More Learning Per Minute

#### C1. Learning Rate / Schedule Optimization (MEDIUM IMPACT)
- Current: linear warmdown. Try cosine, cosine with warm restarts.
- Higher peak LR with aggressive warmdown.
- Per-layer LR scaling.
- **Experiment:** Sweep LR 2x up and 2x down, try cosine schedule.

#### C2. Batch Size / Sequence Length (MEDIUM IMPACT)
- Current: 524K tokens/step, 1024 seq len.
- Larger batch = fewer steps but more stable gradients.
- Shorter sequence (512) = more steps per minute but less context.
- **Experiment:** Try 256K and 1M batch sizes, try 512 and 2048 seq len.

#### C3. Muon Optimizer Tuning (LOW-MEDIUM IMPACT)
- momentum, backend_steps, warmup parameters.
- Newton-Schulz iteration count (currently 5 in backend, 10 in function).
- **Experiment:** Sweep momentum 0.9-0.99, backend_steps 3-7.

#### C4. Data Ordering / Curriculum (LOW IMPACT)
- Sort training data by difficulty (shorter/simpler documents first).
- **Risk:** Fixed shards make this hard without preprocessing.

### D. Evaluation Tricks — Better Score Without Better Model

#### D1. Longer Context at Eval (HIGH IMPACT, LOW EFFORT)
- They explicitly allow eval at any sequence length.
- Train on 1024, eval on 2048 or 4096. More context = better predictions.
- RoPE extrapolation or NTK-aware scaling for longer eval.
- **Experiment:** Just change VAL_BATCH_SIZE eval seq len. Might get 0.01+ BPB for free.

#### D2. Test-Time Training (HIGH IMPACT, COMPLEX)
- Fine-tune on the validation prefix before predicting next tokens.
- Eval budget is 10 min separately from training. That's a LOT of test-time compute.
- **Experiment:** Online SGD on val data during eval pass.

#### D3. Ensembling (MEDIUM IMPACT)
- Train 2-3 models with different seeds, average predictions.
- Must fit ALL models in 16MB → only viable with very small individual models.
- Or: train one model, create pseudo-ensemble via dropout at eval time.

### E. Tokenizer — Different Encoding Efficiency

#### E1. Vocab Size Sweep (MEDIUM IMPACT)
- 1024 is tiny. Each token encodes few bytes.
- 2048 or 4096 vocab: fewer tokens to predict, but larger embedding table.
- BPB is tokenizer-agnostic, so bigger vocab helps IF the model can learn the embeddings.
- **Experiment:** Try 512, 2048, 4096 with appropriate model size adjustments.
- **Risk:** They scrutinize tokenizer changes closely. Must be airtight.

---

## Autoresearch Loop Design

```
┌─────────────────────────────────────────────┐
│ AUTORESEARCH │
│ │
│ 1. Read ATTACK_PLAN.md + past results │
│ 2. Pick highest-impact untested idea │
│ 3. Modify train_gpt.py │
│ 4. Run: torchrun --nproc_per_node=8 │
│ train_gpt.py (10 min cap) │
│ 5. Read final_int8_zlib_roundtrip val_bpb │
│ 6. If improved ≥ 0.001: KEEP, log result │
│ If regressed: REVERT, log negative │
│ 7. Repeat │
│ │
│ Cost: ~$3.30/experiment (8xH100 @ $20/hr) │
│ Rate: ~5 experiments/hour │
│ Budget: $500 = ~150 experiments │
└─────────────────────────────────────────────┘
```

---

## Phase Plan

### Phase 1: Foundation (Days 1-3)
- [x] Clone repo, read baseline code
- [x] Map attack vectors
- [ ] Reproduce baseline on 1xH100 (verify ~1.22 BPB)
- [ ] Set up autoresearch harness
- [ ] Apply for compute grant
- [ ] Run MLX smoke tests locally for fast iteration on arch ideas

### Phase 2: Low-Hanging Fruit (Days 4-7)
- [ ] Eval at longer sequence length (D1) — potentially free BPB
- [ ] LR / schedule sweep (C1)
- [ ] Muon hyperparameter sweep (C3)
- [ ] QAT implementation (B1) — eliminate the 0.007 BPB quant loss

### Phase 3: Architecture Innovation (Days 8-14)
- [ ] Weight sharing experiments (A1)
- [ ] Low-rank attention (A2)
- [ ] Vocab size sweep (E1)
- [ ] BitNet/ternary exploration (B2)

### Phase 4: Advanced Techniques (Days 15-25)
- [ ] Test-time training (D2)
- [ ] MoE sparse MLP (A3)
- [ ] Compound improvements — stack all winners
- [ ] Population-based search (ARTEMIS-style) on top performers

### Phase 5: Polish & Submit (Days 26-43)
- [ ] Stack all winning changes
- [ ] Run 5+ seeds for statistical significance
- [ ] Write submission README
- [ ] Submit PR

---

## Key Insights From Our Research

1. **Karpathy's autoresearch** found 20 improvements on a "well-tuned" codebase. The baseline here is explicitly "not SOTA" — there's likely 50+ improvements waiting.

2. **The 5-minute rule transfers.** 10 min fixed budget = identical compute per experiment. Improvements that work here genuinely extract more from same compute.

3. **Weight sharing + quantization = double compression.** Shared weights have identical byte patterns → zlib compresses them to nearly zero. This is the architectural insight most people will miss.

4. **Eval tricks are legal and encouraged.** "We encourage competitors to push the bounds of evaluation methods as aggressively as with training methods." Test-time training with the separate 10-min eval budget is the sleeper weapon.

5. **The scoring gap is small.** 0.005 nats to set a new record. That's achievable with a single good idea.
115 changes: 115 additions & 0 deletions atris/INTEL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Competitive Intelligence — Updated 2026-03-20 (Cycle 9)

## OFFICIAL LEADERBOARD (14 merged entries!)

| Rank | BPB | Author | Key Techniques | PR |
|------|-----|--------|----------------|----|
| **1** | **1.1428** | thwu1 | Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 | #180 |
| 2 | 1.1458 | Raahil Shah | SmearGate + BigramHash + MLP3x + OrthoInit + MuonWD + SWA | #162 |
| 3 | 1.1502 | aruniyer | 11L MLP3x + WD=0.04 + zstd-22 + int6 QAT | #86 |
| 4 | 1.1556 | aquariouseworkman | SmearGate + BigramHash + MLP3x + int6 STE QAT | #65 |
| 5 | 1.1586 | yahya010 | 10L int6 QAT + zstd-22 + MLP2.6x + Muon0.99 | #63 |
| 6 | 1.1630 | aquariouseworkman | Int6 blocks + int8 embed + MLP3x + sliding window | #65 |
| 7 | 1.1748 | notapplica | Spectral embed + residual mixing + sliding window | #60 |
| 8 | 1.1925 | Matthew Li | Sliding window eval stride=64 (zero training changes!) | #50 |
| 9 | 1.1928 | samacqua | Sliding window + LoRA TTT (test-time training) | #77 |
| 10 | 1.2014 | Spokane Way | 4k seq length + better hyperparams | #52 |
| 11 | 1.2060 | Spokane Way | 2048 seq length | #49 |
| 12 | 1.2147 | Nan Liu | 10L mixed int8/int6 | #39 |
| 13 | 1.2197 | Renier Velazco | FP16 tied embed + warmdown tuning | #42 |
| 14 | 1.2244 | Baseline | 9L 512d MLP2x 1024vocab | — |

## WINNING TECHNIQUES WE'RE MISSING

### 1. BigramHash (CRITICAL — used by #1 and #2)
- Hash consecutive token pairs → lookup in 4096-10240 bucket embedding table
- 128-dim bigram embeddings projected to model_dim
- Captures local bigram context (~524K params for 4096 buckets)
- Implementation: XOR hash with coprime multipliers
- **Impact: ~0.003 BPB improvement**

### 2. SmearGate (used by #2, #4)
- Per-dimension learned gate blending current token with previous token embedding
- Applied after embedding normalization
- Only ~512 params (one gate vector per dim)
- Captures temporal continuity
- **Impact: ~0.002 BPB improvement**

### 3. SWA — Stochastic Weight Averaging (used by #1, #2)
- Collect checkpoints every 50 steps during warmdown
- Average them at the end (24+ snapshots)
- Start at 40% through training (#1) or 50% (#2)
- Zero artifact cost — just averages weights
- **Impact: ~0.002-0.003 BPB improvement**

### 4. Weight Decay (used by #1, #2, #3)
- WD=0.04 for Muon (decoupled: `p.data.mul_(1 - lr * wd)`)
- WD=0.01-0.04 for AdamW on embeddings/scalars
- Not in baseline at all
- **Impact: ~0.002 BPB improvement**

### 5. Int5 Quantization (used by #1)
- MLP weights at Int5 [-16,15]: 3 zero high bits per byte
- zstd-22 compresses Int5 at 1.88x (vs 1.51x for Int6)
- Saves ~1.86MB → funds 10th layer
- **Impact: enables more params within 16MB budget**

### 6. zstd-22 instead of zlib (used by #1, #3, #5)
- Better compression ratio than zlib
- More room for parameters
- **Impact: ~0.5-1MB saved → more model capacity**

### 7. OrthoInit + muP (used by #2, #4)
- Orthogonal weight initialization
- Output projections scaled by 1/√(2·num_layers)
- Better gradient flow
- **Impact: ~0.001 BPB improvement**

### 8. Gradient Clipping (used by #1)
- grad_clip_norm=0.3 (baseline: 0.0 = disabled)
- Stabilizes training, especially with higher LR/WD

## WHAT WE HAVE vs WHAT WE NEED

| Technique | We Have? | They Have? | Gap? |
|-----------|----------|------------|------|
| 10 layers | ✅ | ✅ | — |
| Lower LR 0.02 | ✅ | ✅ | — |
| INT6 QAT | ✅ | ✅ | — |
| Sliding window eval | ✅ | ✅ | — |
| Muon 0.99 | ✅ | ✅ | — |
| Weight sharing | ✅ | ❌ | We have extra |
| MLP 3x | ✅ (config) | ✅ | — |
| **BigramHash** | ❌ | ✅ (#1,#2) | **MISSING** |
| **SmearGate** | ❌ | ✅ (#2,#4) | **MISSING** |
| **SWA** | ❌ | ✅ (#1,#2) | **MISSING** |
| **Weight Decay** | ❌ | ✅ (#1,#2,#3) | **MISSING** |
| **Int5 quant** | ❌ | ✅ (#1) | **MISSING** |
| **zstd compression** | ❌ | ✅ (#1,#3,#5) | **MISSING** |
| **OrthoInit** | ❌ | ✅ (#2,#4) | **MISSING** |
| **Gradient clip** | ❌ | ✅ (#1) | **MISSING** |

## PRIORITY IMPLEMENTATION ORDER

1. **SWA** — Zero cost, average checkpoints during warmdown. Easiest win.
2. **Weight Decay** — Add WD=0.04 to Muon, WD=0.01 to Adam. One-line changes.
3. **Gradient clipping** — Set GRAD_CLIP_NORM=0.3. Already an env var!
4. **zstd-22** — Replace zlib.compress with zstd. Small code change.
5. **BigramHash** — Need to implement hash table + projection. ~50 lines.
6. **SmearGate** — Learned gate after embeddings. ~20 lines.
7. **Int5 quantization** — Extend our INT6 to INT5 for MLP layers. ~30 lines.
8. **OrthoInit** — Change weight initialization. ~10 lines.

## REALISTIC TARGET

If we stack ALL techniques the top 3 are using:
- Base (our v5): ~1.19 BPB (sliding window + 10L + QAT + lower LR)
- + SWA: ~1.187
- + WD: ~1.185
- + BigramHash: ~1.182
- + SmearGate: ~1.180
- + Int5 + zstd: ~1.175 (more room for params)
- + OrthoInit: ~1.173

**Realistic target: 1.14-1.15 BPB** (competitive with top 3)
**To beat #1 (1.1428): need novel technique or better hyperparameter tuning**
Loading