Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
7256b18
docs: fractal transformer research plan — weight sharing + gravity + …
Mar 18, 2026
90a0896
results: first local ladder — fractal 3x3 beats baseline by 7.1% BPB,…
Mar 19, 2026
c96ec49
Add exact clone of PR #254 — best pending SOTA (1.1313 BPB)
Mar 21, 2026
5e31538
Add XSA last 3 layers to #254 SOTA clone
Mar 21, 2026
75ef44f
Fix XSA GQA broadcast bug — expand KV heads before manual attention
Mar 21, 2026
8f6a8cc
Add 3 SOTA improvement experiments: MTP, SwiGLU, Vocab1536
Mar 21, 2026
18c0e66
Add FarnsworthEngine v2: full improvement stack on SOTA254 base
Mar 21, 2026
d0d4775
Add FA3→FA2→SDPA fallback chain for pod restart resilience
Mar 21, 2026
82c0624
Revert FA3 fallback chain — was unauthorized code change to baseline …
Mar 21, 2026
a768362
Fix FA3 NaN: cast qkv to bf16 before FA3 call, disable dynamo DDP opt
Mar 21, 2026
ae0afa0
Add 2-seed validation scripts for exp A/B/C
Mar 21, 2026
4abe6f5
Log exp A/B results: both behind baseline, zlib fallback bug found
Mar 22, 2026
ac31eeb
Fix XSA NaN: position 0 has no valid targets when self-mask + causal …
Mar 22, 2026
c916c9f
Disable XSA in ttt_only run — manual attention too slow vs FA3
Mar 22, 2026
c68de2c
Add run_v2_ttt_noXSA.sh — TTT v2 + temp scaling, all FA3, max speed
Mar 22, 2026
27889d6
Restore XSA_LAST_N=3 in run_v2_ttt_only.sh (keep existing test intact)
Mar 22, 2026
277511f
Log v2 TTT-only + XSA=3 result: 1.1982 BPB (worse than 1.1301 baseline)
Mar 22, 2026
c269e90
Strip verbose logging from v2 train loop — match baseline format
Mar 22, 2026
bcfb475
Log v2 noXSA result: 1.1538/1.1315 BPB — TTT v2 hurt, no edge over ba…
Mar 22, 2026
2c00f0a
Log exp_a/b/c results: all worse than 1.1301 baseline, exp_c never ran
Mar 22, 2026
ee44ac0
Add exp D: TTT 8 epochs + stride 32 (eval-only improvement)
Mar 22, 2026
d1ab62e
Add SAM (Sharpness-Aware Minimization) option for TTT
Mar 22, 2026
853bce8
Add baseline reproduction script — verify 1.1303 on current FA3 build
Mar 22, 2026
56d9a79
Add SAM to baseline TTT — test sharpness-aware adaptation on proven code
Mar 22, 2026
fcf812f
Log exp D result: 1.1295 BPB — new best (-0.0008 vs baseline)
Mar 22, 2026
39064ef
Log exp D seed 42: 1.1307 BPB — confirms improvement (mean 1.1301)
Mar 22, 2026
38fb7cc
Add exp_d SAM variant — TTT 8ep + stride 32 + sharpness-aware TTT
Mar 22, 2026
473aa62
Log exp D seed 7: 1.1313 BPB but 16.18 MB — over size limit
Mar 22, 2026
33a1276
Add Partial RoPE + LN Scale (from PR #315) to sota254 + run_sam
Mar 22, 2026
86ef9af
Add exp_d/run_sam_clean.sh — pure SAM A/B test, no other changes
Mar 22, 2026
d2750e9
Log exp D seeds 7+137: both over size limit
Mar 22, 2026
12ca289
Add Sponge Bath experiment: TTT 8ep + stride 32 eval-only improvement
Mar 22, 2026
1b9a2dc
Add PR#315 clone with TTT 8ep + run script
Mar 22, 2026
fdd256e
Log D+SAM+PR315tricks: 1.1274 BPB new best; add SAM to pr315 run script
Mar 22, 2026
b10ef8a
Log PR315+TTT results: 1.1240 BPB (invalidated — TTT now banned)
Mar 22, 2026
ca939f2
Add PR374 enchilada experiment: 12L/2KV/2.75xMLP + train@1024 + EMA
Mar 22, 2026
0fa2d63
Add RunPod 8xH100 setup guide — every gotcha we've hit 3 times
Mar 22, 2026
2d47e82
Add fractal cadence experiment: F/N alternation pattern
Mar 22, 2026
9782e18
Add PR374-safe: EMA + earlier QAT + longer warmdown only, shape uncha…
Mar 22, 2026
00b8d97
Add PR374-depth: 12L/4KV/2.625xMLP (same params, +1 layer) + EMA + QA…
Mar 22, 2026
0b92cd9
Add minified train_gpt (30KB vs 74KB) with EMA+SWA edge
Mar 22, 2026
4ff70c4
Fix log0 keyword arg bug in minified script
Mar 22, 2026
6aa7a9f
Fix Block kwarg: layer_idx -> li in minified script
Mar 22, 2026
8ac59a5
Fix a.te -> a.tie_embeddings in minified script
Mar 22, 2026
a51766b
Add pr374_submit: trimmed winner for submission
Mar 22, 2026
9102365
Remove duplicate stride=64 eval from submit script
Mar 22, 2026
f48b98f
Revert "Remove duplicate stride=64 eval from submit script"
Mar 22, 2026
9692858
Add fractal cadence H100 script: 4 unique × 3 loops + ortho positions
Mar 22, 2026
6eaaf81
Add fractal cadence long run (1.6575 BPB @ 3929 steps) + H100 script
Mar 22, 2026
f43c720
Add pr374_slim: comment-stripped pr374_safe (67KB vs 74KB, -6.4KB)
Mar 22, 2026
fb551c2
Add fractal cadence auto-research sweep runner
Mar 22, 2026
d04199b
Fix FA3 head_dim limit: 12 heads / 6 kv_heads for dim=768 (head_dim=64)
Mar 22, 2026
2f3e804
Add pr374_ttt: PR374 + EMA + TTT(20ep,SAM) — rock the house
Mar 22, 2026
469974c
Fix early QAT trigger: skip warmdown/QAT for first 100 steps
Mar 22, 2026
4a4908a
Optimize fractal: tuned H100 defaults + focused sweep grid
Mar 22, 2026
3363866
Add TTT eval: test-time training on already-graded tokens
Mar 22, 2026
ed2e0e0
Log depth + TTT results: 1.1223 BPB (12L+TTT20ep+SAM)
Mar 22, 2026
4392ca7
Save pr374_throne: our #1 non-TTT record holder (1.1243 BPB)
Mar 22, 2026
5bc584b
Add autonomous overnight fractal optimizer (Karpathy auto-research)
Mar 22, 2026
d0ff238
Rewrite autoresearch: Qwen-guided fractal optimizer via Ollama
Mar 22, 2026
9ae4ed2
Add 3 leapfrog variants based on PR#414 SOTA (1.1233 BPB)
Mar 22, 2026
e03dd34
Add pod setup script for leapfrog variants (v1/v2/v3)
Mar 22, 2026
b12f744
Fix setup_pod.sh: apply FA3 build fixes from RUNPOD_SETUP.md
Mar 22, 2026
3dc64e4
Update fractal H100 with overnight Qwen findings
Mar 22, 2026
5a08f56
Fix v1 TTT burst: apply EMA first, then QAT-aware sharpening
Mar 22, 2026
cd89c3c
Add inner-TTT to fractal eval: recursive weight improvement per loop
Mar 22, 2026
18f41c0
v1 combo: burst+QAT before EMA + 15 GPTQ percentiles
Mar 22, 2026
e239db2
Isolate TTT to fractal layers only: blocks + loop_pos + skip_weights
Mar 22, 2026
b5c90dd
Add TTT drift gate: snap block weights back toward trained originals
Mar 22, 2026
3c6f5bc
Add v4: burst + distill + train_seq_len=1024 for more steps
Mar 22, 2026
c75145c
Add SOTA autoresearch: Qwen-guided edge finding for 1.1233 record
Mar 22, 2026
5808ddd
Fix TTT graph bug + upgrade to 3 blocks x 896d (23M params, ~14MB)
Mar 22, 2026
1a9b681
Save leapfrog experiment results: 6 variants tested, v1 wins at 1.12319
Mar 22, 2026
3bb985c
Fractal v2: dim=960 (25.7M, ~15.8MB), Muon LR back to 0.025
Mar 22, 2026
41d743f
Disable TTT in v2 (backward graph bug), focus on bigger model
Mar 22, 2026
6a6802b
Add v5: QAT percentile fix + TrigramHash + EMA-SWA blend
Mar 22, 2026
7e1220d
Fix TTT graph bug: fresh vc + detached inputs per loop
Mar 22, 2026
0500004
Two v2 scripts: with TTT and without, for A/B comparison
Mar 22, 2026
0243779
Fractal v3: MLP 3.0->3.3 fills 16MB budget (27.4M params, ~15.2MB)
Mar 22, 2026
a734827
Fractal v4: 1.5x batch (1.18M tokens/step) to use spare VRAM
Mar 22, 2026
b42deae
Log The Frugendorff: fractal cadence baseline at 1.2113 BPB
Mar 22, 2026
659707b
Add v6: fractal 6L×2 loops (12 effective) + 480d/16H/4xMLP
Mar 22, 2026
9ea76e2
Frugendorff v5: TTT warmup-then-freeze to capture 1.19 peak
Mar 22, 2026
bbbb285
Log all Frugendorff results: v3 TTT peak 1.1901, v4 batch 1.2186
Mar 22, 2026
3e0182b
Fix v6: 512d/16H/8KV (head_dim=32, FA3 requires multiple of 8)
Mar 22, 2026
1017b30
Log v5/v6 results + add fractal sweep script
Mar 22, 2026
67ca84b
Frugendorff 8-hour single GPU longrun
Mar 22, 2026
e0527c1
Complete Frugendorff documentation: all runs, findings, architecture …
Mar 23, 2026
b34f424
Add v7: legal score-first TTT eval (PR #461 recipe)
Mar 23, 2026
03849f5
Add v8: mutual learning — flat GPT + fractal GPT co-training
Mar 23, 2026
c446351
v7: add TTT early stopping — train only first 60 chunks
Mar 23, 2026
8fd4326
Update v6 to sweep winner: 4×3/512d/8H/4KV/4xMLP
Mar 23, 2026
7c9d8cc
Log v7 TTT results: early stop at 60 chunks = 1.12312 (best)
Mar 23, 2026
19acd25
The Frugendorff Squared: 6L x 2 loops, dim=640, MLP 4x, 28.5M params
Mar 23, 2026
504d3d9
Log higher LR (0.030) + MTP results — both worse
Mar 23, 2026
c8552f4
The Frugendorff Squared: 1.1478 BPB — 0.025 from world lead
Mar 23, 2026
7f556e6
Stack distillation onto Frugendorff Squared
Mar 23, 2026
f1ac766
Prep Frugendorff Squared PR submission (draft — needs image)
Mar 23, 2026
edea2e1
Add train script to Frugendorff submission folder
Mar 23, 2026
fe27bfb
TTT stabilization: EMA scoring + fixed cosine LR + embed freeze
Mar 23, 2026
fe46767
Add GPTQ quantization: Hessian-aware error compensation for int6
Mar 23, 2026
5ef90ae
Stack low-hanging fruit: fix GPTQ, earlier QAT, matched clipping
Mar 23, 2026
17b34e9
Add post-quant training burst: repair quant damage before eval
Mar 23, 2026
711eb3a
Remove post-quant burst — illegal (uses training data during eval)
Mar 23, 2026
0a22c2b
Add selective int8 for sensitive layers (GPTQ-int8 upgrade path)
Mar 23, 2026
769b01b
Frugendorff V3: v7 full stack + fractal loops (GPTQ + TTT + PQB)
Mar 23, 2026
b5986e4
Add AdamW TTT option — PR #462 shows 5x better TTT gain vs SGD
Mar 23, 2026
e605d1c
Record: GPTQ + Early QAT + TTT EMA — 3-seed mean val_bpb 1.1215
Mar 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions FRUGENDORFF_PR_DRAFT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## PR DRAFT — DO NOT SUBMIT YET (add image first)

### Title:
The Frugendorff Squared: Fractal Weight Sharing + MLP 4x (1.1478 BPB, 15.15MB)

### Body:

## Summary

Non-record submission exploring **fractal weight sharing** — a novel approach where 6 unique transformer blocks are looped 2× each, providing 12 effective layers of depth with only 6 blocks of stored parameters. The freed parameter budget enables **MLP 4x expansion**, which is the primary quality driver.

<!-- ADD YOUR IMAGE HERE -->

- **val_bpb: 1.1478** (sliding window stride=64) | **15.15 MB** | 8xH100 SXM, 600s
- 28.2M params, 4,390 steps at 136.7ms/step
- Full pipeline: Muon + SWA + Late QAT + Training Replay + Self-Distillation + EMA

## Key Insight

MLP 4x gives ~2% relative BPB improvement over MLP 3x, but doesn't fit in 16MB with 12 unique layers. Fractal weight sharing (6 unique × 2 loops) fits it in 15.15 MB. The weight sharing is the compression technique; the MLP 4x is the quality lever.

## Architecture

- 6 unique blocks × 2 loops = 12 effective depth
- dim=640, 10 heads, 5 KV (GQA), head_dim=64
- MLP 4x (hidden=2560), relu-squared
- Orthogonal loop positions, U-Net skips, SmearGate, BigramHash, VE128, XSA

## No TTT on validation data

All training uses training data only. Late replay buffers training batches. Self-distillation uses EMA teacher on training data. Fully compliant with issue #402.

## Test plan

- [x] 8xH100 SXM, 600s
- [x] Artifact under 16MB (15.15 MB)
- [x] No TTT on validation data (per issue #402)
- [x] Post-quant roundtrip verified
- [x] Sliding window eval (stride=64)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---
### Command to create PR (after adding image):
```
gh pr create --repo openai/parameter-golf \
--title "The Frugendorff Squared: Fractal Weight Sharing + MLP 4x (1.1478 BPB, 15.15MB)" \
--body "$(cat FRUGENDORFF_PR_BODY.md)"
```
269 changes: 269 additions & 0 deletions PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
# Parameter Golf — Fractal Transformer Research Plan
**DGX Spark · GB10 · March 2026**

---

## Challenge Summary

| Constraint | Value |
|------------|-------|
| Artifact size | ≤16MB (code + int8 quantized + zlib compressed weights) |
| Training time | ≤10 minutes on 8×H100 |
| Metric | bits-per-byte (BPB) on FineWeb validation set |
| Baseline | 1.2244 BPB |
| Record threshold | ≤1.2194 BPB (must beat by ≥0.005) |
| 4-hour unlimited baseline | 1.2074 BPB |
| Challenge window | March 18 → April 30, 2026 |
| Repo | https://github.com/newjordan/parameter-golf |

---

## Our Approach: Fractal Transformer + Gravity + AttnRes

### Core Thesis

Weight-shared transformer layers with learned gravitational auxiliary losses
and attention residuals will achieve lower BPB than the baseline's 9-unique-layer
architecture within the same 16MB parameter budget.

### Three Innovations Combined

**1. Fractal Architecture (Weight Sharing / Depth Recurrence)**

Instead of 9 unique layers, use 3 unique layers repeated in 3 loops.

```
CURRENT BASELINE:
9 unique layers × 512 dim = ~14M params

OUR APPROACH:
3 unique layers × 3 loops = 9 effective layers
Wider layers (~700 dim) with same total param count
Loop position embedding tells shared weights which pass they're on
```

Why this helps:
- Fewer unique parameters → more room in 16MB budget → wider layers
- Wider layers = richer features per layer
- Weight sharing compresses extremely well under int8+zlib
- Depth recurrence explicitly encouraged by the challenge README

**2. Gravity (Learned Auxiliary Losses)**

At the end of each loop, peek at the output using the shared lm_head and
compute an auxiliary cross-entropy loss. The weights are LEARNED parameters.

```python
self.gravity_weights = nn.Parameter(torch.tensor([0.1, 0.3, 1.0]))

total_loss = 0
for loop in range(3):
x = run_shared_layers(x, loop_pos=loop)
loop_logits = lm_head(rms_norm(x))
loop_loss = cross_entropy(loop_logits, targets)
total_loss += softplus(self.gravity_weights[loop]) * loop_loss
```

Why this helps:
- 3× gradient signal — every layer gets direct supervision, not diluted backprop
- Model discovers optimal loop weighting during training
- Especially powerful with weight sharing: same weights receive gradient from 3 depths
- Zero new parameters (3 scalars for weights, reuses existing lm_head)
- ~1.2% compute overhead (2 extra lm_head calls)

The "gravity" analogy:
- Loop 1 output is far from the target → strong pull, large updates
- Loop 2 is closer → medium pull, refinement
- Loop 3 is nearest → full weight, precision
- Each loop starts from a better position because the previous loop was already pulled toward the answer

**3. AttnRes (Attention Residuals)**

Replace fixed skip connections with learned, input-dependent attention over depth.
From Moonshot's paper (arxiv:2603.15031).

```
Standard residuals: x = x + layer_output (fixed, uniform weight)
AttnRes: x = softmax(query · [prev_outputs]) · [prev_outputs]
```

Each layer has a single learned query vector w_l ∈ R^d that attends over all
previous loop outputs. The softmax produces content-aware, input-dependent
weights instead of fixed uniform accumulation.

Why this helps:
- Paper shows 1.25× compute equivalent for near-zero parameter cost
- Replaces BOTH the baseline's U-Net skips AND resid_mix
- Only 9 × dim ≈ 4,608 new parameters
- Critical for weight sharing: lets later loops selectively reference earlier loops

### What We Remove From Baseline

| Component | Parameters | Replaced By |
|-----------|-----------|-------------|
| U-Net encoder/decoder split | structural | Fractal loops |
| skip_weights (9 × 512) | 4,608 | AttnRes queries |
| resid_mix (9 × 2 × 512) | 9,216 | AttnRes |
| **Total removed** | **~13,824** | |

### What We Add

| Component | Parameters | Purpose |
|-----------|-----------|---------|
| AttnRes queries (9 layers) | 4,608 | Selective depth attention |
| Loop position embeddings (3 loops) | ~2,100 | Tell weights which loop they're in |
| Gravity weights (3 scalars) | 3 | Learned auxiliary loss weighting |
| **Total added** | **~6,711** | |

**Net: ~7,113 parameters saved → reinvested into wider layers.**

---

## Architecture Diagram

```
INPUT TOKENS (1024 vocab)
EMBEDDING (1024 × ~700 dim)
LOOP 1 (broad strokes):
├── Layer A (attention + MLP, loop_pos=0)
├── Layer B (attention + MLP, loop_pos=0)
├── Layer C (attention + MLP, loop_pos=0)
├── GRAVITY: peek → compute loss₁ (learned weight ~0.1)
└── Store loop 1 output for AttnRes
LOOP 2 (refinement):
├── AttnRes: attend over [embedding, loop1_output]
├── Layer A (attention + MLP, loop_pos=1) ← same weights as loop 1
├── Layer B (attention + MLP, loop_pos=1)
├── Layer C (attention + MLP, loop_pos=1)
├── GRAVITY: peek → compute loss₂ (learned weight ~0.3)
└── Store loop 2 output for AttnRes
LOOP 3 (precision):
├── AttnRes: attend over [embedding, loop1_output, loop2_output]
├── Layer A (attention + MLP, loop_pos=2) ← same weights again
├── Layer B (attention + MLP, loop_pos=2)
├── Layer C (attention + MLP, loop_pos=2)
└── FINAL LOSS: full cross-entropy (weight = 1.0)
OUTPUT: logits → BPB
```

Each loop tightens the representation:
- Loop 1: rough sketch (only sees embedding)
- Loop 2: refinement (sees embedding + loop 1 output via AttnRes)
- Loop 3: precision (sees full history, committed to answer)

---

## Information Tightening Mechanisms

### Gravity (primary — Frosty's intuition)
Each loop is pulled toward the final answer by its own loss signal. Later loops
start from better positions because earlier loops were already course-correcting.
The model learns how hard each loop should pull (learned gravity weights).

### AttnRes (secondary — from Moonshot paper)
Selective attention over previous loop outputs. Later loops can choose which
earlier representations are useful for each specific token, not a fixed blend.

### Future: Ring Buffer + Temperature Cooling (Phase 4)
- Ring buffer: bounded memory with eviction of unhelpful previous states
- Temperature: AttnRes attention sharpens with depth (soft early, committed late)
- Only add if Phase 1-3 show signal

---

## Experiment Sequence

### Phase 1: Establish Weight Sharing Baselines
1. Run baseline as-is → establish local BPB reference
2. 3 shared layers × 3 loops, same total params, ~512 dim → does sharing work?
3. 3 shared layers × 3 loops, wider ~700 dim → does width help?
4. 2 shared layers × 4 loops, widest ~850 dim → more loops?
5. 4 shared layers × 2 loops, ~620 dim → fewer loops?

### Phase 2: Add Gravity
6. Best config from Phase 1 + gravity with learned weights
7. Compare: gravity learned vs gravity fixed [0.1, 0.3, 1.0] vs no gravity

### Phase 3: Add AttnRes
8. Best from Phase 2 + full AttnRes
9. Test: AttnRes before attention only / before MLP only / both
10. Test: AttnRes with vs without gravity

### Phase 4: Advanced Mechanisms
11. Add ring buffer (bounded memory with eviction)
12. Add temperature cooling on AttnRes
13. Try combining all mechanisms

### Phase 5: Optimize for Submission
14. Verify int8+zlib artifact ≤16MB
15. Tune width to maximize quality within size budget
16. Port winning config to official train_gpt.py style
17. Run on cloud 8×H100, verify 10-minute timing
18. Prepare submission folder for /records

---

## Workflow

### Local (DGX Spark, free, unlimited)
- Adapted research fork without Triton/torch.compile dependency
- Shorter training budget (2 min per experiment)
- Smaller batch size
- Same model, data, tokenizer, BPB metric
- Results won't match H100 numbers but relative ordering transfers
- Run 50-100 experiments to find winning configuration
- Autoresearch agent runs overnight (Phase 1-4)

### Cloud (H100s, paid, limited)
- Take best configuration from local experiments
- Run at full scale: 8×H100, 10 minutes, full batch
- Verify BPB, artifact size, timing
- Prepare official submission

---

## Source Material

### Attention Residuals (Moonshot)
- Paper: arxiv:2603.15031
- Repo: https://github.com/MoonshotAI/Attention-Residuals
- Core: replace fixed residual connections with softmax attention over depth
- Result: matches 1.25× compute baseline at near-zero parameter cost

### Autoresearch (Karpathy)
- Repo: https://github.com/karpathy/autoresearch
- Core: AI agent modifies train.py, trains 5 min, keeps/discards, loops forever
- Adapted as our outer optimization loop

### Parameter Golf Baseline
- Repo: https://github.com/openai/parameter-golf
- Architecture: 9-layer GPT, 512 dim, 1024 vocab, GQA, Muon optimizer
- Key features: U-Net skip connections, resid_mix, ReLU², logit softcapping
- BPB: 1.2244 (10 min), 1.2074 (4 hour)

---

## Key Insight

The competition rewards compression quality per parameter. Weight sharing is
the ultimate compression — the same function applied repeatedly. AttnRes gives
that repeated function the ability to selectively reference its earlier outputs.
Gravity ensures every repetition is actively pulled toward the correct answer.

The fractal structure means each loop genuinely tightens the representation:
same weights, progressively richer input, direct loss supervision at every
stage. The model isn't just repeating — it's refining.

---

*Plan authored by Octavian + Frosty · Spark-2949 · 2026-03-18*
Loading