Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
6e503d9
docs: fractal transformer research plan — weight sharing + gravity + …
Mar 18, 2026
73271f3
results: first local ladder — fractal 3x3 beats baseline by 7.1% BPB,…
Mar 19, 2026
aa20600
Add exact clone of PR #254 — best pending SOTA (1.1313 BPB)
Mar 21, 2026
2636011
Add XSA last 3 layers to #254 SOTA clone
Mar 21, 2026
4e4cc7f
Fix XSA GQA broadcast bug — expand KV heads before manual attention
Mar 21, 2026
44d290d
Add 3 SOTA improvement experiments: MTP, SwiGLU, Vocab1536
Mar 21, 2026
83efa9c
Add FarnsworthEngine v2: full improvement stack on SOTA254 base
Mar 21, 2026
e0d06d0
Add FA3→FA2→SDPA fallback chain for pod restart resilience
Mar 21, 2026
d94c7a1
Revert FA3 fallback chain — was unauthorized code change to baseline …
Mar 21, 2026
7171b6a
Fix FA3 NaN: cast qkv to bf16 before FA3 call, disable dynamo DDP opt
Mar 21, 2026
c0adf16
Add 2-seed validation scripts for exp A/B/C
Mar 21, 2026
a54066a
Log exp A/B results: both behind baseline, zlib fallback bug found
Mar 22, 2026
065bd06
Fix XSA NaN: position 0 has no valid targets when self-mask + causal …
Mar 22, 2026
0b2c73c
Disable XSA in ttt_only run — manual attention too slow vs FA3
Mar 22, 2026
2d79228
Add run_v2_ttt_noXSA.sh — TTT v2 + temp scaling, all FA3, max speed
Mar 22, 2026
508cdf1
Restore XSA_LAST_N=3 in run_v2_ttt_only.sh (keep existing test intact)
Mar 22, 2026
c1e74ba
Log v2 TTT-only + XSA=3 result: 1.1982 BPB (worse than 1.1301 baseline)
Mar 22, 2026
f263214
Strip verbose logging from v2 train loop — match baseline format
Mar 22, 2026
7bdf6de
Log v2 noXSA result: 1.1538/1.1315 BPB — TTT v2 hurt, no edge over ba…
Mar 22, 2026
2620ec3
Log exp_a/b/c results: all worse than 1.1301 baseline, exp_c never ran
Mar 22, 2026
aea1e39
Add exp D: TTT 8 epochs + stride 32 (eval-only improvement)
Mar 22, 2026
e407bea
Add SAM (Sharpness-Aware Minimization) option for TTT
Mar 22, 2026
4fb1bec
Add baseline reproduction script — verify 1.1303 on current FA3 build
Mar 22, 2026
3583889
Add SAM to baseline TTT — test sharpness-aware adaptation on proven code
Mar 22, 2026
9d86a37
Log exp D result: 1.1295 BPB — new best (-0.0008 vs baseline)
Mar 22, 2026
79c9c2a
Log exp D seed 42: 1.1307 BPB — confirms improvement (mean 1.1301)
Mar 22, 2026
87c2831
Add exp_d SAM variant — TTT 8ep + stride 32 + sharpness-aware TTT
Mar 22, 2026
e24283a
Log exp D seed 7: 1.1313 BPB but 16.18 MB — over size limit
Mar 22, 2026
e6d3dc5
Add Partial RoPE + LN Scale (from PR #315) to sota254 + run_sam
Mar 22, 2026
753ebd1
Add exp_d/run_sam_clean.sh — pure SAM A/B test, no other changes
Mar 22, 2026
d8053e6
Log exp D seeds 7+137: both over size limit
Mar 22, 2026
169e4a3
Add Sponge Bath experiment: TTT 8ep + stride 32 eval-only improvement
Mar 22, 2026
e65d662
Add PR#315 clone with TTT 8ep + run script
Mar 22, 2026
9de9e70
Log D+SAM+PR315tricks: 1.1274 BPB new best; add SAM to pr315 run script
Mar 22, 2026
24267e1
Log PR315+TTT results: 1.1240 BPB (invalidated — TTT now banned)
Mar 22, 2026
f7c1a70
Add PR374 enchilada experiment: 12L/2KV/2.75xMLP + train@1024 + EMA
Mar 22, 2026
511428d
Add RunPod 8xH100 setup guide — every gotcha we've hit 3 times
Mar 22, 2026
2369654
Add fractal cadence experiment: F/N alternation pattern
Mar 22, 2026
fbc1888
Add PR374-safe: EMA + earlier QAT + longer warmdown only, shape uncha…
Mar 22, 2026
a9e5d06
Add PR374-depth: 12L/4KV/2.625xMLP (same params, +1 layer) + EMA + QA…
Mar 22, 2026
9049b86
Add minified train_gpt (30KB vs 74KB) with EMA+SWA edge
Mar 22, 2026
602a6d2
Fix log0 keyword arg bug in minified script
Mar 22, 2026
6990d60
Fix Block kwarg: layer_idx -> li in minified script
Mar 22, 2026
833efbf
Fix a.te -> a.tie_embeddings in minified script
Mar 22, 2026
5fff32a
Add pr374_submit: trimmed winner for submission
Mar 22, 2026
8aa618a
Remove duplicate stride=64 eval from submit script
Mar 22, 2026
2ebc631
Revert "Remove duplicate stride=64 eval from submit script"
Mar 22, 2026
194cf3d
Add fractal cadence H100 script: 4 unique × 3 loops + ortho positions
Mar 22, 2026
a47bfa2
Add fractal cadence long run (1.6575 BPB @ 3929 steps) + H100 script
Mar 22, 2026
e3770d1
Add pr374_slim: comment-stripped pr374_safe (67KB vs 74KB, -6.4KB)
Mar 22, 2026
cfc576e
Add fractal cadence auto-research sweep runner
Mar 22, 2026
86b1161
Fix FA3 head_dim limit: 12 heads / 6 kv_heads for dim=768 (head_dim=64)
Mar 22, 2026
166d3b4
Add pr374_ttt: PR374 + EMA + TTT(20ep,SAM) — rock the house
Mar 22, 2026
7351d10
Fix early QAT trigger: skip warmdown/QAT for first 100 steps
Mar 22, 2026
79f951a
Optimize fractal: tuned H100 defaults + focused sweep grid
Mar 22, 2026
a31c39d
Add TTT eval: test-time training on already-graded tokens
Mar 22, 2026
170c6b7
Log depth + TTT results: 1.1223 BPB (12L+TTT20ep+SAM)
Mar 22, 2026
30b63ba
Save pr374_throne: our #1 non-TTT record holder (1.1243 BPB)
Mar 22, 2026
346fd87
Add autonomous overnight fractal optimizer (Karpathy auto-research)
Mar 22, 2026
525d755
Rewrite autoresearch: Qwen-guided fractal optimizer via Ollama
Mar 22, 2026
eb7f8df
Add 3 leapfrog variants based on PR#414 SOTA (1.1233 BPB)
Mar 22, 2026
2717b3c
Add pod setup script for leapfrog variants (v1/v2/v3)
Mar 22, 2026
05beadf
Fix setup_pod.sh: apply FA3 build fixes from RUNPOD_SETUP.md
Mar 22, 2026
b8914a5
Update fractal H100 with overnight Qwen findings
Mar 22, 2026
82d16b4
Fix v1 TTT burst: apply EMA first, then QAT-aware sharpening
Mar 22, 2026
57cc969
Add inner-TTT to fractal eval: recursive weight improvement per loop
Mar 22, 2026
0ee493d
v1 combo: burst+QAT before EMA + 15 GPTQ percentiles
Mar 22, 2026
7236a59
Isolate TTT to fractal layers only: blocks + loop_pos + skip_weights
Mar 22, 2026
0e6fd68
Add TTT drift gate: snap block weights back toward trained originals
Mar 22, 2026
50638c6
Add v4: burst + distill + train_seq_len=1024 for more steps
Mar 22, 2026
1529671
Add SOTA autoresearch: Qwen-guided edge finding for 1.1233 record
Mar 22, 2026
1f65df5
Fix TTT graph bug + upgrade to 3 blocks x 896d (23M params, ~14MB)
Mar 22, 2026
041a854
Save leapfrog experiment results: 6 variants tested, v1 wins at 1.12319
Mar 22, 2026
d14df67
Fractal v2: dim=960 (25.7M, ~15.8MB), Muon LR back to 0.025
Mar 22, 2026
7b5232a
Disable TTT in v2 (backward graph bug), focus on bigger model
Mar 22, 2026
3699d19
Add v5: QAT percentile fix + TrigramHash + EMA-SWA blend
Mar 22, 2026
79abf6c
Fix TTT graph bug: fresh vc + detached inputs per loop
Mar 22, 2026
bf17d79
Two v2 scripts: with TTT and without, for A/B comparison
Mar 22, 2026
61aee00
Fractal v3: MLP 3.0->3.3 fills 16MB budget (27.4M params, ~15.2MB)
Mar 22, 2026
672d037
Fractal v4: 1.5x batch (1.18M tokens/step) to use spare VRAM
Mar 22, 2026
5f9b9e9
Log The Frugendorff: fractal cadence baseline at 1.2113 BPB
Mar 22, 2026
e758544
Add v6: fractal 6L×2 loops (12 effective) + 480d/16H/4xMLP
Mar 22, 2026
f196e84
Frugendorff v5: TTT warmup-then-freeze to capture 1.19 peak
Mar 22, 2026
7fba307
Log all Frugendorff results: v3 TTT peak 1.1901, v4 batch 1.2186
Mar 22, 2026
5f4eb5f
Fix v6: 512d/16H/8KV (head_dim=32, FA3 requires multiple of 8)
Mar 22, 2026
74eebb6
Log v5/v6 results + add fractal sweep script
Mar 22, 2026
bd37700
Frugendorff 8-hour single GPU longrun
Mar 22, 2026
cc2ff3a
Complete Frugendorff documentation: all runs, findings, architecture …
Mar 23, 2026
dec4594
Add v7: legal score-first TTT eval (PR #461 recipe)
Mar 23, 2026
735b22f
Add v8: mutual learning — flat GPT + fractal GPT co-training
Mar 23, 2026
4fbad04
v7: add TTT early stopping — train only first 60 chunks
Mar 23, 2026
6d68de3
Update v6 to sweep winner: 4×3/512d/8H/4KV/4xMLP
Mar 23, 2026
bd1e51d
Log v7 TTT results: early stop at 60 chunks = 1.12312 (best)
Mar 23, 2026
afba076
The Frugendorff Squared: 6L x 2 loops, dim=640, MLP 4x, 28.5M params
Mar 23, 2026
4fafae8
Log higher LR (0.030) + MTP results — both worse
Mar 23, 2026
c40bfce
The Frugendorff Squared: 1.1478 BPB — 0.025 from world lead
Mar 23, 2026
57062a5
Stack distillation onto Frugendorff Squared
Mar 23, 2026
eef3124
Prep Frugendorff Squared PR submission (draft — needs image)
Mar 23, 2026
2c91897
Add train script to Frugendorff submission folder
Mar 23, 2026
cb4f879
TTT stabilization: EMA scoring + fixed cosine LR + embed freeze
Mar 23, 2026
ffad368
Add GPTQ quantization: Hessian-aware error compensation for int6
Mar 23, 2026
8e0a01f
Stack low-hanging fruit: fix GPTQ, earlier QAT, matched clipping
Mar 23, 2026
4f3bde6
Add post-quant training burst: repair quant damage before eval
Mar 23, 2026
217eaad
Remove post-quant burst — illegal (uses training data during eval)
Mar 23, 2026
f94c09a
Add selective int8 for sensitive layers (GPTQ-int8 upgrade path)
Mar 23, 2026
8b22b10
Frugendorff V3: v7 full stack + fractal loops (GPTQ + TTT + PQB)
Mar 23, 2026
a0e0841
Add AdamW TTT option — PR #462 shows 5x better TTT gain vs SGD
Mar 23, 2026
464f859
Record: GPTQ + Early QAT + TTT EMA — 3-seed mean val_bpb 1.1215
Mar 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions FRUGENDORFF_PR_DRAFT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## PR DRAFT — DO NOT SUBMIT YET (add image first)

### Title:
The Frugendorff Squared: Fractal Weight Sharing + MLP 4x (1.1478 BPB, 15.15MB)

### Body:

## Summary

Non-record submission exploring **fractal weight sharing** — a novel approach where 6 unique transformer blocks are looped 2× each, providing 12 effective layers of depth with only 6 blocks of stored parameters. The freed parameter budget enables **MLP 4x expansion**, which is the primary quality driver.

<!-- ADD YOUR IMAGE HERE -->

- **val_bpb: 1.1478** (sliding window stride=64) | **15.15 MB** | 8xH100 SXM, 600s
- 28.2M params, 4,390 steps at 136.7ms/step
- Full pipeline: Muon + SWA + Late QAT + Training Replay + Self-Distillation + EMA

## Key Insight

MLP 4x gives ~2% relative BPB improvement over MLP 3x, but doesn't fit in 16MB with 12 unique layers. Fractal weight sharing (6 unique × 2 loops) fits it in 15.15 MB. The weight sharing is the compression technique; the MLP 4x is the quality lever.

## Architecture

- 6 unique blocks × 2 loops = 12 effective depth
- dim=640, 10 heads, 5 KV (GQA), head_dim=64
- MLP 4x (hidden=2560), relu-squared
- Orthogonal loop positions, U-Net skips, SmearGate, BigramHash, VE128, XSA

## No TTT on validation data

All training uses training data only. Late replay buffers training batches. Self-distillation uses EMA teacher on training data. Fully compliant with issue #402.

## Test plan

- [x] 8xH100 SXM, 600s
- [x] Artifact under 16MB (15.15 MB)
- [x] No TTT on validation data (per issue #402)
- [x] Post-quant roundtrip verified
- [x] Sliding window eval (stride=64)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---
### Command to create PR (after adding image):
```
gh pr create --repo openai/parameter-golf \
--title "The Frugendorff Squared: Fractal Weight Sharing + MLP 4x (1.1478 BPB, 15.15MB)" \
--body "$(cat FRUGENDORFF_PR_BODY.md)"
```
269 changes: 269 additions & 0 deletions PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
# Parameter Golf — Fractal Transformer Research Plan
**DGX Spark · GB10 · March 2026**

---

## Challenge Summary

| Constraint | Value |
|------------|-------|
| Artifact size | ≤16MB (code + int8 quantized + zlib compressed weights) |
| Training time | ≤10 minutes on 8×H100 |
| Metric | bits-per-byte (BPB) on FineWeb validation set |
| Baseline | 1.2244 BPB |
| Record threshold | ≤1.2194 BPB (must beat by ≥0.005) |
| 4-hour unlimited baseline | 1.2074 BPB |
| Challenge window | March 18 → April 30, 2026 |
| Repo | https://github.com/newjordan/parameter-golf |

---

## Our Approach: Fractal Transformer + Gravity + AttnRes

### Core Thesis

Weight-shared transformer layers with learned gravitational auxiliary losses
and attention residuals will achieve lower BPB than the baseline's 9-unique-layer
architecture within the same 16MB parameter budget.

### Three Innovations Combined

**1. Fractal Architecture (Weight Sharing / Depth Recurrence)**

Instead of 9 unique layers, use 3 unique layers repeated in 3 loops.

```
CURRENT BASELINE:
9 unique layers × 512 dim = ~14M params

OUR APPROACH:
3 unique layers × 3 loops = 9 effective layers
Wider layers (~700 dim) with same total param count
Loop position embedding tells shared weights which pass they're on
```

Why this helps:
- Fewer unique parameters → more room in 16MB budget → wider layers
- Wider layers = richer features per layer
- Weight sharing compresses extremely well under int8+zlib
- Depth recurrence explicitly encouraged by the challenge README

**2. Gravity (Learned Auxiliary Losses)**

At the end of each loop, peek at the output using the shared lm_head and
compute an auxiliary cross-entropy loss. The weights are LEARNED parameters.

```python
self.gravity_weights = nn.Parameter(torch.tensor([0.1, 0.3, 1.0]))

total_loss = 0
for loop in range(3):
x = run_shared_layers(x, loop_pos=loop)
loop_logits = lm_head(rms_norm(x))
loop_loss = cross_entropy(loop_logits, targets)
total_loss += softplus(self.gravity_weights[loop]) * loop_loss
```

Why this helps:
- 3× gradient signal — every layer gets direct supervision, not diluted backprop
- Model discovers optimal loop weighting during training
- Especially powerful with weight sharing: same weights receive gradient from 3 depths
- Zero new parameters (3 scalars for weights, reuses existing lm_head)
- ~1.2% compute overhead (2 extra lm_head calls)

The "gravity" analogy:
- Loop 1 output is far from the target → strong pull, large updates
- Loop 2 is closer → medium pull, refinement
- Loop 3 is nearest → full weight, precision
- Each loop starts from a better position because the previous loop was already pulled toward the answer

**3. AttnRes (Attention Residuals)**

Replace fixed skip connections with learned, input-dependent attention over depth.
From Moonshot's paper (arxiv:2603.15031).

```
Standard residuals: x = x + layer_output (fixed, uniform weight)
AttnRes: x = softmax(query · [prev_outputs]) · [prev_outputs]
```

Each layer has a single learned query vector w_l ∈ R^d that attends over all
previous loop outputs. The softmax produces content-aware, input-dependent
weights instead of fixed uniform accumulation.

Why this helps:
- Paper shows 1.25× compute equivalent for near-zero parameter cost
- Replaces BOTH the baseline's U-Net skips AND resid_mix
- Only 9 × dim ≈ 4,608 new parameters
- Critical for weight sharing: lets later loops selectively reference earlier loops

### What We Remove From Baseline

| Component | Parameters | Replaced By |
|-----------|-----------|-------------|
| U-Net encoder/decoder split | structural | Fractal loops |
| skip_weights (9 × 512) | 4,608 | AttnRes queries |
| resid_mix (9 × 2 × 512) | 9,216 | AttnRes |
| **Total removed** | **~13,824** | |

### What We Add

| Component | Parameters | Purpose |
|-----------|-----------|---------|
| AttnRes queries (9 layers) | 4,608 | Selective depth attention |
| Loop position embeddings (3 loops) | ~2,100 | Tell weights which loop they're in |
| Gravity weights (3 scalars) | 3 | Learned auxiliary loss weighting |
| **Total added** | **~6,711** | |

**Net: ~7,113 parameters saved → reinvested into wider layers.**

---

## Architecture Diagram

```
INPUT TOKENS (1024 vocab)
EMBEDDING (1024 × ~700 dim)
LOOP 1 (broad strokes):
├── Layer A (attention + MLP, loop_pos=0)
├── Layer B (attention + MLP, loop_pos=0)
├── Layer C (attention + MLP, loop_pos=0)
├── GRAVITY: peek → compute loss₁ (learned weight ~0.1)
└── Store loop 1 output for AttnRes
LOOP 2 (refinement):
├── AttnRes: attend over [embedding, loop1_output]
├── Layer A (attention + MLP, loop_pos=1) ← same weights as loop 1
├── Layer B (attention + MLP, loop_pos=1)
├── Layer C (attention + MLP, loop_pos=1)
├── GRAVITY: peek → compute loss₂ (learned weight ~0.3)
└── Store loop 2 output for AttnRes
LOOP 3 (precision):
├── AttnRes: attend over [embedding, loop1_output, loop2_output]
├── Layer A (attention + MLP, loop_pos=2) ← same weights again
├── Layer B (attention + MLP, loop_pos=2)
├── Layer C (attention + MLP, loop_pos=2)
└── FINAL LOSS: full cross-entropy (weight = 1.0)
OUTPUT: logits → BPB
```

Each loop tightens the representation:
- Loop 1: rough sketch (only sees embedding)
- Loop 2: refinement (sees embedding + loop 1 output via AttnRes)
- Loop 3: precision (sees full history, committed to answer)

---

## Information Tightening Mechanisms

### Gravity (primary — Frosty's intuition)
Each loop is pulled toward the final answer by its own loss signal. Later loops
start from better positions because earlier loops were already course-correcting.
The model learns how hard each loop should pull (learned gravity weights).

### AttnRes (secondary — from Moonshot paper)
Selective attention over previous loop outputs. Later loops can choose which
earlier representations are useful for each specific token, not a fixed blend.

### Future: Ring Buffer + Temperature Cooling (Phase 4)
- Ring buffer: bounded memory with eviction of unhelpful previous states
- Temperature: AttnRes attention sharpens with depth (soft early, committed late)
- Only add if Phase 1-3 show signal

---

## Experiment Sequence

### Phase 1: Establish Weight Sharing Baselines
1. Run baseline as-is → establish local BPB reference
2. 3 shared layers × 3 loops, same total params, ~512 dim → does sharing work?
3. 3 shared layers × 3 loops, wider ~700 dim → does width help?
4. 2 shared layers × 4 loops, widest ~850 dim → more loops?
5. 4 shared layers × 2 loops, ~620 dim → fewer loops?

### Phase 2: Add Gravity
6. Best config from Phase 1 + gravity with learned weights
7. Compare: gravity learned vs gravity fixed [0.1, 0.3, 1.0] vs no gravity

### Phase 3: Add AttnRes
8. Best from Phase 2 + full AttnRes
9. Test: AttnRes before attention only / before MLP only / both
10. Test: AttnRes with vs without gravity

### Phase 4: Advanced Mechanisms
11. Add ring buffer (bounded memory with eviction)
12. Add temperature cooling on AttnRes
13. Try combining all mechanisms

### Phase 5: Optimize for Submission
14. Verify int8+zlib artifact ≤16MB
15. Tune width to maximize quality within size budget
16. Port winning config to official train_gpt.py style
17. Run on cloud 8×H100, verify 10-minute timing
18. Prepare submission folder for /records

---

## Workflow

### Local (DGX Spark, free, unlimited)
- Adapted research fork without Triton/torch.compile dependency
- Shorter training budget (2 min per experiment)
- Smaller batch size
- Same model, data, tokenizer, BPB metric
- Results won't match H100 numbers but relative ordering transfers
- Run 50-100 experiments to find winning configuration
- Autoresearch agent runs overnight (Phase 1-4)

### Cloud (H100s, paid, limited)
- Take best configuration from local experiments
- Run at full scale: 8×H100, 10 minutes, full batch
- Verify BPB, artifact size, timing
- Prepare official submission

---

## Source Material

### Attention Residuals (Moonshot)
- Paper: arxiv:2603.15031
- Repo: https://github.com/MoonshotAI/Attention-Residuals
- Core: replace fixed residual connections with softmax attention over depth
- Result: matches 1.25× compute baseline at near-zero parameter cost

### Autoresearch (Karpathy)
- Repo: https://github.com/karpathy/autoresearch
- Core: AI agent modifies train.py, trains 5 min, keeps/discards, loops forever
- Adapted as our outer optimization loop

### Parameter Golf Baseline
- Repo: https://github.com/openai/parameter-golf
- Architecture: 9-layer GPT, 512 dim, 1024 vocab, GQA, Muon optimizer
- Key features: U-Net skip connections, resid_mix, ReLU², logit softcapping
- BPB: 1.2244 (10 min), 1.2074 (4 hour)

---

## Key Insight

The competition rewards compression quality per parameter. Weight sharing is
the ultimate compression — the same function applied repeatedly. AttnRes gives
that repeated function the ability to selectively reference its earlier outputs.
Gravity ensures every repetition is actively pulled toward the correct answer.

The fractal structure means each loop genuinely tightens the representation:
same weights, progressively richer input, direct loss supervision at every
stage. The model isn't just repeating — it's refining.

---

*Plan authored by Octavian + Frosty · Spark-2949 · 2026-03-18*
Loading