Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
1b3a8f4
docs: fractal transformer research plan — weight sharing + gravity + …
Mar 18, 2026
2537179
results: first local ladder — fractal 3x3 beats baseline by 7.1% BPB,…
Mar 19, 2026
203e59d
Add exact clone of PR #254 — best pending SOTA (1.1313 BPB)
Mar 21, 2026
caf1fc7
Add XSA last 3 layers to #254 SOTA clone
Mar 21, 2026
2342712
Fix XSA GQA broadcast bug — expand KV heads before manual attention
Mar 21, 2026
0fa4ea3
Add 3 SOTA improvement experiments: MTP, SwiGLU, Vocab1536
Mar 21, 2026
d5f2207
Add FarnsworthEngine v2: full improvement stack on SOTA254 base
Mar 21, 2026
fe35f95
Add FA3→FA2→SDPA fallback chain for pod restart resilience
Mar 21, 2026
2aa5315
Revert FA3 fallback chain — was unauthorized code change to baseline …
Mar 21, 2026
cb95d03
Fix FA3 NaN: cast qkv to bf16 before FA3 call, disable dynamo DDP opt
Mar 21, 2026
960ada4
Add 2-seed validation scripts for exp A/B/C
Mar 21, 2026
abdfdea
Log exp A/B results: both behind baseline, zlib fallback bug found
Mar 22, 2026
4ef403f
Fix XSA NaN: position 0 has no valid targets when self-mask + causal …
Mar 22, 2026
b356a91
Disable XSA in ttt_only run — manual attention too slow vs FA3
Mar 22, 2026
c8bfd59
Add run_v2_ttt_noXSA.sh — TTT v2 + temp scaling, all FA3, max speed
Mar 22, 2026
12f37dc
Restore XSA_LAST_N=3 in run_v2_ttt_only.sh (keep existing test intact)
Mar 22, 2026
0c4d861
Log v2 TTT-only + XSA=3 result: 1.1982 BPB (worse than 1.1301 baseline)
Mar 22, 2026
ecc4755
Strip verbose logging from v2 train loop — match baseline format
Mar 22, 2026
7a11286
Log v2 noXSA result: 1.1538/1.1315 BPB — TTT v2 hurt, no edge over ba…
Mar 22, 2026
f56f695
Log exp_a/b/c results: all worse than 1.1301 baseline, exp_c never ran
Mar 22, 2026
35110f9
Add exp D: TTT 8 epochs + stride 32 (eval-only improvement)
Mar 22, 2026
07f0bba
Add SAM (Sharpness-Aware Minimization) option for TTT
Mar 22, 2026
4b100ac
Add baseline reproduction script — verify 1.1303 on current FA3 build
Mar 22, 2026
4bfbe60
Add SAM to baseline TTT — test sharpness-aware adaptation on proven code
Mar 22, 2026
723d095
Log exp D result: 1.1295 BPB — new best (-0.0008 vs baseline)
Mar 22, 2026
2bf6171
Log exp D seed 42: 1.1307 BPB — confirms improvement (mean 1.1301)
Mar 22, 2026
a1b9703
Add exp_d SAM variant — TTT 8ep + stride 32 + sharpness-aware TTT
Mar 22, 2026
ef4dc44
Log exp D seed 7: 1.1313 BPB but 16.18 MB — over size limit
Mar 22, 2026
7429285
Add Partial RoPE + LN Scale (from PR #315) to sota254 + run_sam
Mar 22, 2026
32de9e9
Add exp_d/run_sam_clean.sh — pure SAM A/B test, no other changes
Mar 22, 2026
3cc6006
Log exp D seeds 7+137: both over size limit
Mar 22, 2026
aa4b1c0
Add Sponge Bath experiment: TTT 8ep + stride 32 eval-only improvement
Mar 22, 2026
f9ef363
Add PR#315 clone with TTT 8ep + run script
Mar 22, 2026
8dd0e89
Log D+SAM+PR315tricks: 1.1274 BPB new best; add SAM to pr315 run script
Mar 22, 2026
719b69d
Log PR315+TTT results: 1.1240 BPB (invalidated — TTT now banned)
Mar 22, 2026
72573d9
Add PR374 enchilada experiment: 12L/2KV/2.75xMLP + train@1024 + EMA
Mar 22, 2026
6f88e84
Add RunPod 8xH100 setup guide — every gotcha we've hit 3 times
Mar 22, 2026
b27e134
Add fractal cadence experiment: F/N alternation pattern
Mar 22, 2026
63714e4
Add PR374-safe: EMA + earlier QAT + longer warmdown only, shape uncha…
Mar 22, 2026
b34c3de
Add PR374-depth: 12L/4KV/2.625xMLP (same params, +1 layer) + EMA + QA…
Mar 22, 2026
7fba755
Add minified train_gpt (30KB vs 74KB) with EMA+SWA edge
Mar 22, 2026
48711f6
Fix log0 keyword arg bug in minified script
Mar 22, 2026
c328493
Fix Block kwarg: layer_idx -> li in minified script
Mar 22, 2026
e190db5
Fix a.te -> a.tie_embeddings in minified script
Mar 22, 2026
59f6d3d
Add pr374_submit: trimmed winner for submission
Mar 22, 2026
038b2e7
Remove duplicate stride=64 eval from submit script
Mar 22, 2026
8595a9a
Revert "Remove duplicate stride=64 eval from submit script"
Mar 22, 2026
e8692f5
Add fractal cadence H100 script: 4 unique × 3 loops + ortho positions
Mar 22, 2026
f0eab08
Add fractal cadence long run (1.6575 BPB @ 3929 steps) + H100 script
Mar 22, 2026
7d93af3
Add pr374_slim: comment-stripped pr374_safe (67KB vs 74KB, -6.4KB)
Mar 22, 2026
594315f
Add fractal cadence auto-research sweep runner
Mar 22, 2026
64a606a
Fix FA3 head_dim limit: 12 heads / 6 kv_heads for dim=768 (head_dim=64)
Mar 22, 2026
545f4d0
Add pr374_ttt: PR374 + EMA + TTT(20ep,SAM) — rock the house
Mar 22, 2026
ee16ade
Fix early QAT trigger: skip warmdown/QAT for first 100 steps
Mar 22, 2026
80958dd
Optimize fractal: tuned H100 defaults + focused sweep grid
Mar 22, 2026
fa6f6d6
Add TTT eval: test-time training on already-graded tokens
Mar 22, 2026
a043ab3
Log depth + TTT results: 1.1223 BPB (12L+TTT20ep+SAM)
Mar 22, 2026
307bc81
Save pr374_throne: our #1 non-TTT record holder (1.1243 BPB)
Mar 22, 2026
c0adc01
Add autonomous overnight fractal optimizer (Karpathy auto-research)
Mar 22, 2026
c8195b6
Rewrite autoresearch: Qwen-guided fractal optimizer via Ollama
Mar 22, 2026
6f1339f
Add 3 leapfrog variants based on PR#414 SOTA (1.1233 BPB)
Mar 22, 2026
2205ab1
Add pod setup script for leapfrog variants (v1/v2/v3)
Mar 22, 2026
f38c563
Fix setup_pod.sh: apply FA3 build fixes from RUNPOD_SETUP.md
Mar 22, 2026
3daeff7
Update fractal H100 with overnight Qwen findings
Mar 22, 2026
4073091
Fix v1 TTT burst: apply EMA first, then QAT-aware sharpening
Mar 22, 2026
92a48d2
Add inner-TTT to fractal eval: recursive weight improvement per loop
Mar 22, 2026
12b5538
v1 combo: burst+QAT before EMA + 15 GPTQ percentiles
Mar 22, 2026
f45beef
Isolate TTT to fractal layers only: blocks + loop_pos + skip_weights
Mar 22, 2026
66ebae7
Add TTT drift gate: snap block weights back toward trained originals
Mar 22, 2026
39c346b
Add v4: burst + distill + train_seq_len=1024 for more steps
Mar 22, 2026
f2ed46e
Add SOTA autoresearch: Qwen-guided edge finding for 1.1233 record
Mar 22, 2026
2ba8b6f
Fix TTT graph bug + upgrade to 3 blocks x 896d (23M params, ~14MB)
Mar 22, 2026
c3b0643
Save leapfrog experiment results: 6 variants tested, v1 wins at 1.12319
Mar 22, 2026
c3eda01
Fractal v2: dim=960 (25.7M, ~15.8MB), Muon LR back to 0.025
Mar 22, 2026
1a06df2
Disable TTT in v2 (backward graph bug), focus on bigger model
Mar 22, 2026
e5834b4
Add v5: QAT percentile fix + TrigramHash + EMA-SWA blend
Mar 22, 2026
9273a98
Fix TTT graph bug: fresh vc + detached inputs per loop
Mar 22, 2026
b73d229
Two v2 scripts: with TTT and without, for A/B comparison
Mar 22, 2026
52c49dc
Fractal v3: MLP 3.0->3.3 fills 16MB budget (27.4M params, ~15.2MB)
Mar 22, 2026
c7dac39
Fractal v4: 1.5x batch (1.18M tokens/step) to use spare VRAM
Mar 22, 2026
c28434e
Log The Frugendorff: fractal cadence baseline at 1.2113 BPB
Mar 22, 2026
8a50d68
Add v6: fractal 6L×2 loops (12 effective) + 480d/16H/4xMLP
Mar 22, 2026
feef78e
Frugendorff v5: TTT warmup-then-freeze to capture 1.19 peak
Mar 22, 2026
b60ee7a
Log all Frugendorff results: v3 TTT peak 1.1901, v4 batch 1.2186
Mar 22, 2026
4d02995
Fix v6: 512d/16H/8KV (head_dim=32, FA3 requires multiple of 8)
Mar 22, 2026
cb5acc1
Log v5/v6 results + add fractal sweep script
Mar 22, 2026
26bb918
Frugendorff 8-hour single GPU longrun
Mar 22, 2026
4394e44
Complete Frugendorff documentation: all runs, findings, architecture …
Mar 23, 2026
1633685
Add v7: legal score-first TTT eval (PR #461 recipe)
Mar 23, 2026
d03b8aa
Add v8: mutual learning — flat GPT + fractal GPT co-training
Mar 23, 2026
7e42a13
v7: add TTT early stopping — train only first 60 chunks
Mar 23, 2026
4ead1ee
Update v6 to sweep winner: 4×3/512d/8H/4KV/4xMLP
Mar 23, 2026
d3b1e2f
Log v7 TTT results: early stop at 60 chunks = 1.12312 (best)
Mar 23, 2026
961dbbd
The Frugendorff Squared: 6L x 2 loops, dim=640, MLP 4x, 28.5M params
Mar 23, 2026
c51d394
Log higher LR (0.030) + MTP results — both worse
Mar 23, 2026
0263f44
The Frugendorff Squared: 1.1478 BPB — 0.025 from world lead
Mar 23, 2026
d60df79
Stack distillation onto Frugendorff Squared
Mar 23, 2026
868accf
Prep Frugendorff Squared PR submission (draft — needs image)
Mar 23, 2026
132f449
Add train script to Frugendorff submission folder
Mar 23, 2026
5229c66
TTT stabilization: EMA scoring + fixed cosine LR + embed freeze
Mar 23, 2026
f20e387
Add GPTQ quantization: Hessian-aware error compensation for int6
Mar 23, 2026
011e7d9
Stack low-hanging fruit: fix GPTQ, earlier QAT, matched clipping
Mar 23, 2026
988e83b
Add post-quant training burst: repair quant damage before eval
Mar 23, 2026
fcfbd34
Remove post-quant burst — illegal (uses training data during eval)
Mar 23, 2026
286b2ad
Add selective int8 for sensitive layers (GPTQ-int8 upgrade path)
Mar 23, 2026
c96450e
Frugendorff V3: v7 full stack + fractal loops (GPTQ + TTT + PQB)
Mar 23, 2026
ff4a79e
Add AdamW TTT option — PR #462 shows 5x better TTT gain vs SGD
Mar 23, 2026
8fe38f0
Frugendorff V3: SwiGLU + AdamW TTT + 8K bigram (PR#462 techniques)
Mar 23, 2026
98f3e2f
Fix DDP unused params: add find_unused_parameters=True
Mar 23, 2026
c97ad54
SwiGLU + GPTQ: PR #462 architecture with our quant improvements
Mar 23, 2026
03e5f54
Add OptRot: Hadamard rotation before GPTQ quantization
Mar 23, 2026
eb09f7e
Frugendorff V3: Star-ReLU SwiGLU + VRL + MLP 4.0 (fits 16MB)
Mar 23, 2026
7020bf6
Log session results: 1.0763 SwiGLU (size problem), 1.1215 v7 (submitted)
Mar 23, 2026
addbea9
Frugendorff V4: GEPA-matched config for 16MB compression
Mar 23, 2026
5c912b5
Frugendorff V4 A/B: 5x2 and 4x2 variants for H100 calibration
Mar 23, 2026
324843b
Add short TTT experiment: no-EMA, 50 chunks, SGD
Mar 23, 2026
d36c159
Add XSA-all(11) + short TTT run script
Mar 23, 2026
4185795
Record: v7 GPTQ + Short TTT — val_bpb 1.1207 (seed 1337)
Mar 23, 2026
874237b
Frugendorff compression shim on SwiGLU SOTA (1.0763 model)
Mar 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions FRUGENDORFF_PR_DRAFT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## PR DRAFT — DO NOT SUBMIT YET (add image first)

### Title:
The Frugendorff Squared: Fractal Weight Sharing + MLP 4x (1.1478 BPB, 15.15MB)

### Body:

## Summary

Non-record submission exploring **fractal weight sharing** — a novel approach where 6 unique transformer blocks are looped 2× each, providing 12 effective layers of depth with only 6 blocks of stored parameters. The freed parameter budget enables **MLP 4x expansion**, which is the primary quality driver.

<!-- ADD YOUR IMAGE HERE -->

- **val_bpb: 1.1478** (sliding window stride=64) | **15.15 MB** | 8xH100 SXM, 600s
- 28.2M params, 4,390 steps at 136.7ms/step
- Full pipeline: Muon + SWA + Late QAT + Training Replay + Self-Distillation + EMA

## Key Insight

MLP 4x gives ~2% relative BPB improvement over MLP 3x, but doesn't fit in 16MB with 12 unique layers. Fractal weight sharing (6 unique × 2 loops) fits it in 15.15 MB. The weight sharing is the compression technique; the MLP 4x is the quality lever.

## Architecture

- 6 unique blocks × 2 loops = 12 effective depth
- dim=640, 10 heads, 5 KV (GQA), head_dim=64
- MLP 4x (hidden=2560), relu-squared
- Orthogonal loop positions, U-Net skips, SmearGate, BigramHash, VE128, XSA

## No TTT on validation data

All training uses training data only. Late replay buffers training batches. Self-distillation uses EMA teacher on training data. Fully compliant with issue #402.

## Test plan

- [x] 8xH100 SXM, 600s
- [x] Artifact under 16MB (15.15 MB)
- [x] No TTT on validation data (per issue #402)
- [x] Post-quant roundtrip verified
- [x] Sliding window eval (stride=64)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---
### Command to create PR (after adding image):
```
gh pr create --repo openai/parameter-golf \
--title "The Frugendorff Squared: Fractal Weight Sharing + MLP 4x (1.1478 BPB, 15.15MB)" \
--body "$(cat FRUGENDORFF_PR_BODY.md)"
```
269 changes: 269 additions & 0 deletions PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
# Parameter Golf — Fractal Transformer Research Plan
**DGX Spark · GB10 · March 2026**

---

## Challenge Summary

| Constraint | Value |
|------------|-------|
| Artifact size | ≤16MB (code + int8 quantized + zlib compressed weights) |
| Training time | ≤10 minutes on 8×H100 |
| Metric | bits-per-byte (BPB) on FineWeb validation set |
| Baseline | 1.2244 BPB |
| Record threshold | ≤1.2194 BPB (must beat by ≥0.005) |
| 4-hour unlimited baseline | 1.2074 BPB |
| Challenge window | March 18 → April 30, 2026 |
| Repo | https://github.com/newjordan/parameter-golf |

---

## Our Approach: Fractal Transformer + Gravity + AttnRes

### Core Thesis

Weight-shared transformer layers with learned gravitational auxiliary losses
and attention residuals will achieve lower BPB than the baseline's 9-unique-layer
architecture within the same 16MB parameter budget.

### Three Innovations Combined

**1. Fractal Architecture (Weight Sharing / Depth Recurrence)**

Instead of 9 unique layers, use 3 unique layers repeated in 3 loops.

```
CURRENT BASELINE:
9 unique layers × 512 dim = ~14M params

OUR APPROACH:
3 unique layers × 3 loops = 9 effective layers
Wider layers (~700 dim) with same total param count
Loop position embedding tells shared weights which pass they're on
```

Why this helps:
- Fewer unique parameters → more room in 16MB budget → wider layers
- Wider layers = richer features per layer
- Weight sharing compresses extremely well under int8+zlib
- Depth recurrence explicitly encouraged by the challenge README

**2. Gravity (Learned Auxiliary Losses)**

At the end of each loop, peek at the output using the shared lm_head and
compute an auxiliary cross-entropy loss. The weights are LEARNED parameters.

```python
self.gravity_weights = nn.Parameter(torch.tensor([0.1, 0.3, 1.0]))

total_loss = 0
for loop in range(3):
x = run_shared_layers(x, loop_pos=loop)
loop_logits = lm_head(rms_norm(x))
loop_loss = cross_entropy(loop_logits, targets)
total_loss += softplus(self.gravity_weights[loop]) * loop_loss
```

Why this helps:
- 3× gradient signal — every layer gets direct supervision, not diluted backprop
- Model discovers optimal loop weighting during training
- Especially powerful with weight sharing: same weights receive gradient from 3 depths
- Zero new parameters (3 scalars for weights, reuses existing lm_head)
- ~1.2% compute overhead (2 extra lm_head calls)

The "gravity" analogy:
- Loop 1 output is far from the target → strong pull, large updates
- Loop 2 is closer → medium pull, refinement
- Loop 3 is nearest → full weight, precision
- Each loop starts from a better position because the previous loop was already pulled toward the answer

**3. AttnRes (Attention Residuals)**

Replace fixed skip connections with learned, input-dependent attention over depth.
From Moonshot's paper (arxiv:2603.15031).

```
Standard residuals: x = x + layer_output (fixed, uniform weight)
AttnRes: x = softmax(query · [prev_outputs]) · [prev_outputs]
```

Each layer has a single learned query vector w_l ∈ R^d that attends over all
previous loop outputs. The softmax produces content-aware, input-dependent
weights instead of fixed uniform accumulation.

Why this helps:
- Paper shows 1.25× compute equivalent for near-zero parameter cost
- Replaces BOTH the baseline's U-Net skips AND resid_mix
- Only 9 × dim ≈ 4,608 new parameters
- Critical for weight sharing: lets later loops selectively reference earlier loops

### What We Remove From Baseline

| Component | Parameters | Replaced By |
|-----------|-----------|-------------|
| U-Net encoder/decoder split | structural | Fractal loops |
| skip_weights (9 × 512) | 4,608 | AttnRes queries |
| resid_mix (9 × 2 × 512) | 9,216 | AttnRes |
| **Total removed** | **~13,824** | |

### What We Add

| Component | Parameters | Purpose |
|-----------|-----------|---------|
| AttnRes queries (9 layers) | 4,608 | Selective depth attention |
| Loop position embeddings (3 loops) | ~2,100 | Tell weights which loop they're in |
| Gravity weights (3 scalars) | 3 | Learned auxiliary loss weighting |
| **Total added** | **~6,711** | |

**Net: ~7,113 parameters saved → reinvested into wider layers.**

---

## Architecture Diagram

```
INPUT TOKENS (1024 vocab)
EMBEDDING (1024 × ~700 dim)
LOOP 1 (broad strokes):
├── Layer A (attention + MLP, loop_pos=0)
├── Layer B (attention + MLP, loop_pos=0)
├── Layer C (attention + MLP, loop_pos=0)
├── GRAVITY: peek → compute loss₁ (learned weight ~0.1)
└── Store loop 1 output for AttnRes
LOOP 2 (refinement):
├── AttnRes: attend over [embedding, loop1_output]
├── Layer A (attention + MLP, loop_pos=1) ← same weights as loop 1
├── Layer B (attention + MLP, loop_pos=1)
├── Layer C (attention + MLP, loop_pos=1)
├── GRAVITY: peek → compute loss₂ (learned weight ~0.3)
└── Store loop 2 output for AttnRes
LOOP 3 (precision):
├── AttnRes: attend over [embedding, loop1_output, loop2_output]
├── Layer A (attention + MLP, loop_pos=2) ← same weights again
├── Layer B (attention + MLP, loop_pos=2)
├── Layer C (attention + MLP, loop_pos=2)
└── FINAL LOSS: full cross-entropy (weight = 1.0)
OUTPUT: logits → BPB
```

Each loop tightens the representation:
- Loop 1: rough sketch (only sees embedding)
- Loop 2: refinement (sees embedding + loop 1 output via AttnRes)
- Loop 3: precision (sees full history, committed to answer)

---

## Information Tightening Mechanisms

### Gravity (primary — Frosty's intuition)
Each loop is pulled toward the final answer by its own loss signal. Later loops
start from better positions because earlier loops were already course-correcting.
The model learns how hard each loop should pull (learned gravity weights).

### AttnRes (secondary — from Moonshot paper)
Selective attention over previous loop outputs. Later loops can choose which
earlier representations are useful for each specific token, not a fixed blend.

### Future: Ring Buffer + Temperature Cooling (Phase 4)
- Ring buffer: bounded memory with eviction of unhelpful previous states
- Temperature: AttnRes attention sharpens with depth (soft early, committed late)
- Only add if Phase 1-3 show signal

---

## Experiment Sequence

### Phase 1: Establish Weight Sharing Baselines
1. Run baseline as-is → establish local BPB reference
2. 3 shared layers × 3 loops, same total params, ~512 dim → does sharing work?
3. 3 shared layers × 3 loops, wider ~700 dim → does width help?
4. 2 shared layers × 4 loops, widest ~850 dim → more loops?
5. 4 shared layers × 2 loops, ~620 dim → fewer loops?

### Phase 2: Add Gravity
6. Best config from Phase 1 + gravity with learned weights
7. Compare: gravity learned vs gravity fixed [0.1, 0.3, 1.0] vs no gravity

### Phase 3: Add AttnRes
8. Best from Phase 2 + full AttnRes
9. Test: AttnRes before attention only / before MLP only / both
10. Test: AttnRes with vs without gravity

### Phase 4: Advanced Mechanisms
11. Add ring buffer (bounded memory with eviction)
12. Add temperature cooling on AttnRes
13. Try combining all mechanisms

### Phase 5: Optimize for Submission
14. Verify int8+zlib artifact ≤16MB
15. Tune width to maximize quality within size budget
16. Port winning config to official train_gpt.py style
17. Run on cloud 8×H100, verify 10-minute timing
18. Prepare submission folder for /records

---

## Workflow

### Local (DGX Spark, free, unlimited)
- Adapted research fork without Triton/torch.compile dependency
- Shorter training budget (2 min per experiment)
- Smaller batch size
- Same model, data, tokenizer, BPB metric
- Results won't match H100 numbers but relative ordering transfers
- Run 50-100 experiments to find winning configuration
- Autoresearch agent runs overnight (Phase 1-4)

### Cloud (H100s, paid, limited)
- Take best configuration from local experiments
- Run at full scale: 8×H100, 10 minutes, full batch
- Verify BPB, artifact size, timing
- Prepare official submission

---

## Source Material

### Attention Residuals (Moonshot)
- Paper: arxiv:2603.15031
- Repo: https://github.com/MoonshotAI/Attention-Residuals
- Core: replace fixed residual connections with softmax attention over depth
- Result: matches 1.25× compute baseline at near-zero parameter cost

### Autoresearch (Karpathy)
- Repo: https://github.com/karpathy/autoresearch
- Core: AI agent modifies train.py, trains 5 min, keeps/discards, loops forever
- Adapted as our outer optimization loop

### Parameter Golf Baseline
- Repo: https://github.com/openai/parameter-golf
- Architecture: 9-layer GPT, 512 dim, 1024 vocab, GQA, Muon optimizer
- Key features: U-Net skip connections, resid_mix, ReLU², logit softcapping
- BPB: 1.2244 (10 min), 1.2074 (4 hour)

---

## Key Insight

The competition rewards compression quality per parameter. Weight sharing is
the ultimate compression — the same function applied repeatedly. AttnRes gives
that repeated function the ability to selectively reference its earlier outputs.
Gravity ensures every repetition is actively pulled toward the correct answer.

The fractal structure means each loop genuinely tightens the representation:
same weights, progressively richer input, direct loss supervision at every
stage. The model isn't just repeating — it's refining.

---

*Plan authored by Octavian + Frosty · Spark-2949 · 2026-03-18*
Loading