GPTQ + Early QAT + Legal TTT — 3-seed mean val_bpb 1.1215 by newjordan · Pull Request #578 · openai/parameter-golf

newjordan · 2026-03-23T22:13:56Z

Summary

3-seed mean val_bpb: 1.1215 (std: 0.0008)
Best seed: 1.1206 (seed 1337)
Artifact: 15.56 MB (int6+zstd-22)
Training: 600s on 8xH100 SXM | Eval: ~330s

3-Seed Results

Seed	val_bpb
1337	1.12059684
42	1.12178348
7	1.12214237
Mean	1.12150756
Std	0.00082

Key Innovations

1. GPTQ Quantization (biggest contributor: −0.0027 BPB)

Replaces naive per-row int6 quantization with GPTQ (Hessian-aware error compensation):

256-sample calibration on training data → per-layer H = X^T X
Optimal per-row scales via 5-percentile search
Column reordering by ascending Hessian diagonal
Block-128 column-wise quantization with Cholesky-factored error compensation
Quant tax reduced from 0.0082 to 0.0058 BPB (32% reduction)

2. Early QAT with Matched Clipping

QAT threshold 0.15→0.5: ~1750 QAT steps instead of ~521 (3x more)
STE uses 99.95th percentile clipping (matches GPTQ export quantizer)

3. Legal Score-First TTT with EMA + Cosine Fix

EMA scoring (decay=0.995): smoothed weights for eval, raw weights for training
Fixed cosine LR decay over actual training window (200 chunks)
Embedding freeze during TTT (tok_emb, bigram, ve_shared)
SGD + momentum 0.9, 3 epochs/chunk, grad clip 1.0

Architecture

11L/512d/8H/4KV/3xMLP (relu²) with U-Net skip connections, Partial RoPE (16/64), XSA last 4 layers, BigramHash(2048), VE128 on layers 9-10, SmearGate, logit softcap 30, tied embeddings. 26.99M params.

Credits

Base architecture: @signalrush (PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414)
Legal TTT recipe: @anantdgoel (PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461)
GPTQ: Frantar et al. (2022), adapted for int6 per-row quantization

Run

SEED=1337 torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-03-23_11L_GPTQ_TTT_EMA_QAT_1.1206/train_gpt.py

Requires Flash Attention 3 (Hopper selective build). See RUNPOD_SETUP.md.

🤖 Generated with Claude Code

…AttnRes

… gravity needs more steps

@timowhite88

11L Int6 MLP3x + SmearGate + BigramHash + OrthoInit + TTT SGD 3ep Exact reproduction of @timowhite88's FarnsworthEngine recipe. No modifications — run as-is to validate baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openai#1 untried combination from competition commentary: TTT (from openai#254) + XSA (from openai#265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

8 Q heads with 4 KV heads needs repeat_interleave before matmul. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export) exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count) exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio All based on PR openai#254 SOTA clone (1.1303 BPB). Priority: exp_c first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TTT v2 (cosine LR decay, discriminative per-layer LR, low momentum 0.3, WD), seq-length curriculum (256→2048), batch warmup (262K→786K), D2Z LR schedule, XSA last 3, temperature scaling, optional Mousse optimizer. Two run scripts: full stack (run_v2.sh) and conservative TTT-only (run_v2_ttt_only.sh). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

flash_attn_interface (FA3 Hopper) → flash_attn (FA2) → torch SDPA. Script never crashes on missing flash-attn. Run scripts attempt pip install on startup if FA3 not found. Applied to both sota254 and sota_v2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…+ untested v2 Restores all four files to their state at 83efa9c. The FA3→FA2→SDPA fallback was added in response to an environment question and should not have touched application code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

torch.compile can promote tensors to fp32 which hits missing FA3 kernels (disabled at build time). Explicit bf16 cast prevents silent NaN output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

A (MTP): 1.1619 BPB roundtrip — worse than baseline B (SwiGLU): 1.1348 BPB sliding — close but +0.0045 vs baseline Both artifacts over 16MB due to missing zstandard (zlib fallback) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…combine The self-exclusion mask + causal mask leaves position 0 with all -inf, producing NaN from softmax. Fix: don't self-exclude position 0 since it has no other causal targets to attend to. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

XSA_LAST_N=3 was costing ~25% step time due to manual matmul path. Set to 0 to isolate TTT v2 + temp scaling gains at full speed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

XSA manual attention killed step speed, only 4771/9000 steps completed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…seline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

exp_a MTP: 1.1619, exp_b SwiGLU: 1.1570, exp_c: missing tokenizer data. TTT v1 hurt in both exp_a and exp_b (same pattern as TTT v2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Same model/artifact as SOTA254 baseline — zero risk. More TTT adaptation (3→8 epochs) and finer sliding window (64→32 stride). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TTT_SAM=1 enables SAM during test-time training. Two forward+backward passes per step: first computes gradient, perturbs weights by rho in gradient direction, then recomputes gradient at the perturbed point. Uses the perturbed gradient to update original weights, seeking flatter minima that generalize better. Motivated by TTT consistently overfitting: loss goes down but eval gets worse across all runs. SAM directly targets this failure mode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Exact settings from README. If this doesn't reproduce, the FA3 build is the variable, not the code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Same training as the 1.1303 baseline, only change is TTT_SAM=1. SAM seeks flatter minima during test-time training to fix the TTT overfitting pattern (loss down, eval up). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TTT 8 epochs + stride 32. Stride made no difference — all gain from extra TTT adaptation. Same model/artifact, eval-only change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both seeds beat baseline. TTT 8 epochs is a free win. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seed 7 compresses worse than 1337/42. BPB improved but artifact exceeds 16 MB cap. Need passing 3rd seed for submission. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ROPE_DIMS=16: apply rotary to 25% of head dims, rest position-free LN_SCALE=1: scale RMSNorm output by 1/sqrt(layer+1) Both env-var gated, default off — existing runs unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Novel architecture: 3 unique blocks x 4 loops = 12 effective depth, 960d, F/N/N cadence, orthogonal loop positions. Completely original design built from scratch in one session. Gap to SOTA: 0.088 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Qwen research found 12L/480d/16H/4xMLP beats baseline by 2% relative. Problem: 29.5M params won't fit in 16MB. Solution: fractal weight sharing. 6 unique layers × 2 loops = 12 effective depth with ~half the block params. Loop position embeddings differentiate passes through shared blocks. Also fix autoresearch_sota.py parser (val_bpb never captured due to whitespace mismatch in output parsing). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phase 1: TTT warmup for 1500 windows — adapt shared blocks Phase 2: Freeze weights, standard compiled sliding window eval Tuned for the 1.1901 sweet spot observed in v3 TTT: - TTT_EPOCHS=1 (was 3 — less drift per window) - TTT_LR=5e-5 (was 1e-4 — gentler updates) - TTT_DRIFT=0.05 (was 0.1 — tighter leash) - TTT_STRIDE=128 (was 64 — 2x faster warmup) - TTT_WARMUP_WINDOWS=1500 (freeze after peak adaptation) Same model: 3x4 loops, 960d, MLP 3.3, 27.4M params. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v3 standard: 1.2118 sliding BPB v3 TTT: peaked at 1.1901 at window 1400, then drifted to ~1.205 v4 (1.5x batch): 1.2186 — bigger batch slightly helped v5 (TTT warmup-then-freeze) ready for next run to capture 1.19 peak Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

480d/16H gave head_dim=30 which FA3 rejects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v5 (QAT fix + trigram + EMA-SWA): 1.12439 — all 3 changes hurt v6 (fractal 6×2, 512d/16H/4xMLP): 1.17566 — dead end sweep_fractal.py: 36-config automated sweep testing fractal vs flat across depths, dims, MLP widths, loop counts, gravity, attnres. ~4 hours on DGX Spark. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

8 hours on 1xH100, eval every ~2 hours, same effective batch as 8-GPU. 3 blocks x 4 loops, dim=960, MLP 3.3, cadence 3. See how far fractal goes with unlimited time. warmdown_iters=15000, grad_accum=8 for 1 GPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…analysis Full results table (v1-v5 + longrun), experimental findings on TTT calibration, architecture-as-compression thesis, Qwen overnight sweep results (141 runs), single GPU longrun plateau finding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Our v1 base (1.1232 pre-TTT) + legal TTT should give ~1.1211. PR openai#473 gets 1.1213 from a 1.1234 base — our base is 0.0002 better. TTT recipe: 32K-token chunks, score-first (inference_mode), then train (SGD lr=0.002, momentum=0.9, 3 epochs, freeze blocks 0-1, cosine LR decay across chunks, grad clip 1.0). Removed TTT burst (replaced by legal TTT eval). 1499 lines (under 1500 limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two models train simultaneously, teaching each other via KL divergence: - Model A: 11L/512d/8H/4KV/3xMLP (our SOTA flat architecture) - Model B: 4×3 fractal, 512d/8H/4KV/4xMLP (sweep winner) Each step: both forward on same batch, each gets CE + alpha*KL from the other's soft labels. Architectural diversity prevents collapse. Ensemble logits at eval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full 1893-chunk TTT degraded from 1.1233 to 1.1360 (bust). But peaked at chunk 51 with 1.1119 — best score seen all day. Fix: stop training after 60 chunks, keep scoring the rest with the peak-adapted model. TTT_MAX_TRAIN_CHUNKS=60. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous v6 (6×2/512d/16H/8KV) failed on H100 due to head_dim=32 being slow on FA3 and wrong architecture. New config matches the DGX sweep winner: 4 unique × 3 loops, standard 8H/4KV (head_dim=64), 4xMLP. 12M params. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full TTT: 1.13599 (degraded). Early stop 60: 1.12312. Gentle TTT (lr=0.0005, 1ep, freeze=4): 1.12328 (identical). Peak at chunk 51: 1.1119 but can't maintain over full val set. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fills the 16MB budget (est ~15.8MB). 33% wider than the 1.1757 run. 6 unique layers x 2 loops = 12 effective depth. dim=640, 10 heads, 5 KV (GQA), head_dim=64 (FA3 optimal). MLP 4x + EMA + TTT burst (training data only, issue openai#402 compliant). The Frugendorff: sounds fake, hits different. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Higher LR: 1.12467 TTT (vs 1.12312 at default LR). Qwen signal doesn't transfer Muon. MTP: ~1.16+ base, not enough steps to converge with auxiliary heads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

6 unique blocks x 2 loops = 12 effective depth, dim=640, MLP 4x. Fractal weight sharing enables MLP 4x within 16MB budget (15.15MB). Pre-quant 1.1570, sliding window 1.1478. Gap to SOTA: 0.025 BPB. Missing distillation + quant tightening could recover ~0.012. We frugendorffed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Added self-distillation: EMA as teacher, 50 steps, temp=2.0, alpha=0.7. Runs after TTT burst, before final EMA application. Pipeline: train -> SWA -> QAT -> TTT burst -> distill -> EMA -> quant. Targets ~0.003 BPB improvement from weight smoothing + quant friendliness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Submission folder + README + submission.json + PR draft. Framed as research submission: fractal weight sharing enables MLP 4x. User needs to add image before submitting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three-mechanism fix to capture the 1.1119 chunk-51 peak: 1. Fixed cosine LR decay — denominator changed from num_chunks (1892) to ttt_max_train_chunks. The old schedule was decorative (99.8% of max LR at chunk 51). Now properly decays over the training window. 2. TTT-EMA — maintain exponential moving average of weights during TTT. Score each chunk with smoothed EMA weights, train with raw weights. Decay=0.995 gives ~50-chunk half-life matching the best-performance window. Prevents single-chunk noise from degrading later scores. 3. Embed freezing — freeze tok_emb (tied with lm_head), bigram, and ve_shared during TTT. Small embedding changes have outsized effect on every token; this removes the highest-variance overfitting path. All mechanisms independently toggleable via env vars: TTT_EMA_DECAY=0 # disable EMA TTT_FREEZE_EMBED=0 # disable embed freeze TTT_MAX_TRAIN_CHUNKS=N # control training window (default 200) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaces naive per-row int6 quantization with GPTQ for attention and MLP weight matrices. GPTQ quantizes columns one at a time, compensating each column's quantization error in the remaining unquantized columns using H = X^T X (Hessian of the layer's input activations). Three new functions: - gptq_calibrate: hooks all linear layers, runs 128 training sequences to collect per-layer Hessians (~5s on H100) - gptq_quantize_weight: column-wise quantization with block-128 error propagation, Cholesky-factored Hessian inverse - mixed_quantize_int6_gptq: drop-in replacement for mixed_quantize_int6, uses GPTQ when Hessian is available, falls back to naive otherwise Expected impact: 30-50% reduction in quantization tax (0.0082 BPB). The quant tax is the primary bottleneck — chunk-51 TTT peak (1.1119) proves the model capacity exists; GPTQ captures it at the source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Five improvements stacked: 1. GPTQ per-row scale fix — pre-compute optimal per-row scales via percentile search BEFORE column-wise error compensation. Old code used per-column amax then re-quantized at the end, losing GPTQ benefits. Now scales are fixed and consistent throughout. 2. Column reordering — process columns in ascending Hessian diagonal order (least-important first). Standard GPTQ optimization that concentrates error compensation on the most important columns. 3. Earlier QAT — LATE_QAT_THRESHOLD 0.15→0.5, giving ~1750 QAT steps instead of ~521. The model has 3x more time to adapt to int6 quantization noise before final weights are frozen. 4. QAT clipping match — STE fake quantization now uses 99.95th percentile clipping instead of row_max, matching the GPTQ export quantizer. Eliminates train/export quantization mismatch. 5. More GPTQ calibration — 128→256 samples for more stable Hessians. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After GPTQ quantization + dequantization, fine-tune the model on training data for 100 steps with STE (quantization-aware). This repairs quant damage at the source using training data (legal), so every eval chunk gets scored by the "peak" model from chunk 0. Key insight: TTT chunk 51 achieves 1.1119 because it repairs quant damage — but the first 50 chunks are scored with a damaged model, dragging the average up. Post-quant burst pre-repairs the damage so ALL eval chunks benefit. Settings (env var configurable): PQB_ENABLED=1 # on by default PQB_STEPS=100 # 100 SGD steps with cosine decay PQB_LR=0.001 # conservative LR PQB_FREEZE_BLOCKS=2 # freeze early layers + embeddings STE enabled during burst so weights stay near quantization grid points, preventing re-quantization from undoing the fine-tuning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Competition rules: "you aren't allowed to access any training data during evaluation." The burst ran post-training, accessing train shards during eval time. Moving it into training time costs 175 training steps for 100 repair steps — not worth the trade. GPTQ + earlier QAT + TTT stabilization remain as the strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Layers matching INT8_SENSITIVE patterns get GPTQ with int8 range (127 levels) instead of int6 (31 levels) — 4x more quantization precision for the most damage-prone parameters. Default: INT8_SENSITIVE=attn.proj (attention output projections, which suffer ~3.4x more quant damage per PR openai#481 analysis). Controlled via env var, comma-separated patterns. Empty = disabled. Costs ~0.1-0.3MB extra compressed size (int8 has higher entropy). We have 0.44MB headroom (15.56MB artifact, 16MB limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v7 base (GPTQ, legal TTT, PQB, EMA) with fractal weight sharing: 6 unique x 2 loops = 12 eff depth, dim=640, 10H/5KV, MLP 4x. Inherits everything that made v7 hit 1.1206. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR openai#462 achieves 1.0672 BPB. Their key finding: switching TTT optimizer from SGD to AdamW gives 5x more improvement (0.053 vs 0.011 BPB). AdamW's per-parameter adaptive LR handles the heterogeneous update needs of attention/MLP/control params naturally — exactly what we were trying to do manually. New defaults (matching PR openai#462 recipe): TTT_OPTIMIZER=adamw (was implicit SGD) TTT_LR=0.0005 (was 0.002) TTT_EPOCHS=10 (was 3) TTT_FREEZE_BLOCKS=0 (was 2) Fallback to SGD: TTT_OPTIMIZER=sgd TTT_LR=0.002 TTT_EPOCHS=3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

3-seed results on 8xH100 SXM, 600s training + ~330s eval: Seed 1337: 1.1206 BPB Seed 42: 1.1218 BPB Seed 7: 1.1221 BPB Mean: 1.1215 BPB (std 0.0008) Key innovations: 1. GPTQ quantization — Hessian-aware int6 with column reordering and optimal per-row scale search. Reduces quant tax 32%. 2. Early QAT — threshold 0.5 (3x more QAT steps) with percentile clipping matching the GPTQ export quantizer. 3. Legal score-first TTT with EMA scoring, fixed cosine LR decay, and embedding freeze. Artifact: 15.56 MB (int6+zstd-22) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

newjordan · 2026-03-23T22:19:08Z

Note: This is a resubmission of PR #508 (originally submitted 2026-03-23T06:02:59Z). The original PR was closed when the source repository's visibility was accidentally changed. The code, results, and artifact are identical to the original submission.

Octavian and others added 30 commits March 23, 2026 17:12

docs: fractal transformer research plan — weight sharing + gravity + …

7256b18

…AttnRes

results: first local ladder — fractal 3x3 beats baseline by 7.1% BPB,…

90a0896

… gravity needs more steps

Fix XSA GQA broadcast bug — expand KV heads before manual attention

75ef44f

8 Q heads with 4 KV heads needs repeat_interleave before matmul. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix FA3 NaN: cast qkv to bf16 before FA3 call, disable dynamo DDP opt

a768362

torch.compile can promote tensors to fp32 which hits missing FA3 kernels (disabled at build time). Explicit bf16 cast prevents silent NaN output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add 2-seed validation scripts for exp A/B/C

ae0afa0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Disable XSA in ttt_only run — manual attention too slow vs FA3

c916c9f

XSA_LAST_N=3 was costing ~25% step time due to manual matmul path. Set to 0 to isolate TTT v2 + temp scaling gains at full speed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add run_v2_ttt_noXSA.sh — TTT v2 + temp scaling, all FA3, max speed

c68de2c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Restore XSA_LAST_N=3 in run_v2_ttt_only.sh (keep existing test intact)

27889d6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log v2 TTT-only + XSA=3 result: 1.1982 BPB (worse than 1.1301 baseline)

277511f

XSA manual attention killed step speed, only 4771/9000 steps completed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Strip verbose logging from v2 train loop — match baseline format

c269e90

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log v2 noXSA result: 1.1538/1.1315 BPB — TTT v2 hurt, no edge over ba…

bcfb475

…seline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log exp_a/b/c results: all worse than 1.1301 baseline, exp_c never ran

2c00f0a

exp_a MTP: 1.1619, exp_b SwiGLU: 1.1570, exp_c: missing tokenizer data. TTT v1 hurt in both exp_a and exp_b (same pattern as TTT v2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add exp D: TTT 8 epochs + stride 32 (eval-only improvement)

ee44ac0

Same model/artifact as SOTA254 baseline — zero risk. More TTT adaptation (3→8 epochs) and finer sliding window (64→32 stride). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add baseline reproduction script — verify 1.1303 on current FA3 build

853bce8

Exact settings from README. If this doesn't reproduce, the FA3 build is the variable, not the code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log exp D result: 1.1295 BPB — new best (-0.0008 vs baseline)

fcf812f

TTT 8 epochs + stride 32. Stride made no difference — all gain from extra TTT adaptation. Same model/artifact, eval-only change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log exp D seed 42: 1.1307 BPB — confirms improvement (mean 1.1301)

39064ef

Both seeds beat baseline. TTT 8 epochs is a free win. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add exp_d SAM variant — TTT 8ep + stride 32 + sharpness-aware TTT

38fb7cc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log exp D seed 7: 1.1313 BPB but 16.18 MB — over size limit

473aa62

Seed 7 compresses worse than 1337/42. BPB improved but artifact exceeds 16 MB cap. Need passing 3rd seed for submission. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add exp_d/run_sam_clean.sh — pure SAM A/B test, no other changes

86ef9af

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Octavian and others added 28 commits March 23, 2026 17:12

Fix v6: 512d/16H/8KV (head_dim=32, FA3 requires multiple of 8)

3e0182b

480d/16H gave head_dim=30 which FA3 rejects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log higher LR (0.030) + MTP results — both worse

504d3d9

Higher LR: 1.12467 TTT (vs 1.12312 at default LR). Qwen signal doesn't transfer Muon. MTP: ~1.16+ base, not enough steps to converge with auxiliary heads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add train script to Frugendorff submission folder

edea2e1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

newjordan mentioned this pull request Mar 24, 2026

Three Breadsticks: 1.1190 BPB (3-seed mean 1.1195) #656

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTQ + Early QAT + Legal TTT — 3-seed mean val_bpb 1.1215#578

GPTQ + Early QAT + Legal TTT — 3-seed mean val_bpb 1.1215#578
newjordan wants to merge 108 commits intoopenai:mainfrom
newjordan:submission/gptq-ttt-ema-1.1215

newjordan commented Mar 23, 2026

Uh oh!

newjordan commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

newjordan commented Mar 23, 2026

Summary

3-Seed Results

Key Innovations

1. GPTQ Quantization (biggest contributor: −0.0027 BPB)

2. Early QAT with Matched Clipping

3. Legal Score-First TTT with EMA + Cosine Fix

Architecture

Credits

Run

Uh oh!

newjordan commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant