Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean) by JoeProAI · Pull Request #505 · openai/parameter-golf

JoeProAI · 2026-03-23T05:10:22Z

SwiGLU + VE128 + U-Net Skip Gates (No TTT)

val_bpb (3-seed mean): 1.11807945 (std: 0.000836, best: 1.11691774)

Seed	val_bpb
42	1.11885136
123	1.11691774
7	1.11846924

Architecture

11-layer transformer:

SwiGLU FFN with Star-ReLU activation (hidden=1792)
U-Net Skip Gates: 5 encoder, 6 decoder with learned gating
XSA4: Extended Self-Attention in last 4 layers
Value Embeddings (VE128): 128-dim shared embedding, per-layer scales (layers 9-10)
BigramHash: 8192 buckets, 128-dim
EMA (decay=0.997)
Partial RoPE (16 dims)
LN Scale, Late QAT@0.15
Int6 + GPTQ-lite + zstd-22

Training Configuration

Sequence length: 2048 (key finding: +0.008 bpb over seq_len=1024)
Batch tokens: 786,432
Warmdown: 3,500 steps
8xH100 SXM (Modal), ~18 min/seed

Key Ablation: Sequence Length

Config	val_bpb
Full arch, seq_len=1024	1.12670
Full arch, seq_len=2048	1.11808
Improvement	-0.00862

Provenance

All architectural components discovered through systematic ablation search and Codex-guided exploration. Value Embeddings adapted from community. No test-time training.

Disclosure

This work is independently funded with no sponsorship, grants, or funding from OpenAI or any other organization. All compute was self-funded on Modal (personal account).

Compute

~18 min/seed on 8xH100 SXM (Modal). Three seeds for verification.

Non-TTT submission: 3-seed mean 1.11807945 (std 0.000836) Seed 42: 1.11885136 Seed 123: 1.11691774 Seed 7: 1.11846924 Architecture: 11L SwiGLU (Star-ReLU), U-Net skip gates, XSA4, VE128 (layers 9-10), BigramHash 8192, EMA 0.997, seq_len=2048, batch=786K, warmdown=3500 Key finding: seq_len 2048 yields -0.008 bpb vs 1024. No test-time training.

Star-ReLU: learnable per-channel scale+bias on relu² activation (MetaFormer). Same architecture as PR openai#505's "SwiGLU" — 2 weight matrices, not gated MLP. Zero step time overhead, ~34K params (66KB fp16). TrigramHash: 3-token xor hash embedding extending BigramHash to trigram context. 4096 buckets, 32-dim, ~147K params (108KB int6). Independent contribution. BigramHash doubled to 4096 buckets (from 2048) for less collision. All features env-var controlled and default ON. Artifact headroom: ~466KB remaining (well within 16MB cap).

Key changes from studying PR openai#505 (1.1181) and openai#486 (1.0887): - train_batch_tokens: 524K → 786K (all top entries use this) - bigram_hash_buckets: 4096 → 8192 (PR openai#505 uses 8192, openai#493 uses 10240) - grad_clip_norm: 0.3 → 0.0 (PR openai#505 disables clipping) - Star-ReLU and TrigramHash enabled in all run scripts

Sigmoid skip gates (PR openai#505): replace additive skip connections with sigmoid-gated blend: x = gate*x + (1-gate)*scaled_skip. Learned per-dim gates init to sigmoid(0)=0.5. SIGMOID_SKIP_GATES=1 (default on). Decoder 2x LR (PR openai#505): decoder layers (>= num_encoder_layers) get DECODER_LR_MULT=2.0 applied on top of the per-layer lr splits. Combined with quant-damage lr: decoder proj gets 1.5*2=3x base lr. Both features are env-var controlled and default ON.

himanalot · 2026-03-23T15:26:32Z

uh i ran this and got like 20 mb artifact but idk bro maybe smths diff on my machine?

Upgrades train_gpt_swiglu.py with every proven technique for max quality: - seq_len 1024→2048, batch 524K→786K (PR openai#505: -0.009 BPB) - LeakyReLU(0.5)² replaces ReLU² (preserves negative gradient flow) - VRL: sigmoid-gated first-block mixing into attention input - Legal score-first TTT ported from v7 (disabled by default) - int8 GPTQ for attn.proj (lower quant tax on sensitive layers) - grad_clip 0→0.3, EMA 0.9985→0.997, warmdown 6000→3500 - All illegal TTT remains purged. Score-first only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Run6 (1.1635) showed sigmoid gates + decoder 2x LR + bigram 8192 all hurt. Those techniques are from PR openai#505 which has a different architecture (kv8, h1792). They don't transfer to our kv4/h1536 setup. Reverted: SIGMOID_SKIP_GATES=0, DECODER_LR_MULT=1.0, BIGRAM_HASH_BUCKETS=4096, GRAD_CLIP_NORM=0.3 Our unique stack remains: VR + GA + Star-ReLU + per-layer lr + GradQuant + TrigramHash

…iques Key finding: PR openai#505 (1.1181) does NOT fit in 16MB — their 8KV+h1792 config produces ~20MB artifacts. Real non-TTT target is openai#445 at 1.1236. Novel technique analysis: DG Attention (differential values), BitNet b1.58 (ternary weights + depth recurrence), arithmetic coding (replaces zstd-22), LeakyReLU(0.5)^2 (-0.003 BPB, zero params).

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean)#505

Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean)#505
JoeProAI wants to merge 1 commit intoopenai:mainfrom
JoeProAI:joeproai/swiglu-ve128-nottt-1.1181

JoeProAI commented Mar 23, 2026 •

edited

Loading

Uh oh!

himanalot commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JoeProAI commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SwiGLU + VE128 + U-Net Skip Gates (No TTT)

Architecture

Training Configuration

Key Ablation: Sequence Length

Provenance

Disclosure

Compute

Uh oh!

himanalot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JoeProAI commented Mar 23, 2026 •

edited

Loading

himanalot commented Mar 23, 2026 •

edited

Loading