Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean)#505
Open
JoeProAI wants to merge 1 commit intoopenai:mainfrom
Open
Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean)#505JoeProAI wants to merge 1 commit intoopenai:mainfrom
JoeProAI wants to merge 1 commit intoopenai:mainfrom
Conversation
Non-TTT submission: 3-seed mean 1.11807945 (std 0.000836) Seed 42: 1.11885136 Seed 123: 1.11691774 Seed 7: 1.11846924 Architecture: 11L SwiGLU (Star-ReLU), U-Net skip gates, XSA4, VE128 (layers 9-10), BigramHash 8192, EMA 0.997, seq_len=2048, batch=786K, warmdown=3500 Key finding: seq_len 2048 yields -0.008 bpb vs 1024. No test-time training.
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
Star-ReLU: learnable per-channel scale+bias on relu² activation (MetaFormer). Same architecture as PR openai#505's "SwiGLU" — 2 weight matrices, not gated MLP. Zero step time overhead, ~34K params (66KB fp16). TrigramHash: 3-token xor hash embedding extending BigramHash to trigram context. 4096 buckets, 32-dim, ~147K params (108KB int6). Independent contribution. BigramHash doubled to 4096 buckets (from 2048) for less collision. All features env-var controlled and default ON. Artifact headroom: ~466KB remaining (well within 16MB cap).
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
Key changes from studying PR openai#505 (1.1181) and openai#486 (1.0887): - train_batch_tokens: 524K → 786K (all top entries use this) - bigram_hash_buckets: 4096 → 8192 (PR openai#505 uses 8192, openai#493 uses 10240) - grad_clip_norm: 0.3 → 0.0 (PR openai#505 disables clipping) - Star-ReLU and TrigramHash enabled in all run scripts
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
Sigmoid skip gates (PR openai#505): replace additive skip connections with sigmoid-gated blend: x = gate*x + (1-gate)*scaled_skip. Learned per-dim gates init to sigmoid(0)=0.5. SIGMOID_SKIP_GATES=1 (default on). Decoder 2x LR (PR openai#505): decoder layers (>= num_encoder_layers) get DECODER_LR_MULT=2.0 applied on top of the per-layer lr splits. Combined with quant-damage lr: decoder proj gets 1.5*2=3x base lr. Both features are env-var controlled and default ON.
|
uh i ran this and got like 20 mb artifact but idk bro maybe smths diff on my machine? |
newjordan
pushed a commit
to newjordan/parameter-golf
that referenced
this pull request
Mar 23, 2026
Upgrades train_gpt_swiglu.py with every proven technique for max quality: - seq_len 1024→2048, batch 524K→786K (PR openai#505: -0.009 BPB) - LeakyReLU(0.5)² replaces ReLU² (preserves negative gradient flow) - VRL: sigmoid-gated first-block mixing into attention input - Legal score-first TTT ported from v7 (disabled by default) - int8 GPTQ for attn.proj (lower quant tax on sensitive layers) - grad_clip 0→0.3, EMA 0.9985→0.997, warmdown 6000→3500 - All illegal TTT remains purged. Score-first only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
Run6 (1.1635) showed sigmoid gates + decoder 2x LR + bigram 8192 all hurt. Those techniques are from PR openai#505 which has a different architecture (kv8, h1792). They don't transfer to our kv4/h1536 setup. Reverted: SIGMOID_SKIP_GATES=0, DECODER_LR_MULT=1.0, BIGRAM_HASH_BUCKETS=4096, GRAD_CLIP_NORM=0.3 Our unique stack remains: VR + GA + Star-ReLU + per-layer lr + GradQuant + TrigramHash
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
…iques Key finding: PR openai#505 (1.1181) does NOT fit in 16MB — their 8KV+h1792 config produces ~20MB artifacts. Real non-TTT target is openai#445 at 1.1236. Novel technique analysis: DG Attention (differential values), BitNet b1.58 (ternary weights + depth recurrence), arithmetic coding (replaces zstd-22), LeakyReLU(0.5)^2 (-0.003 BPB, zero params).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SwiGLU + VE128 + U-Net Skip Gates (No TTT)
val_bpb (3-seed mean): 1.11807945 (std: 0.000836, best: 1.11691774)
Architecture
11-layer transformer:
Training Configuration
Key Ablation: Sequence Length
Provenance
All architectural components discovered through systematic ablation search and Codex-guided exploration. Value Embeddings adapted from community. No test-time training.
Disclosure
This work is independently funded with no sponsorship, grants, or funding from OpenAI or any other organization. All compute was self-funded on Modal (personal account).
Compute
~18 min/seed on 8xH100 SXM (Modal). Three seeds for verification.