GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) by FlashyFlash3011 · Pull Request #347 · openai/parameter-golf

FlashyFlash3011 · 2026-03-21T15:22:07Z

Submission

Experiment: records/track_10min_16mb/2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT/
Target: < 1.1144 BPB (beat leaderboard #1 by 0.005)

GPTQLite: GatedAttention + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)

Built on the PR #414 + PR #399 + PR #841 stack. Four improvements over the current #1 (1.1194 BPB):

Change	Why	Expected Δ BPB
GatedAttention (PR #841)	Per-head sigmoid gate, weight=0/bias=4.0 init (near no-op at start)	−0.002 to −0.005
ValueResidual (PR #841)	Layer-0 value injected into all layers, λ=[0.5,0.5] init	included above
Full QAT from step 1	CastedLinear int6 fake-quant for all ~7000 steps vs last 5% only	−0.001 to −0.003
lzma-9 + BigramHash 2048	lzma-9 is ~10% smaller than lzma-6, freeing budget to restore BigramHash from 1536→2048	−0.001 to −0.002
Total expected		−0.004 to −0.010 → ~1.109–1.115 BPB

Architecture (unchanged from base): 11L · 512d · 8H · 4KV · 3× MLP · LeakyReLU(0.5)² · XSA-4 · Partial RoPE (16/64) · LN Scale · VE128 (layers 9–10) · EMA(0.997) + SWA(every 50) · GPTQ-lite int6 · Legal TTT (3ep SGD, all blocks, lr=0.002) · Parameter Banking + Parallel Muon

Status

Experiment script complete (train_gpt.py, submission.json, README.md)
All changes baked into defaults — no env var overrides needed beyond QAT_ENABLED=1, TTT_ENABLED=1
Full 8×H100 runs (3 seeds: 1337, 42, 2025) — in progress
Results table and train logs — to be added after runs complete

Two new submissions targeting sub-1.1698 BPB: 1. 2026-03-21_LongContext4096_FullStack - 4096-token training context + full modern SOTA stack - Sliding window eval stride=256 (3840 context tokens per position) - Same eval cost as SOTA: 64x4096 = 256x1024 tokens per batch - NTK-aware RoPE base=40000, re-tuned LRs/momentum for 4096 context 2. 2026-03-21_QAT_Int4_16L - Int4 nibble-packing enables 16 transformer layers in 16MB budget - QAT with straight-through estimator activates at 15% of training - All SOTA techniques carried forward (Muon WD, FP16 embed, Overtone init)

- warmdown_iters: 1600 -> 800 (~12% of ~6700 steps vs prior 24%) - rope_base: 40000 -> 41832 (proper NTK formula: 10000 x 4^(64/62) instead of naive 4x multiplication)

…penai#549) - train_seq_len and eval_seq_len raised 2048 -> 4096 - All SOTA techniques inherited: 11L, LeakyReLU(0.5)^2, SmearGate, BigramHash, XSA-4, Partial RoPE, LN Scale, VE128, EMA+SWA, GPTQ-lite, Parallel Muon, OrthoInit, Legal TTT - Dynamic NTK auto-scales rope_base to ~48550 for 4096 context - SDPA fallback added for flash_attn_3 unavailability (local testing) - rocm-smi fallback for nvidia-smi on ROCm hardware - Update QAT Int4 expected BPB estimate to ~1.13-1.14

Fixes: - LongContext4096_Int4_16L_FullSOTA: CastedLinear fake-quant was 6-bit (/31.0) but export was int4 — fixed to /7.0 clamp(-8,7) to match export precision - QAT_Int4_16L_FullSOTA: same CastedLinear fix + adds int4 pack/unpack/quant functions and switches export from int6 to int4 New scripts: - 2026-03-25_LongContext4096_Int6_QAT (safe): LongContext4096_FullSOTA with QAT_ENABLED=1 by default so 6-bit QAT runs from step 1, late_qat_threshold=0.0 - 2026-03-25_LongContext4096_Int4_BankQAT (risky): same Int4 stack plus _fake_quant_int4_bank() applied to all bank weight slices in the forward pass — first time the ~95% of params in qo/kv/mlp banks are QAT-prepared Also: add zstandard to requirements.txt; add missing README/submission.json

…unts and targets

Combines NewTest (PR openai#841 base) with SOTA experiments that achieved ~1.12 BPB: - train_seq_len/eval_seq_len: 2048 → 4096 (long context from user's SOTA exps) - bigram_vocab_size: 3072 → 2048, bigram_dim: 112 → 128 (proven SOTA settings) - xsa_last_n: 11 → 4 (from user's best experiments) - gated_attention + value_residual: enabled by default (PR openai#824/838 show ~0.018 BPB improvement) - Bank QAT: symmetric int6 STE fake-quant on all weight banks during warmdown - Fix: CastedLinear QAT clip range (-32,31) → (-31,31) to match export format - Compression: lzma-6 → zstd-22 (PR openai#824/838: 14.9MB vs ~16MB, critical for fitting under limit) - Fix: target_mb budget uses decimal MB (1e6) not MiB (1024^2) matching competition rules - Budget-aware ±1 weight pruning retained from NewTest

…rgetting

…15.90->15.95

…, TTT epochs=1/freeze=4/lr=0.001

… to fork

…only

notapplica mentioned this pull request Mar 21, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

FlashyFlash3011 added 3 commits March 22, 2026 10:56

Fix warmdown and rope_base in LongContext4096 script

538bfa6

- warmdown_iters: 1600 -> 800 (~12% of ~6700 steps vs prior 24%) - rope_base: 40000 -> 41832 (proper NTK formula: 10000 x 4^(64/62) instead of naive 4x multiplication)

FlashyFlash3011 changed the title ~~LongContext 4096 + Full SOTA Stack & QAT Int4 → 16 Layers~~ LongContext 4096 + Full SOTA Stack + QAT Int4/Int6 → 16 Layers Mar 25, 2026

FlashyFlash3011 and others added 18 commits March 25, 2026 18:29

Add run_experiments.sh and reset RESULTS.md with correct iteration co…

15e6d9e

…unts and targets

Fix BASE path in run_experiments.sh

ee26d04

Fix lzma preset, TTT stride, add QAT exp to run script

b7eb0ed

results: 2026-03-25_LongContext4096_Int6_QAT seed1337

b688621

add recompress_l9.py utility

b937791

exp6: Int6_QAT_2048 — same as Exp5 but ctx=2048 for size+speed fix

12edb34

exp6: full bank QAT + submission.json

0b6146d

exp7: Int6_QAT_2048_LateBank — late bank QAT + MLP_MULT=2.75 + 2048 ctx

b992b40

results: 2026-03-26_Int6_QAT_2048_LateBank seed1337

b371be9

fix: clamp QAT range -32->-31 to match export symmetric range

3c73097

reset: remove failed exps, add BankQAT_2048train_4096eval (Option B)

8ccc5d2

fix: lzma-9 compression, TTT epochs=1/lr=0.001/freeze=4 to prevent fo…

97d4cda

…rgetting

tune: bank_qat_threshold 0.15->0.05 (less warmdown noise), target_mb …

1cc698c

…15.90->15.95

exp: 2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT

74c4ce7

exp: 2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT — lzma-9, bank_qat=0.05…

d1563e1

…, TTT epochs=1/freeze=4/lr=0.001

fix: add git identity + save_and_push after each seed for auto-commit…

4355194

… to fork

cleanup: remove old experiments, slim run_experiments.sh to GPTQLite …

7fc776c

…only

FlashyFlash3011 closed this Mar 27, 2026

FlashyFlash3011 deleted the flashyflash3011/long-context-4096-qat-int4-16l branch March 27, 2026 13:06

FlashyFlash3011 reopened this Mar 27, 2026

FlashyFlash3011 changed the title ~~LongContext 4096 + Full SOTA Stack + QAT Int4/Int6 → 16 Layers~~ GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) Mar 27, 2026

FlashyFlash3011 added 2 commits March 28, 2026 13:32

fix: remove TTT cosine LR decay — use constant ttt_lr across all chunks

6351d76

revert: restore TTT cosine LR decay

3dd38f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347

GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347
FlashyFlash3011 wants to merge 24 commits intoopenai:mainfrom
FlashyFlash3011:flashyflash3011/long-context-4096-qat-int4-16l

FlashyFlash3011 commented Mar 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FlashyFlash3011 commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Submission

GPTQLite: GatedAttention + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FlashyFlash3011 commented Mar 21, 2026 •

edited

Loading