GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347
Draft
FlashyFlash3011 wants to merge 24 commits intoopenai:mainfrom
Draft
GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347FlashyFlash3011 wants to merge 24 commits intoopenai:mainfrom
FlashyFlash3011 wants to merge 24 commits intoopenai:mainfrom
Conversation
Two new submissions targeting sub-1.1698 BPB: 1. 2026-03-21_LongContext4096_FullStack - 4096-token training context + full modern SOTA stack - Sliding window eval stride=256 (3840 context tokens per position) - Same eval cost as SOTA: 64x4096 = 256x1024 tokens per batch - NTK-aware RoPE base=40000, re-tuned LRs/momentum for 4096 context 2. 2026-03-21_QAT_Int4_16L - Int4 nibble-packing enables 16 transformer layers in 16MB budget - QAT with straight-through estimator activates at 15% of training - All SOTA techniques carried forward (Muon WD, FP16 embed, Overtone init)
- warmdown_iters: 1600 -> 800 (~12% of ~6700 steps vs prior 24%) - rope_base: 40000 -> 41832 (proper NTK formula: 10000 x 4^(64/62) instead of naive 4x multiplication)
…penai#549) - train_seq_len and eval_seq_len raised 2048 -> 4096 - All SOTA techniques inherited: 11L, LeakyReLU(0.5)^2, SmearGate, BigramHash, XSA-4, Partial RoPE, LN Scale, VE128, EMA+SWA, GPTQ-lite, Parallel Muon, OrthoInit, Legal TTT - Dynamic NTK auto-scales rope_base to ~48550 for 4096 context - SDPA fallback added for flash_attn_3 unavailability (local testing) - rocm-smi fallback for nvidia-smi on ROCm hardware - Update QAT Int4 expected BPB estimate to ~1.13-1.14
Fixes: - LongContext4096_Int4_16L_FullSOTA: CastedLinear fake-quant was 6-bit (/31.0) but export was int4 — fixed to /7.0 clamp(-8,7) to match export precision - QAT_Int4_16L_FullSOTA: same CastedLinear fix + adds int4 pack/unpack/quant functions and switches export from int6 to int4 New scripts: - 2026-03-25_LongContext4096_Int6_QAT (safe): LongContext4096_FullSOTA with QAT_ENABLED=1 by default so 6-bit QAT runs from step 1, late_qat_threshold=0.0 - 2026-03-25_LongContext4096_Int4_BankQAT (risky): same Int4 stack plus _fake_quant_int4_bank() applied to all bank weight slices in the forward pass — first time the ~95% of params in qo/kv/mlp banks are QAT-prepared Also: add zstandard to requirements.txt; add missing README/submission.json
Combines NewTest (PR openai#841 base) with SOTA experiments that achieved ~1.12 BPB: - train_seq_len/eval_seq_len: 2048 → 4096 (long context from user's SOTA exps) - bigram_vocab_size: 3072 → 2048, bigram_dim: 112 → 128 (proven SOTA settings) - xsa_last_n: 11 → 4 (from user's best experiments) - gated_attention + value_residual: enabled by default (PR openai#824/838 show ~0.018 BPB improvement) - Bank QAT: symmetric int6 STE fake-quant on all weight banks during warmdown - Fix: CastedLinear QAT clip range (-32,31) → (-31,31) to match export format - Compression: lzma-6 → zstd-22 (PR openai#824/838: 14.9MB vs ~16MB, critical for fitting under limit) - Fix: target_mb budget uses decimal MB (1e6) not MiB (1024^2) matching competition rules - Budget-aware ±1 weight pruning retained from NewTest
…, TTT epochs=1/freeze=4/lr=0.001
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Submission
Experiment:
records/track_10min_16mb/2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT/Target: < 1.1144 BPB (beat leaderboard #1 by 0.005)
GPTQLite: GatedAttention + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)
Built on the PR #414 + PR #399 + PR #841 stack. Four improvements over the current #1 (1.1194 BPB):
Architecture (unchanged from base): 11L · 512d · 8H · 4KV · 3× MLP · LeakyReLU(0.5)² · XSA-4 · Partial RoPE (16/64) · LN Scale · VE128 (layers 9–10) · EMA(0.997) + SWA(every 50) · GPTQ-lite int6 · Legal TTT (3ep SGD, all blocks, lr=0.002) · Parameter Banking + Parallel Muon
Status
train_gpt.py,submission.json,README.md)QAT_ENABLED=1,TTT_ENABLED=1