Skip to content

GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347

Draft
FlashyFlash3011 wants to merge 24 commits intoopenai:mainfrom
FlashyFlash3011:flashyflash3011/long-context-4096-qat-int4-16l
Draft

GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347
FlashyFlash3011 wants to merge 24 commits intoopenai:mainfrom
FlashyFlash3011:flashyflash3011/long-context-4096-qat-int4-16l

Conversation

@FlashyFlash3011
Copy link
Copy Markdown

@FlashyFlash3011 FlashyFlash3011 commented Mar 21, 2026

Submission

Experiment: records/track_10min_16mb/2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT/
Target: < 1.1144 BPB (beat leaderboard #1 by 0.005)


GPTQLite: GatedAttention + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)

Built on the PR #414 + PR #399 + PR #841 stack. Four improvements over the current #1 (1.1194 BPB):

Change Why Expected Δ BPB
GatedAttention (PR #841) Per-head sigmoid gate, weight=0/bias=4.0 init (near no-op at start) −0.002 to −0.005
ValueResidual (PR #841) Layer-0 value injected into all layers, λ=[0.5,0.5] init included above
Full QAT from step 1 CastedLinear int6 fake-quant for all ~7000 steps vs last 5% only −0.001 to −0.003
lzma-9 + BigramHash 2048 lzma-9 is ~10% smaller than lzma-6, freeing budget to restore BigramHash from 1536→2048 −0.001 to −0.002
Total expected −0.004 to −0.010 → ~1.109–1.115 BPB

Architecture (unchanged from base): 11L · 512d · 8H · 4KV · 3× MLP · LeakyReLU(0.5)² · XSA-4 · Partial RoPE (16/64) · LN Scale · VE128 (layers 9–10) · EMA(0.997) + SWA(every 50) · GPTQ-lite int6 · Legal TTT (3ep SGD, all blocks, lr=0.002) · Parameter Banking + Parallel Muon


Status

  • Experiment script complete (train_gpt.py, submission.json, README.md)
  • All changes baked into defaults — no env var overrides needed beyond QAT_ENABLED=1, TTT_ENABLED=1
  • Full 8×H100 runs (3 seeds: 1337, 42, 2025) — in progress
  • Results table and train logs — to be added after runs complete

Two new submissions targeting sub-1.1698 BPB:

1. 2026-03-21_LongContext4096_FullStack
   - 4096-token training context + full modern SOTA stack
   - Sliding window eval stride=256 (3840 context tokens per position)
   - Same eval cost as SOTA: 64x4096 = 256x1024 tokens per batch
   - NTK-aware RoPE base=40000, re-tuned LRs/momentum for 4096 context

2. 2026-03-21_QAT_Int4_16L
   - Int4 nibble-packing enables 16 transformer layers in 16MB budget
   - QAT with straight-through estimator activates at 15% of training
   - All SOTA techniques carried forward (Muon WD, FP16 embed, Overtone init)
- warmdown_iters: 1600 -> 800 (~12% of ~6700 steps vs prior 24%)
- rope_base: 40000 -> 41832 (proper NTK formula: 10000 x 4^(64/62)
  instead of naive 4x multiplication)
…penai#549)

- train_seq_len and eval_seq_len raised 2048 -> 4096
- All SOTA techniques inherited: 11L, LeakyReLU(0.5)^2, SmearGate,
  BigramHash, XSA-4, Partial RoPE, LN Scale, VE128, EMA+SWA,
  GPTQ-lite, Parallel Muon, OrthoInit, Legal TTT
- Dynamic NTK auto-scales rope_base to ~48550 for 4096 context
- SDPA fallback added for flash_attn_3 unavailability (local testing)
- rocm-smi fallback for nvidia-smi on ROCm hardware
- Update QAT Int4 expected BPB estimate to ~1.13-1.14
Fixes:
- LongContext4096_Int4_16L_FullSOTA: CastedLinear fake-quant was 6-bit (/31.0)
  but export was int4 — fixed to /7.0 clamp(-8,7) to match export precision
- QAT_Int4_16L_FullSOTA: same CastedLinear fix + adds int4 pack/unpack/quant
  functions and switches export from int6 to int4

New scripts:
- 2026-03-25_LongContext4096_Int6_QAT (safe): LongContext4096_FullSOTA with
  QAT_ENABLED=1 by default so 6-bit QAT runs from step 1, late_qat_threshold=0.0
- 2026-03-25_LongContext4096_Int4_BankQAT (risky): same Int4 stack plus
  _fake_quant_int4_bank() applied to all bank weight slices in the forward
  pass — first time the ~95% of params in qo/kv/mlp banks are QAT-prepared

Also: add zstandard to requirements.txt; add missing README/submission.json
@FlashyFlash3011 FlashyFlash3011 changed the title LongContext 4096 + Full SOTA Stack & QAT Int4 → 16 Layers LongContext 4096 + Full SOTA Stack + QAT Int4/Int6 → 16 Layers Mar 25, 2026
FlashyFlash3011 and others added 18 commits March 25, 2026 18:29
Combines NewTest (PR openai#841 base) with SOTA experiments that achieved ~1.12 BPB:
- train_seq_len/eval_seq_len: 2048 → 4096 (long context from user's SOTA exps)
- bigram_vocab_size: 3072 → 2048, bigram_dim: 112 → 128 (proven SOTA settings)
- xsa_last_n: 11 → 4 (from user's best experiments)
- gated_attention + value_residual: enabled by default (PR openai#824/838 show ~0.018 BPB improvement)
- Bank QAT: symmetric int6 STE fake-quant on all weight banks during warmdown
- Fix: CastedLinear QAT clip range (-32,31) → (-31,31) to match export format
- Compression: lzma-6 → zstd-22 (PR openai#824/838: 14.9MB vs ~16MB, critical for fitting under limit)
- Fix: target_mb budget uses decimal MB (1e6) not MiB (1024^2) matching competition rules
- Budget-aware ±1 weight pruning retained from NewTest
@FlashyFlash3011 FlashyFlash3011 deleted the flashyflash3011/long-context-4096-qat-int4-16l branch March 27, 2026 13:06
@FlashyFlash3011 FlashyFlash3011 changed the title LongContext 4096 + Full SOTA Stack + QAT Int4/Int6 → 16 Layers GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant