Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487) by NotADevIAmaMeatPopsicle · Pull Request #532 · openai/parameter-golf

NotADevIAmaMeatPopsicle · 2026-03-23T14:36:16Z

Summary

val_bpb: 1.0487 (seed 1337, 8xH100 SXM)
Artifact: 14.12 MB (88.3% of 16 MB cap — 1.88 MB headroom)
11L, d512, 8H/4KV GQA, MLP 1500, tied embeddings
Additional seeds pending compute credits

Novel Contribution: Codebook + Huffman Compression

Standard pipeline (int6 + zstd) produces 18+ MB on our architecture — over the 16 MB cap. We built a novel compression pipeline that fits comfortably:

Per-tensor k-means codebook quantization — non-uniform levels matched to weight distributions (CB-48 MLP / CB-80 QKV / CB-64 proj), tuned across 40 experiments
Huffman entropy coding of codebook indices — exploits non-uniform index distribution that zstd misses
Custom binary format (PCLL) + zstd-22 final compression

Result: 14.12 MB (saves 21% / 3.9 MB vs int6+zstd). Prior work (PR #212) tested codebook+zstd and got 25% larger artifacts — our Huffman stage is the key innovation that makes codebook compression viable.

The 1.88 MB of headroom could fund ~2.5M additional parameters (a 12th layer or wider MLP) in future iterations.

vs. Prior Submissions

Submission	BPB	Compression	TTT	Artifact
PR #512 (PROTEUS)	0.9512	int6+zstd	LoRA 3ep	15.4 MB
PR #517 (Goldfish)	0.978	int6+zstd	Cosine 100ep	15.5 MB
Ours (#532)	1.0487	Codebook+Huffman	AdamW 10ep	14.12 MB
PR #518 (sofiabod)	1.0622	int6+zstd	Cosine 50ep	~15.8 MB
PR #462 (GEPA)	1.0672	int6+zstd	AdamW 10ep	~15 MB

Training Stack

NorMuon + Adam hybrid (MATRIX_LR=0.03, WD=0.04)
EMA (decay=0.997)
Partial RoPE (16/64 dims), LN Scale (1/sqrt(layer+1)), XSA on last 4 layers
Value Residual (arXiv:2410.17897) — cache layer-0 V, blend via learned lambda
Gated Attention (arXiv:2505.06708) — per-head sigmoid gate
LeakyReLU(0.5)^2 activation
OrthoInit, logit softcap 30, RoPE base 50K

Test-Time Training

AdamW TTT, 10 epochs, cosine lr schedule, per-layer lr groups:

Output projections (c_proj, mlp.proj): 3x base lr
MLP FC: 0.5x base lr
Base lr=0.001, grad clip 1.0, all params unfrozen

Results

Stage	BPB	Notes
Pre-quant (step 5364)	1.1511	600s training, 112ms/step
Post-codebook-compression	~1.16	14.12 MB artifact
Post-TTT (10 epochs)	1.0487	Cosine schedule, per-layer lr

Timing

Phase	Time
Training	600s (112ms/step, 5,364 steps)
Codebook compression	~270s (k-means on CPU)
TTT (10 epochs)	~250s
Sliding window eval (stride=32)	~55s
Total	~1175s (well within budget)

Reproduction

pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install sentencepiece zstandard huggingface_hub
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-03-22_pcloadloveletter_v6/train_gpt.py

Platform: RunPod 8xH100 SXM, PyTorch 2.6.0+cu124

Credits

Training architecture baseline: PRs Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483) #162, Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds) #180, Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 and the modded-nanogpt community
EMA + XSA: PR Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed) #398 (felipe-parodi), arXiv:2603.09078
AdamW TTT: PR Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027) #442 (sjp611)
Per-layer TTT lr: PR Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481 (mrdavtan)
Value Residual: arXiv:2410.17897 (ResFormer)
Gated Attention: arXiv:2505.06708
LeakyReLU(0.5)^2: PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 (sofiabod)
MATRIX_LR=0.03: PR Non-record: Basis Block Interpolation (novel negative result) + Hyperparameter Sweep (MATRIX_LR=0.03 improves SOTA by 0.059 bpb) #530 (j420)
Codebook quantization insight: PR Non-record: Negative findings on codebook quantization, magnitude pruning, multi-token prediction, embedding factorization #212 (mrdavtan) — tested codebook+zstd (negative result); we improved with Huffman entropy coding
Huffman on weight data validated by ZipNN (arXiv:2411.05239)

Team

Built by Artie AI

Combines three proven orthogonal improvements: - Sliding window eval (stride=64) from SlidingWindowEval - Sequence length 2048 from LongContextSeq2048 - FP16 embedding passthrough from FP16Embed_WD3600 - Tuned LR schedule (warmdown=3600, matrix_lr=0.06) - MLP hidden shrunk to 992 to fit fp16 embed in 16MB budget Pending validation on 8xH100 SXM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Team: pcloadloveletter (Artie AI) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ough Full current-meta implementation: - Int6 quantization ([-32,31] in int8 containers) + zstd-22 compression - MLP 3x (1536 hidden) funded by int6 savings - fp16 passthrough for tied embedding + last 2 K projections - SmearGate: bigram blending module (~512 params) - Muon tuned: mom=0.99, lr=0.02, warmdown=3000 - Sliding window eval stride=64 - Train seq len 2048 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Root cause: quantize_float_tensor used raw abs().amax() for scale computation. With only 31 int6 levels, outlier weights inflate the scale and collapse most values to zero. Fixed by using percentile clipping (99.99984th quantile) matching PR openai#114's proven approach. Also fixed asymmetric range [-32,31] -> symmetric [-31,31]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Root cause of catastrophic quant gap: clamp_min(1.0/31) = 0.032 forced a minimum quantization step of 0.032. Zero-init proj weights (absmax ~0.03) were almost entirely zeroed out since values < 0.016 round to 0 with that step size. All 18 projection matrices effectively destroyed. Fix: use 1e-12 floor (just prevents div-by-zero, doesn't clamp real scales). This allows the scale to match actual weight range. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous test showed artifact at 16.19MB (over by 190KB). MLP_HIDDEN=1500 saves ~332K params, estimated artifact ~15.94MB. Added mlp_hidden parameter threading through MLP/Block/GPT classes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- SWA during warmdown: average ~7 checkpoints in second half of warmdown for flatter loss landscape generalization (-0.003-0.005 BPB) - GRAD_CLIP_NORM=0.3: stabilizes train@2048 (from PR openai#114 sweep) - TRAIN_BATCH_TOKENS=786432: larger batch for seq2048 (from PR openai#114) - EVAL_STRIDE=256: faster eval, equal or better BPB than stride=64 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NorMuon adds per-row second-moment tracking after Newton-Schulz orthogonalization, then normalizes and rescales to preserve total norm. Based on arXiv:2510.05491 and PR openai#89. Expected -0.005 to -0.010 BPB improvement. Drop-in replacement (same class name). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

batch_seqs=1024 caused CUDA OOM during sliding window eval on 8xH100 (tried to allocate 11.7GB with only 10.5GB free per GPU). Reduce to 64 windows per batch to fit in memory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major upgrades matching current top leaderboard techniques: - 11 layers (from 9): +2 layers funded by int6 compression headroom - BigramHash(2048,128): hash-based bigram embedding for token-pair context - MuonWD=0.04: decoupled weight decay on Muon optimizer - OrthoInit: orthogonal initialization for all large 2D weights - RoPE base 50K (from 10K): better position discrimination at seq2048 - SWA every 50 steps (from 200): smoother weight averaging - Eval stride 64 (from 256): more context per scored token - Late-K fp16 updated for blocks.9/.10 (was .7/.8 for 9 layers) Target: ~1.13-1.14 BPB on 8xH100 (vs 1.1645 on v3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…pression Standard meta additions: - Partial RoPE (16/64 dims) for position-free pattern learning - LN Scale (1/sqrt(layer_idx+1)) for training stability - XSA on last 4 layers to remove self-value bias - Late QAT (STE int6 at lr_scale<0.1) for quantization robustness - Tight SWA (scale<0.2) for zero-penalty weight averaging - LR bump 0.02 → 0.025 Novel compression pipeline (our differentiation): - Per-tensor k-means codebook quantization (non-uniform levels) - Mixed codebook sizes: CB-48 MLP / CB-80 attn-QKV / CB-64 attn-proj - Huffman entropy coding (beats zstd by 1.66 MB on weight data) - Custom binary format (PCLL) — no pickle, no ZIP - 40 experiments on beastmode 3080 validated this approach - Estimated 3.82 MB (21%) savings vs baseline int6+zstd Not yet validated on GPU — needs round-trip BPB test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Novel codebook+Huffman compression pipeline (14.12 MB artifact, 21% savings vs int6+zstd) + EMA + Value Residual + Gated Attention + AdamW TTT. 8xH100 SXM, PyTorch 2.6.0, seed 1337.

NotADevIAmaMeatPopsicle · 2026-03-23T15:50:26Z

Closing — fixing TTT compliance (multi-epoch global TTT is non-causal after epoch 1). Will resubmit with legal per-document score-first TTT. Compression pipeline and architecture unchanged.

NotADevIAmaMeatPopsicle and others added 12 commits March 19, 2026 14:32

Rename submission to pcloadloveletter v1

7376de9

Team: pcloadloveletter (Artie AI) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: pcloadloveletter v6 — 1.0487 BPB (Artie AI)

6379f92

Novel codebook+Huffman compression pipeline (14.12 MB artifact, 21% savings vs int6+zstd) + EMA + Value Residual + Gated Attention + AdamW TTT. 8xH100 SXM, PyTorch 2.6.0, seed 1337.

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

NotADevIAmaMeatPopsicle closed this Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487)#532

Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487)#532
NotADevIAmaMeatPopsicle wants to merge 12 commits intoopenai:mainfrom
NotADevIAmaMeatPopsicle:stacked-wins

NotADevIAmaMeatPopsicle commented Mar 23, 2026 •

edited

Loading

Uh oh!

NotADevIAmaMeatPopsicle commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NotADevIAmaMeatPopsicle commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Novel Contribution: Codebook + Huffman Compression

vs. Prior Submissions

Training Stack

Test-Time Training

Results

Timing

Reproduction

Credits

Team

Uh oh!

NotADevIAmaMeatPopsicle commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NotADevIAmaMeatPopsicle commented Mar 23, 2026 •

edited

Loading