Skip to content

Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487)#532

Closed
NotADevIAmaMeatPopsicle wants to merge 12 commits intoopenai:mainfrom
NotADevIAmaMeatPopsicle:stacked-wins
Closed

Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487)#532
NotADevIAmaMeatPopsicle wants to merge 12 commits intoopenai:mainfrom
NotADevIAmaMeatPopsicle:stacked-wins

Conversation

@NotADevIAmaMeatPopsicle
Copy link

@NotADevIAmaMeatPopsicle NotADevIAmaMeatPopsicle commented Mar 23, 2026

Summary

  • val_bpb: 1.0487 (seed 1337, 8xH100 SXM)
  • Artifact: 14.12 MB (88.3% of 16 MB cap — 1.88 MB headroom)
  • 11L, d512, 8H/4KV GQA, MLP 1500, tied embeddings
  • Additional seeds pending compute credits

Novel Contribution: Codebook + Huffman Compression

Standard pipeline (int6 + zstd) produces 18+ MB on our architecture — over the 16 MB cap. We built a novel compression pipeline that fits comfortably:

  1. Per-tensor k-means codebook quantization — non-uniform levels matched to weight distributions (CB-48 MLP / CB-80 QKV / CB-64 proj), tuned across 40 experiments
  2. Huffman entropy coding of codebook indices — exploits non-uniform index distribution that zstd misses
  3. Custom binary format (PCLL) + zstd-22 final compression

Result: 14.12 MB (saves 21% / 3.9 MB vs int6+zstd). Prior work (PR #212) tested codebook+zstd and got 25% larger artifacts — our Huffman stage is the key innovation that makes codebook compression viable.

The 1.88 MB of headroom could fund ~2.5M additional parameters (a 12th layer or wider MLP) in future iterations.

vs. Prior Submissions

Submission BPB Compression TTT Artifact
PR #512 (PROTEUS) 0.9512 int6+zstd LoRA 3ep 15.4 MB
PR #517 (Goldfish) 0.978 int6+zstd Cosine 100ep 15.5 MB
Ours (#532) 1.0487 Codebook+Huffman AdamW 10ep 14.12 MB
PR #518 (sofiabod) 1.0622 int6+zstd Cosine 50ep ~15.8 MB
PR #462 (GEPA) 1.0672 int6+zstd AdamW 10ep ~15 MB

Training Stack

  • NorMuon + Adam hybrid (MATRIX_LR=0.03, WD=0.04)
  • EMA (decay=0.997)
  • Partial RoPE (16/64 dims), LN Scale (1/sqrt(layer+1)), XSA on last 4 layers
  • Value Residual (arXiv:2410.17897) — cache layer-0 V, blend via learned lambda
  • Gated Attention (arXiv:2505.06708) — per-head sigmoid gate
  • LeakyReLU(0.5)^2 activation
  • OrthoInit, logit softcap 30, RoPE base 50K

Test-Time Training

AdamW TTT, 10 epochs, cosine lr schedule, per-layer lr groups:

  • Output projections (c_proj, mlp.proj): 3x base lr
  • MLP FC: 0.5x base lr
  • Base lr=0.001, grad clip 1.0, all params unfrozen

Results

Stage BPB Notes
Pre-quant (step 5364) 1.1511 600s training, 112ms/step
Post-codebook-compression ~1.16 14.12 MB artifact
Post-TTT (10 epochs) 1.0487 Cosine schedule, per-layer lr

Timing

Phase Time
Training 600s (112ms/step, 5,364 steps)
Codebook compression ~270s (k-means on CPU)
TTT (10 epochs) ~250s
Sliding window eval (stride=32) ~55s
Total ~1175s (well within budget)

Reproduction

pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install sentencepiece zstandard huggingface_hub
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-03-22_pcloadloveletter_v6/train_gpt.py

Platform: RunPod 8xH100 SXM, PyTorch 2.6.0+cu124

Credits

Team

Built by Artie AI

NotADevIAmaMeatPopsicle and others added 12 commits March 19, 2026 14:32
Combines three proven orthogonal improvements:
- Sliding window eval (stride=64) from SlidingWindowEval
- Sequence length 2048 from LongContextSeq2048
- FP16 embedding passthrough from FP16Embed_WD3600
- Tuned LR schedule (warmdown=3600, matrix_lr=0.06)
- MLP hidden shrunk to 992 to fit fp16 embed in 16MB budget

Pending validation on 8xH100 SXM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Team: pcloadloveletter (Artie AI)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ough

Full current-meta implementation:
- Int6 quantization ([-32,31] in int8 containers) + zstd-22 compression
- MLP 3x (1536 hidden) funded by int6 savings
- fp16 passthrough for tied embedding + last 2 K projections
- SmearGate: bigram blending module (~512 params)
- Muon tuned: mom=0.99, lr=0.02, warmdown=3000
- Sliding window eval stride=64
- Train seq len 2048

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: quantize_float_tensor used raw abs().amax() for scale
computation. With only 31 int6 levels, outlier weights inflate the
scale and collapse most values to zero. Fixed by using percentile
clipping (99.99984th quantile) matching PR openai#114's proven approach.

Also fixed asymmetric range [-32,31] -> symmetric [-31,31].

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause of catastrophic quant gap: clamp_min(1.0/31) = 0.032
forced a minimum quantization step of 0.032. Zero-init proj
weights (absmax ~0.03) were almost entirely zeroed out since
values < 0.016 round to 0 with that step size. All 18 projection
matrices effectively destroyed.

Fix: use 1e-12 floor (just prevents div-by-zero, doesn't clamp
real scales). This allows the scale to match actual weight range.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous test showed artifact at 16.19MB (over by 190KB).
MLP_HIDDEN=1500 saves ~332K params, estimated artifact ~15.94MB.
Added mlp_hidden parameter threading through MLP/Block/GPT classes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SWA during warmdown: average ~7 checkpoints in second half of
  warmdown for flatter loss landscape generalization (-0.003-0.005 BPB)
- GRAD_CLIP_NORM=0.3: stabilizes train@2048 (from PR openai#114 sweep)
- TRAIN_BATCH_TOKENS=786432: larger batch for seq2048 (from PR openai#114)
- EVAL_STRIDE=256: faster eval, equal or better BPB than stride=64

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NorMuon adds per-row second-moment tracking after Newton-Schulz
orthogonalization, then normalizes and rescales to preserve total
norm. Based on arXiv:2510.05491 and PR openai#89. Expected -0.005 to
-0.010 BPB improvement. Drop-in replacement (same class name).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
batch_seqs=1024 caused CUDA OOM during sliding window eval on
8xH100 (tried to allocate 11.7GB with only 10.5GB free per GPU).
Reduce to 64 windows per batch to fit in memory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major upgrades matching current top leaderboard techniques:
- 11 layers (from 9): +2 layers funded by int6 compression headroom
- BigramHash(2048,128): hash-based bigram embedding for token-pair context
- MuonWD=0.04: decoupled weight decay on Muon optimizer
- OrthoInit: orthogonal initialization for all large 2D weights
- RoPE base 50K (from 10K): better position discrimination at seq2048
- SWA every 50 steps (from 200): smoother weight averaging
- Eval stride 64 (from 256): more context per scored token
- Late-K fp16 updated for blocks.9/.10 (was .7/.8 for 9 layers)

Target: ~1.13-1.14 BPB on 8xH100 (vs 1.1645 on v3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pression

Standard meta additions:
- Partial RoPE (16/64 dims) for position-free pattern learning
- LN Scale (1/sqrt(layer_idx+1)) for training stability
- XSA on last 4 layers to remove self-value bias
- Late QAT (STE int6 at lr_scale<0.1) for quantization robustness
- Tight SWA (scale<0.2) for zero-penalty weight averaging
- LR bump 0.02 → 0.025

Novel compression pipeline (our differentiation):
- Per-tensor k-means codebook quantization (non-uniform levels)
- Mixed codebook sizes: CB-48 MLP / CB-80 attn-QKV / CB-64 attn-proj
- Huffman entropy coding (beats zstd by 1.66 MB on weight data)
- Custom binary format (PCLL) — no pickle, no ZIP
- 40 experiments on beastmode 3080 validated this approach
- Estimated 3.82 MB (21%) savings vs baseline int6+zstd

Not yet validated on GPU — needs round-trip BPB test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Novel codebook+Huffman compression pipeline (14.12 MB artifact, 21% savings
vs int6+zstd) + EMA + Value Residual + Gated Attention + AdamW TTT.

8xH100 SXM, PyTorch 2.6.0, seed 1337.
@NotADevIAmaMeatPopsicle
Copy link
Author

Closing — fixing TTT compliance (multi-epoch global TTT is non-causal after epoch 1). Will resubmit with legal per-document score-first TTT. Compression pipeline and architecture unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant