Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487)#532
Closed
NotADevIAmaMeatPopsicle wants to merge 12 commits intoopenai:mainfrom
Closed
Conversation
Combines three proven orthogonal improvements: - Sliding window eval (stride=64) from SlidingWindowEval - Sequence length 2048 from LongContextSeq2048 - FP16 embedding passthrough from FP16Embed_WD3600 - Tuned LR schedule (warmdown=3600, matrix_lr=0.06) - MLP hidden shrunk to 992 to fit fp16 embed in 16MB budget Pending validation on 8xH100 SXM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Team: pcloadloveletter (Artie AI) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ough Full current-meta implementation: - Int6 quantization ([-32,31] in int8 containers) + zstd-22 compression - MLP 3x (1536 hidden) funded by int6 savings - fp16 passthrough for tied embedding + last 2 K projections - SmearGate: bigram blending module (~512 params) - Muon tuned: mom=0.99, lr=0.02, warmdown=3000 - Sliding window eval stride=64 - Train seq len 2048 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: quantize_float_tensor used raw abs().amax() for scale computation. With only 31 int6 levels, outlier weights inflate the scale and collapse most values to zero. Fixed by using percentile clipping (99.99984th quantile) matching PR openai#114's proven approach. Also fixed asymmetric range [-32,31] -> symmetric [-31,31]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause of catastrophic quant gap: clamp_min(1.0/31) = 0.032 forced a minimum quantization step of 0.032. Zero-init proj weights (absmax ~0.03) were almost entirely zeroed out since values < 0.016 round to 0 with that step size. All 18 projection matrices effectively destroyed. Fix: use 1e-12 floor (just prevents div-by-zero, doesn't clamp real scales). This allows the scale to match actual weight range. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous test showed artifact at 16.19MB (over by 190KB). MLP_HIDDEN=1500 saves ~332K params, estimated artifact ~15.94MB. Added mlp_hidden parameter threading through MLP/Block/GPT classes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SWA during warmdown: average ~7 checkpoints in second half of warmdown for flatter loss landscape generalization (-0.003-0.005 BPB) - GRAD_CLIP_NORM=0.3: stabilizes train@2048 (from PR openai#114 sweep) - TRAIN_BATCH_TOKENS=786432: larger batch for seq2048 (from PR openai#114) - EVAL_STRIDE=256: faster eval, equal or better BPB than stride=64 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NorMuon adds per-row second-moment tracking after Newton-Schulz orthogonalization, then normalizes and rescales to preserve total norm. Based on arXiv:2510.05491 and PR openai#89. Expected -0.005 to -0.010 BPB improvement. Drop-in replacement (same class name). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
batch_seqs=1024 caused CUDA OOM during sliding window eval on 8xH100 (tried to allocate 11.7GB with only 10.5GB free per GPU). Reduce to 64 windows per batch to fit in memory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major upgrades matching current top leaderboard techniques: - 11 layers (from 9): +2 layers funded by int6 compression headroom - BigramHash(2048,128): hash-based bigram embedding for token-pair context - MuonWD=0.04: decoupled weight decay on Muon optimizer - OrthoInit: orthogonal initialization for all large 2D weights - RoPE base 50K (from 10K): better position discrimination at seq2048 - SWA every 50 steps (from 200): smoother weight averaging - Eval stride 64 (from 256): more context per scored token - Late-K fp16 updated for blocks.9/.10 (was .7/.8 for 9 layers) Target: ~1.13-1.14 BPB on 8xH100 (vs 1.1645 on v3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pression Standard meta additions: - Partial RoPE (16/64 dims) for position-free pattern learning - LN Scale (1/sqrt(layer_idx+1)) for training stability - XSA on last 4 layers to remove self-value bias - Late QAT (STE int6 at lr_scale<0.1) for quantization robustness - Tight SWA (scale<0.2) for zero-penalty weight averaging - LR bump 0.02 → 0.025 Novel compression pipeline (our differentiation): - Per-tensor k-means codebook quantization (non-uniform levels) - Mixed codebook sizes: CB-48 MLP / CB-80 attn-QKV / CB-64 attn-proj - Huffman entropy coding (beats zstd by 1.66 MB on weight data) - Custom binary format (PCLL) — no pickle, no ZIP - 40 experiments on beastmode 3080 validated this approach - Estimated 3.82 MB (21%) savings vs baseline int6+zstd Not yet validated on GPU — needs round-trip BPB test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Novel codebook+Huffman compression pipeline (14.12 MB artifact, 21% savings vs int6+zstd) + EMA + Value Residual + Gated Attention + AdamW TTT. 8xH100 SXM, PyTorch 2.6.0, seed 1337.
Author
|
Closing — fixing TTT compliance (multi-epoch global TTT is non-causal after epoch 1). Will resubmit with legal per-document score-first TTT. Compression pipeline and architecture unchanged. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Novel Contribution: Codebook + Huffman Compression
Standard pipeline (int6 + zstd) produces 18+ MB on our architecture — over the 16 MB cap. We built a novel compression pipeline that fits comfortably:
Result: 14.12 MB (saves 21% / 3.9 MB vs int6+zstd). Prior work (PR #212) tested codebook+zstd and got 25% larger artifacts — our Huffman stage is the key innovation that makes codebook compression viable.
The 1.88 MB of headroom could fund ~2.5M additional parameters (a 12th layer or wider MLP) in future iterations.
vs. Prior Submissions
Training Stack
Test-Time Training
AdamW TTT, 10 epochs, cosine lr schedule, per-layer lr groups:
Results
Timing
Reproduction
Platform: RunPod 8xH100 SXM, PyTorch 2.6.0+cu124
Credits
Team
Built by Artie AI