Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation#1103
Open
abaybektursun wants to merge 2 commits intoopenai:mainfrom
Open
Conversation
…on GPTQ, loss truncation 8 experiments on PR openai#1019 stack: 1 positive (memmap −0.0033), 7 negative. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Negative results on eval-time interventions, mixed-precision quantization, and loss truncation on the PR #1019 stack (1.1147 BPB).
7 experiments, all negative. Tested on 4×H100 PCIe (vast.ai spot) with seed 314, MAX_WALLCLOCK_SECONDS=1800 (equivalent ~6800 steps, matching the PR's ~6922).
1–2. kNN-LM (Single-Layer and Multi-Layer)
Single-layer: k=64, λ=0.015, T=1.0, L2 distance, 2M store, final hidden states (512-dim). Delta stabilized at +0.0026 BPB after store filled. L2 retrieval in raw hidden space adds noise — the model's XSA on all 11 layers already captures inter-position patterns.
Multi-layer: Concatenated hidden states from all 11 layers (5632-dim), L2-normalized, cosine similarity. Worse than single-layer (+0.0031). Early layers add noise to the retrieval key. Also hit fp16 overflow (NaN) and fp32 OOM (45GB temporary tensor) before settling on normalized keys.
3. Sliding Window Logit Averaging
Averaged NLLs across all 32 overlapping windows per position (geometric mean of probabilities). Result: +0.024 BPB. Short-context windows (64 tokens) produce much worse predictions than max-context windows (2048 tokens). Averaging dilutes the best prediction with 31 worse ones.
4. SelfExtend / Extended Context
SelfExtend (position remapping, group_size=4): 2.35 BPB. Conflicts with the model's existing NTK-aware RoPE scaling.
Plain extended context (4096 tokens, built-in NTK): 1.59 BPB. Model was trained at seq_len=2048; extending to 4096 produces out-of-distribution attention patterns.
5. N-gram Log-Linear Blend
Single-pass (legal): at each scored position, neural logits + interpolated n-gram distribution blended via
log P = (1-α)·log P_neural + α·log P_ngram, then normalized. Strictly causal — n-gram cache updated after scoring.Best α=0.02 gave −0.0003 BPB on 1% test (confirmed on 5% of full eval). Problem: pure Python n-gram loop runs at ~9K tok/s. Full 62M token eval would take ~2 hours, far exceeding the 10-minute eval budget. Would need a C extension to be practical.
6. Mixed-Precision GPTQ
Hypothesis (from model analysis): MLP accounts for 80% of quantization damage; attention Q matrices at 72.6% SVD utilization are insensitive. Steal bits from attention, give to MLP_down.
SVD utilization does not predict quantization robustness. Int4 attention destroys critical signal. Uniform int6 is already near-optimal.
7. Loss Truncation
Clipped per-token loss at 95th percentile during training (top 5% of losses zeroed).
The model learned to predict only "easy" tokens and abandoned the hard ones. At 27M params, every gradient signal matters — the top 5% includes genuine learning signal (rare words, domain terms, context-dependent predictions), not just noise.
Conclusions
This model is tightly optimized. Uniform int6 quantization, max-context sliding window scoring, and standard cross-entropy loss are all locally optimal. No free lunch from post-hoc interventions.
Eval-time symbolic methods don't help. kNN, n-gram blending, logit averaging, extended context — all negative. The model already captures the patterns these methods target via XSA, BigramHash, and attention.
Quantization precision should be uniform. SVD utilization does not predict quantization sensitivity. All weight types need equal precision.
Loss truncation hurts at 27M params. Every training signal matters at this scale. Filtering "hard" tokens removes genuine learning signal.
🤖 Generated with Claude Code