Skip to content

Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation#1103

Open
abaybektursun wants to merge 2 commits intoopenai:mainfrom
abaybektursun:nonrecord/eval-time-quant-losstrunc-negative
Open

Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation#1103
abaybektursun wants to merge 2 commits intoopenai:mainfrom
abaybektursun:nonrecord/eval-time-quant-losstrunc-negative

Conversation

@abaybektursun
Copy link
Copy Markdown
Contributor

@abaybektursun abaybektursun commented Mar 29, 2026

Summary

Negative results on eval-time interventions, mixed-precision quantization, and loss truncation on the PR #1019 stack (1.1147 BPB).

7 experiments, all negative. Tested on 4×H100 PCIe (vast.ai spot) with seed 314, MAX_WALLCLOCK_SECONDS=1800 (equivalent ~6800 steps, matching the PR's ~6922).

# Experiment Type Delta BPB Verdict
1 Single-layer kNN-LM (k=64, L2, 2M store) Eval-time +0.0026 Negative
2 Multi-layer kNN-LM (11 layers, cosine, 2M store) Eval-time +0.0031 Negative
3 Sliding window logit averaging (32 windows) Eval-time +0.024 Catastrophically negative
4 SelfExtend / extended context 4096 Eval-time +0.48 Catastrophically negative
5 N-gram log-linear blend (single-pass, α=0.02) Eval-time −0.0003 Marginal, too slow for 10-min eval
6 Mixed-precision GPTQ (int4 attn, int8 MLP_down) Post-training +0.047 Catastrophically negative
7 Loss truncation (95th percentile) Training +0.081 Catastrophically negative

1–2. kNN-LM (Single-Layer and Multi-Layer)

Single-layer: k=64, λ=0.015, T=1.0, L2 distance, 2M store, final hidden states (512-dim). Delta stabilized at +0.0026 BPB after store filled. L2 retrieval in raw hidden space adds noise — the model's XSA on all 11 layers already captures inter-position patterns.

Multi-layer: Concatenated hidden states from all 11 layers (5632-dim), L2-normalized, cosine similarity. Worse than single-layer (+0.0031). Early layers add noise to the retrieval key. Also hit fp16 overflow (NaN) and fp32 OOM (45GB temporary tensor) before settling on normalized keys.


3. Sliding Window Logit Averaging

Averaged NLLs across all 32 overlapping windows per position (geometric mean of probabilities). Result: +0.024 BPB. Short-context windows (64 tokens) produce much worse predictions than max-context windows (2048 tokens). Averaging dilutes the best prediction with 31 worse ones.


4. SelfExtend / Extended Context

SelfExtend (position remapping, group_size=4): 2.35 BPB. Conflicts with the model's existing NTK-aware RoPE scaling.

Plain extended context (4096 tokens, built-in NTK): 1.59 BPB. Model was trained at seq_len=2048; extending to 4096 produces out-of-distribution attention patterns.


5. N-gram Log-Linear Blend

Single-pass (legal): at each scored position, neural logits + interpolated n-gram distribution blended via log P = (1-α)·log P_neural + α·log P_ngram, then normalized. Strictly causal — n-gram cache updated after scoring.

Best α=0.02 gave −0.0003 BPB on 1% test (confirmed on 5% of full eval). Problem: pure Python n-gram loop runs at ~9K tok/s. Full 62M token eval would take ~2 hours, far exceeding the 10-minute eval budget. Would need a C extension to be practical.


6. Mixed-Precision GPTQ

Hypothesis (from model analysis): MLP accounts for 80% of quantization damage; attention Q matrices at 72.6% SVD utilization are insensitive. Steal bits from attention, give to MLP_down.

Config Attn MLP_up MLP_down Size BPB Delta
uniform_int6 6 6 6 14.87 MiB 1.11773 baseline
attn4_mlpD8 4 6 8 12.84 MiB 1.16472 +0.047
attn5_mlpD7 5 6 7 13.88 MiB 1.12998 +0.012

SVD utilization does not predict quantization robustness. Int4 attention destroys critical signal. Uniform int6 is already near-optimal.


7. Loss Truncation

Clipped per-token loss at 95th percentile during training (top 5% of losses zeroed).

Metric Baseline Loss truncation Delta
Val BPB @ step 4000 1.1959 1.2772 +0.081

The model learned to predict only "easy" tokens and abandoned the hard ones. At 27M params, every gradient signal matters — the top 5% includes genuine learning signal (rare words, domain terms, context-dependent predictions), not just noise.


Conclusions

  1. This model is tightly optimized. Uniform int6 quantization, max-context sliding window scoring, and standard cross-entropy loss are all locally optimal. No free lunch from post-hoc interventions.

  2. Eval-time symbolic methods don't help. kNN, n-gram blending, logit averaging, extended context — all negative. The model already captures the patterns these methods target via XSA, BigramHash, and attention.

  3. Quantization precision should be uniform. SVD utilization does not predict quantization sensitivity. All weight types need equal precision.

  4. Loss truncation hurts at 27M params. Every training signal matters at this scale. Filtering "hard" tokens removes genuine learning signal.

🤖 Generated with Claude Code

…on GPTQ, loss truncation

8 experiments on PR openai#1019 stack: 1 positive (memmap −0.0033), 7 negative.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abaybektursun abaybektursun changed the title Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation, memmap pipeline (+) Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation Mar 29, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant