Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation by abaybektursun · Pull Request #1103 · openai/parameter-golf

abaybektursun · 2026-03-29T23:30:35Z

Summary

Negative results on eval-time interventions, mixed-precision quantization, and loss truncation on the PR #1019 stack (1.1147 BPB).

7 experiments, all negative. Tested on 4×H100 PCIe (vast.ai spot) with seed 314, MAX_WALLCLOCK_SECONDS=1800 (equivalent ~6800 steps, matching the PR's ~6922).

#	Experiment	Type	Delta BPB	Verdict
1	Single-layer kNN-LM (k=64, L2, 2M store)	Eval-time	+0.0026	Negative
2	Multi-layer kNN-LM (11 layers, cosine, 2M store)	Eval-time	+0.0031	Negative
3	Sliding window logit averaging (32 windows)	Eval-time	+0.024	Catastrophically negative
4	SelfExtend / extended context 4096	Eval-time	+0.48	Catastrophically negative
5	N-gram log-linear blend (single-pass, α=0.02)	Eval-time	−0.0003	Marginal, too slow for 10-min eval
6	Mixed-precision GPTQ (int4 attn, int8 MLP_down)	Post-training	+0.047	Catastrophically negative
7	Loss truncation (95th percentile)	Training	+0.081	Catastrophically negative

1–2. kNN-LM (Single-Layer and Multi-Layer)

Single-layer: k=64, λ=0.015, T=1.0, L2 distance, 2M store, final hidden states (512-dim). Delta stabilized at +0.0026 BPB after store filled. L2 retrieval in raw hidden space adds noise — the model's XSA on all 11 layers already captures inter-position patterns.

Multi-layer: Concatenated hidden states from all 11 layers (5632-dim), L2-normalized, cosine similarity. Worse than single-layer (+0.0031). Early layers add noise to the retrieval key. Also hit fp16 overflow (NaN) and fp32 OOM (45GB temporary tensor) before settling on normalized keys.

3. Sliding Window Logit Averaging

Averaged NLLs across all 32 overlapping windows per position (geometric mean of probabilities). Result: +0.024 BPB. Short-context windows (64 tokens) produce much worse predictions than max-context windows (2048 tokens). Averaging dilutes the best prediction with 31 worse ones.

4. SelfExtend / Extended Context

SelfExtend (position remapping, group_size=4): 2.35 BPB. Conflicts with the model's existing NTK-aware RoPE scaling.

Plain extended context (4096 tokens, built-in NTK): 1.59 BPB. Model was trained at seq_len=2048; extending to 4096 produces out-of-distribution attention patterns.

5. N-gram Log-Linear Blend

Single-pass (legal): at each scored position, neural logits + interpolated n-gram distribution blended via log P = (1-α)·log P_neural + α·log P_ngram, then normalized. Strictly causal — n-gram cache updated after scoring.

Best α=0.02 gave −0.0003 BPB on 1% test (confirmed on 5% of full eval). Problem: pure Python n-gram loop runs at ~9K tok/s. Full 62M token eval would take ~2 hours, far exceeding the 10-minute eval budget. Would need a C extension to be practical.

6. Mixed-Precision GPTQ

Hypothesis (from model analysis): MLP accounts for 80% of quantization damage; attention Q matrices at 72.6% SVD utilization are insensitive. Steal bits from attention, give to MLP_down.

Config	Attn	MLP_up	MLP_down	Size	BPB	Delta
uniform_int6	6	6	6	14.87 MiB	1.11773	baseline
attn4_mlpD8	4	6	8	12.84 MiB	1.16472	+0.047
attn5_mlpD7	5	6	7	13.88 MiB	1.12998	+0.012

SVD utilization does not predict quantization robustness. Int4 attention destroys critical signal. Uniform int6 is already near-optimal.

7. Loss Truncation

Clipped per-token loss at 95th percentile during training (top 5% of losses zeroed).

Metric	Baseline	Loss truncation	Delta
Val BPB @ step 4000	1.1959	1.2772	+0.081

The model learned to predict only "easy" tokens and abandoned the hard ones. At 27M params, every gradient signal matters — the top 5% includes genuine learning signal (rare words, domain terms, context-dependent predictions), not just noise.

Conclusions

This model is tightly optimized. Uniform int6 quantization, max-context sliding window scoring, and standard cross-entropy loss are all locally optimal. No free lunch from post-hoc interventions.
Eval-time symbolic methods don't help. kNN, n-gram blending, logit averaging, extended context — all negative. The model already captures the patterns these methods target via XSA, BigramHash, and attention.
Quantization precision should be uniform. SVD utilization does not predict quantization sensitivity. All weight types need equal precision.
Loss truncation hurts at 27M params. Every training signal matters at this scale. Filtering "hard" tokens removes genuine learning signal.

🤖 Generated with Claude Code

…on GPTQ, loss truncation 8 experiments on PR openai#1019 stack: 1 positive (memmap −0.0033), 7 negative. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Non-record: Negative results — eval-time interventions, mixed-precisi…

e5193b1

…on GPTQ, loss truncation 8 experiments on PR openai#1019 stack: 1 positive (memmap −0.0033), 7 negative. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun changed the title ~~Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation, memmap pipeline (+)~~ Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation Mar 29, 2026

Remove memmap positive result from negative-results PR

3a67d66

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 29, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

abaybektursun marked this pull request as draft March 29, 2026 23:44

abaybektursun marked this pull request as ready for review March 29, 2026 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation#1103

Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation#1103
abaybektursun wants to merge 2 commits intoopenai:mainfrom
abaybektursun:nonrecord/eval-time-quant-losstrunc-negative

abaybektursun commented Mar 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abaybektursun commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1–2. kNN-LM (Single-Layer and Multi-Layer)

3. Sliding Window Logit Averaging

4. SelfExtend / Extended Context

5. N-gram Log-Linear Blend

6. Mixed-Precision GPTQ

7. Loss Truncation

Conclusions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abaybektursun commented Mar 29, 2026 •

edited

Loading