Novel approaches analysis for sub-1.10 BPB Parameter Golf by Copilot · Pull Request #1 · kailean/parameter-golf

Copilot · 2026-03-30T02:55:01Z

Comprehensive analysis of 8 proposed angles for pushing below 1.10 BPB, grounded in competition ablation data from Issue openai#140 (2,100+ forks, 100+ submissions) and this repo's leaderboard entries.

Key findings

MSE ≠ artifact size kills Angles 2 (non-uniform quant) and 3 (sparsification) — lower reconstruction error increases index entropy, defeating zstd-22 compression (Non-record: Compression moonshots — 8 negative/marginal findings (Procrustes, SWA smoothness, selective fp16, pruning+zstd) openai/parameter-golf#1048, Non-record: 12L Low-Rank Q + QAT (1xH100, pre-quant 1.2035) openai/parameter-golf#316)
Depth recurrence ≥3 loops is dead — int6 quantization error amplifies ~900× over 3 cycles (Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why openai/parameter-golf#363)
Context mixing (Angle 5) is the highest-EV path — EngramLite gated hashing (Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d) openai/parameter-golf#1089, 1.1086 BPB) + BackoffNgramMixer (Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011) openai/parameter-golf#1094, 0.40 BPB) are independently proven but never combined

Deliverable: `pg_novel_ideas.md`

8 angle analyses with will-it-work verdicts citing specific PR results
Ranked idea list (3 tiers + dead ideas table with evidence)
3 POC code stubs: EngramLite (gated multi-head bigram+trigram hash replacing BigramHash), Complementary Training loss (down-weights n-gram-predictable tokens), BackoffNgramMixer (causal, full-vocab normalized eval-time cache)
THE MOONSHOT: Complementary Training + EngramLite + BackoffNgramMixer — the neural model specializes on what n-grams can't predict, then the free eval-time cache handles the rest

Example: Complementary Training loss weighting

# p_bigram[i] = bigram_probs[prev_token[i], target[i]]
weights = 1.0 - complement_alpha * p_bigram  # down-weight easy tokens
weights = mx.clip(weights, 0.1, 1.0)
loss = (ce_per_token * weights / weights.mean()).mean()

No training runs executed — analysis and code stubs only.

Original prompt

Referenz: 145bf29 analyse all the tests and best bpb acores and MISSION: Brainstorm and prototype genuinely novel approaches to push
Parameter Golf below 1.10 bpb. This is NOT about tweaking hyperparameters.
This is about finding ideas that the 2,100+ forks haven't tried.

Read this first for context on what everyone IS doing:
openai#140

Then think about these angles. For each one, write a short analysis
(will it work? why/why not? estimated bpb impact?) and if it's
promising, write a proof-of-concept implementation stub in Python.
Save everything to ~/pg_novel_ideas.md.

=== ANGLE 1: COMPRESSION IS THE OBJECTIVE ===

The metric is bits-per-byte. Cross-entropy loss optimizes
token-level prediction, but bpb measures byte-level compression.
These are NOT the same thing. Every current submission trains
with cross-entropy loss then measures bpb. What if we directly
optimized for bpb during training?

Ideas to explore:

Byte-level auxiliary loss alongside token loss
Loss weighting by bytes-per-token (tokens that decode to more
bytes matter more for bpb)
Is there a differentiable approximation of bpb that we could
backprop through?

=== ANGLE 2: NON-UNIFORM QUANTIZATION ===

Everyone uses uniform int6 (64 evenly spaced levels). But neural
network weights follow approximately Gaussian distributions — most
values cluster near zero, few are large. Uniform quantization wastes
precision on the tails.

Ideas to explore:

K-means quantization: cluster weights into 64 centroids that
match the actual distribution. Store a 64-entry codebook + 6-bit
indices. The codebook costs ~256 bytes per row — negligible.
QAT with learned codebook: train the codebook end-to-end with STE
Log-scale quantization: levels spaced logarithmically around zero
NormalFloat (NF4/NF6): quantization levels matched to a normal
distribution, as in QLoRA. Has anyone tried NF6 for Parameter Golf?

Estimate: if non-uniform quant reduces quantization error by 30%,
that's roughly equivalent to having 30% more effective parameters.

=== ANGLE 3: ENTROPY-CODED WEIGHTS ===

Current pipeline: quantize to int6 → compress with zstd. But zstd
treats all bytes equally. What if we designed the weight distribution
to be maximally compressible?

Ideas to explore:

Weight entropy regularization during training: add a loss term
that penalizes high-entropy weight distributions. Train the model
to have weights that compress well, not just weights that predict well.
Sparse + dense hybrid: force 50% of weights to exactly zero
(structured pruning), then use the freed bits for higher precision
on remaining weights (int8 instead of int6 on the surviving weights)
ANS (Asymmetric Numeral Systems) coding instead of zstd: custom
entropy coder tuned to the exact weight distribution

=== ANGLE 4: HYPERNETWORK WEIGHT GENERATION ===

Instead of storing 37M quantized parameters, store a tiny neural
network (200K params) that GENERATES the weight matrices. At load
time, run the hypernetwork to produce all weights. The hypernetwork
itself is the artifact.

This is essentially neural network compression via implicit
representation. A 200K-param hypernetwork stored in int8 = ~200KB.
That leaves 15.8MB for... more hypernetworks, or auxiliary tables,
or a completely different use of the budget.

Question: can a 200K-param network generate 37M coherent weights?
Probably not directly. But what about: hypernetwork generates a
low-rank basis, and the artifact stores the coefficients?

=== ANGLE 5: CONTEXT MIXING (from compression theory) ===

The best non-neural compressors (cmix, paq8) use context mixing:
multiple simple predictors combined with learned weights. Each
predictor specializes in a different pattern (bigrams, trigrams,
word boundaries, repeating contexts). The mixer learns how much
to trust each predictor for each byte.

BigramHash is already a primitive version of this. What if we went
further?

TrigramHash (hash of 3 previous tokens → logit bias)
Skip-gram hash (token[t-2] × token[t] → logit bias)
Byte-level predictor alongside the token-level transformer
A tiny context mixing layer on top that blends transformer
logits with n-gram logits

Estimate: TrigramHash(8192) adds ~8M params in int6 = ~6MB.
Combined with BigramHash(16384), that's ~8MB of n-gram tables
plus ~7MB of transformer. Total: ~15MB. The n-gram tables are
essentially a compressed language model in themselves.

=== ANGLE 6: DEPTH RECURRENCE WITH PROGRESSIVE ADAPTATION ===

PR openai#831 showed novel architectures fail because
throughput-quantization co-optimization is the binding constraint.
But what if recurrence was done RIGHT?

4 unique blocks × 3 loops = 12 effective layers from 4 layers
of parameters
FiLM conditioning (scale/shift per loop) so each pass behaves
differently — costs ~3K params to...

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Analyzes 8 angles for pushing Parameter Golf below 1.10 bpb: - Byte-weighted CE loss, non-uniform quantization, entropy-coded weights, hypernetwork generation, context mixing, depth recurrence, multi-model ensemble, and information-theoretic bounds. Includes ranked idea list, 3 proof-of-concept code stubs (EngramLite, Complementary Training + BackoffNgramMixer, Skip-gram Hash + Shallow Recurrence), implementation difficulty estimates, and a Moonshot section. Key finding: The integrated Complementary Training + EngramLite + BackoffNgramMixer stack is the highest-EV unexplored combination. Agent-Logs-Url: https://github.com/kailean/parameter-golf/sessions/caad42c8-8417-45c0-bdf7-90803cf87976 Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>

- Add explicit gate bias initialization (-2.0) in EngramLiteEmbedding - Fix parameter budget: hash_size=2048, embed_dim=128 with projection to 1024 - Clarify shallow recurrence pass_scales as pseudocode with implementation notes - Ensure POC 2 smoke test config matches actual class parameters Agent-Logs-Url: https://github.com/kailean/parameter-golf/sessions/caad42c8-8417-45c0-bdf7-90803cf87976 Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>

Initial plan

ea95da0

Copilot AI assigned Copilot and kailean Mar 30, 2026

Copilot started work on behalf of kailean March 30, 2026 02:55 View session

Copilot AI and others added 2 commits March 30, 2026 03:06

Copilot AI changed the title ~~[WIP] Explore novel methods to achieve below 1.10 bpb in Parameter Golf~~ Novel approaches analysis for sub-1.10 BPB Parameter Golf Mar 30, 2026

Copilot finished work on behalf of kailean March 30, 2026 03:10

Copilot AI requested a review from kailean March 30, 2026 03:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Novel approaches analysis for sub-1.10 BPB Parameter Golf#1

Novel approaches analysis for sub-1.10 BPB Parameter Golf#1
Copilot wants to merge 3 commits intomainfrom
copilot/brainstorm-novel-approaches

Copilot AI commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key findings

Deliverable: pg_novel_ideas.md

Example: Complementary Training loss weighting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 30, 2026 •

edited

Loading

Deliverable: `pg_novel_ideas.md`