Skip to content

Novel approaches analysis for sub-1.10 BPB Parameter Golf#1

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/brainstorm-novel-approaches
Draft

Novel approaches analysis for sub-1.10 BPB Parameter Golf#1
Copilot wants to merge 3 commits intomainfrom
copilot/brainstorm-novel-approaches

Conversation

Copy link
Copy Markdown

Copilot AI commented Mar 30, 2026

Comprehensive analysis of 8 proposed angles for pushing below 1.10 BPB, grounded in competition ablation data from Issue openai#140 (2,100+ forks, 100+ submissions) and this repo's leaderboard entries.

Key findings

Deliverable: pg_novel_ideas.md

  • 8 angle analyses with will-it-work verdicts citing specific PR results
  • Ranked idea list (3 tiers + dead ideas table with evidence)
  • 3 POC code stubs: EngramLite (gated multi-head bigram+trigram hash replacing BigramHash), Complementary Training loss (down-weights n-gram-predictable tokens), BackoffNgramMixer (causal, full-vocab normalized eval-time cache)
  • THE MOONSHOT: Complementary Training + EngramLite + BackoffNgramMixer — the neural model specializes on what n-grams can't predict, then the free eval-time cache handles the rest

Example: Complementary Training loss weighting

# p_bigram[i] = bigram_probs[prev_token[i], target[i]]
weights = 1.0 - complement_alpha * p_bigram  # down-weight easy tokens
weights = mx.clip(weights, 0.1, 1.0)
loss = (ce_per_token * weights / weights.mean()).mean()

No training runs executed — analysis and code stubs only.

Original prompt

Referenz: 145bf29 analyse all the tests and best bpb acores and MISSION: Brainstorm and prototype genuinely novel approaches to push
Parameter Golf below 1.10 bpb. This is NOT about tweaking hyperparameters.
This is about finding ideas that the 2,100+ forks haven't tried.

Read this first for context on what everyone IS doing:
openai#140

Then think about these angles. For each one, write a short analysis
(will it work? why/why not? estimated bpb impact?) and if it's
promising, write a proof-of-concept implementation stub in Python.
Save everything to ~/pg_novel_ideas.md.

=== ANGLE 1: COMPRESSION IS THE OBJECTIVE ===

The metric is bits-per-byte. Cross-entropy loss optimizes
token-level prediction, but bpb measures byte-level compression.
These are NOT the same thing. Every current submission trains
with cross-entropy loss then measures bpb. What if we directly
optimized for bpb during training?

Ideas to explore:

  • Byte-level auxiliary loss alongside token loss
  • Loss weighting by bytes-per-token (tokens that decode to more
    bytes matter more for bpb)
  • Is there a differentiable approximation of bpb that we could
    backprop through?

=== ANGLE 2: NON-UNIFORM QUANTIZATION ===

Everyone uses uniform int6 (64 evenly spaced levels). But neural
network weights follow approximately Gaussian distributions — most
values cluster near zero, few are large. Uniform quantization wastes
precision on the tails.

Ideas to explore:

  • K-means quantization: cluster weights into 64 centroids that
    match the actual distribution. Store a 64-entry codebook + 6-bit
    indices. The codebook costs ~256 bytes per row — negligible.
  • QAT with learned codebook: train the codebook end-to-end with STE
  • Log-scale quantization: levels spaced logarithmically around zero
  • NormalFloat (NF4/NF6): quantization levels matched to a normal
    distribution, as in QLoRA. Has anyone tried NF6 for Parameter Golf?

Estimate: if non-uniform quant reduces quantization error by 30%,
that's roughly equivalent to having 30% more effective parameters.

=== ANGLE 3: ENTROPY-CODED WEIGHTS ===

Current pipeline: quantize to int6 → compress with zstd. But zstd
treats all bytes equally. What if we designed the weight distribution
to be maximally compressible?

Ideas to explore:

  • Weight entropy regularization during training: add a loss term
    that penalizes high-entropy weight distributions. Train the model
    to have weights that compress well, not just weights that predict well.
  • Sparse + dense hybrid: force 50% of weights to exactly zero
    (structured pruning), then use the freed bits for higher precision
    on remaining weights (int8 instead of int6 on the surviving weights)
  • ANS (Asymmetric Numeral Systems) coding instead of zstd: custom
    entropy coder tuned to the exact weight distribution

=== ANGLE 4: HYPERNETWORK WEIGHT GENERATION ===

Instead of storing 37M quantized parameters, store a tiny neural
network (200K params) that GENERATES the weight matrices. At load
time, run the hypernetwork to produce all weights. The hypernetwork
itself is the artifact.

This is essentially neural network compression via implicit
representation. A 200K-param hypernetwork stored in int8 = ~200KB.
That leaves 15.8MB for... more hypernetworks, or auxiliary tables,
or a completely different use of the budget.

Question: can a 200K-param network generate 37M coherent weights?
Probably not directly. But what about: hypernetwork generates a
low-rank basis, and the artifact stores the coefficients?

=== ANGLE 5: CONTEXT MIXING (from compression theory) ===

The best non-neural compressors (cmix, paq8) use context mixing:
multiple simple predictors combined with learned weights. Each
predictor specializes in a different pattern (bigrams, trigrams,
word boundaries, repeating contexts). The mixer learns how much
to trust each predictor for each byte.

BigramHash is already a primitive version of this. What if we went
further?

  • TrigramHash (hash of 3 previous tokens → logit bias)
  • Skip-gram hash (token[t-2] × token[t] → logit bias)
  • Byte-level predictor alongside the token-level transformer
  • A tiny context mixing layer on top that blends transformer
    logits with n-gram logits

Estimate: TrigramHash(8192) adds ~8M params in int6 = ~6MB.
Combined with BigramHash(16384), that's ~8MB of n-gram tables
plus ~7MB of transformer. Total: ~15MB. The n-gram tables are
essentially a compressed language model in themselves.

=== ANGLE 6: DEPTH RECURRENCE WITH PROGRESSIVE ADAPTATION ===

PR openai#831 showed novel architectures fail because
throughput-quantization co-optimization is the binding constraint.
But what if recurrence was done RIGHT?

  • 4 unique blocks × 3 loops = 12 effective layers from 4 layers
    of parameters
  • FiLM conditioning (scale/shift per loop) so each pass behaves
    differently — costs ~3K params to...

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits March 30, 2026 03:06
Analyzes 8 angles for pushing Parameter Golf below 1.10 bpb:
- Byte-weighted CE loss, non-uniform quantization, entropy-coded weights,
  hypernetwork generation, context mixing, depth recurrence, multi-model
  ensemble, and information-theoretic bounds.

Includes ranked idea list, 3 proof-of-concept code stubs (EngramLite,
Complementary Training + BackoffNgramMixer, Skip-gram Hash + Shallow
Recurrence), implementation difficulty estimates, and a Moonshot section.

Key finding: The integrated Complementary Training + EngramLite +
BackoffNgramMixer stack is the highest-EV unexplored combination.

Agent-Logs-Url: https://github.com/kailean/parameter-golf/sessions/caad42c8-8417-45c0-bdf7-90803cf87976

Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>
- Add explicit gate bias initialization (-2.0) in EngramLiteEmbedding
- Fix parameter budget: hash_size=2048, embed_dim=128 with projection to 1024
- Clarify shallow recurrence pass_scales as pseudocode with implementation notes
- Ensure POC 2 smoke test config matches actual class parameters

Agent-Logs-Url: https://github.com/kailean/parameter-golf/sessions/caad42c8-8417-45c0-bdf7-90803cf87976

Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>
Copilot AI changed the title [WIP] Explore novel methods to achieve below 1.10 bpb in Parameter Golf Novel approaches analysis for sub-1.10 BPB Parameter Golf Mar 30, 2026
Copilot AI requested a review from kailean March 30, 2026 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants