Novel approaches analysis for sub-1.10 BPB Parameter Golf#1
Draft
Novel approaches analysis for sub-1.10 BPB Parameter Golf#1
Conversation
Analyzes 8 angles for pushing Parameter Golf below 1.10 bpb: - Byte-weighted CE loss, non-uniform quantization, entropy-coded weights, hypernetwork generation, context mixing, depth recurrence, multi-model ensemble, and information-theoretic bounds. Includes ranked idea list, 3 proof-of-concept code stubs (EngramLite, Complementary Training + BackoffNgramMixer, Skip-gram Hash + Shallow Recurrence), implementation difficulty estimates, and a Moonshot section. Key finding: The integrated Complementary Training + EngramLite + BackoffNgramMixer stack is the highest-EV unexplored combination. Agent-Logs-Url: https://github.com/kailean/parameter-golf/sessions/caad42c8-8417-45c0-bdf7-90803cf87976 Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>
- Add explicit gate bias initialization (-2.0) in EngramLiteEmbedding - Fix parameter budget: hash_size=2048, embed_dim=128 with projection to 1024 - Clarify shallow recurrence pass_scales as pseudocode with implementation notes - Ensure POC 2 smoke test config matches actual class parameters Agent-Logs-Url: https://github.com/kailean/parameter-golf/sessions/caad42c8-8417-45c0-bdf7-90803cf87976 Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Explore novel methods to achieve below 1.10 bpb in Parameter Golf
Novel approaches analysis for sub-1.10 BPB Parameter Golf
Mar 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Comprehensive analysis of 8 proposed angles for pushing below 1.10 BPB, grounded in competition ablation data from Issue openai#140 (2,100+ forks, 100+ submissions) and this repo's leaderboard entries.
Key findings
Deliverable:
pg_novel_ideas.mdExample: Complementary Training loss weighting
No training runs executed — analysis and code stubs only.
Original prompt
Referenz: 145bf29 analyse all the tests and best bpb acores and MISSION: Brainstorm and prototype genuinely novel approaches to push
Parameter Golf below 1.10 bpb. This is NOT about tweaking hyperparameters.
This is about finding ideas that the 2,100+ forks haven't tried.
Read this first for context on what everyone IS doing:
openai#140
Then think about these angles. For each one, write a short analysis
(will it work? why/why not? estimated bpb impact?) and if it's
promising, write a proof-of-concept implementation stub in Python.
Save everything to ~/pg_novel_ideas.md.
=== ANGLE 1: COMPRESSION IS THE OBJECTIVE ===
The metric is bits-per-byte. Cross-entropy loss optimizes
token-level prediction, but bpb measures byte-level compression.
These are NOT the same thing. Every current submission trains
with cross-entropy loss then measures bpb. What if we directly
optimized for bpb during training?
Ideas to explore:
bytes matter more for bpb)
backprop through?
=== ANGLE 2: NON-UNIFORM QUANTIZATION ===
Everyone uses uniform int6 (64 evenly spaced levels). But neural
network weights follow approximately Gaussian distributions — most
values cluster near zero, few are large. Uniform quantization wastes
precision on the tails.
Ideas to explore:
match the actual distribution. Store a 64-entry codebook + 6-bit
indices. The codebook costs ~256 bytes per row — negligible.
distribution, as in QLoRA. Has anyone tried NF6 for Parameter Golf?
Estimate: if non-uniform quant reduces quantization error by 30%,
that's roughly equivalent to having 30% more effective parameters.
=== ANGLE 3: ENTROPY-CODED WEIGHTS ===
Current pipeline: quantize to int6 → compress with zstd. But zstd
treats all bytes equally. What if we designed the weight distribution
to be maximally compressible?
Ideas to explore:
that penalizes high-entropy weight distributions. Train the model
to have weights that compress well, not just weights that predict well.
(structured pruning), then use the freed bits for higher precision
on remaining weights (int8 instead of int6 on the surviving weights)
entropy coder tuned to the exact weight distribution
=== ANGLE 4: HYPERNETWORK WEIGHT GENERATION ===
Instead of storing 37M quantized parameters, store a tiny neural
network (200K params) that GENERATES the weight matrices. At load
time, run the hypernetwork to produce all weights. The hypernetwork
itself is the artifact.
This is essentially neural network compression via implicit
representation. A 200K-param hypernetwork stored in int8 = ~200KB.
That leaves 15.8MB for... more hypernetworks, or auxiliary tables,
or a completely different use of the budget.
Question: can a 200K-param network generate 37M coherent weights?
Probably not directly. But what about: hypernetwork generates a
low-rank basis, and the artifact stores the coefficients?
=== ANGLE 5: CONTEXT MIXING (from compression theory) ===
The best non-neural compressors (cmix, paq8) use context mixing:
multiple simple predictors combined with learned weights. Each
predictor specializes in a different pattern (bigrams, trigrams,
word boundaries, repeating contexts). The mixer learns how much
to trust each predictor for each byte.
BigramHash is already a primitive version of this. What if we went
further?
logits with n-gram logits
Estimate: TrigramHash(8192) adds ~8M params in int6 = ~6MB.
Combined with BigramHash(16384), that's ~8MB of n-gram tables
plus ~7MB of transformer. Total: ~15MB. The n-gram tables are
essentially a compressed language model in themselves.
=== ANGLE 6: DEPTH RECURRENCE WITH PROGRESSIVE ADAPTATION ===
PR openai#831 showed novel architectures fail because
throughput-quantization co-optimization is the binding constraint.
But what if recurrence was done RIGHT?
of parameters
differently — costs ~3K params to...
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.