Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization#831
Open
sseanliu wants to merge 2 commits intoopenai:mainfrom
Open
Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization#831sseanliu wants to merge 2 commits intoopenai:mainfrom
sseanliu wants to merge 2 commits intoopenai:mainfrom
Conversation
…ded eval context Non-record research submission. Proposes caching K/V pairs across sliding windows to extend effective context from 2K to 50K+ tokens at eval time. Backward-looking, zero artifact cost, rule-compliant. Implementation provided but untested due to compute constraints. Base: PR openai#287 reproduction at 1.1284 BPB.
… throughput-quantization co-optimization
|
The fact that hypersphere normalization and hourglass weight splitting are fundamentally incompatible with low-bit quantization is very interesting |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record research submission
We systematically evaluated 6 architectural innovations from March 2026 papers on the PR #549 SOTA stack. All failed. The unified finding: at 16MB/600s, throughput-quantization co-optimization is the binding constraint, not model quality.
Experiments
Key Insight
The SOTA stack is a co-optimized system: Parallel Muon (batched banks) + torch.compile (fused kernels) + int6 per-row quantization + H100 tensor cores. Breaking any one pillar cascades into the others. To beat SOTA, you must co-optimize all four simultaneously (as the ternary PR #640 did).
The Throughput Tax
At 83ms/step, each 1ms of overhead costs ~7 steps. Each step improves BPB by ~0.001. Therefore: any technique must improve BPB by 0.007 per millisecond of overhead. No tested technique clears this bar.
Novel Findings
See README for full details, all 6 experiments with analysis.
Test plan