Skip to content

Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization#831

Open
sseanliu wants to merge 2 commits intoopenai:mainfrom
sseanliu:submission/why-novel-architectures-fail
Open

Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization#831
sseanliu wants to merge 2 commits intoopenai:mainfrom
sseanliu:submission/why-novel-architectures-fail

Conversation

@sseanliu
Copy link

Non-record research submission

We systematically evaluated 6 architectural innovations from March 2026 papers on the PR #549 SOTA stack. All failed. The unified finding: at 16MB/600s, throughput-quantization co-optimization is the binding constraint, not model quality.

Experiments

Technique Paper Step Time BPB Why It Failed
MUD Optimizer 2603.17970 88ms (+5%) 1.1581 solve_triangular can't use tensor cores
Info-Max (XSA-all + LeakyReLU(0.9)) 89ms (+7%) 1.1261 XSA-all overhead eats its own gain
Hourglass FFN 2602.06471 92ms (+11%) 1.4519 Split weights catastrophic for int6 (+0.33 quant gap)
nGPT Hypersphere 2410.01131 122ms (+47%) 1.6915 Unit-norm weights incompatible with int6 (+0.35 quant gap)
TrigramHash Competition 98ms (+18%) 1.1298 Hash overhead costs more steps than trigram saves
SSM Hybrid (GatedDeltaNet) 2412.06464 282ms (+240%) 1.2516 Breaks torch.compile, memory-bound on H100

Key Insight

The SOTA stack is a co-optimized system: Parallel Muon (batched banks) + torch.compile (fused kernels) + int6 per-row quantization + H100 tensor cores. Breaking any one pillar cascades into the others. To beat SOTA, you must co-optimize all four simultaneously (as the ternary PR #640 did).

The Throughput Tax

At 83ms/step, each 1ms of overhead costs ~7 steps. Each step improves BPB by ~0.001. Therefore: any technique must improve BPB by 0.007 per millisecond of overhead. No tested technique clears this bar.

Novel Findings

  1. MLP shape affects quantizability — Hourglass sub-block weights produce distributions int6 can't handle
  2. Hypersphere normalization is incompatible with per-row quantization — normalized weights need angular-aware quantization
  3. GatedDeltaNet matches per-step quality but is 3.4x slower without torch.compile support
  4. The throughput tax formula generalizes to any constrained-compute training setting

See README for full details, all 6 experiments with analysis.

Test plan

…ded eval context

Non-record research submission. Proposes caching K/V pairs across sliding
windows to extend effective context from 2K to 50K+ tokens at eval time.
Backward-looking, zero artifact cost, rule-compliant. Implementation provided
but untested due to compute constraints. Base: PR openai#287 reproduction at 1.1284 BPB.
@henrycashe26
Copy link

The fact that hypersphere normalization and hourglass weight splitting are fundamentally incompatible with low-bit quantization is very interesting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants