Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization by sseanliu · Pull Request #831 · openai/parameter-golf

sseanliu · 2026-03-26T08:02:01Z

Non-record research submission

We systematically evaluated 6 architectural innovations from March 2026 papers on the PR #549 SOTA stack. All failed. The unified finding: at 16MB/600s, throughput-quantization co-optimization is the binding constraint, not model quality.

Experiments

Technique	Paper	Step Time	BPB	Why It Failed
MUD Optimizer	2603.17970	88ms (+5%)	1.1581	solve_triangular can't use tensor cores
Info-Max (XSA-all + LeakyReLU(0.9))	—	89ms (+7%)	1.1261	XSA-all overhead eats its own gain
Hourglass FFN	2602.06471	92ms (+11%)	1.4519	Split weights catastrophic for int6 (+0.33 quant gap)
nGPT Hypersphere	2410.01131	122ms (+47%)	1.6915	Unit-norm weights incompatible with int6 (+0.35 quant gap)
TrigramHash	Competition	98ms (+18%)	1.1298	Hash overhead costs more steps than trigram saves
SSM Hybrid (GatedDeltaNet)	2412.06464	282ms (+240%)	1.2516	Breaks torch.compile, memory-bound on H100

Key Insight

The SOTA stack is a co-optimized system: Parallel Muon (batched banks) + torch.compile (fused kernels) + int6 per-row quantization + H100 tensor cores. Breaking any one pillar cascades into the others. To beat SOTA, you must co-optimize all four simultaneously (as the ternary PR #640 did).

The Throughput Tax

At 83ms/step, each 1ms of overhead costs ~7 steps. Each step improves BPB by ~0.001. Therefore: any technique must improve BPB by 0.007 per millisecond of overhead. No tested technique clears this bar.

Novel Findings

MLP shape affects quantizability — Hourglass sub-block weights produce distributions int6 can't handle
Hypersphere normalization is incompatible with per-row quantization — normalized weights need angular-aware quantization
GatedDeltaNet matches per-step quality but is 3.4x slower without torch.compile support
The throughput tax formula generalizes to any constrained-compute training setting

See README for full details, all 6 experiments with analysis.

Test plan

All experiments run on 8xH100 SXM (RunPod)
Results verified against PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 baseline
Quantization gaps measured for all architectures

…ded eval context Non-record research submission. Proposes caching K/V pairs across sliding windows to extend effective context from 2K to 50K+ tokens at eval time. Backward-looking, zero artifact cost, rule-compliant. Implementation provided but untested due to compute constraints. Base: PR openai#287 reproduction at 1.1284 BPB.

… throughput-quantization co-optimization

henrycashe26 · 2026-03-26T15:45:48Z

The fact that hypersphere normalization and hourglass weight splitting are fundamentally incompatible with low-bit quantization is very interesting

sseanliu added 2 commits March 21, 2026 00:35

Add research: Why Novel Architectures Fail at 16MB — 6 experiments on…

b11bcdb

… throughput-quantization co-optimization

This was referenced Mar 26, 2026

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools #891

Closed

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools #892

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization#831

Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization#831
sseanliu wants to merge 2 commits intoopenai:mainfrom
sseanliu:submission/why-novel-architectures-fail

sseanliu commented Mar 26, 2026

Uh oh!

henrycashe26 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sseanliu commented Mar 26, 2026

Non-record research submission

Experiments

Key Insight

The Throughput Tax

Novel Findings

Test plan

Uh oh!

henrycashe26 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants