Optimize inference: fused kernels, GEMV loop, fix uniform leak by m96-chan · Pull Request #15 · m96-chan/0xBitNet

m96-chan · 2026-02-22T01:46:26Z

Summary

Fused RMSNorm+Quantize shader (rmsnorm_quantize.wgsl): Merges two sequential dispatches into one 3-pass kernel, eliminating the intermediate normed buffer. Saves 56 dispatches per token (705→649) across 28 transformer layers.
Optimized ternary_gemv inner loop: Reduced from 16-iteration scalar loop to 4-iteration byte-by-byte extraction with batched 4-way multiply-accumulate. Removed per-iteration division/modulo and bounds checking.
Fixed uniform buffer leak: Added prefill uniform buffers to all nn/ classes (BitLinear, Attention, FFN, TransformerBlock, BitNetModel). N>1 paths now reuse pre-created buffers via writeBuffer() instead of leaking ~899 createBuffer() calls per prefill forward pass.

Closes #4

Test plan

npm run build passes
npm run lint passes
Load BitNet 2B-4T in web-chat, send a message, verify coherent output
Check Chrome DevTools GPU memory — uniform buffer count stays flat after prefill
Load Falcon-E 1B — verify multi-model support still works
Qualitative tok/s improvement check on decode path

🤖 Generated with Claude Code

…et-25") The multi-model support PR introduced a strict equality check `arch === "bitnet-25"` for activation function detection. The actual GGUF architecture string for BitNet 2B-4T is "bitnet-b1.58", causing the model to use silu instead of relu² and producing garbage output. Use `arch.startsWith("bitnet")` to match all variants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

1. Fused RMSNorm+Quantize shader: eliminates intermediate normed buffer and saves 56 dispatches per token (2 per block × 28 layers) 2. Optimized ternary_gemv inner loop: 4 iterations instead of 16, byte-by-byte extraction with batched multiply-accumulate 3. Pre-created prefill uniform buffers: eliminates ~899 createBuffer calls per prefill pass by reusing buffers via writeBuffer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The fused shader produces incorrect output on Chrome's ANGLE OpenGL ES backend (used on Linux+Wayland). Likely a driver/ANGLE bug with reusing workgroup shared memory for multiple reductions in a single dispatch. Keep the other two optimizations (GEMV loop + prefill uniforms). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

m96-chan and others added 2 commits February 22, 2026 11:07

m96-chan force-pushed the perf/inference-tuning branch from 0761488 to b0f0664 Compare February 22, 2026 02:13

m96-chan merged commit 7486eb6 into main Feb 22, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize inference: fused kernels, GEMV loop, fix uniform leak#15

Optimize inference: fused kernels, GEMV loop, fix uniform leak#15
m96-chan merged 3 commits intomainfrom
perf/inference-tuning

m96-chan commented Feb 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m96-chan commented Feb 22, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant