Skip to content

Optimize inference: fused kernels, GEMV loop, fix uniform leak#15

Merged
m96-chan merged 3 commits intomainfrom
perf/inference-tuning
Feb 22, 2026
Merged

Optimize inference: fused kernels, GEMV loop, fix uniform leak#15
m96-chan merged 3 commits intomainfrom
perf/inference-tuning

Conversation

@m96-chan
Copy link
Copy Markdown
Owner

Summary

  • Fused RMSNorm+Quantize shader (rmsnorm_quantize.wgsl): Merges two sequential dispatches into one 3-pass kernel, eliminating the intermediate normed buffer. Saves 56 dispatches per token (705→649) across 28 transformer layers.
  • Optimized ternary_gemv inner loop: Reduced from 16-iteration scalar loop to 4-iteration byte-by-byte extraction with batched 4-way multiply-accumulate. Removed per-iteration division/modulo and bounds checking.
  • Fixed uniform buffer leak: Added prefill uniform buffers to all nn/ classes (BitLinear, Attention, FFN, TransformerBlock, BitNetModel). N>1 paths now reuse pre-created buffers via writeBuffer() instead of leaking ~899 createBuffer() calls per prefill forward pass.

Closes #4

Test plan

  • npm run build passes
  • npm run lint passes
  • Load BitNet 2B-4T in web-chat, send a message, verify coherent output
  • Check Chrome DevTools GPU memory — uniform buffer count stays flat after prefill
  • Load Falcon-E 1B — verify multi-model support still works
  • Qualitative tok/s improvement check on decode path

🤖 Generated with Claude Code

m96-chan and others added 2 commits February 22, 2026 11:07
…et-25")

The multi-model support PR introduced a strict equality check `arch === "bitnet-25"`
for activation function detection. The actual GGUF architecture string for BitNet
2B-4T is "bitnet-b1.58", causing the model to use silu instead of relu² and
producing garbage output. Use `arch.startsWith("bitnet")` to match all variants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Fused RMSNorm+Quantize shader: eliminates intermediate normed buffer
   and saves 56 dispatches per token (2 per block × 28 layers)
2. Optimized ternary_gemv inner loop: 4 iterations instead of 16,
   byte-by-byte extraction with batched multiply-accumulate
3. Pre-created prefill uniform buffers: eliminates ~899 createBuffer
   calls per prefill pass by reusing buffers via writeBuffer

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@m96-chan m96-chan force-pushed the perf/inference-tuning branch from 0761488 to b0f0664 Compare February 22, 2026 02:13
The fused shader produces incorrect output on Chrome's ANGLE OpenGL ES
backend (used on Linux+Wayland). Likely a driver/ANGLE bug with reusing
workgroup shared memory for multiple reductions in a single dispatch.
Keep the other two optimizations (GEMV loop + prefill uniforms).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@m96-chan m96-chan merged commit 7486eb6 into main Feb 22, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inference performance tuning

1 participant