Optimize inference: fused kernels, GEMV loop, fix uniform leak#15
Merged
Optimize inference: fused kernels, GEMV loop, fix uniform leak#15
Conversation
…et-25")
The multi-model support PR introduced a strict equality check `arch === "bitnet-25"`
for activation function detection. The actual GGUF architecture string for BitNet
2B-4T is "bitnet-b1.58", causing the model to use silu instead of relu² and
producing garbage output. Use `arch.startsWith("bitnet")` to match all variants.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Fused RMSNorm+Quantize shader: eliminates intermediate normed buffer and saves 56 dispatches per token (2 per block × 28 layers) 2. Optimized ternary_gemv inner loop: 4 iterations instead of 16, byte-by-byte extraction with batched multiply-accumulate 3. Pre-created prefill uniform buffers: eliminates ~899 createBuffer calls per prefill pass by reusing buffers via writeBuffer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0761488 to
b0f0664
Compare
The fused shader produces incorrect output on Chrome's ANGLE OpenGL ES backend (used on Linux+Wayland). Likely a driver/ANGLE bug with reusing workgroup shared memory for multiple reductions in a single dispatch. Keep the other two optimizations (GEMV loop + prefill uniforms). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rmsnorm_quantize.wgsl): Merges two sequential dispatches into one 3-pass kernel, eliminating the intermediatenormedbuffer. Saves 56 dispatches per token (705→649) across 28 transformer layers.ternary_gemvinner loop: Reduced from 16-iteration scalar loop to 4-iteration byte-by-byte extraction with batched 4-way multiply-accumulate. Removed per-iteration division/modulo and bounds checking.nn/classes (BitLinear,Attention,FFN,TransformerBlock,BitNetModel). N>1 paths now reuse pre-created buffers viawriteBuffer()instead of leaking ~899createBuffer()calls per prefill forward pass.Closes #4
Test plan
npm run buildpassesnpm run lintpasses🤖 Generated with Claude Code