3-bit Lloyd-Max KV Cache Compression for LLM Inference on NVIDIA DGX Spark GB10 — 5.12x compression, 0.983 cosine similarity, pure numpy on ARM unified memory
compression numpy transformers quantization lloyd-max kv-cache unified-memory vllm llm-inference vibe-coding claude-code gb10 nvidia-dgx-spark arm-aarch64
-
Updated
Apr 3, 2026 - Python