Custom native vLLM for NVIDIA DGX Spark GB10 (ARM aarch64, Blackwell sm_121)
Stock vLLM does not work on the DGX Spark GB10 out of the box. The GB10 is NVIDIA's first ARM-based desktop AI platform with a Blackwell-architecture GPU (compute capability sm_121) and 128GB of unified CPU/GPU memory. These characteristics break several assumptions baked into vLLM, NCCL, CUTLASS, and Ray. This repository packages the 8 required fixes into a one-command installer so you can go from a fresh DGX Spark to running large language models without wading through days of debugging.
The DGX Spark GB10 differs from datacenter GPUs in three fundamental ways:
- Blackwell sm_121 compute capability -- Most prebuilt CUDA libraries (NCCL, CUTLASS) ship kernels for sm_120/sm_120a but not sm_121. Operations that rely on these kernels crash at runtime.
- Unified CPU/GPU memory (128GB shared) -- Ray's out-of-memory killer sees GPU-resident model weights as system memory usage and kills workers that are functioning correctly.
- ARM aarch64 architecture -- Some x86-assumed build paths and binary distributions do not apply.
Without the fixes in this repository, you will encounter cryptic NCCL transport failures, CUTLASS internal errors, Ray OOM kills after successful model loading, and silent IP mismatches in multi-node setups.
| Component | Specification |
|---|---|
| Platform | NVIDIA DGX Spark GB10 |
| Architecture | ARM aarch64 |
| GPU | NVIDIA Blackwell (sm_121) |
| Memory | 128GB unified CPU/GPU |
| Interconnect (multi-node) | 200GbE QSFP direct cable |
| QSFP interface | enp1s0f0np0 (primary), enp1s0f1np1 (secondary) |
| Component | Version |
|---|---|
| Ubuntu | 24.04+ (aarch64) |
| CUDA | 13.0 |
| Python | 3.12 |
| python3.12-dev | Required for Triton compilation |
| git, build-essential | Required for NCCL build |
git clone https://github.com/your-org/vllm-gb10.git
cd vllm-gb10
chmod +x install.sh
./install.shAfter installation completes, launch a model:
source ~/vllm-env/bin/activate
source configs/single-node.env
vllm serve /path/to/your/model \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--max-model-len 32768 \
--enforce-eagerRun the installer on every node. Then:
On the head node:
source configs/multi-node-head.env
ray start --head --port=6379
vllm serve /path/to/model \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--trust-remote-code \
--gpu-memory-utilization 0.80 \
--max-model-len 16384 \
--enforce-eagerOn each worker node:
source configs/multi-node-worker.env
ray start --address=<HEAD_QSFP_IP>:6379See configs/example-launch.sh for complete launch commands for tested models.
Error without fix:
Message truncated: received 176 bytes instead of 172
...connection refused during multi-node init
Why: Pre-built NCCL packages (pip 2.28.9, deb 2.29.3) do not include GPU kernels for Blackwell sm_121. The GPU-side NCCL collectives silently fail or produce truncated messages because the required compute kernels were never compiled into the distributed binary.
Fix: Build NCCL v2.28.9 from source with -gencode=arch=compute_121,code=sm_121 and place the resulting libnccl.so ahead of all other NCCL libraries in LD_LIBRARY_PATH. This must be done on every node in a multi-node setup.
Script: scripts/build-nccl-sm121.sh
Error without fix: Extreme packet fragmentation on 200GbE links. NCCL tensor transfers degrade to a fraction of available bandwidth. Multi-node inference becomes unusably slow or times out.
Why: The default MTU of 1500 bytes on QSFP interfaces wastes bandwidth on 200GbE direct-connect links. Large tensor transfers benefit enormously from jumbo frames.
Fix: Configure MTU 9000 on QSFP interfaces via netplan on all nodes. Verify with ping -M do -s 8972 <PEER_QSFP_IP>.
Script: scripts/setup-qsfp.sh
Error without fix:
RuntimeError: Expected 2 unique IPs but got 3
Why: When a node has multiple network interfaces (LAN + QSFP), Ray registers the node with the QSFP IP while vLLM's get_ip() uses socket-based detection and returns the LAN IP. vLLM sees 3 unique IPs instead of the expected 2 and refuses to create the distributed placement group.
Fix: Set VLLM_HOST_IP to the QSFP IP on every node. vLLM's network_utils.py:get_ip() checks this environment variable first, ensuring consistent IP reporting across Ray and vLLM.
Config: Set in configs/multi-node-head.env and configs/multi-node-worker.env.
Error without fix: NCCL picks the LAN interface for collective operations. Multi-node transfers run over the slow management network instead of the 200GbE QSFP link. Timeouts and degraded performance.
Why: NCCL has no way to know which interface is the high-speed interconnect. Without explicit configuration, it selects whichever interface the OS routing table prefers, which is typically the LAN.
Fix: Set these environment variables on all processes:
NCCL_SOCKET_IFNAME=enp1s0f0np0 # Force QSFP primary interface
UCX_NET_DEVICES=enp1s0f0np0 # UCX transport also uses QSFP
NCCL_NET_GDR_LEVEL=SYS # GPU Direct RDMA level
NCCL_P2P_DISABLE=1 # No P2P (separate physical nodes)
NCCL_IB_DISABLE=1 # No InfiniBand (using Socket transport)
Config: Set in all env files under configs/.
Error without fix:
RuntimeError: Error Internal
when calling cutlass_scaled_mm during FP8 model inference.
Why: vLLM's prebuilt _C.abi3.so contains CUTLASS FP8 kernels compiled for sm_120 and sm_120a, but not sm_121 (GB10 Blackwell). When vLLM detects the GPU supports sm_12x, it tries to use these kernels and crashes.
Fix: Patch w8a8_utils.py to force cutlass_fp8_supported() and cutlass_block_fp8_supported() to return False. This makes vLLM fall back to Triton-based FP8 kernels, which dynamically compile for the current GPU and work correctly on sm_121. Both the functions AND the module-level constants must be patched, because multiple callers invoke the functions directly.
Patch: patches/cutlass-fp8-sm121.patch
Error without fix:
ModuleNotFoundError: No module named 'triton'
or model inspection errors from transformers version mismatches between nodes.
Why: In a multi-node Ray cluster, the worker node executes model loading and inference code in its local Python environment. If packages are missing or versions differ from the head node, workers crash with import errors or produce incorrect model architectures.
Fix: The installer ensures identical package versions on all nodes:
vllm==0.18.1triton==3.6.0transformers==4.57.6python3.12-devsystem package
Run install.sh on every node to guarantee alignment.
Error without fix:
ray.exceptions.OutOfMemoryError: Memory on the node (IP: 192.168.100.10)
was 115.65GB / 121.69GB (0.950369), which exceeds the memory usage threshold of 0.95.
This happens AFTER successfully loading all model shards, during the profiling step.
Why: The DGX Spark has unified CPU/GPU memory. When model weights are loaded onto the GPU, they appear as regular system memory usage to Ray's memory monitor. Ray's default threshold of 0.95 (95%) triggers the OOM killer even though the system is functioning correctly -- the memory is being used exactly as intended.
Fix: Set RAY_memory_usage_threshold=1.0 to disable Ray's OOM killer. Combine with conservative vLLM settings (gpu_memory_utilization=0.80, reduced max_model_len) to leave headroom for KV cache and runtime allocations.
Script: patches/ray-memory-threshold.sh
Error without fix:
Thinking/reasoning tokens from models like Qwen3 are invisible in streaming output, or raw <think> tags leak into the content display.
Why: vLLM with --reasoning-parser qwen3 sends thinking tokens via delta.reasoning / delta.reasoning_content in SSE chunks, not as <think> tags inside delta.content. Applications that only parse <think> tags miss the reasoning output entirely.
Fix: Application-level fix. When consuming vLLM's streaming output with reasoning models, check for delta.reasoning and delta.reasoning_content fields first. Fall back to <think> tag parsing for models that do not use a reasoning parser. See the example in configs/example-launch.sh for the correct --reasoning-parser flag.
This section walks through setting up two DGX Spark GB10 nodes for tensor-parallel inference.
- Two DGX Spark GB10 nodes connected via 200GbE QSFP direct cable
- Both nodes have run
install.shsuccessfully - The model is available at the same path on both nodes (e.g., via NFS or copied)
On each node, run:
sudo ./scripts/setup-qsfp.shEdit the generated netplan config to assign unique QSFP IPs:
- Head node:
192.168.100.10/24 - Worker node:
192.168.100.11/24
Verify connectivity:
ping -M do -s 8972 <PEER_QSFP_IP>source ~/vllm-env/bin/activate
source configs/multi-node-head.env
# Set your QSFP IP
export VLLM_HOST_IP=192.168.100.10
export RAY_memory_usage_threshold=1.0
ray start --head --port=6379 --num-gpus=1source ~/vllm-env/bin/activate
source configs/multi-node-worker.env
# Set YOUR QSFP IP (not the head's)
export VLLM_HOST_IP=192.168.100.11
export RAY_memory_usage_threshold=1.0
ray start --address=192.168.100.10:6379 --num-gpus=1export HF_HUB_OFFLINE=1
vllm serve /home/atc/hf_models/Qwen3.5-122B-A10B-FP8 \
--served-model-name manthan-general \
--port 8000 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--trust-remote-code \
--gpu-memory-utilization 0.80 \
--max-model-len 16384 \
--enforce-eager \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--max-num-batched-tokens 4096Model loading takes approximately 13 minutes (39 safetensors shards). The API server is ready when the health endpoint returns 200:
curl http://localhost:8000/healthFor models that fit in a single GB10's 128GB unified memory:
source ~/vllm-env/bin/activate
source configs/single-node.env
export HF_HUB_OFFLINE=1
vllm serve /path/to/model \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--max-model-len 32768 \
--enforce-eager \
--enable-prefix-caching| Model | Nodes | TP | gpu_memory_utilization | max_model_len | Notes |
|---|---|---|---|---|---|
| Qwen3.5-122B-A10B-FP8 | 2 | 2 | 0.80 | 16384 | Primary production model, uses reasoning + tool parsers |
| Qwen3-Omni-30B | 1 | 1 | 0.85 | 32768 | Launch with vllm serve --omni |
| Qwen3.5-35B-A3B | 1 | 1 | 0.85 | 32768 | Lightweight MoE, fits easily on single node |
You are using a pre-built NCCL that lacks sm_121 kernels. Rebuild from source:
./scripts/build-nccl-sm121.shVerify the custom library is loaded first in LD_LIBRARY_PATH.
The CUTLASS FP8 patch has not been applied. Run:
./patches/cutlass-fp8-sm121.patchSet RAY_memory_usage_threshold=1.0 in your environment. The unified memory architecture causes Ray to miscount GPU memory as system memory.
VLLM_HOST_IP is not set or is set to the wrong IP. Each node must export VLLM_HOST_IP set to its own QSFP IP.
Install python3.12-dev:
sudo apt install -y python3.12-devNCCL is using the LAN interface instead of QSFP. Verify NCCL_SOCKET_IFNAME=enp1s0f0np0 is set and the QSFP link is up with MTU 9000.
This is normal on GB10. The unified memory architecture does not report GPU memory separately via nvidia-smi. Use gpu_memory_utilization to control how much memory vLLM allocates.
Reduce gpu_memory_utilization (try 0.75) and max_model_len (try 8192). The profiling step allocates KV cache and may exceed available memory on unified-memory systems.
vllm-gb10/
├── install.sh # Main installer (run on every node)
├── patches/
│ ├── cutlass-fp8-sm121.patch # CUTLASS FP8 sm_121 patch script
│ └── ray-memory-threshold.sh # Ray unified memory configuration
├── scripts/
│ ├── build-nccl-sm121.sh # NCCL source build for sm_121
│ ├── setup-qsfp.sh # QSFP 200GbE network configuration
│ └── verify-install.sh # Installation verification
├── configs/
│ ├── single-node.env # Environment for single-node inference
│ ├── multi-node-head.env # Environment for Ray head node
│ ├── multi-node-worker.env # Environment for Ray worker node
│ └── example-launch.sh # Example launch commands
├── LICENSE # Apache 2.0
└── README.md # This file
Contributions are welcome. If you have a DGX Spark GB10 and have found additional fixes or improvements:
- Fork the repository
- Create a feature branch (
git checkout -b fix/description) - Test on actual GB10 hardware -- this project cannot be meaningfully tested on other platforms
- Submit a pull request with a clear description of the problem and fix
Please include the error message you encountered and the exact vLLM/CUDA/NCCL versions in your PR description.
Apache License 2.0. See LICENSE.
Developed on NVIDIA DGX Spark GB10 hardware. All fixes were discovered through hands-on debugging of real multi-node deployments running Qwen3.5-122B-A10B-FP8 with tensor parallelism.