Positive: WSL2 2.7.0 enables CUDA graph capture on RTX 5090 (Blackwell sm_120) — nvidia-cdi-refresh fix documented

## WSL2 2.7.0 enables CUDA graph capture on RTX 5090 (Blackwell sm_120) — community finding

Just wanted to document a positive finding for the WSL2 + Blackwell community.

**Short version:** CUDA graph capture (previously crashing with `cudaErrorUnknown` on RTX 5090 + WSL2) now works correctly on WSL2 2.7.0 with the right system configuration. This unlocks full vLLM performance on Blackwell under WSL2.

**Hardware:** RTX 5090 32GB (sm_120, Blackwell), Windows 11, WSL2 2.7.0, CUDA 12.8

**What was crashing:** Any attempt to run vLLM (or other CUDA graph-capturing workloads) would fail with `cudaErrorUnknown` during graph capture. The workaround was `--enforce-eager` mode which disables CUDA graphs and results in ~8x throughput loss.

**What fixed it (beyond 2.7.0 itself):**

Two system-level issues were causing instability that compounded the dxgkrnl issues:

1. **nvidia-cdi-refresh** probes CUDA devices at boot ~11 seconds in, racing with Blackwell driver initialization. Hard-masking it resolves the instability:
```bash
ln -sf /dev/null /etc/systemd/system/nvidia-cdi-refresh.path
ln -sf /dev/null /etc/systemd/system/nvidia-cdi-refresh.service
systemctl daemon-reload
```

2. **Boot timing** — CUDA services (Ollama, etc.) need a `ExecStartPre=/bin/sleep 45` delay to avoid racing with dxgkrnl initialization on Blackwell.

**Result with 2.7.0 + above fixes:** vLLM runs with full CUDA graphs, ~140 tok/s on Qwen3-14B-AWQ. Stable across reboots. No enforce-eager needed.

**Still not working:** FP8 quantization falls back to an emulated path (3x slower than INT4 AWQ). Blackwell FP8 tensor cores appear not yet exposed through dxgkrnl.

Full benchmark writeup: https://github.com/vllm-project/vllm/issues/37242

Thanks to the WSL team for the 2.7.0 improvements — this is a meaningful unlock for the AI/ML community on Windows + Blackwell.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Positive: WSL2 2.7.0 enables CUDA graph capture on RTX 5090 (Blackwell sm_120) — nvidia-cdi-refresh fix documented #14452

WSL2 2.7.0 enables CUDA graph capture on RTX 5090 (Blackwell sm_120) — community finding

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Positive: WSL2 2.7.0 enables CUDA graph capture on RTX 5090 (Blackwell sm_120) — nvidia-cdi-refresh fix documented #14452

Description

WSL2 2.7.0 enables CUDA graph capture on RTX 5090 (Blackwell sm_120) — community finding

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions