Fix: torch.AcceleratorError: CUDA error: an illegal memory access was encountered#89
Fix: torch.AcceleratorError: CUDA error: an illegal memory access was encountered#89itechbear wants to merge 1 commit intosgl-project:mainfrom
Conversation
… encountered Update synchronization method in scheduler.py. Replaced copy_done.synchronize() with torch.cuda.synchronize(self.device) for better synchronization handling.
|
Hi. Could you show your benchmark code BTW, could you refer to #58 and try |
|
The information about benchmark scripts is documented in the "Benchmark command" section. I'll try setting |
|
The problem still exists. I added a debugging statement to scheduler.py The server hanged and failed to finish any request. I had to abort the client. |
|
This really looks tricky. I don't have access to ampere GPUs, but on H200 and B200, under the same config (without MINISGL_OVERLAP_EXTRA_SYNC), there's no illegal memory access. # server
python -m minisgl --model Qwen/Qwen3-1.7B --num-pages 45228 --attn "fi" --max-running-req 128 --port 1919
# client
python -m sglang.bench_serving --backend sglang-oai-chat --dataset-name random --random-input 128 --random-output 128 --num-prompts 500 --request-rate 128 --random-range-ratio 1.0 --base-url http://127.0.0.1:1919It seems that the issue only exists for Ampere GPUs... Could you try some older commits, like If the issue still exists, could you please try:
|
|
Setting I'll try setting |
|
Thanks. Unfortunately I don't have access to 3060/A100. The fastest way to locate which kernel is encountering IMA is to use cuda coredump. This bug really looks puzzling to me because I never reproduce it on Hopper/Blackwell GPUs. Anyway I will take a static pass on the codebase to check again. |
|
Setting I created a cuda coredump file following the article: https://blog.vllm.ai/2025/08/11/cuda-debugging.html . I hope it helps you debug since I don't have much deep knowledge about cuda. |
Thanks a lot. I will look into it tonight👍🏿 |
|
I managed to reproduce this bug on my H200 by adding |
|
I just benchmarked the latest main branch. While the previous problem seems to be gone, a new error arises benchmark scipt: |
Thanks. This should be another issue, and I will fix it later. Around 2 weeks ago we updated a little about the radix cache implementation and that possibly break some implementation. |

Summary
Replaced copy_done.synchronize() with torch.cuda.synchronize(self.device) for better synchronization handling. Fix the error
torch.AcceleratorError: CUDA error: an illegal memory access was encounteredraised when the server is being benchmarked.I'm not quite sure whether the patch would cause performance degration, since it might effectively disable the CPU/GPU overlapped scheduling feature. But I think it is a good starting point.
The Problem
When I benchmarked the server, it consistently raises error in server logs.
Hardware
Server log
Benchmark command
The
bench_serving.pywas taken from https://github.com/sgl-project/sglang/blob/main/python/sglang/bench_serving.py without any modificationRationale of the fix
The following content was mostly generated by AI. I reviewed it before adding it here
Root Cause: Resource Race in Overlap Scheduling$N$ ) is scheduled and launched on the GPU before the results of the previous batch ($N-1$ ) are processed:
In the overlap_loop of Scheduler, the next batch (
Why the Fix Works$N-1$ was finished. This allowed _process_last_data to proceed while batch $N$ was still running.
The original code used copy_done.synchronize(), which only blocked the CPU until batch
By replacing it with torch.cuda.synchronize(self.device), the scheduler now waits for all ongoing GPU operations on the device to complete. This includes the speculatively launched batch$N$ . As a result:
While this fix restores stability by effectively serializing parts of the execution, it ensures that speculative tokens don't conflict with resource management, preventing the CUDA illegal memory access reported in the logs.