-
Couldn't load subscription status.
- Fork 22
Open
Description
Driver version: 575.57.08
CUDA version: 12.9
Container runtime: gVisor or runc (both fail)
GPU: H200
We are attempting to run cuda-checkpoint on an SGLang server.
Build the image
FROM lmsysorg/sglang:v0.5.0rc2-cu126
RUN apt-get update && apt-get install -y wget && \
wget https://github.com/NVIDIA/cuda-checkpoint/raw/refs/heads/main/bin/x86_64_Linux/cuda-checkpoint -O /usr/bin/cuda-checkpoint && \
chmod +x /usr/bin/cuda-checkpoint
RUN pip install protobuf
docker build . --tag sglang-server
Run the container, and run cuda-checkpoint inside
#!/bin/bash
set -ex
docker build . --tag sglang-server
log() {
echo "[$(date +"%H:%M:%S")]" "$@" >&2
}
container_id="$(sudo docker run --detach --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--ipc=host \
--name=sglang-server \
sglang-server \
python3 -m sglang.launch_server \
--model-path "mistralai/Mistral-7B-Instruct-v0.3" \
--host 0.0.0.0 --port 30000)"
sleep 3; log "checkpointing internally"
get_state() {
sudo docker exec $container_id \
sh -c "ps aux | grep -v grep | grep '^root' | grep -v cuda-checkpoint | awk '{ print \$2 }' | xargs -I{} cuda-checkpoint --get-state --pid {}" || true
}
toggle_all() {
sudo docker exec $container_id \
sh -c "ps aux | grep -v grep | grep '^root' | grep -v cuda-checkpoint | awk '{ print \$2 }' | xargs -I{} cuda-checkpoint --toggle --pid {}" || true
}
get_state
toggle_all
get_state
cuda-checkpoint fails with OS call failed or operation not supported on this OS when running --get-state and --toggle on a checkpointed PyTorch PID.
This ultimately causes CRIU to fail, since /dev/nvidiaX FDs are left open.
Experiments
- I added retries to the
cuda-checkpointinvocation, but it seems to fail consistently. - I tried enabling driver persistence, but received the same error.
This seems similar to #34 and #27, although using a later driver version.
Abbreviated strace of cuda-checkpoint --toggle --pid
Metadata
Metadata
Assignees
Labels
No labels