Skip to content

[GPU Dependence] "OS call failed or operation not supported on this OS" for get-state and toggle #37

@mattnappo

Description

@mattnappo

Driver version: 575.57.08
CUDA version: 12.9
Container runtime: gVisor or runc (both fail)
GPU: H200
We are attempting to run cuda-checkpoint on an SGLang server.

Build the image

FROM lmsysorg/sglang:v0.5.0rc2-cu126

RUN apt-get update && apt-get install -y wget && \
    wget https://github.com/NVIDIA/cuda-checkpoint/raw/refs/heads/main/bin/x86_64_Linux/cuda-checkpoint -O /usr/bin/cuda-checkpoint && \
    chmod +x /usr/bin/cuda-checkpoint

RUN pip install protobuf
docker build . --tag sglang-server

Run the container, and run cuda-checkpoint inside

#!/bin/bash

set -ex

docker build . --tag sglang-server

log() {
  echo "[$(date +"%H:%M:%S")]" "$@" >&2
}

container_id="$(sudo docker run --detach --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    --name=sglang-server \
    sglang-server \
    python3 -m sglang.launch_server \
        --model-path "mistralai/Mistral-7B-Instruct-v0.3" \
        --host 0.0.0.0 --port 30000)"

sleep 3; log "checkpointing internally"

get_state() {
  sudo docker exec $container_id \
    sh -c "ps aux | grep -v grep | grep '^root' | grep -v cuda-checkpoint | awk '{ print \$2 }' | xargs -I{} cuda-checkpoint --get-state --pid {}" || true
}

toggle_all() {
  sudo docker exec $container_id \
    sh -c "ps aux | grep -v grep | grep '^root' | grep -v cuda-checkpoint | awk '{ print \$2 }' | xargs -I{} cuda-checkpoint --toggle --pid {}" || true
}

get_state
toggle_all
get_state

cuda-checkpoint fails with OS call failed or operation not supported on this OS when running --get-state and --toggle on a checkpointed PyTorch PID.

This ultimately causes CRIU to fail, since /dev/nvidiaX FDs are left open.

Experiments

  • I added retries to the cuda-checkpoint invocation, but it seems to fail consistently.
  • I tried enabling driver persistence, but received the same error.

This seems similar to #34 and #27, although using a later driver version.

Abbreviated strace of cuda-checkpoint --toggle --pid

strace.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions