Skip to content

Segfault reported when running Primus on 8 MI250X GPUs #555

@secondspass

Description

@secondspass

Hello there,

I'm trying to run Primus through an Apptainer container on a node with 8 MI250X GPUs. Within the running container, I'm using primus-cli direct. The container I'm using is the rocm/primus:v26.1 from Dockerhub. Running from within the Apptainer container is effectively the same as running directly on the node, it has access to all 8 GPUs and the network.

Clone Primus first

git clone --recursive -b v0.7.0 https://github.com/AMD-AGI/Primus/

To build the container

apptainer build primusdockerhub.sif docker://docker.io/rocm/primus:v26.1

To run Primus

# start a shell with the container
apptainer shell primusdockerhub.sif

# by default this starts a container shell and also mount your current working directory on the host, so you should still see your current
# directory within the container shell when you do `ls`

# Now we're in the running container 
cd ./Primus # cd-ing into the Primus repository we had cloned earlier in the current directory
# running the qwen2.5 pretrain from the MI300X examples. There was no directory for MI250X.
./runner/primus-cli direct -- train pretrain   --config ./examples/megatron/configs/MI300X/qwen2.5_7B-BF16-pretrain.yaml

I've attached the output file from the /runner/primus-cli direct run. You can see at the very end of the output there are reports of SIGSEGV from the processes started by torchrun.

interactiveoutput.txt

EDIT (2026-02-19): Updating some of the above instructions because it was incorrect. Need to recursively git clone, and apptainer --bind flags are not necesary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions