-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Hello there,
I'm trying to run Primus through an Apptainer container on a node with 8 MI250X GPUs. Within the running container, I'm using primus-cli direct. The container I'm using is the rocm/primus:v26.1 from Dockerhub. Running from within the Apptainer container is effectively the same as running directly on the node, it has access to all 8 GPUs and the network.
Clone Primus first
git clone --recursive -b v0.7.0 https://github.com/AMD-AGI/Primus/
To build the container
apptainer build primusdockerhub.sif docker://docker.io/rocm/primus:v26.1
To run Primus
# start a shell with the container
apptainer shell primusdockerhub.sif
# by default this starts a container shell and also mount your current working directory on the host, so you should still see your current
# directory within the container shell when you do `ls`
# Now we're in the running container
cd ./Primus # cd-ing into the Primus repository we had cloned earlier in the current directory
# running the qwen2.5 pretrain from the MI300X examples. There was no directory for MI250X.
./runner/primus-cli direct -- train pretrain --config ./examples/megatron/configs/MI300X/qwen2.5_7B-BF16-pretrain.yaml
I've attached the output file from the /runner/primus-cli direct run. You can see at the very end of the output there are reports of SIGSEGV from the processes started by torchrun.
EDIT (2026-02-19): Updating some of the above instructions because it was incorrect. Need to recursively git clone, and apptainer --bind flags are not necesary.