Skip to content

Conversation

@amd-wsung102
Copy link
Contributor

Motivation

This PR continues Amir's commit cde5f3d and enables CTX for multinode cases using the new template variables.

Submission Checklist

@amd-wsung102
Copy link
Contributor Author

The changes made in this PR include:

  • Using the existing code changes in this commit.
  • Enabling CTX in internode_ll.cu for multinode use cases. A multinode case is defined when the newly added template variable multinode is true, which depends on the number of ranks during runtime dispatch.

Copy link

@RichardChamberlain1 RichardChamberlain1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some general tidy up.

I ran the a single node test and it hung on inter-node, see here...

https://ml-ci-internal.amd.com/job/DeepEP/job/Experimental/101/console.

Has this been stress tested? And if so what docker image did you base it off?

@amd-wsung102
Copy link
Contributor Author

The docker image is based on rocm6.3.4_ubuntu24.04_py3.12_pytorch_release_2.4.0. This has not been stress tested. I am currently trying to stress test it on OCI but OCI is showing this error message permission denied while trying to connect to the docker API at unix:///var/run/docker.sock.

I will continue to try running it on OCI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants