-
Notifications
You must be signed in to change notification settings - Fork 63
Open
Description
One of our machines uses PBS/Torque as the job scheduler and aprun to launch jobs. In this environment, based on the behavior of our scripts, it appears that Legate is failing to detect that we're in a multi-node run. Thus the script executes multiple discrete copies, resulting in various bad behavior (slow, out of memory, filesystem races, etc.).
@manopapad pointed me to the code that attempts to detect the network, and we can clearly see aprun is not supported:
legate/src/cpp/legate/runtime/detail/argument_parsing/util.cc
Lines 29 to 33 in 60b41a2
| unsigned int num_ranks() | |
| { | |
| constexpr EnvironmentVariable<std::uint32_t> OMPI_COMM_WORLD_SIZE{"OMPI_COMM_WORLD_SIZE"}; | |
| constexpr EnvironmentVariable<std::uint32_t> MV2_COMM_WORLD_SIZE{"MV2_COMM_WORLD_SIZE"}; | |
| constexpr EnvironmentVariable<std::uint32_t> SLURM_NTASKS{"SLURM_NTASKS"}; |
So this draws two larger questions.
- First, we need a way to initialize when the network is not one of these. A manual flag or some sort of hint might be an option.
- Second, we should have a standardized documentation that covers: (a) how to debug whether the network is working, (b) how to run with an unsupported launcher, (c) advice for checking that all the rank/core/GPU/etc. bindings are working when you're in this scenario.
Metadata
Metadata
Assignees
Labels
No labels