Skip to content

Legate fails to detect network when launched with aprun #1013

@elliottslaughter

Description

@elliottslaughter

One of our machines uses PBS/Torque as the job scheduler and aprun to launch jobs. In this environment, based on the behavior of our scripts, it appears that Legate is failing to detect that we're in a multi-node run. Thus the script executes multiple discrete copies, resulting in various bad behavior (slow, out of memory, filesystem races, etc.).

@manopapad pointed me to the code that attempts to detect the network, and we can clearly see aprun is not supported:

unsigned int num_ranks()
{
constexpr EnvironmentVariable<std::uint32_t> OMPI_COMM_WORLD_SIZE{"OMPI_COMM_WORLD_SIZE"};
constexpr EnvironmentVariable<std::uint32_t> MV2_COMM_WORLD_SIZE{"MV2_COMM_WORLD_SIZE"};
constexpr EnvironmentVariable<std::uint32_t> SLURM_NTASKS{"SLURM_NTASKS"};

So this draws two larger questions.

  1. First, we need a way to initialize when the network is not one of these. A manual flag or some sort of hint might be an option.
  2. Second, we should have a standardized documentation that covers: (a) how to debug whether the network is working, (b) how to run with an unsupported launcher, (c) advice for checking that all the rank/core/GPU/etc. bindings are working when you're in this scenario.

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions