Add optional Training Plan support for HyperPod instance groups by newabdosheham · Pull Request #1004 · awslabs/awsome-distributed-training

newabdosheham · 2026-02-26T23:39:07Z

Why new PR: “Old PR #930 auto-closed due to upstream history rewrite; this is rebased onto new main.”

What changed: “Adds optional Training Plan support for HyperPod instance groups + HyperPod Slurm observability improvements.”

Signed-off-by: Ankur Srivastava <awsankur@amazon.com>

* Change readme to refer to recent test cases & assets * Remove borked ascii logo

Description of changes: - Updating the CloudFormation template for the HyperPod Slurm observability stack. Now it automatically register Prometheus as the data-source for Grafana, and installs pre-configured dashboard to Grafana. - Updating lifecycle script to install metric exporters and OTEL collector in each node - more scalable architecture similar to the HyperPod EKS Observability. - Added "ObservabilityConfig" class in config.py. It has Prometheus Remote Write URL and advanced flag. For more details - https://catalog.workshops.aws/sagemaker-hyperpod/en-US/09-observability --------- Co-authored-by: Madhubalasri-B <madbal@amazon.com>

…an ARN

newabdosheham · 2026-02-27T00:32:18Z

This PR supersedes #930 (auto-closed due to upstream history rewrite).

Addressing review feedback from #930:

Refactored instance_groups to a list of objects with embedded name
Moved Training Plan support into the instance group object via optional training_plan_arn (instead of global vars / selecting a group by name)
Updated the Terraform example tfvars accordingly

Happy to adjust if you’d prefer a different naming/schema.

newabdosheham · 2026-02-27T00:46:22Z

Hi @KeitaW @bluecrayon52 — thanks again for the guidance on #930.

This PR (#1004) supersedes #930 (auto-closed due to upstream history rewrite) and is rebased onto the new main.

It addresses the review feedback from #930 by:

Standardizing instance_groups to a list of objects

Embedding name in each instance group object

Supporting Training Plans via optional per-group training_plan_arn (no global training-plan vars)

Updating terraform.tfvars.example accordingly

Would you mind taking a look when you have a moment? Thank you!

cc @perifaws @awsankur @shimomut

paragao

Review

Thanks for the contribution! The core Terraform changes for Training Plan support look well-implemented — the map(object) → list(object) refactoring and the optional training_plan_arn pattern are clean and idiomatic.

However, this PR includes several unrelated regressions that need to be cleaned up before it can be merged. Please ensure the PR only contains changes related to Training Plan support and nothing else.

1. Revert or fix README.md changes (Blocking)

The README.md appears to have been regressed to an older version of the file, likely from a bad rebase. Specific issues:

Links changed from awslabs/awsome-distributed-training back to aws-samples/awsome-distributed-training (incorrect — the repo has migrated)
Removed sections (micro-benchmarks, additional architectures like 6.ldap_server, 7.sagemaker-hyperpod-eks, 8.accounting-database)
Stale directory references (0.common instead of 0.s3, missing _and_ in 4.validation_observability)
Incorrect product name capitalization ("Hyperpod" → should be "HyperPod", "Pytorch" → "PyTorch", "MegatronLM" → "Megatron-LM")

Action: Please exclude README.md from this PR entirely (git checkout main -- README.md), or if you have intentional README changes, submit them in a separate PR with proper corrections.

2. Revert .gitattributes changes (Blocking)

The .gitattributes file has had Git LFS tracking rules removed for .gif, .zip, .tar.bz2, and .tar.gz files. This is almost certainly unintentional from the rebase and risks breaking large file handling in the repository.

Action: Please revert this file — git checkout main -- .gitattributes.

3. Remove unrelated binary ZIP changes (Blocking)

Two Lambda function ZIP files under 4.validation_and_observability/4.prometheus-grafana/assets/ are modified, but there are no corresponding source code changes in this PR. Binary changes without source make the diff impossible to review and pose a supply chain risk.

Action: Please revert these files — they are unrelated to Training Plan support.

Summary

Please clean up the branch so that it only includes the Terraform changes for Training Plan support (the instance_groups refactoring, training_plan_arn optional field, precondition, variables, and example tfvars). All unrelated file changes should be reverted to match main. This will make the PR much easier to review and merge.

Once the cleanup is done, happy to re-review. The core feature work looks good.

perifaws and others added 9 commits February 27, 2026 03:06

Removed inclusion of nsight

3490914

Boilerplate settings

4a2e90c

Added nsys profile for BioNemo

1f3f5c1

Signed-off-by: Ankur Srivastava <awsankur@amazon.com>

Updated twith comments from PR

dbc9d7c

Signed-off-by: Ankur Srivastava <awsankur@amazon.com>

Fix/readme - fix borked logo (awslabs#356)

b1677a1

* Change readme to refer to recent test cases & assets * Remove borked ascii logo

Deleting SMP/SMDDP test-cases based on PM's input (awslabs#617)

2796df4

Add optional Training Plan support for HyperPod instance groups

84f4998

Standardize instance_groups to list objects with optional training pl…

3ceef94

…an ARN

paragao requested changes Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional Training Plan support for HyperPod instance groups#1004

Add optional Training Plan support for HyperPod instance groups#1004
newabdosheham wants to merge 9 commits intoawslabs:mainfrom
newabdosheham:feature/hyperpod-training-plan

newabdosheham commented Feb 26, 2026

Uh oh!

newabdosheham commented Feb 27, 2026

Uh oh!

newabdosheham commented Feb 27, 2026

Uh oh!

paragao left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

newabdosheham commented Feb 26, 2026

Uh oh!

newabdosheham commented Feb 27, 2026

Uh oh!

newabdosheham commented Feb 27, 2026

Uh oh!

paragao left a comment

Choose a reason for hiding this comment

Review

1. Revert or fix README.md changes (Blocking)

2. Revert .gitattributes changes (Blocking)

3. Remove unrelated binary ZIP changes (Blocking)

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants