Skip to content

Add optional Training Plan support for HyperPod instance groups#1004

Open
newabdosheham wants to merge 9 commits intoawslabs:mainfrom
newabdosheham:feature/hyperpod-training-plan
Open

Add optional Training Plan support for HyperPod instance groups#1004
newabdosheham wants to merge 9 commits intoawslabs:mainfrom
newabdosheham:feature/hyperpod-training-plan

Conversation

@newabdosheham
Copy link

Why new PR: “Old PR #930 auto-closed due to upstream history rewrite; this is rebased onto new main.”

What changed: “Adds optional Training Plan support for HyperPod instance groups + HyperPod Slurm observability improvements.”

perifaws and others added 9 commits February 27, 2026 03:06
Signed-off-by: Ankur Srivastava <awsankur@amazon.com>
Signed-off-by: Ankur Srivastava <awsankur@amazon.com>
* Change readme to refer to recent test cases & assets

* Remove borked ascii logo
Description of changes:

- Updating the CloudFormation template for the HyperPod Slurm observability stack. Now it automatically register Prometheus as the data-source for Grafana, and installs pre-configured dashboard to Grafana.
- Updating lifecycle script to install metric exporters and OTEL collector in each node - more scalable architecture similar to the HyperPod EKS Observability.
- Added "ObservabilityConfig" class in config.py. It has Prometheus Remote Write URL and advanced flag.

For more details - https://catalog.workshops.aws/sagemaker-hyperpod/en-US/09-observability

---------

Co-authored-by: Madhubalasri-B <madbal@amazon.com>
@newabdosheham
Copy link
Author

This PR supersedes #930 (auto-closed due to upstream history rewrite).

Addressing review feedback from #930:

  • Refactored instance_groups to a list of objects with embedded name
  • Moved Training Plan support into the instance group object via optional training_plan_arn (instead of global vars / selecting a group by name)
  • Updated the Terraform example tfvars accordingly

Happy to adjust if you’d prefer a different naming/schema.

@newabdosheham
Copy link
Author

Hi @KeitaW @bluecrayon52 — thanks again for the guidance on #930.

This PR (#1004) supersedes #930 (auto-closed due to upstream history rewrite) and is rebased onto the new main.

It addresses the review feedback from #930 by:

Standardizing instance_groups to a list of objects

Embedding name in each instance group object

Supporting Training Plans via optional per-group training_plan_arn (no global training-plan vars)

Updating terraform.tfvars.example accordingly

Would you mind taking a look when you have a moment? Thank you!

cc @perifaws @awsankur @shimomut

Copy link
Contributor

@paragao paragao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Thanks for the contribution! The core Terraform changes for Training Plan support look well-implemented — the map(object)list(object) refactoring and the optional training_plan_arn pattern are clean and idiomatic.

However, this PR includes several unrelated regressions that need to be cleaned up before it can be merged. Please ensure the PR only contains changes related to Training Plan support and nothing else.

1. Revert or fix README.md changes (Blocking)

The README.md appears to have been regressed to an older version of the file, likely from a bad rebase. Specific issues:

  • Links changed from awslabs/awsome-distributed-training back to aws-samples/awsome-distributed-training (incorrect — the repo has migrated)
  • Removed sections (micro-benchmarks, additional architectures like 6.ldap_server, 7.sagemaker-hyperpod-eks, 8.accounting-database)
  • Stale directory references (0.common instead of 0.s3, missing _and_ in 4.validation_observability)
  • Incorrect product name capitalization ("Hyperpod" → should be "HyperPod", "Pytorch" → "PyTorch", "MegatronLM" → "Megatron-LM")

Action: Please exclude README.md from this PR entirely (git checkout main -- README.md), or if you have intentional README changes, submit them in a separate PR with proper corrections.

2. Revert .gitattributes changes (Blocking)

The .gitattributes file has had Git LFS tracking rules removed for .gif, .zip, .tar.bz2, and .tar.gz files. This is almost certainly unintentional from the rebase and risks breaking large file handling in the repository.

Action: Please revert this file — git checkout main -- .gitattributes.

3. Remove unrelated binary ZIP changes (Blocking)

Two Lambda function ZIP files under 4.validation_and_observability/4.prometheus-grafana/assets/ are modified, but there are no corresponding source code changes in this PR. Binary changes without source make the diff impossible to review and pose a supply chain risk.

Action: Please revert these files — they are unrelated to Training Plan support.


Summary

Please clean up the branch so that it only includes the Terraform changes for Training Plan support (the instance_groups refactoring, training_plan_arn optional field, precondition, variables, and example tfvars). All unrelated file changes should be reverted to match main. This will make the PR much easier to review and merge.

Once the cleanup is done, happy to re-review. The core feature work looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants