Add optional Training Plan support for HyperPod instance groups#1004
Add optional Training Plan support for HyperPod instance groups#1004newabdosheham wants to merge 9 commits intoawslabs:mainfrom
Conversation
Signed-off-by: Ankur Srivastava <awsankur@amazon.com>
Signed-off-by: Ankur Srivastava <awsankur@amazon.com>
* Change readme to refer to recent test cases & assets * Remove borked ascii logo
Description of changes: - Updating the CloudFormation template for the HyperPod Slurm observability stack. Now it automatically register Prometheus as the data-source for Grafana, and installs pre-configured dashboard to Grafana. - Updating lifecycle script to install metric exporters and OTEL collector in each node - more scalable architecture similar to the HyperPod EKS Observability. - Added "ObservabilityConfig" class in config.py. It has Prometheus Remote Write URL and advanced flag. For more details - https://catalog.workshops.aws/sagemaker-hyperpod/en-US/09-observability --------- Co-authored-by: Madhubalasri-B <madbal@amazon.com>
|
This PR supersedes #930 (auto-closed due to upstream history rewrite). Addressing review feedback from #930:
Happy to adjust if you’d prefer a different naming/schema. |
|
Hi @KeitaW @bluecrayon52 — thanks again for the guidance on #930. This PR (#1004) supersedes #930 (auto-closed due to upstream history rewrite) and is rebased onto the new main. It addresses the review feedback from #930 by: Standardizing instance_groups to a list of objects Embedding name in each instance group object Supporting Training Plans via optional per-group training_plan_arn (no global training-plan vars) Updating terraform.tfvars.example accordingly Would you mind taking a look when you have a moment? Thank you! |
paragao
left a comment
There was a problem hiding this comment.
Review
Thanks for the contribution! The core Terraform changes for Training Plan support look well-implemented — the map(object) → list(object) refactoring and the optional training_plan_arn pattern are clean and idiomatic.
However, this PR includes several unrelated regressions that need to be cleaned up before it can be merged. Please ensure the PR only contains changes related to Training Plan support and nothing else.
1. Revert or fix README.md changes (Blocking)
The README.md appears to have been regressed to an older version of the file, likely from a bad rebase. Specific issues:
- Links changed from
awslabs/awsome-distributed-trainingback toaws-samples/awsome-distributed-training(incorrect — the repo has migrated) - Removed sections (micro-benchmarks, additional architectures like
6.ldap_server,7.sagemaker-hyperpod-eks,8.accounting-database) - Stale directory references (
0.commoninstead of0.s3, missing_and_in4.validation_observability) - Incorrect product name capitalization ("Hyperpod" → should be "HyperPod", "Pytorch" → "PyTorch", "MegatronLM" → "Megatron-LM")
Action: Please exclude README.md from this PR entirely (git checkout main -- README.md), or if you have intentional README changes, submit them in a separate PR with proper corrections.
2. Revert .gitattributes changes (Blocking)
The .gitattributes file has had Git LFS tracking rules removed for .gif, .zip, .tar.bz2, and .tar.gz files. This is almost certainly unintentional from the rebase and risks breaking large file handling in the repository.
Action: Please revert this file — git checkout main -- .gitattributes.
3. Remove unrelated binary ZIP changes (Blocking)
Two Lambda function ZIP files under 4.validation_and_observability/4.prometheus-grafana/assets/ are modified, but there are no corresponding source code changes in this PR. Binary changes without source make the diff impossible to review and pose a supply chain risk.
Action: Please revert these files — they are unrelated to Training Plan support.
Summary
Please clean up the branch so that it only includes the Terraform changes for Training Plan support (the instance_groups refactoring, training_plan_arn optional field, precondition, variables, and example tfvars). All unrelated file changes should be reverted to match main. This will make the PR much easier to review and merge.
Once the cleanup is done, happy to re-review. The core feature work looks good.
Why new PR: “Old PR #930 auto-closed due to upstream history rewrite; this is rebased onto new main.”
What changed: “Adds optional Training Plan support for HyperPod instance groups + HyperPod Slurm observability improvements.”