fix: overhaul CI workflows for FSDP regression tests by paragao · Pull Request #1024 · awslabs/awsome-distributed-training

paragao · 2026-03-17T14:45:50Z

Summary

Restructure and harden the GitHub Actions CI workflows that run FSDP distributed training regression tests on a remote Slurm cluster via SSH.

Changes

Workflow structural fixes

Fix YAML syntax errors in all 4 Slurm workflows (heredoc EOF terminators at column 0 inside YAML block scalars)
Remove broken EKS regression workflow (fsdp-eks-regression.yml)
Replace heavyweight 814-line PR review workflow with lightweight path-aware linter (pr-lint.yml)
Update actions/stale from v5 to v9
Resolve all actionlint and shellcheck findings

Slurm job monitoring improvements

Replace broken squeue-based job status detection with sacct exit code checking (squeue returning empty was incorrectly treated as COMPLETED)
Add SSH retry wrapper (ssh_cmd) with exponential backoff
Add SSH keepalive settings (ServerAliveInterval/ServerAliveCountMax)
Add dedicated enroot cleanup job to avoid race conditions
Add inline error log dump (last 200 lines) before job failure exit

Runtime fixes

Skip sudo apt install in create_venv.sh when sudo is unavailable
Fix shell variable escaping for sbatch filename in heredoc
Insert venv activation after last #SBATCH directive (not line 2)
Add /opt/slurm/bin to PATH for non-login SSH sessions
Fix LD_PRELOAD path to system NCCL (/lib/x86_64-linux-gnu/libnccl.so)
Pass HF_TOKEN to Slurm jobs via sed injection

HuggingFace data loading (fixes HTTP 429 rate limiting)

Pre-download C4 dataset shards to local FSx storage on the cluster
Generate a JSON manifest of local file paths per split
Modify train_utils.py to use load_dataset("json", data_files=...) when HF_DATA_FILES_MANIFEST env var points to a valid manifest
Set HF_HOME and HF_DATASETS_CACHE for shared FSx caching
Set HF_HUB_OFFLINE=1 in sbatch files to block runtime API calls
Pre-cache tokenizers in the workflow pre-download step

Test matrix

Align both venv and container workflows: cluster: [p5], model_config: [llama3_1_8b, llama3_1_70b]
Set NCCL_DEBUG=WARN (reduced from INFO) in all sbatch files

Files changed (19)

Category	Files
Workflows	`fsdp-regression-test-venv.yml`, `fsdp-regression-test-container.yml`, `megatron-ci-slurm.yaml`, `closing-soon.yml`, `pr-lint.yml` (new)
Workflows removed	`fsdp-eks-regression.yml`, `pr-review-and-slurm-test.yml`
Sbatch files (9)	All `3.test_cases/pytorch/FSDP/slurm/*-training.sbatch` — LD_PRELOAD, HF caching, NCCL_DEBUG
Scripts	`create_venv.sh` — sudo fallback
Training code	`train_utils.py` — manifest-based local data loading
Other	`.gitignore` — added `log.failed`

Testing

Validated through 15+ CI workflow runs on the p5 cluster, iteratively fixing issues from OIDC configuration through data loading.

KeitaW

Review — Deployment Pipeline & Positives

Nice cleanup across the board — the sacct fix, SSH resilience, and heredoc corrections are all substantive improvements.

Security scanning (bandit) dropped without replacement

The old pr-review-and-slurm-test.yml ran bandit security scanning on all Python files. The new pr-lint.yml replaces it with only flake8 and bash -n. I understand the intent is a lightweight replacement (and the old workflow had YAML issues that made it non-functional), but this is a net reduction in security coverage.

I'd suggest either adding a bandit step to pr-lint.yml (scoped to changed files only, to keep it fast), or tracking the re-addition as a follow-up issue so it doesn't fall through the cracks.

Things That Look Great

sacct for proper job status detection: The old pattern (squeue empty → "COMPLETED") silently treated FAILED/OOM_KILLED jobs as successes. Using sacct to check the actual exit code is the correct fix and will catch real failures.
SSH retry wrapper with keepalive: The ssh_cmd() function with ServerAliveInterval=60 and 3-retry logic is a smart pattern for the 6-hour monitoring loops where SSH connections would previously drop silently.
Dedicated cleanup-enroot job: Moving enroot image cleanup to a separate job that needs: [build, run-tests] with if: always() eliminates the race condition where matrix entries could try to delete images still in use by sibling jobs.
Heredoc indentation fixes: The EOF terminator placement was genuinely broken (unindented EOF inside indented YAML blocks). Fixing this across all 4 workflow files makes them actually parseable.
ShellCheck compliance: Quoting $GITHUB_OUTPUT / $GITHUB_ENV redirects (SC2086) and grouping consecutive redirects into { } >> blocks (SC2129) are good hygiene that prevents subtle word-splitting bugs in CI.
Scoped PR linting: Only linting files changed in the PR (via git diff --name-only --diff-filter=ACMR) is much faster and avoids noise from pre-existing issues in the codebase.

KeitaW · 2026-03-18T07:30:49Z

.github/workflows/pr-lint.yml

+          ERRORS=0
+
+          echo "${{ steps.changed.outputs.sh_files }}" | while IFS= read -r f; do
+            [ -z "$f" ] && continue
+            [ -f "$f" ] || continue
+            echo "  Checking: $f"
+            if ! bash -n "$f" 2>&1; then
+              ERRORS=$((ERRORS + 1))
+            fi
+          done
+
+          if [ "$ERRORS" -gt 0 ]; then
+            echo "Found syntax errors in $ERRORS shell script(s)"
+            exit 1
+          fi
+
+          echo "All shell scripts passed syntax check"


Shell syntax check never fails due to subshell variable scope

The ERRORS counter is incremented inside a while loop that reads from a pipe (echo | while). In bash, the right side of a pipe runs in a subshell, so ERRORS is always 0 in the parent shell after the loop exits. The if [ "$ERRORS" -gt 0 ] check will never trigger, even when bash -n reports syntax errors.

I'd suggest using a here-string to keep the loop in the parent shell:

Suggested change

ERRORS=0

echo "${{ steps.changed.outputs.sh_files }}" | while IFS= read -r f; do

[ -z "$f" ] && continue

[ -f "$f" ] || continue

echo " Checking: $f"

if ! bash -n "$f" 2>&1; then

ERRORS=$((ERRORS + 1))

fi

done

if [ "$ERRORS" -gt 0 ]; then

echo "Found syntax errors in $ERRORS shell script(s)"

exit 1

fi

echo "All shell scripts passed syntax check"

ERRORS=0

while IFS= read -r f; do

[ -z "$f" ] && continue

[ -f "$f" ] || continue

echo " Checking: $f"

if ! bash -n "$f" 2>&1; then

ERRORS=$((ERRORS + 1))

fi

done <<< "${{ steps.changed.outputs.sh_files }}"

if [ "$ERRORS" -gt 0 ]; then

echo "Found syntax errors in $ERRORS shell script(s)"

exit 1

fi

echo "All shell scripts passed syntax check"

KeitaW · 2026-03-18T07:30:49Z

.github/workflows/pr-lint.yml

+      - name: Lint Python Files
+        if: steps.changed.outputs.py_count != '0'
+        run: |
+          pip install flake8


Unpinned flake8 version

pip install flake8 installs whatever version is current. A future major release could change rules or output format and break the workflow. I'd suggest pinning to a specific version.

Suggested change

pip install flake8

pip install flake8==7.1.1

KeitaW

overall LGTM, few nit

Restructure and harden the GitHub Actions CI workflows that run FSDP distributed training regression tests on a remote Slurm cluster via SSH. Workflow structural fixes: - Fix YAML syntax errors in all 4 Slurm workflows (heredoc EOF terminators at column 0 inside YAML block scalars) - Remove broken EKS regression workflow (fsdp-eks-regression.yml) - Replace heavyweight 814-line PR review workflow with lightweight path-aware linter (pr-lint.yml) - Update actions/stale from v5 to v9 - Resolve all actionlint and shellcheck findings Slurm job monitoring improvements: - Replace broken squeue-based job status detection with sacct exit code checking (squeue returning empty was incorrectly treated as COMPLETED) - Add SSH retry wrapper (ssh_cmd) with exponential backoff - Add SSH keepalive settings (ServerAliveInterval/ServerAliveCountMax) - Add dedicated enroot cleanup job to avoid race conditions - Add inline error log dump (last 200 lines) before job failure exit Runtime fixes: - Skip sudo apt install in create_venv.sh when sudo is unavailable - Fix shell variable escaping for sbatch filename in heredoc - Insert venv activation after last SBATCH directive (not line 2) - Add /opt/slurm/bin to PATH for non-login SSH sessions - Fix LD_PRELOAD path to system NCCL (/lib/x86_64-linux-gnu/libnccl.so) - Pass HF_TOKEN to Slurm jobs via sed injection HuggingFace data loading (fixes HTTP 429 rate limiting): - Pre-download C4 dataset shards to local FSx storage on the cluster - Generate a JSON manifest of local file paths per split - Modify train_utils.py to use load_dataset('json', data_files=...) when HF_DATA_FILES_MANIFEST env var points to a valid manifest - Set HF_HOME and HF_DATASETS_CACHE for shared FSx caching - Set HF_HUB_OFFLINE=1 in sbatch files to block runtime API calls - Pre-cache tokenizers in the workflow pre-download step Test matrix: - Align both venv and container workflows to cluster: [p5], model_config: [llama3_1_8b, llama3_1_70b] - Set NCCL_DEBUG=WARN (reduced from INFO) in all sbatch files

KeitaW reviewed Mar 18, 2026

View reviewed changes

KeitaW requested changes Mar 18, 2026

View reviewed changes

paragao force-pushed the fix/ci-workflows branch from 4307461 to ea5d883 Compare March 19, 2026 23:46

paragao changed the title ~~fix: overhaul CI workflows -- fix YAML errors, improve robustness, clean up linting~~ fix: overhaul CI workflows for FSDP regression tests Mar 19, 2026

paragao mentioned this pull request Mar 20, 2026

Add DeepSpeed CI regression tests for QLoRA and GPT-103B #1029

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: overhaul CI workflows for FSDP regression tests#1024

fix: overhaul CI workflows for FSDP regression tests#1024
paragao wants to merge 1 commit intomainfrom
fix/ci-workflows

paragao commented Mar 17, 2026 •

edited

Loading

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Mar 18, 2026

Uh oh!

KeitaW Mar 18, 2026

Uh oh!

KeitaW left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

paragao commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Workflow structural fixes

Slurm job monitoring improvements

Runtime fixes

HuggingFace data loading (fixes HTTP 429 rate limiting)

Test matrix

Files changed (19)

Testing

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review — Deployment Pipeline & Positives

Security scanning (bandit) dropped without replacement

Things That Look Great

Uh oh!

KeitaW Mar 18, 2026

Choose a reason for hiding this comment

Shell syntax check never fails due to subshell variable scope

Uh oh!

KeitaW Mar 18, 2026

Choose a reason for hiding this comment

Unpinned flake8 version

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paragao commented Mar 17, 2026 •

edited

Loading

Unpinned `flake8` version