fix: overhaul CI workflows for FSDP regression tests#1024
fix: overhaul CI workflows for FSDP regression tests#1024
Conversation
KeitaW
left a comment
There was a problem hiding this comment.
Review — Deployment Pipeline & Positives
Nice cleanup across the board — the sacct fix, SSH resilience, and heredoc corrections are all substantive improvements.
Security scanning (bandit) dropped without replacement
The old pr-review-and-slurm-test.yml ran bandit security scanning on all Python files. The new pr-lint.yml replaces it with only flake8 and bash -n. I understand the intent is a lightweight replacement (and the old workflow had YAML issues that made it non-functional), but this is a net reduction in security coverage.
I'd suggest either adding a bandit step to pr-lint.yml (scoped to changed files only, to keep it fast), or tracking the re-addition as a follow-up issue so it doesn't fall through the cracks.
Things That Look Great
sacctfor proper job status detection: The old pattern (squeueempty → "COMPLETED") silently treated FAILED/OOM_KILLED jobs as successes. Usingsacctto check the actual exit code is the correct fix and will catch real failures.- SSH retry wrapper with keepalive: The
ssh_cmd()function withServerAliveInterval=60and 3-retry logic is a smart pattern for the 6-hour monitoring loops where SSH connections would previously drop silently. - Dedicated
cleanup-enrootjob: Moving enroot image cleanup to a separate job thatneeds: [build, run-tests]withif: always()eliminates the race condition where matrix entries could try to delete images still in use by sibling jobs. - Heredoc indentation fixes: The
EOFterminator placement was genuinely broken (unindentedEOFinside indented YAML blocks). Fixing this across all 4 workflow files makes them actually parseable. - ShellCheck compliance: Quoting
$GITHUB_OUTPUT/$GITHUB_ENVredirects (SC2086) and grouping consecutive redirects into{ } >>blocks (SC2129) are good hygiene that prevents subtle word-splitting bugs in CI. - Scoped PR linting: Only linting files changed in the PR (via
git diff --name-only --diff-filter=ACMR) is much faster and avoids noise from pre-existing issues in the codebase.
| ERRORS=0 | ||
|
|
||
| echo "${{ steps.changed.outputs.sh_files }}" | while IFS= read -r f; do | ||
| [ -z "$f" ] && continue | ||
| [ -f "$f" ] || continue | ||
| echo " Checking: $f" | ||
| if ! bash -n "$f" 2>&1; then | ||
| ERRORS=$((ERRORS + 1)) | ||
| fi | ||
| done | ||
|
|
||
| if [ "$ERRORS" -gt 0 ]; then | ||
| echo "Found syntax errors in $ERRORS shell script(s)" | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "All shell scripts passed syntax check" |
There was a problem hiding this comment.
Shell syntax check never fails due to subshell variable scope
The ERRORS counter is incremented inside a while loop that reads from a pipe (echo | while). In bash, the right side of a pipe runs in a subshell, so ERRORS is always 0 in the parent shell after the loop exits. The if [ "$ERRORS" -gt 0 ] check will never trigger, even when bash -n reports syntax errors.
I'd suggest using a here-string to keep the loop in the parent shell:
| ERRORS=0 | |
| echo "${{ steps.changed.outputs.sh_files }}" | while IFS= read -r f; do | |
| [ -z "$f" ] && continue | |
| [ -f "$f" ] || continue | |
| echo " Checking: $f" | |
| if ! bash -n "$f" 2>&1; then | |
| ERRORS=$((ERRORS + 1)) | |
| fi | |
| done | |
| if [ "$ERRORS" -gt 0 ]; then | |
| echo "Found syntax errors in $ERRORS shell script(s)" | |
| exit 1 | |
| fi | |
| echo "All shell scripts passed syntax check" | |
| ERRORS=0 | |
| while IFS= read -r f; do | |
| [ -z "$f" ] && continue | |
| [ -f "$f" ] || continue | |
| echo " Checking: $f" | |
| if ! bash -n "$f" 2>&1; then | |
| ERRORS=$((ERRORS + 1)) | |
| fi | |
| done <<< "${{ steps.changed.outputs.sh_files }}" | |
| if [ "$ERRORS" -gt 0 ]; then | |
| echo "Found syntax errors in $ERRORS shell script(s)" | |
| exit 1 | |
| fi | |
| echo "All shell scripts passed syntax check" |
| - name: Lint Python Files | ||
| if: steps.changed.outputs.py_count != '0' | ||
| run: | | ||
| pip install flake8 |
There was a problem hiding this comment.
Unpinned flake8 version
pip install flake8 installs whatever version is current. A future major release could change rules or output format and break the workflow. I'd suggest pinning to a specific version.
| pip install flake8 | |
| pip install flake8==7.1.1 |
Restructure and harden the GitHub Actions CI workflows that run FSDP
distributed training regression tests on a remote Slurm cluster via SSH.
Workflow structural fixes:
- Fix YAML syntax errors in all 4 Slurm workflows (heredoc EOF terminators
at column 0 inside YAML block scalars)
- Remove broken EKS regression workflow (fsdp-eks-regression.yml)
- Replace heavyweight 814-line PR review workflow with lightweight
path-aware linter (pr-lint.yml)
- Update actions/stale from v5 to v9
- Resolve all actionlint and shellcheck findings
Slurm job monitoring improvements:
- Replace broken squeue-based job status detection with sacct exit code
checking (squeue returning empty was incorrectly treated as COMPLETED)
- Add SSH retry wrapper (ssh_cmd) with exponential backoff
- Add SSH keepalive settings (ServerAliveInterval/ServerAliveCountMax)
- Add dedicated enroot cleanup job to avoid race conditions
- Add inline error log dump (last 200 lines) before job failure exit
Runtime fixes:
- Skip sudo apt install in create_venv.sh when sudo is unavailable
- Fix shell variable escaping for sbatch filename in heredoc
- Insert venv activation after last SBATCH directive (not line 2)
- Add /opt/slurm/bin to PATH for non-login SSH sessions
- Fix LD_PRELOAD path to system NCCL (/lib/x86_64-linux-gnu/libnccl.so)
- Pass HF_TOKEN to Slurm jobs via sed injection
HuggingFace data loading (fixes HTTP 429 rate limiting):
- Pre-download C4 dataset shards to local FSx storage on the cluster
- Generate a JSON manifest of local file paths per split
- Modify train_utils.py to use load_dataset('json', data_files=...)
when HF_DATA_FILES_MANIFEST env var points to a valid manifest
- Set HF_HOME and HF_DATASETS_CACHE for shared FSx caching
- Set HF_HUB_OFFLINE=1 in sbatch files to block runtime API calls
- Pre-cache tokenizers in the workflow pre-download step
Test matrix:
- Align both venv and container workflows to cluster: [p5],
model_config: [llama3_1_8b, llama3_1_70b]
- Set NCCL_DEBUG=WARN (reduced from INFO) in all sbatch files
4307461 to
ea5d883
Compare
Summary
Restructure and harden the GitHub Actions CI workflows that run FSDP distributed training regression tests on a remote Slurm cluster via SSH.
Changes
Workflow structural fixes
EOFterminators at column 0 inside YAML block scalars)fsdp-eks-regression.yml)pr-lint.yml)actions/stalefrom v5 to v9Slurm job monitoring improvements
squeue-based job status detection withsacctexit code checking (squeuereturning empty was incorrectly treated as COMPLETED)ssh_cmd) with exponential backoffServerAliveInterval/ServerAliveCountMax)Runtime fixes
sudo apt installincreate_venv.shwhen sudo is unavailable#SBATCHdirective (not line 2)/opt/slurm/bintoPATHfor non-login SSH sessionsLD_PRELOADpath to system NCCL (/lib/x86_64-linux-gnu/libnccl.so)HF_TOKENto Slurm jobs via sed injectionHuggingFace data loading (fixes HTTP 429 rate limiting)
train_utils.pyto useload_dataset("json", data_files=...)whenHF_DATA_FILES_MANIFESTenv var points to a valid manifestHF_HOMEandHF_DATASETS_CACHEfor shared FSx cachingHF_HUB_OFFLINE=1in sbatch files to block runtime API callsTest matrix
cluster: [p5],model_config: [llama3_1_8b, llama3_1_70b]NCCL_DEBUG=WARN(reduced fromINFO) in all sbatch filesFiles changed (19)
fsdp-regression-test-venv.yml,fsdp-regression-test-container.yml,megatron-ci-slurm.yaml,closing-soon.yml,pr-lint.yml(new)fsdp-eks-regression.yml,pr-review-and-slurm-test.yml3.test_cases/pytorch/FSDP/slurm/*-training.sbatch— LD_PRELOAD, HF caching, NCCL_DEBUGcreate_venv.sh— sudo fallbacktrain_utils.py— manifest-based local data loading.gitignore— addedlog.failedTesting
Validated through 15+ CI workflow runs on the p5 cluster, iteratively fixing issues from OIDC configuration through data loading.