Skip to content

fix: overhaul CI workflows for FSDP regression tests#1024

Open
paragao wants to merge 1 commit intomainfrom
fix/ci-workflows
Open

fix: overhaul CI workflows for FSDP regression tests#1024
paragao wants to merge 1 commit intomainfrom
fix/ci-workflows

Conversation

@paragao
Copy link
Contributor

@paragao paragao commented Mar 17, 2026

Summary

Restructure and harden the GitHub Actions CI workflows that run FSDP distributed training regression tests on a remote Slurm cluster via SSH.

Changes

Workflow structural fixes

  • Fix YAML syntax errors in all 4 Slurm workflows (heredoc EOF terminators at column 0 inside YAML block scalars)
  • Remove broken EKS regression workflow (fsdp-eks-regression.yml)
  • Replace heavyweight 814-line PR review workflow with lightweight path-aware linter (pr-lint.yml)
  • Update actions/stale from v5 to v9
  • Resolve all actionlint and shellcheck findings

Slurm job monitoring improvements

  • Replace broken squeue-based job status detection with sacct exit code checking (squeue returning empty was incorrectly treated as COMPLETED)
  • Add SSH retry wrapper (ssh_cmd) with exponential backoff
  • Add SSH keepalive settings (ServerAliveInterval/ServerAliveCountMax)
  • Add dedicated enroot cleanup job to avoid race conditions
  • Add inline error log dump (last 200 lines) before job failure exit

Runtime fixes

  • Skip sudo apt install in create_venv.sh when sudo is unavailable
  • Fix shell variable escaping for sbatch filename in heredoc
  • Insert venv activation after last #SBATCH directive (not line 2)
  • Add /opt/slurm/bin to PATH for non-login SSH sessions
  • Fix LD_PRELOAD path to system NCCL (/lib/x86_64-linux-gnu/libnccl.so)
  • Pass HF_TOKEN to Slurm jobs via sed injection

HuggingFace data loading (fixes HTTP 429 rate limiting)

  • Pre-download C4 dataset shards to local FSx storage on the cluster
  • Generate a JSON manifest of local file paths per split
  • Modify train_utils.py to use load_dataset("json", data_files=...) when HF_DATA_FILES_MANIFEST env var points to a valid manifest
  • Set HF_HOME and HF_DATASETS_CACHE for shared FSx caching
  • Set HF_HUB_OFFLINE=1 in sbatch files to block runtime API calls
  • Pre-cache tokenizers in the workflow pre-download step

Test matrix

  • Align both venv and container workflows: cluster: [p5], model_config: [llama3_1_8b, llama3_1_70b]
  • Set NCCL_DEBUG=WARN (reduced from INFO) in all sbatch files

Files changed (19)

Category Files
Workflows fsdp-regression-test-venv.yml, fsdp-regression-test-container.yml, megatron-ci-slurm.yaml, closing-soon.yml, pr-lint.yml (new)
Workflows removed fsdp-eks-regression.yml, pr-review-and-slurm-test.yml
Sbatch files (9) All 3.test_cases/pytorch/FSDP/slurm/*-training.sbatch — LD_PRELOAD, HF caching, NCCL_DEBUG
Scripts create_venv.sh — sudo fallback
Training code train_utils.py — manifest-based local data loading
Other .gitignore — added log.failed

Testing

Validated through 15+ CI workflow runs on the p5 cluster, iteratively fixing issues from OIDC configuration through data loading.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Deployment Pipeline & Positives

Nice cleanup across the board — the sacct fix, SSH resilience, and heredoc corrections are all substantive improvements.

Security scanning (bandit) dropped without replacement

The old pr-review-and-slurm-test.yml ran bandit security scanning on all Python files. The new pr-lint.yml replaces it with only flake8 and bash -n. I understand the intent is a lightweight replacement (and the old workflow had YAML issues that made it non-functional), but this is a net reduction in security coverage.

I'd suggest either adding a bandit step to pr-lint.yml (scoped to changed files only, to keep it fast), or tracking the re-addition as a follow-up issue so it doesn't fall through the cracks.


Things That Look Great

  • sacct for proper job status detection: The old pattern (squeue empty → "COMPLETED") silently treated FAILED/OOM_KILLED jobs as successes. Using sacct to check the actual exit code is the correct fix and will catch real failures.
  • SSH retry wrapper with keepalive: The ssh_cmd() function with ServerAliveInterval=60 and 3-retry logic is a smart pattern for the 6-hour monitoring loops where SSH connections would previously drop silently.
  • Dedicated cleanup-enroot job: Moving enroot image cleanup to a separate job that needs: [build, run-tests] with if: always() eliminates the race condition where matrix entries could try to delete images still in use by sibling jobs.
  • Heredoc indentation fixes: The EOF terminator placement was genuinely broken (unindented EOF inside indented YAML blocks). Fixing this across all 4 workflow files makes them actually parseable.
  • ShellCheck compliance: Quoting $GITHUB_OUTPUT / $GITHUB_ENV redirects (SC2086) and grouping consecutive redirects into { } >> blocks (SC2129) are good hygiene that prevents subtle word-splitting bugs in CI.
  • Scoped PR linting: Only linting files changed in the PR (via git diff --name-only --diff-filter=ACMR) is much faster and avoids noise from pre-existing issues in the codebase.

Comment on lines +85 to +101
ERRORS=0

echo "${{ steps.changed.outputs.sh_files }}" | while IFS= read -r f; do
[ -z "$f" ] && continue
[ -f "$f" ] || continue
echo " Checking: $f"
if ! bash -n "$f" 2>&1; then
ERRORS=$((ERRORS + 1))
fi
done

if [ "$ERRORS" -gt 0 ]; then
echo "Found syntax errors in $ERRORS shell script(s)"
exit 1
fi

echo "All shell scripts passed syntax check"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell syntax check never fails due to subshell variable scope

The ERRORS counter is incremented inside a while loop that reads from a pipe (echo | while). In bash, the right side of a pipe runs in a subshell, so ERRORS is always 0 in the parent shell after the loop exits. The if [ "$ERRORS" -gt 0 ] check will never trigger, even when bash -n reports syntax errors.

I'd suggest using a here-string to keep the loop in the parent shell:

Suggested change
ERRORS=0
echo "${{ steps.changed.outputs.sh_files }}" | while IFS= read -r f; do
[ -z "$f" ] && continue
[ -f "$f" ] || continue
echo " Checking: $f"
if ! bash -n "$f" 2>&1; then
ERRORS=$((ERRORS + 1))
fi
done
if [ "$ERRORS" -gt 0 ]; then
echo "Found syntax errors in $ERRORS shell script(s)"
exit 1
fi
echo "All shell scripts passed syntax check"
ERRORS=0
while IFS= read -r f; do
[ -z "$f" ] && continue
[ -f "$f" ] || continue
echo " Checking: $f"
if ! bash -n "$f" 2>&1; then
ERRORS=$((ERRORS + 1))
fi
done <<< "${{ steps.changed.outputs.sh_files }}"
if [ "$ERRORS" -gt 0 ]; then
echo "Found syntax errors in $ERRORS shell script(s)"
exit 1
fi
echo "All shell scripts passed syntax check"

- name: Lint Python Files
if: steps.changed.outputs.py_count != '0'
run: |
pip install flake8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpinned flake8 version

pip install flake8 installs whatever version is current. A future major release could change rules or output format and break the workflow. I'd suggest pinning to a specific version.

Suggested change
pip install flake8
pip install flake8==7.1.1

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM, few nit

Restructure and harden the GitHub Actions CI workflows that run FSDP
distributed training regression tests on a remote Slurm cluster via SSH.

Workflow structural fixes:
- Fix YAML syntax errors in all 4 Slurm workflows (heredoc EOF terminators
  at column 0 inside YAML block scalars)
- Remove broken EKS regression workflow (fsdp-eks-regression.yml)
- Replace heavyweight 814-line PR review workflow with lightweight
  path-aware linter (pr-lint.yml)
- Update actions/stale from v5 to v9
- Resolve all actionlint and shellcheck findings

Slurm job monitoring improvements:
- Replace broken squeue-based job status detection with sacct exit code
  checking (squeue returning empty was incorrectly treated as COMPLETED)
- Add SSH retry wrapper (ssh_cmd) with exponential backoff
- Add SSH keepalive settings (ServerAliveInterval/ServerAliveCountMax)
- Add dedicated enroot cleanup job to avoid race conditions
- Add inline error log dump (last 200 lines) before job failure exit

Runtime fixes:
- Skip sudo apt install in create_venv.sh when sudo is unavailable
- Fix shell variable escaping for sbatch filename in heredoc
- Insert venv activation after last SBATCH directive (not line 2)
- Add /opt/slurm/bin to PATH for non-login SSH sessions
- Fix LD_PRELOAD path to system NCCL (/lib/x86_64-linux-gnu/libnccl.so)
- Pass HF_TOKEN to Slurm jobs via sed injection

HuggingFace data loading (fixes HTTP 429 rate limiting):
- Pre-download C4 dataset shards to local FSx storage on the cluster
- Generate a JSON manifest of local file paths per split
- Modify train_utils.py to use load_dataset('json', data_files=...)
  when HF_DATA_FILES_MANIFEST env var points to a valid manifest
- Set HF_HOME and HF_DATASETS_CACHE for shared FSx caching
- Set HF_HUB_OFFLINE=1 in sbatch files to block runtime API calls
- Pre-cache tokenizers in the workflow pre-download step

Test matrix:
- Align both venv and container workflows to cluster: [p5],
  model_config: [llama3_1_8b, llama3_1_70b]
- Set NCCL_DEBUG=WARN (reduced from INFO) in all sbatch files
@paragao paragao changed the title fix: overhaul CI workflows -- fix YAML errors, improve robustness, clean up linting fix: overhaul CI workflows for FSDP regression tests Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants