SLURM GPU job performance monitoring tool.
This tool monitors GPU utilization for SLURM array jobs, identifying which specific GPU each job is using and collecting performance statistics.
- Job-GPU Mapping: Identifies which GPU each SLURM job is using by matching job processes with GPU processes
- Real-time Monitoring: Continuously monitors GPU utilization, memory usage, and performance metrics
- Parallel Monitoring: Can monitor multiple jobs simultaneously using threading
- Statistical Analysis: Generates summary statistics and efficiency reports
- Visualization: Creates plots showing GPU utilization over time
- SLURM Integration: Can be run as a SLURM job itself for scalability
# Python packages
pip install pandas matplotlib numpy
# System requirements
- SSH access to compute nodes
- nvidia-smi available on GPU nodes
- SLURM utilities (squeue, scontrol)# Basic usage - monitor user '$user' jobs indefinitely
python3 slurm_gpu_monitor.py --username $USER
# Monitor for specific duration (300 seconds)
python3 slurm_gpu_monitor.py --username $USER --duration 300
# Specify monitoring interval and output directory
python3 slurm_gpu_monitor.py --username $USER --interval 10 --output-dir my_results
# Save raw data to CSV
python3 slurm_gpu_monitor.py --username $USER --csv-file raw_data.csv# Make the batch script executable
chmod +x slurm_gpu_monitor.sh
# Submit monitoring job
sbatch slurm_gpu_monitor.sh --username $USER
# Monitor for specific duration
sbatch slurm_gpu_monitor.sh --username $USER --duration 1800
# With custom settings
sbatch slurm_gpu_monitor.sh --username $USER --interval 10 --output-dir results_$(date +%Y%m%d)# Start monitoring
python3 slurm_gpu_monitor.py --username $USER --interval 5
# The script will:
# 1. Find all running GPU jobs for the user
# 2. For each job, identify which GPU it's using
# 3. Monitor GPU utilization continuously
# 4. Save data and generate reports when stopped (Ctrl+C)The tool generates several output files:
job_summary.csv: Summary statistics for each monitored jobgpu_monitoring_plots.png: Combined visualization of all jobsjob_<ID>_monitoring.png: Individual plots for each jobreport.txt: Detailed text report with statisticsraw_data.csv: Raw monitoring data (if --csv-file specified)
gpu_monitoring_output/
├── job_summary.csv
├── gpu_monitoring_plots.png
├── job_245126_monitoring.png
├── job_245127_monitoring.png
├── report.txt
└── raw_data.csv (optional)
- Job Discovery: Uses
squeueto find running GPU jobs for the specified user - GPU Identification:
- Gets job processes using
pson each compute node - Gets GPU processes using
nvidia-smi --query-compute-apps - Matches job PIDs with GPU processes to identify which GPU each job uses
- Gets job processes using
- Monitoring: Continuously queries GPU utilization using
nvidia-smifor the identified GPU - Data Collection: Stores timestamped utilization data for each job
- Analysis: Generates statistics and visualizations when monitoring completes
job_id,node,gpu_index,duration_minutes,avg_gpu_utilization,max_gpu_utilization,avg_memory_utilization,max_memory_utilization,avg_memory_used_gb,max_memory_used_gb
245126,somagpu084,1,45.2,78.5,95.0,65.3,82.1,12.4,15.8
245127,somagpu084,2,45.1,82.1,98.2,71.2,89.4,13.7,16.2
245128,somagpu084,3,45.0,15.2,45.8,12.4,28.7,2.8,6.4
The script uses a sophisticated approach to identify which GPU each job is using:
# 1. Get all processes for the job
job_processes = get_job_processes(node, job_id)
# 2. Get GPU processes and their GPU assignments
gpu_processes = get_gpu_processes(node) # nvidia-smi --query-compute-apps
# 3. Find intersection - which job processes are using GPUs
for process in job_processes:
if process.pid in gpu_processes:
gpu_index = gpu_processes[process.pid]['gpu_index']
# Monitor this specific GPUThe tool can monitor multiple jobs simultaneously using threading:
# Each job gets its own monitoring thread
with ThreadPoolExecutor(max_workers=len(jobs)) as executor:
for job in jobs:
executor.submit(monitor_job, job)Generates comprehensive statistics:
- Average and maximum GPU utilization
- Memory usage patterns
- Job efficiency metrics
- Duration and performance summaries
If you want to monitor only specific job IDs, you can modify the script:
# Add job filtering in get_running_jobs()
target_jobs = ['245126', '245127', '245128']
jobs = [job for job in jobs if job['job_id'] in target_jobs]Different monitoring intervals for different scenarios:
# High-frequency monitoring for short jobs
python3 slurm_gpu_monitor.py --username $USER --interval 1
# Low-frequency monitoring for long jobs
python3 slurm_gpu_monitor.py --username $USER --interval 30For large job arrays, submit the monitoring as a separate job:
# Submit your job array
sbatch --array=1-100 your_gpu_job.sh
# Submit monitoring job
sbatch slurm_gpu_monitor.sh --username $USER --duration 3600For real-time monitoring, you can extend the script to output live data:
# Add real-time output in monitor_job()
print(f"Job {job_id}: GPU {gpu_util}%, Memory {mem_util}%")-
SSH Connection Issues
# Test SSH access to compute nodes ssh somagpu084 'nvidia-smi'
-
Permission Issues
# Ensure you can access job information scontrol show job <job_id>
-
Missing Dependencies
# Install required packages pip install pandas matplotlib numpy -
GPU Not Detected
- The job might not be using GPU yet
- Check if the job is actually running GPU code
- Verify nvidia-smi works on the compute node
Add debug output to troubleshoot issues:
# In the script, add debugging
import logging
logging.basicConfig(level=logging.DEBUG)- Monitoring Overhead: Each SSH call has overhead; balance interval vs. accuracy
- Network Load: Multiple concurrent SSH connections to same node
- Storage: Raw data can grow large for long monitoring periods
The tool integrates well with SLURM workflows:
# Submit your GPU jobs
sbatch --array=1-50 gpu_job.sh
# Monitor them
sbatch --dependency=afterok:$SLURM_JOB_ID slurm_gpu_monitor.sh --username $USER
# Or run monitoring in parallel
sbatch slurm_gpu_monitor.sh --username $USER --duration 1800 &-
Submit your GPU job array:
sbatch --array=1-100 sleap_inference.sh
-
Start monitoring:
python3 slurm_gpu_monitor.py --username $USER --interval 5 -
Let it run until jobs complete or stop with Ctrl+C
-
Analyze results:
- Check
job_summary.csvfor efficiency metrics - Review plots in
gpu_monitoring_plots.png - Read detailed report in
report.txt
- Check
-
Optimize based on findings:
- Jobs with low GPU utilization might need optimization
- Memory usage patterns can inform resource requests
- Duration data helps with time limit settings
This tool provides comprehensive insights into GPU utilization patterns, helping optimize resource usage and identify performance bottlenecks in SLURM GPU job arrays.