-
Notifications
You must be signed in to change notification settings - Fork 0
AdamDorwart/HPLdiag
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This directory contains everything that you'll need to run a single
node HPL problem on Comet using 12 MPI processes and 2 threads per
process.
log/ Stores simplified logs of test results
results/ Stores indivdual job stdout/err files
compute/HPL.dat HPL input data for CPU tests
compute/hpl_1n_12p_2t.comet Slurm batch script for HPL benchmarks on Comet compute nodes
compute/xhpl.cpu HPL executable for hybrid (MPI+OpenMP) execution
compute/ssd-test.fio FIO input data used for SSD test
compute/fio Flexible IO tester. Used for SSD benchmarks
gpu/HPL.dat HPL input data for GPU tests (larger problem size for longer runtime)
gpu/hpl_1n_12p_4g_2t.comet Slurm batch script for HPL benchmarks on Comet gpu nodes
gpu/xhpl.gpu HPL executable for CUDA GPU execution
hosts List of Comet nodes with partitions
HPLdiag.sh Runs diagnostics on all nodes in a provided input file (see hosts) and puts the results in log/HPLresults.MMDDYY
HPLcancel.sh Cancels all running tests for the current $USER. Logs actions in log/HPLresults.MMDDYY
sendStatusEmail.sh Sends a status email on the health of the system given an input log file (located in logs), uses statusMailingList for recipients
statusEmail.txt The template used for sendStatusEmail script
statusMailingList Line deliminated file of emails to send status reports to
To run a diagnostic across all of comet use the provided hosts file like so
$ ./HPLdiag.sh hosts
To run a diagnostic on particular nodes and partitions
$ cat > smallTest
comet-xx-yy compute
comet-ww-zz compute
comet-ii-jj shared
$ ./HPLdiag.sh smallTest
'hosts' file can be generated with
sinfo -o "%n %R" -h | column -t | sort -u -k1,1 > hosts
The end of the file will sometimes contain a line for a 'fat' partition. Remove it as needed.
TODO:
For GPU nodes, use combination of HPL and AMBER
For SSD tests, look into IOR and FIO. Note that drive performance degrades as sectors get marked for TRIM and need to be periodically wiped to achevie full speed
Create Health Moniter cronjob to manage automated benchmarks
Analyze duration times of results
Better input handling for HPLdiag
- Accept pipping or input file
- Options to run specific tests
Useful commands:
Get the status of all nodes currently enqueued
sinfo -o "%n %T %E" -n `squeue -h -o %n -u $USER | sed ':a;N;$!ba;s/\n/,/g'`
Create a hosts retest file from the TIMEOUTs in a result file
grep HPL-*-TIMEOUT log/HPLresults.$timestamp | awk '{print $4}' | grep -f - hosts > retestHosts
About
A Cron job distributed system monitor. Performs periodic diagnostics using the High Performance Linpack benchmark to test for abnormal behavior in an HPC system.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published