Retry SLURM job submission#58
Retry SLURM job submission#58PerilousApricot wants to merge 1 commit intoopensciencegrid:v1_18_boscofrom
Conversation
Under load, the SLURM scheduler is prone to barf on any client commands. Retry the job submission if it fails.
| retcode=$? | ||
| retry=0 | ||
| MAX_RETRY=3 | ||
| until [ $retry -eq $MAX_RETRY ] ; do |
There was a problem hiding this comment.
This looks like the number of attempts to submit is 3 but the number of retries is actually 2, right?
There was a problem hiding this comment.
The first attempt is retry=0, once retry=3, the loop condition fails and breaks. That should be three tries, right?
There was a problem hiding this comment.
Yea, that's 3 tries total but only 2 retries so we should call the variable MAX_TRIES or bump the initial value of retry
| jobID=`${slurm_binpath}/sbatch $bls_tmp_file` # actual submission | ||
| retcode=$? | ||
| retry=0 | ||
| MAX_RETRY=3 |
There was a problem hiding this comment.
Could we add this as a config variable slurm_max_submit_retries here, defaulting to 0, and reference it via ${slurm_max_submit_retries}?
| break | ||
| fi | ||
| retry=$[$retry+1] | ||
| sleep 10 |
There was a problem hiding this comment.
Could we make the sleep backoff exponentially?
|
Hi @PerilousApricot, we've had some discussions locally that we actually want to disable job retries on the CE completely (https://opensciencegrid.atlassian.net/browse/SOFTWARE-3407) since pilots are just resource requests. Would this have a negative impact for a site, say getting dinged by the WLCG for failing pilots? |
|
TBH, I don't know the WLCG effects, and Im a little unsure of the context as well from the JIRA ticket as well |
|
The idea is that pilots are cheap resource requests that are fairly uniform so if a CE fails to submit a pilot job to the batch system, it should just give up and wait for more pilots. I was wondering if pilot success/failure at sites is tracked closely. |
Under load, the SLURM scheduler is prone to barf on any client commands.
Retry the job submission if it fails.
This should be exposed as configurable values, instead of being hardcoded.