-
Notifications
You must be signed in to change notification settings - Fork 8
feat: add job lifecycle tracking and retry logic for stuck jobs #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add job lifecycle tracking and retry logic for stuck jobs #49
Conversation
- Track jobs immediately upon acquisition to monitor full lifecycle - Implement max 3 provisioning retries with 15-second intervals - Add background monitoring for jobs stuck over 5 minutes - Enhanced logging at each job state transition - Add 8 unit tests for job tracking and concurrent access safety Fixes issues where jobs remain stuck indefinitely when: - Provisioning fails and jobs stay acquired without retry - Jobs are acquired but never assigned by GitHub Actions - Container crashes leave jobs in limbo state Tests: make test passes with 23 total specs Lint: make lint passes with no issues
…eout Jobs stuck for >5 minutes are now automatically removed from tracking and marked as canceled, allowing VM cleanup to proceed. Added 10-minute timeout per provisioning attempt to prevent indefinite hangs from SSH or network issues.
fix: implement active cleanup for stuck jobs and add provisioning timeout
|
Thank you for your contribution. There are a several important things that are worth disussing:
As you can see the second job overwrote the first one, instead of creating its own entry. |
| p.logger.Infof("Provisioning runner for job %s (RunnerRequestId: %d), attempt %d/%d", jobId, runnerRequestId, attempt, maxProvisioningRetries) | ||
|
|
||
| // Create timeout context for this provisioning attempt | ||
| provisionCtx, cancel := context.WithTimeout(p.ctx, provisioningTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout is not only for the provisioning, but also for the job duration.
Provisioning does the following:
- It creates the VM
- It creates the runner
- It runs he job
- It deletes the VM
If the timeout is hit before all of these operations are finished, it is possible that they are not completed.
Imagine the following scenario:
- A job run takes 11 minutes
- Everything is configured properly
- VM delete does not happen as the context is already cancelled
- There is an orphaned VM in the cluster that will not be removed
Summary
Fixes job queuing issues and stale VM cleanup where jobs remain stuck indefinitely when provisioning fails or when acquired jobs never receive assignment messages from GitHub Actions.
Problem
Jobs could get stuck in GitHub's queue in multiple scenarios:
Related to MacStadium ticket SERVICE-203600
Solution
1. Job Lifecycle Tracking
2. Provisioning Retry Logic
3. Stuck Job Monitoring & Cleanup
4. Provisioning Timeout Protection
Changes
Files Modified:
pkg/github/runners/types.go- AddedAcquiredJobInfostruct and tracking fieldspkg/github/runners/message-processor.go- Core implementation with active cleanuppkg/github/runners/message-processor_test.go- Unit tests (10 new tests)Testing