feat: add job lifecycle tracking and retry logic for stuck jobs #49

laurenchurch · 2025-12-03T22:09:51Z

Summary

Fixes job queuing issues and stale VM cleanup where jobs remain stuck indefinitely when provisioning fails or when acquired jobs never receive assignment messages from GitHub Actions.

Problem

Jobs could get stuck in GitHub's queue in multiple scenarios:

Provisioning failures caused jobs to remain acquired indefinitely
Jobs acquired by the integration but never assigned by GitHub had no visibility or timeout detection
Stuck jobs (>5 minutes) were only logged but never cleaned up, preventing VM cleanup

Related to MacStadium ticket SERVICE-203600

Solution

1. Job Lifecycle Tracking

Track jobs immediately upon acquisition (not just assignment)
Monitor job state transitions through entire lifecycle
Clean up tracking when jobs start or complete

2. Provisioning Retry Logic

Max 3 provisioning attempts per job with 15-second intervals
Enhanced logging at each attempt with success/failure details
Proper cleanup after exhausting retries

3. Stuck Job Monitoring & Cleanup

Background monitoring checks every 2 minutes
Actively removes jobs stuck >5 minutes from tracking
Marks stuck jobs as canceled to stop provisioning attempts
Logs all stuck jobs on graceful shutdown

4. Provisioning Timeout Protection

10-minute timeout per provisioning attempt
Prevents indefinite hangs from SSH/network issues
Specific logging for timeout vs other errors

Changes

Files Modified:

pkg/github/runners/types.go - Added AcquiredJobInfo struct and tracking fields
pkg/github/runners/message-processor.go - Core implementation with active cleanup
pkg/github/runners/message-processor_test.go - Unit tests (10 new tests)

Testing

 $ go test ./... -v

 Running Suite: Env Suite
 Ran 11 of 11 Specs in 0.001 seconds
 --- PASS: TestEnv (0.00s)
 PASS
 ok  	github.com/macstadium/orka-github-actions-integration/pkg/env

 Running Suite: Github Suite
 Ran 9 of 9 Specs in 0.000 seconds
 --- PASS: TestGithub (0.00s)
 PASS
 ok  	github.com/macstadium/orka-github-actions-integration/pkg/github

 === RUN   TestTrackAcquiredJob
 --- PASS: TestTrackAcquiredJob (0.00s)
 === RUN   TestRemoveAcquiredJob
 --- PASS: TestRemoveAcquiredJob (0.00s)
 === RUN   TestIsJobAcquired
 --- PASS: TestIsJobAcquired (0.00s)
 === RUN   TestGetAcquiredJobs
 --- PASS: TestGetAcquiredJobs (0.00s)
 === RUN   TestLogStuckJobs_NoStuckJobs
 --- PASS: TestLogStuckJobs_NoStuckJobs (0.00s)
 === RUN   TestLogStuckJobs_WithStuckJob
 --- PASS: TestLogStuckJobs_WithStuckJob (0.00s)
 === RUN   TestLogStuckJobs_WithStuckJobDefaultId
 --- PASS: TestLogStuckJobs_WithStuckJobDefaultId (0.00s)
 === RUN   TestCanceledJobFunctions
 --- PASS: TestCanceledJobFunctions (0.00s)
 === RUN   TestConcurrentJobTracking
 --- PASS: TestConcurrentJobTracking (0.00s)
 PASS
 ok  	github.com/macstadium/orka-github-actions-integration/pkg/github/runners

 Running Suite: Utils Suite
 Ran 3 of 3 Specs in 0.000 seconds
 --- PASS: TestUtils (0.00s)
 PASS
 ok  	github.com/macstadium/orka-github-actions-integration/pkg/utils

- Track jobs immediately upon acquisition to monitor full lifecycle - Implement max 3 provisioning retries with 15-second intervals - Add background monitoring for jobs stuck over 5 minutes - Enhanced logging at each job state transition - Add 8 unit tests for job tracking and concurrent access safety Fixes issues where jobs remain stuck indefinitely when: - Provisioning fails and jobs stay acquired without retry - Jobs are acquired but never assigned by GitHub Actions - Container crashes leave jobs in limbo state Tests: make test passes with 23 total specs Lint: make lint passes with no issues

…eout Jobs stuck for >5 minutes are now automatically removed from tracking and marked as canceled, allowing VM cleanup to proceed. Added 10-minute timeout per provisioning attempt to prevent indefinite hangs from SSH or network issues.

fix: implement active cleanup for stuck jobs and add provisioning timeout

ispasov · 2026-01-19T13:15:16Z

Thank you for your contribution.

There are a several important things that are worth disussing:

I would suggest splittling this PR into 2 - 1 PR for the max provisioning attempts, 1 for job tracking.
The reason is that we try to keep PRs as small and as focused as possible. This makes them easier to review, test and ultimately merge.
About max provisioning attempts - when I reach them, the VM and the runner are deleted. However, the job is neither canceled, nor failed. A new runner is also not provisioned and the job is stuck.
About job tracking - the requestRunnerID is heavily utilized here. In fact, most of the time (if not all the time), the runner id is 0. This makes jobs overwrite each other.
Here is an example in which I started two jobs at the same time


{"level":"info","ts":"2026-01-19T15:03:51+02:00","logger":"runner-message-processor-67","msg":"Tracked acquired job: RunnerRequestId=0, JobId=c1887df8-79ee-5090-8ab4-30ce8fe7c8a1"}

{"level":"info","ts":"2026-01-19T15:03:54+02:00","logger":"runner-message-processor-67","msg":"Updated acquired job with JobId: RunnerRequestId=0, JobId=5056206e-3e23-53d0-8b55-12ef57c0294d"}

As you can see the second job overwrote the first one, instead of creating its own entry.
4. The acquire job logic is mostly legacy and not used anymore. The integration no longer receives JobAvailable messages. It only receives JobAssigned. Because the RequestRunnerIdis also 0, the actual call toacquire` the job is also never done.

ispasov · 2026-01-19T13:28:08Z

pkg/github/runners/message-processor.go

+						p.logger.Infof("Provisioning runner for job %s (RunnerRequestId: %d), attempt %d/%d", jobId, runnerRequestId, attempt, maxProvisioningRetries)
+
+						// Create timeout context for this provisioning attempt
+						provisionCtx, cancel := context.WithTimeout(p.ctx, provisioningTimeout)


This timeout is not only for the provisioning, but also for the job duration.
Provisioning does the following:

It creates the VM

It creates the runner

It runs he job

It deletes the VM

If the timeout is hit before all of these operations are finished, it is possible that they are not completed.

Imagine the following scenario:

A job run takes 11 minutes

Everything is configured properly

VM delete does not happen as the context is already cancelled

There is an orphaned VM in the cluster that will not be removed

laurenchurch marked this pull request as ready for review December 3, 2025 22:13

laurenchurch requested a review from a team as a code owner December 3, 2025 22:13

laurenchurch mentioned this pull request Jan 15, 2026

fix: implement active cleanup for stuck jobs and add provisioning timeout laurenchurch/orka-github-actions-integration#1

Merged

Merge pull request #1 from laurenchurch/lac/fix-stale-vm-cleanup

0f8e774

fix: implement active cleanup for stuck jobs and add provisioning timeout

ispasov reviewed Jan 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add job lifecycle tracking and retry logic for stuck jobs #49

feat: add job lifecycle tracking and retry logic for stuck jobs #49

Uh oh!

laurenchurch commented Dec 3, 2025 •

edited

Loading

Uh oh!

ispasov commented Jan 19, 2026

Uh oh!

ispasov Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add job lifecycle tracking and retry logic for stuck jobs #49

Are you sure you want to change the base?

feat: add job lifecycle tracking and retry logic for stuck jobs #49

Uh oh!

Conversation

laurenchurch commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

1. Job Lifecycle Tracking

2. Provisioning Retry Logic

3. Stuck Job Monitoring & Cleanup

4. Provisioning Timeout Protection

Changes

Testing

Uh oh!

ispasov commented Jan 19, 2026

Uh oh!

ispasov Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

laurenchurch commented Dec 3, 2025 •

edited

Loading