Skip to content

Conversation

@laurenchurch
Copy link

@laurenchurch laurenchurch commented Dec 3, 2025

Summary

Fixes job queuing issues and stale VM cleanup where jobs remain stuck indefinitely when provisioning fails or when acquired jobs never receive assignment messages from GitHub Actions.

Problem

Jobs could get stuck in GitHub's queue in multiple scenarios:

  1. Provisioning failures caused jobs to remain acquired indefinitely
  2. Jobs acquired by the integration but never assigned by GitHub had no visibility or timeout detection
  3. Stuck jobs (>5 minutes) were only logged but never cleaned up, preventing VM cleanup

Related to MacStadium ticket SERVICE-203600

Solution

1. Job Lifecycle Tracking

  • Track jobs immediately upon acquisition (not just assignment)
  • Monitor job state transitions through entire lifecycle
  • Clean up tracking when jobs start or complete

2. Provisioning Retry Logic

  • Max 3 provisioning attempts per job with 15-second intervals
  • Enhanced logging at each attempt with success/failure details
  • Proper cleanup after exhausting retries

3. Stuck Job Monitoring & Cleanup

  • Background monitoring checks every 2 minutes
  • Actively removes jobs stuck >5 minutes from tracking
  • Marks stuck jobs as canceled to stop provisioning attempts
  • Logs all stuck jobs on graceful shutdown

4. Provisioning Timeout Protection

  • 10-minute timeout per provisioning attempt
  • Prevents indefinite hangs from SSH/network issues
  • Specific logging for timeout vs other errors

Changes

Files Modified:

  • pkg/github/runners/types.go - Added AcquiredJobInfo struct and tracking fields
  • pkg/github/runners/message-processor.go - Core implementation with active cleanup
  • pkg/github/runners/message-processor_test.go - Unit tests (10 new tests)

Testing

 $ go test ./... -v

 Running Suite: Env Suite
 Ran 11 of 11 Specs in 0.001 seconds
 --- PASS: TestEnv (0.00s)
 PASS
 ok  	github.com/macstadium/orka-github-actions-integration/pkg/env

 Running Suite: Github Suite
 Ran 9 of 9 Specs in 0.000 seconds
 --- PASS: TestGithub (0.00s)
 PASS
 ok  	github.com/macstadium/orka-github-actions-integration/pkg/github

 === RUN   TestTrackAcquiredJob
 --- PASS: TestTrackAcquiredJob (0.00s)
 === RUN   TestRemoveAcquiredJob
 --- PASS: TestRemoveAcquiredJob (0.00s)
 === RUN   TestIsJobAcquired
 --- PASS: TestIsJobAcquired (0.00s)
 === RUN   TestGetAcquiredJobs
 --- PASS: TestGetAcquiredJobs (0.00s)
 === RUN   TestLogStuckJobs_NoStuckJobs
 --- PASS: TestLogStuckJobs_NoStuckJobs (0.00s)
 === RUN   TestLogStuckJobs_WithStuckJob
 --- PASS: TestLogStuckJobs_WithStuckJob (0.00s)
 === RUN   TestLogStuckJobs_WithStuckJobDefaultId
 --- PASS: TestLogStuckJobs_WithStuckJobDefaultId (0.00s)
 === RUN   TestCanceledJobFunctions
 --- PASS: TestCanceledJobFunctions (0.00s)
 === RUN   TestConcurrentJobTracking
 --- PASS: TestConcurrentJobTracking (0.00s)
 PASS
 ok  	github.com/macstadium/orka-github-actions-integration/pkg/github/runners

 Running Suite: Utils Suite
 Ran 3 of 3 Specs in 0.000 seconds
 --- PASS: TestUtils (0.00s)
 PASS
 ok  	github.com/macstadium/orka-github-actions-integration/pkg/utils

- Track jobs immediately upon acquisition to monitor full lifecycle
- Implement max 3 provisioning retries with 15-second intervals
- Add background monitoring for jobs stuck over 5 minutes
- Enhanced logging at each job state transition
- Add 8 unit tests for job tracking and concurrent access safety

Fixes issues where jobs remain stuck indefinitely when:
- Provisioning fails and jobs stay acquired without retry
- Jobs are acquired but never assigned by GitHub Actions
- Container crashes leave jobs in limbo state

Tests: make test passes with 23 total specs
Lint: make lint passes with no issues
@laurenchurch laurenchurch marked this pull request as ready for review December 3, 2025 22:13
@laurenchurch laurenchurch requested a review from a team as a code owner December 3, 2025 22:13
…eout

Jobs stuck for >5 minutes are now automatically removed from tracking and marked as canceled, allowing VM cleanup to proceed. Added 10-minute timeout per provisioning attempt to prevent indefinite hangs from SSH or network issues.
fix: implement active cleanup for stuck jobs and add provisioning timeout
@ispasov
Copy link
Collaborator

ispasov commented Jan 19, 2026

Thank you for your contribution.

There are a several important things that are worth disussing:

  1. I would suggest splittling this PR into 2 - 1 PR for the max provisioning attempts, 1 for job tracking.
    The reason is that we try to keep PRs as small and as focused as possible. This makes them easier to review, test and ultimately merge.
  2. About max provisioning attempts - when I reach them, the VM and the runner are deleted. However, the job is neither canceled, nor failed. A new runner is also not provisioned and the job is stuck.
  3. About job tracking - the requestRunnerID is heavily utilized here. In fact, most of the time (if not all the time), the runner id is 0. This makes jobs overwrite each other.
    Here is an example in which I started two jobs at the same time

{"level":"info","ts":"2026-01-19T15:03:51+02:00","logger":"runner-message-processor-67","msg":"Tracked acquired job: RunnerRequestId=0, JobId=c1887df8-79ee-5090-8ab4-30ce8fe7c8a1"}

{"level":"info","ts":"2026-01-19T15:03:54+02:00","logger":"runner-message-processor-67","msg":"Updated acquired job with JobId: RunnerRequestId=0, JobId=5056206e-3e23-53d0-8b55-12ef57c0294d"}

As you can see the second job overwrote the first one, instead of creating its own entry.
4. The acquire job logic is mostly legacy and not used anymore. The integration no longer receives JobAvailable messages. It only receives JobAssigned. Because the RequestRunnerIdis also 0, the actual call toacquire` the job is also never done.

p.logger.Infof("Provisioning runner for job %s (RunnerRequestId: %d), attempt %d/%d", jobId, runnerRequestId, attempt, maxProvisioningRetries)

// Create timeout context for this provisioning attempt
provisionCtx, cancel := context.WithTimeout(p.ctx, provisioningTimeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout is not only for the provisioning, but also for the job duration.
Provisioning does the following:

  1. It creates the VM
  2. It creates the runner
  3. It runs he job
  4. It deletes the VM

If the timeout is hit before all of these operations are finished, it is possible that they are not completed.

Imagine the following scenario:

  1. A job run takes 11 minutes
  2. Everything is configured properly
  3. VM delete does not happen as the context is already cancelled
  4. There is an orphaned VM in the cluster that will not be removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants