Skip to content

Conversation

@owenmartin-toast
Copy link
Contributor

This change introduces a mechanism to ensure that runners are properly de-registered from GitHub before their associated VMs are deleted. A new method, ensureRunnerDeregistered, has been added to wait for the de-registration process, with a timeout and polling interval defined. If the runner does not de-register in time, it will be force-deleted. This enhances the reliability of the VM deletion process, and most importantly prevents a scenario where a GitHub Scale Set believes it has an active runner, and therefore will not provision any new ones for incoming requests

Copy link
Collaborator

@ispasov ispasov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution.
I have made a couple of comments. Let me know what you think.

Also, I have tested the proposed change and it is working great.

)

const (
runnerDeregistrationTimeout = 30 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest making this part of the env config so that they can be easily changed.


func (p *RunnerProvisioner) deleteVM(ctx context.Context, runnerName string) {
// Wait for runner to de-register from GitHub before deleting VM
p.ensureRunnerDeregistered(ctx, runnerName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why before deleting a VM?

Can we delete the VM so we can free up space for other runners, before we wait for the runner deregistration?

It is pretty much fire-and-forget anyways.

Also, the comment is not needed. Generally we add comments only to explain "why" if something is not obvious. Here we can see that we are waiting for the runner to be deregistered before we delete the VM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting before the de-registration happens should be fine, but what we've seen (and what the original fix is intended to resolve) is that sometimes the VMs were being removed before the de-registration process was even firing, which left those 'ghost' instances.
But, with the eventual-cleanup that's now happening, we're probably fine to delete the VM, rely on "normal" de-registration to happen in the majority of cases, and rely on the eventual-cleanup to do a force-de-registration in the cases when it doesn't happen via the VM.
I guess two things need answering:

  1. I have defaulted to delete-first-then-wait, but added configuration to allow people to switch to a wait-then-delete flow if they prefer - are you good with that?
  2. Do we want to simplify this, and always delete then always try to deregister from this code, regardless of whether the de-registration succeeded when invoked from the VM?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simplify it - delete the VM first and try to deregister.
The reason is that a VM takes an actual quota from the cluster So there might be another one waiting for this to be deleted.
Why the runner does not take any quota.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I've done that, and the other comments have been addressed too. Thanks for your speedy feedback!


deadline := time.Now().Add(runnerDeregistrationTimeout)
for time.Now().Before(deadline) {
runner, err := p.actionsClient.GetRunner(ctx, runnerName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be checking if the context is cancelled here before.

A more idiomatic way would be to rewrite this to use a Ticker and a select statement.
Something like:

for {
		select {
		case <-ctx.Done():
         ....
        case <-ticker.C:
        ....

This would also handle the context check. You can of course check it with the current implementation as well.

Add functionality to ensure GitHub Actions runners are properly deregistered,
preventing ghost runners from remaining in GitHub after VM deletion.

Approach:
- Delete VM first to free infrastructure resources immediately
- Then ensure runner deregistration (no-op if already deregistered)
- Force-delete runners that don't deregister within timeout

Features:
- Configurable deregistration timeout (default: 30s)
- Configurable polling interval for status checks (default: 2s)
- Idiomatic Go implementation using context, Ticker, and select patterns
- Proper cleanup with defer statements for resource management

Configuration:
- RUNNER_DEREGISTRATION_TIMEOUT: Maximum time to wait for runner deregistration
- RUNNER_DEREGISTRATION_POLL_INTERVAL: Interval between status checks

Implementation details:
- Use context.WithTimeout() for timeout management
- Use time.NewTicker() for polling intervals
- Use select statement for clean control flow
- Support context cancellation for graceful shutdown

Testing:
- Add comprehensive test suite for RunnerProvisioner
- Test runner deregistration success scenarios
- Test force-deletion on timeout
- Test context cancellation handling
- Test transient error handling

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Collaborator

@ispasov ispasov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Thank you for your contribution.

My final suggestion would be to remove all inline comments (that are not func doc comments).
They do not really bring any value as they do not say why sth is done, but what is done. And this is already visible from the code

@owenmartin-toast
Copy link
Contributor Author

I removed the inline comments per your suggestion, thanks!

@ispasov ispasov merged commit 7836b01 into macstadium:main Jan 22, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants