-
Notifications
You must be signed in to change notification settings - Fork 8
Implement runner de-registration handling in RunnerProvisioner #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement runner de-registration handling in RunnerProvisioner #51
Conversation
ace558f to
3b1583c
Compare
ispasov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution.
I have made a couple of comments. Let me know what you think.
Also, I have tested the proposed change and it is working great.
| ) | ||
|
|
||
| const ( | ||
| runnerDeregistrationTimeout = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest making this part of the env config so that they can be easily changed.
|
|
||
| func (p *RunnerProvisioner) deleteVM(ctx context.Context, runnerName string) { | ||
| // Wait for runner to de-register from GitHub before deleting VM | ||
| p.ensureRunnerDeregistered(ctx, runnerName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why before deleting a VM?
Can we delete the VM so we can free up space for other runners, before we wait for the runner deregistration?
It is pretty much fire-and-forget anyways.
Also, the comment is not needed. Generally we add comments only to explain "why" if something is not obvious. Here we can see that we are waiting for the runner to be deregistered before we delete the VM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleting before the de-registration happens should be fine, but what we've seen (and what the original fix is intended to resolve) is that sometimes the VMs were being removed before the de-registration process was even firing, which left those 'ghost' instances.
But, with the eventual-cleanup that's now happening, we're probably fine to delete the VM, rely on "normal" de-registration to happen in the majority of cases, and rely on the eventual-cleanup to do a force-de-registration in the cases when it doesn't happen via the VM.
I guess two things need answering:
- I have defaulted to delete-first-then-wait, but added configuration to allow people to switch to a wait-then-delete flow if they prefer - are you good with that?
- Do we want to simplify this, and always delete then always try to deregister from this code, regardless of whether the de-registration succeeded when invoked from the VM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simplify it - delete the VM first and try to deregister.
The reason is that a VM takes an actual quota from the cluster So there might be another one waiting for this to be deleted.
Why the runner does not take any quota.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I've done that, and the other comments have been addressed too. Thanks for your speedy feedback!
|
|
||
| deadline := time.Now().Add(runnerDeregistrationTimeout) | ||
| for time.Now().Before(deadline) { | ||
| runner, err := p.actionsClient.GetRunner(ctx, runnerName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be checking if the context is cancelled here before.
A more idiomatic way would be to rewrite this to use a Ticker and a select statement.
Something like:
for {
select {
case <-ctx.Done():
....
case <-ticker.C:
....
This would also handle the context check. You can of course check it with the current implementation as well.
3b1583c to
2bf0ddd
Compare
Add functionality to ensure GitHub Actions runners are properly deregistered, preventing ghost runners from remaining in GitHub after VM deletion. Approach: - Delete VM first to free infrastructure resources immediately - Then ensure runner deregistration (no-op if already deregistered) - Force-delete runners that don't deregister within timeout Features: - Configurable deregistration timeout (default: 30s) - Configurable polling interval for status checks (default: 2s) - Idiomatic Go implementation using context, Ticker, and select patterns - Proper cleanup with defer statements for resource management Configuration: - RUNNER_DEREGISTRATION_TIMEOUT: Maximum time to wait for runner deregistration - RUNNER_DEREGISTRATION_POLL_INTERVAL: Interval between status checks Implementation details: - Use context.WithTimeout() for timeout management - Use time.NewTicker() for polling intervals - Use select statement for clean control flow - Support context cancellation for graceful shutdown Testing: - Add comprehensive test suite for RunnerProvisioner - Test runner deregistration success scenarios - Test force-deletion on timeout - Test context cancellation handling - Test transient error handling Co-Authored-By: Claude <noreply@anthropic.com>
2bf0ddd to
fc5b243
Compare
ispasov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Thank you for your contribution.
My final suggestion would be to remove all inline comments (that are not func doc comments).
They do not really bring any value as they do not say why sth is done, but what is done. And this is already visible from the code
|
I removed the inline comments per your suggestion, thanks! |
This change introduces a mechanism to ensure that runners are properly de-registered from GitHub before their associated VMs are deleted. A new method, ensureRunnerDeregistered, has been added to wait for the de-registration process, with a timeout and polling interval defined. If the runner does not de-register in time, it will be force-deleted. This enhances the reliability of the VM deletion process, and most importantly prevents a scenario where a GitHub Scale Set believes it has an active runner, and therefore will not provision any new ones for incoming requests