-
Notifications
You must be signed in to change notification settings - Fork 146
fix: update armada to respect node pod limits #4517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix: update armada to respect node pod limits #4517
Conversation
Signed-off-by: Jason Parraga <jparraga@stackav.com>
Signed-off-by: Jason Parraga <jparraga@stackav.com>
| FailedPodChecks podchecks.FailedChecks | ||
| PendingPodChecks *podchecks.Checks | ||
| FatalPodSubmissionErrors []string | ||
| ResourcesToSanitize []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add docs for the new field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our docs, we often link to config in https://pkg.go.dev, i.e. for the executor it will be https://pkg.go.dev/github.com/armadaproject/armada/internal/executor/configuration#ApplicationConfiguration, it is easier for the end-users if they can see the config field docs there.
| } | ||
|
|
||
| // Sanitizes pod resources that may be used during scheduling but are invalid at runtime. | ||
| func (submitService *SubmitService) sanitizePodResources(pod *v1.Pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this function can be simplified to a simple sanitizePodResources function, no need for it to be a receiver method on SubmitService, same for sanitizeResourceList
|
Hey so generally a good direction but I there are a few things I think we'd like to consider / possibly change
|
What type of PR is this?
Bug fix
What this PR does / why we need it:
This pull request updates armada to respect node limits. Without this change it is possible for Armada to try and schedule pods to nodes which do not have capacity in regards to concurrent running pods. When this happens the pods get stuck in a pending state and end up getting lease returned. Presumably Armada considers these pods towards fair share even though they cannot run.
podspodsresource in scheduling decisions by defaultpodsresources by defaultWhich issue(s) this PR fixes:
Fixes #4515
Special notes for your reviewer:
In order to rollout safely the executors must be updated first. After that the scheduler/server can be rolled out in any order.