-
Notifications
You must be signed in to change notification settings - Fork 754
Refactor checkHealth function #1508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
541c6cb to
44d450f
Compare
internal/plugin/server.go
Outdated
| }() | ||
|
|
||
| // Start recovery worker to detect when unhealthy devices become healthy | ||
| go plugin.runRecoveryWorker() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split the refactoring (that doesn't add any new behaviour) into a different PR from the one that adds devices becoming healthy again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like a good idea, and even more based on your other comment #1508 (review)
I wanted a re-factor, but that interface is a diff conversation. Going to work on splitting this PR
elezar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of the k8s-dra-driver-gpu we discused the Interface that we would expect a DeviceHealthCheckProvider to have. Where is that considered here? From the perspective of the device plugin (or its associated ResourceManager), I would expect a DevideHealthCheckProvider to be instantiated and we would develop against this intervace.
As I discussed in NVIDIA/k8s-dra-driver-gpu#689 I would expect this interface to look something like:
type DeviceHealthCheckProvider interface {
Start(context.Context) error
Stop()
Health() <-channel Device
(alternatively one could split the Health channel into Healthy() and Unhealthy()).
internal/plugin/server.go
Outdated
| // If health provider not available, wait for context cancellation | ||
| if plugin.healthProvider == nil { | ||
| <-plugin.ctx.Done() | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under which conditions is the healthProvider nil? Could we not rather ALWAYS use at least a "no-op" healthProvider to ensure that we don't need to special case this here or at any point where we call Start or Stop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, adopted
internal/rm/health.go
Outdated
| // envDisableHealthChecks defines the environment variable that is checked to determine whether healthchecks | ||
| // should be disabled. If this envvar is set to "all" or contains the string "xids", healthchecks are | ||
| // disabled entirely. If set, the envvar is treated as a comma-separated list of Xids to ignore. Note that | ||
| // this is in addition to the Application errors that are already ignored. | ||
| // envDisableHealthChecks defines the environment variable that is | ||
| // checked to determine whether healthchecks should be disabled. If | ||
| // this envvar is set to "all" or contains the string "xids", | ||
| // healthchecks are disabled entirely. If set, the envvar is treated | ||
| // as a comma-separated list of Xids to ignore. Note that this is in | ||
| // addition to the Application errors that are already ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit: For complex refactorings, keeping changes to a minimum is important as we are able to reduce the noise and focus on the changes. In cases like these, we should update these comments as a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now in a [no-relnote] commit
| nvml: nvml, | ||
| config: config, | ||
| devices: devices, | ||
| healthChan: make(chan *Device, 64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why 64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size would be len(devices) × 4, but I thought 64 was a safe hard coded number as it covers all possible len(devices) sizes
internal/rm/health.go
Outdated
| if p.started { | ||
| p.mu.Unlock() | ||
| return fmt.Errorf("health provider already started") | ||
| } | ||
| p.started = true | ||
| p.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not defer p.mu.Unlock() instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer would be simpler but slower. Using defer would hold the mutex during {NVML initialization, Event set creation, Device registration}, blocking other operations (like Stop()).
internal/rm/health.go
Outdated
| wg sync.WaitGroup | ||
|
|
||
| // State guards | ||
| mu sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use an IsA relationship to simplify taking and releasing the lock:
| mu sync.Mutex | |
| sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, adopted
internal/rm/health.go
Outdated
| ret := p.nvml.Init() | ||
| if ret != nvml.SUCCESS { | ||
| if *r.config.Flags.FailOnInitError { | ||
| if *p.config.Flags.FailOnInitError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's not rename r to p in a single commit. (see comment on managing diffs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, adopted , now an independent commit
internal/rm/health.go
Outdated
| p.xidsDisabled = getDisabledHealthCheckXids() | ||
| if p.xidsDisabled.IsAllDisabled() { | ||
| klog.Info("Health checks disabled via DP_DISABLE_HEALTHCHECKS") | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should happen at construction and not as Start is called. If all healthChecks are disabled, we should return a no-op HealthProvider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, adopted
| klog.Warningf("NVML init failed: %v; health checks disabled", ret) | ||
| return nil | ||
| } | ||
| defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the move away from a deferred shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All error paths after Init() have now individual clean up logic.
| } | ||
| return fmt.Errorf("failed to create event set: %v", ret) | ||
| } | ||
| defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason that we don't use the deferred cleanup here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All error paths after Init() have now individual clean up logic.
| // HealthProvider returns a no-op HealthProvider for Tegra devices. | ||
| // Tegra devices do not support health monitoring. | ||
| func (r *tegraResourceManager) HealthProvider() HealthProvider { | ||
| return &noopHealthProvider{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. You already have it. Why not use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now used as proposed in above comment
internal/rm/tegra_manager.go
Outdated
| } | ||
|
|
||
| func (n *noopHealthProvider) Start(context.Context) error { | ||
| n.healthChan = make(chan *Device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just do this at construction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we need to actually create a channel? Can we no leave it nil?
3f81103 to
91f4a6c
Compare
Extract device health checking logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context. - Add HealthProvider interface (Start/Stop/Health methods) - Implement nvmlHealthProvider with WaitGroup coordination - Update ResourceManager to return HealthProvider instead of CheckHealth - Update device plugin to use HealthProvider - Add no-op implementation for Tegra devices This refactoring improves code modularity and testability without changing existing behavior. Prepares foundation for future device recovery features. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
This patch refactors the device health check system by extracting the logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context.
No behavior changes - this is a pure refactoring to improve code modularity and testability.