Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Nov 18, 2025

This patch refactors the device health check system by extracting the logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context.
No behavior changes - this is a pure refactoring to improve code modularity and testability.

@ArangoGutierrez ArangoGutierrez self-assigned this Nov 18, 2025
@ArangoGutierrez ArangoGutierrez force-pushed the gtg branch 2 times, most recently from 541c6cb to 44d450f Compare November 18, 2025 18:13
@ArangoGutierrez ArangoGutierrez added the feature issue/PR that proposes a new feature or functionality label Nov 18, 2025
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review November 18, 2025 18:45
}()

// Start recovery worker to detect when unhealthy devices become healthy
go plugin.runRecoveryWorker()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split the refactoring (that doesn't add any new behaviour) into a different PR from the one that adds devices becoming healthy again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like a good idea, and even more based on your other comment #1508 (review)
I wanted a re-factor, but that interface is a diff conversation. Going to work on splitting this PR

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of the k8s-dra-driver-gpu we discused the Interface that we would expect a DeviceHealthCheckProvider to have. Where is that considered here? From the perspective of the device plugin (or its associated ResourceManager), I would expect a DevideHealthCheckProvider to be instantiated and we would develop against this intervace.

As I discussed in NVIDIA/k8s-dra-driver-gpu#689 I would expect this interface to look something like:

type DeviceHealthCheckProvider interface {
   Start(context.Context) error
   Stop()
   Health() <-channel Device

(alternatively one could split the Health channel into Healthy() and Unhealthy()).

Comment on lines 278 to 282
// If health provider not available, wait for context cancellation
if plugin.healthProvider == nil {
<-plugin.ctx.Done()
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under which conditions is the healthProvider nil? Could we not rather ALWAYS use at least a "no-op" healthProvider to ensure that we don't need to special case this here or at any point where we call Start or Stop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, adopted

Comment on lines 30 to 40
// envDisableHealthChecks defines the environment variable that is checked to determine whether healthchecks
// should be disabled. If this envvar is set to "all" or contains the string "xids", healthchecks are
// disabled entirely. If set, the envvar is treated as a comma-separated list of Xids to ignore. Note that
// this is in addition to the Application errors that are already ignored.
// envDisableHealthChecks defines the environment variable that is
// checked to determine whether healthchecks should be disabled. If
// this envvar is set to "all" or contains the string "xids",
// healthchecks are disabled entirely. If set, the envvar is treated
// as a comma-separated list of Xids to ignore. Note that this is in
// addition to the Application errors that are already ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit: For complex refactorings, keeping changes to a minimum is important as we are able to reduce the noise and focus on the changes. In cases like these, we should update these comments as a separate commit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now in a [no-relnote] commit

nvml: nvml,
config: config,
devices: devices,
healthChan: make(chan *Device, 64),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why 64?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size would be len(devices) × 4, but I thought 64 was a safe hard coded number as it covers all possible len(devices) sizes

Comment on lines 119 to 124
if p.started {
p.mu.Unlock()
return fmt.Errorf("health provider already started")
}
p.started = true
p.mu.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not defer p.mu.Unlock() instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer would be simpler but slower. Using defer would hold the mutex during {NVML initialization, Event set creation, Device registration}, blocking other operations (like Stop()).

wg sync.WaitGroup

// State guards
mu sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use an IsA relationship to simplify taking and releasing the lock:

Suggested change
mu sync.Mutex
sync.Mutex

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, adopted

ret := p.nvml.Init()
if ret != nvml.SUCCESS {
if *r.config.Flags.FailOnInitError {
if *p.config.Flags.FailOnInitError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's not rename r to p in a single commit. (see comment on managing diffs).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, adopted , now an independent commit

Comment on lines 127 to 131
p.xidsDisabled = getDisabledHealthCheckXids()
if p.xidsDisabled.IsAllDisabled() {
klog.Info("Health checks disabled via DP_DISABLE_HEALTHCHECKS")
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should happen at construction and not as Start is called. If all healthChecks are disabled, we should return a no-op HealthProvider.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, adopted

klog.Warningf("NVML init failed: %v; health checks disabled", ret)
return nil
}
defer func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the move away from a deferred shutdown?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All error paths after Init() have now individual clean up logic.

}
return fmt.Errorf("failed to create event set: %v", ret)
}
defer func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason that we don't use the deferred cleanup here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All error paths after Init() have now individual clean up logic.

// HealthProvider returns a no-op HealthProvider for Tegra devices.
// Tegra devices do not support health monitoring.
func (r *tegraResourceManager) HealthProvider() HealthProvider {
return &noopHealthProvider{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. You already have it. Why not use it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now used as proposed in above comment

}

func (n *noopHealthProvider) Start(context.Context) error {
n.healthChan = make(chan *Device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just do this at construction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we need to actually create a channel? Can we no leave it nil?

@ArangoGutierrez ArangoGutierrez force-pushed the gtg branch 3 times, most recently from 3f81103 to 91f4a6c Compare November 24, 2025 17:02
Extract device health checking logic into a dedicated HealthProvider
interface with proper lifecycle management using WaitGroups and context.

- Add HealthProvider interface (Start/Stop/Health methods)
- Implement nvmlHealthProvider with WaitGroup coordination
- Update ResourceManager to return HealthProvider instead of CheckHealth
- Update device plugin to use HealthProvider
- Add no-op implementation for Tegra devices

This refactoring improves code modularity and testability without
changing existing behavior. Prepares foundation for future device
recovery features.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants