Refactor checkHealth function #1508

ArangoGutierrez · 2025-11-18T14:43:28Z

This patch refactors the device health check system by extracting the logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context.
No behavior changes - this is a pure refactoring to improve code modularity and testability.

elezar · 2025-11-20T13:58:21Z

internal/plugin/server.go

 	}()

+	// Start recovery worker to detect when unhealthy devices become healthy
+	go plugin.runRecoveryWorker()


Can we split the refactoring (that doesn't add any new behaviour) into a different PR from the one that adds devices becoming healthy again?

sounds like a good idea, and even more based on your other comment #1508 (review)
I wanted a re-factor, but that interface is a diff conversation. Going to work on splitting this PR

elezar

In the context of the k8s-dra-driver-gpu we discused the Interface that we would expect a DeviceHealthCheckProvider to have. Where is that considered here? From the perspective of the device plugin (or its associated ResourceManager), I would expect a DevideHealthCheckProvider to be instantiated and we would develop against this intervace.

As I discussed in NVIDIA/k8s-dra-driver-gpu#689 I would expect this interface to look something like:

type DeviceHealthCheckProvider interface {
   Start(context.Context) error
   Stop()
   Health() <-channel Device

(alternatively one could split the Health channel into Healthy() and Unhealthy()).

elezar · 2025-11-24T15:29:27Z

internal/plugin/server.go

+	// If health provider not available, wait for context cancellation
+	if plugin.healthProvider == nil {
+		<-plugin.ctx.Done()
+		return nil
+	}


Under which conditions is the healthProvider nil? Could we not rather ALWAYS use at least a "no-op" healthProvider to ensure that we don't need to special case this here or at any point where we call Start or Stop?

thanks for the suggestion, adopted

elezar · 2025-11-24T15:30:53Z

internal/rm/health.go

-	// envDisableHealthChecks defines the environment variable that is checked to determine whether healthchecks
-	// should be disabled. If this envvar is set to "all" or contains the string "xids", healthchecks are
-	// disabled entirely. If set, the envvar is treated as a comma-separated list of Xids to ignore. Note that
-	// this is in addition to the Application errors that are already ignored.
+	// envDisableHealthChecks defines the environment variable that is
+	// checked to determine whether healthchecks should be disabled. If
+	// this envvar is set to "all" or contains the string "xids",
+	// healthchecks are disabled entirely. If set, the envvar is treated
+	// as a comma-separated list of Xids to ignore. Note that this is in
+	// addition to the Application errors that are already ignored.


This is a nit: For complex refactorings, keeping changes to a minimum is important as we are able to reduce the noise and focus on the changes. In cases like these, we should update these comments as a separate commit.

now in a [no-relnote] commit

elezar · 2025-11-24T15:31:27Z

internal/rm/health.go

+		nvml:       nvml,
+		config:     config,
+		devices:    devices,
+		healthChan: make(chan *Device, 64),


Question: Why 64?

size would be len(devices) × 4, but I thought 64 was a safe hard coded number as it covers all possible len(devices) sizes

elezar · 2025-11-24T15:32:09Z

internal/rm/health.go

+	if p.started {
+		p.mu.Unlock()
+		return fmt.Errorf("health provider already started")
+	}
+	p.started = true
+	p.mu.Unlock()


Any reason to not defer p.mu.Unlock() instead?

defer would be simpler but slower. Using defer would hold the mutex during {NVML initialization, Event set creation, Device registration}, blocking other operations (like Stop()).

elezar · 2025-11-24T15:32:44Z

internal/rm/health.go

+	wg     sync.WaitGroup
+
+	// State guards
+	mu      sync.Mutex


We could use an IsA relationship to simplify taking and releasing the lock:

Suggested change

mu sync.Mutex

sync.Mutex

thanks for the suggestion, adopted

elezar · 2025-11-24T15:33:42Z

internal/rm/health.go

+	ret := p.nvml.Init()
 	if ret != nvml.SUCCESS {
-		if *r.config.Flags.FailOnInitError {
+		if *p.config.Flags.FailOnInitError {


nit: Let's not rename r to p in a single commit. (see comment on managing diffs).

thanks for the suggestion, adopted , now an independent commit

elezar · 2025-11-24T15:34:32Z

internal/rm/health.go

+	p.xidsDisabled = getDisabledHealthCheckXids()
+	if p.xidsDisabled.IsAllDisabled() {
+		klog.Info("Health checks disabled via DP_DISABLE_HEALTHCHECKS")
 		return nil
 	}


This should happen at construction and not as Start is called. If all healthChecks are disabled, we should return a no-op HealthProvider.

thanks for the suggestion, adopted

elezar · 2025-11-24T15:36:14Z

internal/rm/health.go

+		klog.Warningf("NVML init failed: %v; health checks disabled", ret)
 		return nil
 	}
-	defer func() {


Could you explain the move away from a deferred shutdown?

All error paths after Init() have now individual clean up logic.

elezar · 2025-11-24T15:36:43Z

internal/rm/health.go

+		}
 		return fmt.Errorf("failed to create event set: %v", ret)
 	}
-	defer func() {


is there a reason that we don't use the deferred cleanup here?

All error paths after Init() have now individual clean up logic.

internal/rm/health.go

elezar · 2025-11-24T15:42:55Z

internal/rm/tegra_manager.go

+// HealthProvider returns a no-op HealthProvider for Tegra devices.
+// Tegra devices do not support health monitoring.
+func (r *tegraResourceManager) HealthProvider() HealthProvider {
+	return &noopHealthProvider{}


OK. You already have it. Why not use it?

now used as proposed in above comment

elezar · 2025-11-24T15:43:30Z

internal/rm/tegra_manager.go

+}
+
+func (n *noopHealthProvider) Start(context.Context) error {
+	n.healthChan = make(chan *Device)


Why not just do this at construction?

Also, do we need to actually create a channel? Can we no leave it nil?

Extract device health checking logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context. - Add HealthProvider interface (Start/Stop/Health methods) - Implement nvmlHealthProvider with WaitGroup coordination - Update ResourceManager to return HealthProvider instead of CheckHealth - Update device plugin to use HealthProvider - Add no-op implementation for Tegra devices This refactoring improves code modularity and testability without changing existing behavior. Prepares foundation for future device recovery features. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez self-assigned this Nov 18, 2025

ArangoGutierrez force-pushed the gtg branch 2 times, most recently from 541c6cb to 44d450f Compare November 18, 2025 18:13

ArangoGutierrez added the feature issue/PR that proposes a new feature or functionality label Nov 18, 2025

ArangoGutierrez requested a review from elezar November 18, 2025 18:45

ArangoGutierrez marked this pull request as ready for review November 18, 2025 18:45

elezar reviewed Nov 20, 2025

View reviewed changes

ArangoGutierrez force-pushed the gtg branch from 44d450f to 7875a1b Compare November 21, 2025 13:58

ArangoGutierrez requested a review from elezar November 21, 2025 15:16

elezar reviewed Nov 24, 2025

View reviewed changes

internal/rm/health.go Show resolved Hide resolved

elezar reviewed Nov 24, 2025

View reviewed changes

ArangoGutierrez force-pushed the gtg branch 3 times, most recently from 3f81103 to 91f4a6c Compare November 24, 2025 17:02

ArangoGutierrez added 3 commits November 24, 2025 18:06

[no-relnote] Format doc comment at internal/rm/health.go

ab25e41

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

[no-relnote] refactor use p as receiver for nvmlHealthProvider

33636eb

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez force-pushed the gtg branch from 91f4a6c to 33636eb Compare November 24, 2025 17:06

ArangoGutierrez requested a review from elezar November 24, 2025 17:07

ArangoGutierrez requested a review from klueska November 24, 2025 17:07

Refactor checkHealth function #1508

Are you sure you want to change the base?

Refactor checkHealth function #1508

Conversation

ArangoGutierrez commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elezar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArangoGutierrez commented Nov 18, 2025 •

edited

Loading