Skip to content

Conversation

@sergio-correia
Copy link
Contributor

  • Add operational context to TPM mutex errors for better debugging
  • Remove unused session request code (89 lines of dead code)
  • Fix panic on missing ek_handle configuration

@sergio-correia sergio-correia changed the title Reliability improvements for the Keylime agent: Reliability improvements for the Keylime agent Dec 17, 2025
Replace panic with proper error handling for ek_handle configuration.
Instead of using .expect() which would panic on configuration errors,
use .map_err() to provide a helpful error message that guides users
to verify their ek_handle configuration.

Changes:
- Replace .expect("failed to get ek_handle") with .map_err()
- Add descriptive error message for configuration issues
- Remove completed TODO comment
- Add test to verify error handling behavior

This improves reliability by ensuring configuration errors result in
graceful failures with actionable error messages instead of panics.

Assisted-by: Claude Sonnet 4.5
Signed-off-by: Sergio Correia <scorreia@redhat.com>
PoP (Proof of Possession) authentication is now handled by
keylime::auth::AuthenticationClient via middleware in ResilientClient.

The get_session_request() trait method and get_session_request_final()
implementation were never used in production code - only in tests. This
commit removes this dead code to eliminate confusion and reduce technical
debt.

Assisted-by: Claude Sonnet 4.5
Signed-off-by: Sergio Correia <scorreia@redhat.com>
Replace all 10 production code mutex unwraps with proper error handling
to eliminate //#[allow_ci] bypass markers in the shared TPM library.

Changes:
- Replaced MutexPoisoned with MutexPoisonedDuringOperation error variant
- Each error now includes the operation name (e.g., "create_ek", "quote")
- Error messages provide clear context about which TPM operation failed
- All 10 mutex lock sites updated with operation-specific error handling

Benefits:
- Eliminates unexpected panics in favor of graceful error propagation
- Improved debugging: error messages identify the exact operation that failed
- Better observability: clear error messages explain the issue and required action
- Zero runtime overhead: uses &'static str for operation names

Example error message:
"TPM context mutex was poisoned during 'quote' operation. This indicates
a critical bug where a thread panicked while holding the TPM lock. The
agent must be restarted."

Assisted-by: Claude Sonnet 4.5
Signed-off-by: Sergio Correia <scorreia@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants