Skip to content

Conversation

@mssonicbld
Copy link
Collaborator

Description

HLD: https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/graceful-shutdown/graceful-shutdown.md
These changes build upon enhancements in sonic-platform-common#567

This change introduces enhancements to the ModuleBase class to support graceful shutdown and startup operations for DPU and other module types.

It adds new methods and transition handling logic to ensure platform modules follow an ordered and coordinated shutdown/startup procedure, minimizing hardware inconsistencies and transient errors during reboot or DPU detachment.

Key changes include:
Added transition management APIs:

set_module_state_transition()
get_module_state_transition()
clear_module_state_transition()

Introduced graceful lifecycle handlers:

  • _graceful_shutdown_handler() to wait for external transition completion using gnoi_halt_in_progress field with timeout handling
  • Implemented database-backed transition tracking (CHASSIS_MODULE_TABLE)

Added helper functions for:

  • File-based operation locks to ensure concurrency safety during transitions
  • Included caching of transition timeout configuration from platform.json
  • Added robust error-handling and logging to prevent partial updates in Redis DB

Motivation and Context

This enhancement is part of the SmartSwitch / DPU graceful shutdown/reboot and state management effort.
Currently, ModuleBase lacks lifecycle orchestration methods for safe shutdown or startup of DPUs and peripheral modules.
By adding transition-aware handling, the system can:

Avoid race conditions between platform daemons during reboot/shutdown
Ensure state transitions are reflected in Redis (CHASSIS_MODULE_TABLE)
Support controlled detach/reattach of PCIe devices and sensor configuration reloads
Enable PMON daemons to coordinate module-level transitions consistently
This work aligns with SONiC’s graceful reboot framework and the upcoming DPU lifecycle enhancements tracked internally.

How Has This Been Tested?

Testing performed on both SmartSwitch (DPU-enabled) and non-DPU platforms:

  • ✅ Unit tests added under tests/test_module_base.py covering:
    • Transition management (set/get/clear)
    • Timeout behavior and concurrency lock handling
    • PCIe detach/reattach and sensor config updates
    • Graceful shutdown/startup flows (set_admin_state_gracefully)
  • ✅ Verified Redis DB updates for transition keys under CHASSIS_MODULE_TABLE
  • ✅ Simulated shutdown and startup sequences:
    • module_pre_shutdown() → safely detaches PCIe and updates state
    • module_post_startup() → rescans PCIe and restores sensor configuration
  • ✅ Regression-tested existing platform daemons to ensure backward compatibility

Additional Information (Optional)

…ransition handling

<!-- Provide a general summary of your changes in the Title above -->

#### Description
<!--
     Describe your changes in detail
-->

HLD:  https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/graceful-shutdown/graceful-shutdown.md
These changes build upon enhancements in [`sonic-platform-common#567`](sonic-net#567)

This change introduces enhancements to the `ModuleBase` class to support graceful shutdown and startup operations for DPU and other module types.

It adds new methods and transition handling logic to ensure platform modules follow an ordered and coordinated shutdown/startup procedure, minimizing hardware inconsistencies and transient errors during reboot or DPU detachment.

Key changes include:
Added transition management APIs:
```
set_module_state_transition()
get_module_state_transition()
clear_module_state_transition()
```

Introduced graceful lifecycle handlers:

- `_graceful_shutdown_handler()` to wait for external transition completion using `gnoi_halt_in_progress` field with timeout handling
- Implemented database-backed transition tracking (CHASSIS_MODULE_TABLE)

Added helper functions for:

- File-based operation locks to ensure concurrency safety during transitions
- Included caching of transition timeout configuration from platform.json
- Added robust error-handling and logging to prevent partial updates in Redis DB

#### Motivation and Context
<!--
     Why is this change required? What problem does it solve?
     If this pull request closes/resolves an open Issue, make sure you
     include the text "fixes #xxxx", "closes #xxxx" or "resolves #xxxx" here
-->
This enhancement is part of the SmartSwitch / DPU graceful shutdown/reboot and state management effort.
Currently, `ModuleBase` lacks lifecycle orchestration methods for safe shutdown or startup of DPUs and peripheral modules.
By adding transition-aware handling, the system can:

Avoid race conditions between platform daemons during reboot/shutdown
Ensure state transitions are reflected in Redis (CHASSIS_MODULE_TABLE)
Support controlled detach/reattach of PCIe devices and sensor configuration reloads
Enable PMON daemons to coordinate module-level transitions consistently
This work aligns with SONiC’s graceful reboot framework and the upcoming DPU lifecycle enhancements tracked internally.

#### How Has This Been Tested?
<!--
     Please describe in detail how you tested your changes.
     Include details of your testing environment, and the tests you ran to
     see how your change affects other areas of the code, etc.
-->

Testing performed on both SmartSwitch (DPU-enabled) and non-DPU platforms:

- ✅ Unit tests added under tests/test_module_base.py covering:
     - Transition management (set/get/clear)
     - Timeout behavior and concurrency lock handling
     -  PCIe detach/reattach and sensor config updates
     -  Graceful shutdown/startup flows (set_admin_state_gracefully)
- ✅ Verified Redis DB updates for transition keys under CHASSIS_MODULE_TABLE
- ✅ Simulated shutdown and startup sequences:
    - module_pre_shutdown() → safely detaches PCIe and updates state
    - module_post_startup() → rescans PCIe and restores sensor configuration
- ✅ Regression-tested existing platform daemons to ensure backward compatibility

#### Additional Information (Optional)
@mssonicbld
Copy link
Collaborator Author

Original PR: #608

@mssonicbld
Copy link
Collaborator Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld mssonicbld merged commit fad8eda into sonic-net:202511 Dec 3, 2025
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant