-
Notifications
You must be signed in to change notification settings - Fork 79
Clean up stale OVN Metadata agents from TripleO during adoption #1177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Clean up stale OVN Metadata agents from TripleO during adoption #1177
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/87167ff0f91245588551cc0c142084a9 ✔️ noop SUCCESS in 0s |
|
recheck |
klgill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fultonj I added a few minor style edits.
docs_user/modules/proc_adopting-compute-services-to-the-data-plane.adoc
Outdated
Show resolved
Hide resolved
| ---- | ||
| $ oc exec openstackclient -- openstack network agent list | ||
| ---- | ||
| + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| + |
docs_user/modules/proc_adopting-compute-services-to-the-data-plane.adoc
Outdated
Show resolved
Hide resolved
docs_user/modules/proc_adopting-compute-services-to-the-data-plane.adoc
Outdated
Show resolved
Hide resolved
docs_user/modules/proc_adopting-compute-services-to-the-data-plane.adoc
Outdated
Show resolved
Hide resolved
docs_user/modules/proc_adopting-networker-services-to-the-data-plane.adoc
Outdated
Show resolved
Hide resolved
docs_user/modules/proc_adopting-networker-services-to-the-data-plane.adoc
Outdated
Show resolved
Hide resolved
462d320 to
49e3b36
Compare
Add cleanup task to remove dead OVN Metadata agents that were stopped during TripleO service cleanup but never unregistered from Neutron's database. This prevents the neutron_verify task from failing when it detects agents in the "XXX" (not alive) state. Problem: During TripleO to EDPM adoption, the tripleo_cleanup playbook stops TripleO services on compute nodes, including neutron-ovn-metadata-agent. However, these agents don't unregister themselves from Neutron - they simply stop sending heartbeats. When the new EDPM deployment creates containerized metadata agents, they may register as new agents (with different chassis IDs or hostnames) rather than updating the old TripleO agent records. This leaves orphaned "dead" agents in Neutron's database that show as "Alive: XXX" (False). The neutron_verify task checks that no metadata agents are in "XXX" state, causing adoption to fail even though all new EDPM agents are working. This issue can affect ANY TripleO to EDPM adoption where the new deployment doesn't perfectly match the old hostnames/chassis IDs. Common scenarios include: 1. Different hostnames: If nodesets use site-specific naming patterns (e.g., osp-dcn1-compute-*, osp-dcn2-compute-*) that differ from central site nodes (osp-compute-*). This prevents the new EDPM agents from reusing the old TripleO agent IDs. 2. Different chassis IDs: Each site's OVN chassis has unique IDs that change between TripleO and EDPM, forcing creation of new agent records rather than updating existing ones. 3. Scale: Deployments with more compute nodes across multiple sites increase the probability that at least one will create a new agent record instead of reusing the old one. 4. Different OVN bridge mappings: If a nodeset uses different physical network mappings (leaf0, leaf1, leaf2) which can result in different chassis configurations. Solution: Add a cleanup task that deletes all metadata agents marked as "not alive" before the verification runs. This mirrors the existing cleanup for dhcp and ovn-controller-gateway agents. The task uses "|| true" to ensure it doesn't fail if there are no stale agents to delete (e.g., in environments where agents were reused). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: John Fulton <fulton@redhat.com>
49e3b36 to
0e2c00e
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/9459d2e5676a45b39ef5647745ee0575 ✔️ noop SUCCESS in 0s |
klgill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Add cleanup task to remove dead OVN Metadata agents that were stopped during TripleO service cleanup but never unregistered from Neutron's database. This prevents the neutron_verify task from failing when it detects agents in the "XXX" (not alive) state.
Problem:
During TripleO to EDPM adoption, the tripleo_cleanup playbook stops TripleO services on compute nodes, including neutron-ovn-metadata-agent. However, these agents don't unregister themselves from Neutron - they simply stop sending heartbeats. When the new EDPM deployment creates containerized metadata agents, they may register as new agents (with different chassis IDs or hostnames) rather than updating the old TripleO agent records. This leaves orphaned "dead" agents in Neutron's database that show as "Alive: XXX" (False).
The neutron_verify task checks that no metadata agents are in "XXX" state, causing adoption to fail even though all new EDPM agents are working.
This issue can affect ANY TripleO to EDPM adoption where the new deployment doesn't perfectly match the old hostnames/chassis IDs. Common scenarios include:
Different hostnames: If nodesets use site-specific naming patterns (e.g., osp-dcn1-compute-, osp-dcn2-compute-) that differ from central site nodes (osp-compute-*). This prevents the new EDPM agents from reusing the old TripleO agent IDs.
Different chassis IDs: Each site's OVN chassis has unique IDs that change between TripleO and EDPM, forcing creation of new agent records rather than updating existing ones.
Scale: Deployments with more compute nodes across multiple sites increase the probability that at least one will create a new agent record instead of reusing the old one.
Different OVN bridge mappings: If a nodeset uses different physical network mappings (leaf0, leaf1, leaf2) which can result in different chassis configurations.
Solution:
Add a cleanup task that deletes all metadata agents marked as "not alive" before the verification runs. This mirrors the existing cleanup for dhcp and ovn-controller-gateway agents.
The task uses "|| true" to ensure it doesn't fail if there are no stale agents to delete (e.g., in environments where agents were reused).
Co-Authored-By: Claude noreply@anthropic.com