Skip to content

Conversation

@fultonj
Copy link
Contributor

@fultonj fultonj commented Dec 5, 2025

Add cleanup task to remove dead OVN Metadata agents that were stopped during TripleO service cleanup but never unregistered from Neutron's database. This prevents the neutron_verify task from failing when it detects agents in the "XXX" (not alive) state.

Problem:
During TripleO to EDPM adoption, the tripleo_cleanup playbook stops TripleO services on compute nodes, including neutron-ovn-metadata-agent. However, these agents don't unregister themselves from Neutron - they simply stop sending heartbeats. When the new EDPM deployment creates containerized metadata agents, they may register as new agents (with different chassis IDs or hostnames) rather than updating the old TripleO agent records. This leaves orphaned "dead" agents in Neutron's database that show as "Alive: XXX" (False).

The neutron_verify task checks that no metadata agents are in "XXX" state, causing adoption to fail even though all new EDPM agents are working.

This issue can affect ANY TripleO to EDPM adoption where the new deployment doesn't perfectly match the old hostnames/chassis IDs. Common scenarios include:

  1. Different hostnames: If nodesets use site-specific naming patterns (e.g., osp-dcn1-compute-, osp-dcn2-compute-) that differ from central site nodes (osp-compute-*). This prevents the new EDPM agents from reusing the old TripleO agent IDs.

  2. Different chassis IDs: Each site's OVN chassis has unique IDs that change between TripleO and EDPM, forcing creation of new agent records rather than updating existing ones.

  3. Scale: Deployments with more compute nodes across multiple sites increase the probability that at least one will create a new agent record instead of reusing the old one.

  4. Different OVN bridge mappings: If a nodeset uses different physical network mappings (leaf0, leaf1, leaf2) which can result in different chassis configurations.

Solution:
Add a cleanup task that deletes all metadata agents marked as "not alive" before the verification runs. This mirrors the existing cleanup for dhcp and ovn-controller-gateway agents.

The task uses "|| true" to ensure it doesn't fail if there are no stale agents to delete (e.g., in environments where agents were reused).

Co-Authored-By: Claude noreply@anthropic.com

@openshift-ci
Copy link

openshift-ci bot commented Dec 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sathlan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fultonj fultonj requested review from jistr and olliewalsh December 5, 2025 16:01
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/87167ff0f91245588551cc0c142084a9

✔️ noop SUCCESS in 0s
adoption-standalone-to-crc-ceph RETRY_LIMIT in 6m 56s
✔️ adoption-standalone-to-crc-no-ceph SUCCESS in 3h 08m 54s
✔️ adoption-docs-preview SUCCESS in 1m 19s

@fultonj
Copy link
Contributor Author

fultonj commented Dec 10, 2025

recheck

@fultonj fultonj requested a review from a team December 15, 2025 20:27
Copy link
Contributor

@klgill klgill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fultonj I added a few minor style edits.

----
$ oc exec openstackclient -- openstack network agent list
----
+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+

@fultonj fultonj force-pushed the neutron_metadata_cleanup branch 2 times, most recently from 462d320 to 49e3b36 Compare December 16, 2025 02:22
Add cleanup task to remove dead OVN Metadata agents that were stopped
during TripleO service cleanup but never unregistered from Neutron's
database. This prevents the neutron_verify task from failing when it
detects agents in the "XXX" (not alive) state.

Problem:
During TripleO to EDPM adoption, the tripleo_cleanup playbook stops
TripleO services on compute nodes, including neutron-ovn-metadata-agent.
However, these agents don't unregister themselves from Neutron - they
simply stop sending heartbeats. When the new EDPM deployment creates
containerized metadata agents, they may register as new agents (with
different chassis IDs or hostnames) rather than updating the old TripleO
agent records. This leaves orphaned "dead" agents in Neutron's database
that show as "Alive: XXX" (False).

The neutron_verify task checks that no metadata agents are in "XXX" state,
causing adoption to fail even though all new EDPM agents are working.

This issue can affect ANY TripleO to EDPM adoption where the new
deployment doesn't perfectly match the old hostnames/chassis IDs.
Common scenarios include:

1. Different hostnames: If nodesets use site-specific naming patterns
   (e.g., osp-dcn1-compute-*, osp-dcn2-compute-*) that differ from
   central site nodes (osp-compute-*). This prevents the new EDPM
   agents from reusing the old TripleO agent IDs.

2. Different chassis IDs: Each site's OVN chassis has unique IDs that
   change between TripleO and EDPM, forcing creation of new agent
   records rather than updating existing ones.

3. Scale: Deployments with more compute nodes across multiple sites
   increase the probability that at least one will create a new agent
   record instead of reusing the old one.

4. Different OVN bridge mappings: If a nodeset uses different physical
   network mappings (leaf0, leaf1, leaf2) which can result in different
   chassis configurations.

Solution:
Add a cleanup task that deletes all metadata agents marked as "not alive"
before the verification runs. This mirrors the existing cleanup for dhcp
and ovn-controller-gateway agents.

The task uses "|| true" to ensure it doesn't fail if there are no stale
agents to delete (e.g., in environments where agents were reused).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: John Fulton <fulton@redhat.com>
@fultonj fultonj force-pushed the neutron_metadata_cleanup branch from 49e3b36 to 0e2c00e Compare December 16, 2025 02:23
@fultonj
Copy link
Contributor Author

fultonj commented Dec 16, 2025

@fultonj I added a few minor style edits.

Thanks @klgill I've applied your suggestions.

@fultonj fultonj requested a review from klgill December 16, 2025 02:33
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/9459d2e5676a45b39ef5647745ee0575

✔️ noop SUCCESS in 0s
✔️ adoption-standalone-to-crc-ceph SUCCESS in 3h 00m 02s
adoption-standalone-to-crc-no-ceph FAILURE in 1h 35m 21s
✔️ adoption-docs-preview SUCCESS in 1m 19s

Copy link
Contributor

@klgill klgill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants