Skip to content

Conversation

@damdo
Copy link
Member

@damdo damdo commented Dec 11, 2025

Summary

This PR adds a new monitor test crd-version-checker that tracks CustomResourceDefinition (CRD) changes during cluster upgrades. The test captures CRD snapshots before and after an upgrade, generates a summary of changes, and validates that new API versions follow safe upgrade practices.

Motivation

When new API versions are introduced to CRDs during an upgrade, it's important that:
The storage version remains the previous version (not the new one) until data migration is complete
Both old and new versions are served to maintain compatibility
This test helps catch potential issues where a new API version is immediately set as the storage version, which could cause problems with existing data stored in etcd.

What does this include

Summary Generation

This test writes a JSON summary file (crd-version-summary.json) to storage for inspection
Computes differences between before/after snapshots:
Added CRDs
Removed CRDs
Changed CRDs (with details on added/removed versions and storage changes)

Validation Checks

This test defines the following checks:

  • [sig-api-machinery] CRDs with new API versions should not change storage version immediately
    Fails if a CRD has a new version added AND that version immediately becomes the storage version
    This check ensures safe API version introduction practices

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci bot requested review from deads2k and p0lyn0mial December 11, 2025 21:05
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 11, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: damdo
Once this PR has been reviewed and has the lgtm label, please assign xueqzhan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 12, 2025

@damdo: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-microshift 1758db2 link true /test e2e-aws-ovn-microshift
ci/prow/e2e-aws-ovn-serial-1of2 1758db2 link true /test e2e-aws-ovn-serial-1of2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-trt
Copy link

openshift-trt bot commented Dec 12, 2025

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 1758db2

Job Name New Test Risk
pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade Medium - "[Monitor:crd-version-checker][sig-api-machinery] CRDs with new API versions should not change storage version immediately" is a new test, and was only seen in one job.

New tests seen in this PR at sha: 1758db2

  • "[Monitor:crd-version-checker][Jira:"kube-apiserver"] monitor test crd-version-checker cleanup" [Total: 12, Pass: 12, Fail: 0, Flake: 0]
  • "[Monitor:crd-version-checker][Jira:"kube-apiserver"] monitor test crd-version-checker collection" [Total: 10, Pass: 10, Fail: 0, Flake: 0]
  • "[Monitor:crd-version-checker][Jira:"kube-apiserver"] monitor test crd-version-checker interval construction" [Total: 12, Pass: 12, Fail: 0, Flake: 0]
  • "[Monitor:crd-version-checker][Jira:"kube-apiserver"] monitor test crd-version-checker preparation" [Total: 12, Pass: 12, Fail: 0, Flake: 0]
  • "[Monitor:crd-version-checker][Jira:"kube-apiserver"] monitor test crd-version-checker setup" [Total: 10, Pass: 10, Fail: 0, Flake: 0]
  • "[Monitor:crd-version-checker][Jira:"kube-apiserver"] monitor test crd-version-checker test evaluation" [Total: 12, Pass: 12, Fail: 0, Flake: 0]
  • "[Monitor:crd-version-checker][Jira:"kube-apiserver"] monitor test crd-version-checker writing to storage" [Total: 12, Pass: 12, Fail: 0, Flake: 0]
  • "[Monitor:crd-version-checker][sig-api-machinery] CRDs with new API versions should not change storage version immediately" [Total: 1, Pass: 1, Fail: 0, Flake: 0]

@damdo
Copy link
Member Author

damdo commented Dec 12, 2025

@damdo
Copy link
Member Author

damdo commented Dec 12, 2025

/assign @JoelSpeed

}

// CRDCondition captures the condition information for a CRD.
type CRDCondition struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use a metav1 condition here?

if err != nil {
return fmt.Errorf("unable to determine if cluster is MicroShift: %v", err)
}
if isMicroShift {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we support microshift?

Comment on lines +315 to +316
beforeVersions := make(map[string]bool)
afterVersions := make(map[string]bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, if you're using this as just a set, map[string]struct{} is more idiomatic and more efficient

Comment on lines +376 to +380
// Rationale: When a new API version is introduced during an upgrade, existing
// data in etcd is still stored in the old format. Setting the new version as
// the storage version immediately would require a migration. The safe approach
// is to serve both versions but keep the old version as storage until all
// existing objects have been migrated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. You have to change the storage version before you can run the storage version migration.

The reason we want the storage version to remain as the old version for 1 release is so that if you roll-back for any reason, the old schema can still decode objects in etcd. If you switched storage version immediately and then rolled back, any object written between the upgrade and rollback would no longer be decodeable

Comment on lines +384 to +385
// - The new version is marked as the storage version
// - The old version is no longer the storage version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one can be the storage version, so this sounds tautological

return []*junitapi.JUnitTestCase{{
Name: testName,
SkipMessage: &junitapi.SkipMessage{
Message: "Missing CRD snapshots, cannot perform check",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We alway return a skip when not running during an upgrade? Is that what we want vs just outputting an empty JUnit? Not sure what the correct etiquette would be here

afterStorage := getStorageVersion(afterCRD)

// Identify newly added versions
beforeVersionSet := make(map[string]bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: idiomatically should be map[string]struct{}


// Check if a new version became the storage version
for _, newVersion := range newVersions {
if afterStorage == newVersion && beforeStorage != newVersion {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first part of this if statement is sufficient. If it's a new version and the storage matches the new version that's added, we can fail

Don't need to check the second half (how could the previous storage be a new version? I don't think the second check can ever be false)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants