Skip to content

WIP AGENT-1428: Add e2e extended tests for NoRegistryClusterInstall#30824

Draft
pawanpinjarkar wants to merge 1 commit intoopenshift:mainfrom
pawanpinjarkar:no-registry-cluster-install-e2e-extended-tests
Draft

WIP AGENT-1428: Add e2e extended tests for NoRegistryClusterInstall#30824
pawanpinjarkar wants to merge 1 commit intoopenshift:mainfrom
pawanpinjarkar:no-registry-cluster-install-e2e-extended-tests

Conversation

@pawanpinjarkar
Copy link

@pawanpinjarkar pawanpinjarkar commented Mar 2, 2026

WIP

Summary by CodeRabbit

  • Tests
    • Added a comprehensive extended test suite for the InternalReleaseImage feature: validates resource presence and status, registry accessibility from masters and workloads, disruption and recovery behavior, deletion protection, post-install attributes, HA across masters, and dynamic node scaling scenarios.
    • Included the new suite in the extended test set so it runs as part of extended testing.

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 2, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 2, 2026

@pawanpinjarkar: This pull request references AGENT-1428 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

WIP

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 2, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 2d74b43 and 23e3532.

📒 Files selected for processing (2)
  • test/extended/include.go
  • test/extended/internalreleaseimage/internalreleaseimage.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/extended/include.go
  • test/extended/internalreleaseimage/internalreleaseimage.go

Walkthrough

Adds a new Ginkgo-based extended test suite for InternalReleaseImage (IRI) and registers it via a blank import; tests exercise IRI presence, status/conditions, internal registry accessibility, reconciliation after disruption, admission-policy deletion protection, HA across masters, workload image pulls, and MachineSet scaling.

Changes

Cohort / File(s) Summary
Test Registration
test/extended/include.go
Added a blank import to register the new internalreleaseimage extended tests. Minimal change (single-line import).
InternalReleaseImage Test Suite
test/extended/internalreleaseimage/internalreleaseimage.go
New ~668-line Ginkgo/Gomega test file implementing end-to-end tests and helpers: feature-gate checks, IRI status/conditions, image digest extraction, registry reachability, master/node interactions, disruption/reconciliation loops, ValidatingAdmissionPolicy deletion checks, workload image-pull tests, MachineSet scale up/down with cleanup, and HA verification across masters. Review focus: heavy test logic, API interactions, and network/cluster disruption sequences.

Sequence Diagram(s)

sequenceDiagram
    participant Tester as Test Runner
    participant API as Kubernetes API
    participant Registry as Internal Registry
    participant Masters as Master Nodes
    participant Workers as Worker Nodes

    Tester->>API: verify feature gate & list IRI resources
    API-->>Tester: return IRI object & status
    Tester->>Registry: HTTP health / manifest requests
    Registry-->>Tester: HTTP 200 / manifest data
    Tester->>Masters: probe registry reachability on each master
    Masters-->>Tester: HTTP/ping responses
    Tester->>API: simulate disruption (delete/taint resources)
    API-->>Tester: acknowledge deletion/taint
    Tester->>API: wait and verify reconciliation/restoration
    API-->>Tester: restored MachineConfigs/Pods/status
    Tester->>API: scale MachineSet (create worker)
    API-->>Workers: provision new node
    Workers-->>Tester: join cluster and pull images from Registry
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning The test block combines multiple unrelated behaviors (resource listing, status validation, version matching, MachineConfig ownership) into a single It block, violating single-responsibility principle. Missing custom failure messages in Expect calls and no AfterEach cleanup. Refactor each distinct validation into separate It blocks, add BeforeEach/AfterEach hooks for cluster-scoped resource setup/teardown, ensure all assertions include descriptive messages, and align timeout/wait patterns with repository conventions.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately summarizes the main change: adding e2e extended tests for the NoRegistryClusterInstall feature.
Stable And Deterministic Test Names ✅ Passed All test titles in the new test file are static and deterministic, containing no dynamic content like pod names, timestamps, UUIDs, node names, or IP addresses that would change between test runs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 2, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 2, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 2, 2026

@pawanpinjarkar: This pull request references AGENT-1428 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

WIP

Summary by CodeRabbit

  • Tests
  • Added comprehensive test suite for InternalReleaseImage feature, validating resource configuration, registry accessibility, disruption recovery, deletion protection, and cluster integration scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
test/extended/internalreleaseimage/internalreleaseimage.go (3)

149-175: Test title doesn't match actual validation.

The test is titled "be accessible on all master nodes" but only verifies that master nodes have internal IPs. It does not actually test registry accessibility. Consider either:

  1. Renaming to "have valid internal IPs on all master nodes", or
  2. Adding actual registry accessibility checks using checkIRIRegistryAccessible
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 149
- 175, The test labeled in g.It("be accessible on all master nodes
[apigroup:machineconfiguration.openshift.io]") only verifies master node
internal IPs (it iterates masterNodes.Items and checks nodeIP) but doesn't
actually test registry accessibility; either rename the test string to reflect
IP validation (e.g., "have valid internal IPs on all master nodes") or extend
the loop to call the existing checkIRIRegistryAccessible function for each
master nodeIP (use nodeIP and node.Name from the masterNodes iteration) and
assert its success so the test truly validates registry accessibility.

700-709: Registry /v2/ endpoint may return 401 Unauthorized without credentials.

Per the Docker Registry HTTP API v2 spec, the /v2/ endpoint can return 401 Unauthorized when authentication is required, which still indicates the registry is accessible. Restricting to only 200 OK may cause false negatives.

Suggested fix
-	if resp.StatusCode != http.StatusOK {
+	// Registry v2 API returns 200 OK or 401 Unauthorized (if auth required)
+	// Both indicate the registry is accessible
+	if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusUnauthorized {
 		e2e.Logf("Registry responded with unexpected status %d at %s", resp.StatusCode, registryURL)
 		return false
 	}
+
+	// Only check API version header on successful response
+	if resp.StatusCode == http.StatusOK {
+		apiVersion := resp.Header.Get("Docker-Distribution-Api-Version")
+		if apiVersion != "registry/2.0" {
+			e2e.Logf("Registry API version mismatch at %s: %s", registryURL, apiVersion)
+			return false
+		}
+	}
 
-	apiVersion := resp.Header.Get("Docker-Distribution-Api-Version")
-	if apiVersion != "registry/2.0" {
-		e2e.Logf("Registry API version mismatch at %s: %s", registryURL, apiVersion)
-		return false
-	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 700
- 709, The current check treats any non-200 response as failure; change the
logic so that resp.StatusCode values 200 OK and 401 Unauthorized are considered
successful accessibility responses (treat other statuses as failure and log them
using registryURL and resp.StatusCode), and only validate the
Docker-Distribution-Api-Version header (apiVersion) when resp.StatusCode ==
http.StatusOK; if resp.StatusCode == http.StatusUnauthorized, log that auth is
required and return true without checking apiVersion (use the resp, registryURL,
and apiVersion symbols to locate and update the code).

371-372: Replace time.Sleep with polling to avoid flakiness.

A fixed sleep of 10 seconds may be insufficient in some environments or wasteful in others. Consider polling until the registry is confirmed unavailable on the target node.

Suggested approach
 		g.By("Waiting for registry to be unavailable on target node")
-		time.Sleep(10 * time.Second)
+		err = wait.PollUntilContextTimeout(context.Background(), 2*time.Second, 30*time.Second, true,
+			func(_ context.Context) (bool, error) {
+				accessible := checkIRIRegistryAccessible(targetIP)
+				return !accessible, nil
+			})
+		o.Expect(err).NotTo(o.HaveOccurred(), "Registry should become unavailable on target node")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 371
- 372, The fixed time.Sleep(10 * time.Second) after g.By("Waiting for registry
to be unavailable on target node") should be replaced with a polling loop (e.g.,
wait.PollImmediate or a simple for/select with time.Ticker and timeout) that
repeatedly checks the registry reachability and returns as soon as it's
unavailable; replace the time.Sleep call with a call to a helper like
checkRegistryUnavailableOnNode(node) (or inline an HTTP/TCP probe) and poll
until success or timeout, logging progress with g.By and failing the test if the
timeout elapses.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/internalreleaseimage/internalreleaseimage.go`:
- Around line 680-684: The TLS client setup in the http.Client transport sets
InsecureSkipVerify but does not set a minimum TLS version; update the TLS config
used in the Transport (the tls.Config instance inside the http.Client
initialization in this file) to include MinVersion: tls.VersionTLS12 (or
tls.VersionTLS13 if you prefer stricter) so the connection enforces a minimum
TLS protocol version while still keeping InsecureSkipVerify for the test
registry.
- Around line 533-540: The scaling call incorrectly uses initialWorkerCount
(total cluster workers) instead of the selected MachineSet's replica count; fix
by reading the MachineSet's current spec.replicas (e.g., via
oc.AsAdmin().Run("get").Args("machineset", machineSetName, "-n",
"openshift-machine-api", "-o", "jsonpath={.spec.replicas}") or by fetching the
MachineSet object in code) and compute targetReplicas := currentReplicas + 1,
then pass fmt.Sprintf("--replicas=%d", targetReplicas) to
oc.AsAdmin().Run("scale").Args(...). Ensure you reference machineSetName and
replace initialWorkerCount in the scale invocation.

---

Nitpick comments:
In `@test/extended/internalreleaseimage/internalreleaseimage.go`:
- Around line 149-175: The test labeled in g.It("be accessible on all master
nodes [apigroup:machineconfiguration.openshift.io]") only verifies master node
internal IPs (it iterates masterNodes.Items and checks nodeIP) but doesn't
actually test registry accessibility; either rename the test string to reflect
IP validation (e.g., "have valid internal IPs on all master nodes") or extend
the loop to call the existing checkIRIRegistryAccessible function for each
master nodeIP (use nodeIP and node.Name from the masterNodes iteration) and
assert its success so the test truly validates registry accessibility.
- Around line 700-709: The current check treats any non-200 response as failure;
change the logic so that resp.StatusCode values 200 OK and 401 Unauthorized are
considered successful accessibility responses (treat other statuses as failure
and log them using registryURL and resp.StatusCode), and only validate the
Docker-Distribution-Api-Version header (apiVersion) when resp.StatusCode ==
http.StatusOK; if resp.StatusCode == http.StatusUnauthorized, log that auth is
required and return true without checking apiVersion (use the resp, registryURL,
and apiVersion symbols to locate and update the code).
- Around line 371-372: The fixed time.Sleep(10 * time.Second) after
g.By("Waiting for registry to be unavailable on target node") should be replaced
with a polling loop (e.g., wait.PollImmediate or a simple for/select with
time.Ticker and timeout) that repeatedly checks the registry reachability and
returns as soon as it's unavailable; replace the time.Sleep call with a call to
a helper like checkRegistryUnavailableOnNode(node) (or inline an HTTP/TCP probe)
and poll until success or timeout, logging progress with g.By and failing the
test if the timeout elapses.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 02e33c2 and 1e880cb.

📒 Files selected for processing (2)
  • test/extended/include.go
  • test/extended/internalreleaseimage/internalreleaseimage.go

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 2, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pawanpinjarkar
Once this PR has been reviewed and has the lgtm label, please assign xueqzhan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pawanpinjarkar pawanpinjarkar changed the title AGENT-1428: Add e2e extended tests for NoRegistryClusterInstall WIP AGENT-1428: Add e2e extended tests for NoRegistryClusterInstall Mar 2, 2026
@pawanpinjarkar pawanpinjarkar force-pushed the no-registry-cluster-install-e2e-extended-tests branch from 1e880cb to d70e8e0 Compare March 2, 2026 21:36
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 2, 2026

@pawanpinjarkar: This pull request references AGENT-1428 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

WIP

Summary by CodeRabbit

  • Tests
  • Added a comprehensive extended test suite for the InternalReleaseImage feature: validates resource presence and status, registry accessibility from masters and workloads, disruption and recovery behavior, deletion protection, post-install attributes, HA across masters, and dynamic node scaling scenarios.
  • Expanded the set of included extended tests to ensure these checks run.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (2)
test/extended/internalreleaseimage/internalreleaseimage.go (2)

691-694: ⚠️ Potential issue | 🟡 Minor

Set TLS minimum version explicitly in client config.

Line 693 sets InsecureSkipVerify but omits MinVersion; set a minimum protocol version explicitly.

Proposed fix
 	client := &http.Client{
 		Transport: &http.Transport{
-			TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
+			TLSClientConfig: &tls.Config{
+				InsecureSkipVerify: true,
+				MinVersion:         tls.VersionTLS12,
+			},
 		},
 		Timeout: 10 * time.Second,
 	}
In current Go stdlib guidance for net/http clients, what is the recommended tls.Config.MinVersion when MinVersion is not explicitly set?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 691
- 694, The TLS config used for the http.Client sets InsecureSkipVerify but omits
a minimum TLS version; update the tls.Config in the http.Transport (the client
variable) to include MinVersion: tls.VersionTLS12 (use tls.VersionTLS12) so the
config explicitly enforces TLS 1.2 as the minimum while keeping
InsecureSkipVerify as currently set.

544-551: ⚠️ Potential issue | 🟠 Major

Scale target uses total worker count instead of selected MachineSet replicas.

Line 548 should scale from the MachineSet’s current replica count, not cluster-wide worker count. Otherwise, it can overscale/underscale when multiple MachineSets exist.

Proposed fix
+		currentReplicaCount, err := strconv.Atoi(strings.TrimSpace(currentReplicas))
+		o.Expect(err).NotTo(o.HaveOccurred(), "Failed to parse MachineSet replicas: %q", currentReplicas)
+
 		g.By("Scaling MachineSet up by 1")
 		_, err = oc.AsAdmin().Run("scale").Args(
 			"machineset", machineSetName,
 			"-n", "openshift-machine-api",
-			fmt.Sprintf("--replicas=%d", initialWorkerCount+1),
+			fmt.Sprintf("--replicas=%d", currentReplicaCount+1),
 		).Output()
 		o.Expect(err).NotTo(o.HaveOccurred(), "Failed to scale MachineSet")
-		e2e.Logf("Scaled MachineSet %s to %d replicas", machineSetName, initialWorkerCount+1)
+		e2e.Logf("Scaled MachineSet %s to %d replicas", machineSetName, currentReplicaCount+1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 544
- 551, The scale call is using initialWorkerCount (cluster-wide workers) instead
of the MachineSet's current replica count, which can mis-scale when multiple
MachineSets exist; update the logic around machineSetName to first query the
MachineSet's current replicas (e.g., via
oc.AsAdmin().Run("get").Args("machineset", machineSetName, "-n",
"openshift-machine-api", "-o", "jsonpath={.spec.replicas}") or equivalent),
parse that value to an int, compute desiredReplicas = currentReplicas + 1, and
then pass fmt.Sprintf("--replicas=%d", desiredReplicas) to the
oc.AsAdmin().Run("scale").Args(...) call (replace usage of initialWorkerCount),
ensuring error handling for the get/parse steps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/internalreleaseimage/internalreleaseimage.go`:
- Around line 611-620: The test currently only checks newNode.Status.Conditions
for NodeReady but does not perform the promised registry connectivity check; add
an active probe that runs from the new worker node to the registry endpoint on
port 22625 (e.g. create a short-lived Pod scheduled onto newNode.Name or exec
into a probe Pod on that node) that attempts a TCP connect or HTTP request to
the registry host:22625, waits for a successful response/exit code, and assert
that the probe succeeded before or in addition to the existing
o.Expect(nodeReady) check; reference newNode.Name for scheduling and the
registry endpoint/port (22625) when implementing the probe and teardown.
- Around line 678-683: The current filter for machineconfigs uses only
OwnerReferences[0], which is brittle; update the loop that iterates mcList.Items
(where mc.OwnerReferences and iriMCs and IRIResourceName are referenced) to scan
all entries in mc.OwnerReferences and append mc.Name to iriMCs if any ownerRef
has Kind == "InternalReleaseImage" and Name == IRIResourceName, breaking after
append to avoid duplicate adds; remove reliance on indexing and the
len(mc.OwnerReferences) check since the for-range handles empty slices.
- Around line 372-380: The test currently sleeps 10s then only checks other
masters, missing an assertion that the target master actually went down; replace
the fixed time.Sleep with a polling/timeout loop that repeatedly calls
checkIRIRegistryAccessible(masterIPs[0] or the target IP variable) until it
returns false (target unreachable) or a timeout is reached, and fail the test if
the target remains reachable after the timeout; once the target unreachability
is asserted, proceed to verify peers using the existing loop that calls
checkIRIRegistryAccessible(masterIPs[i]) and expects true so HA is validated.
- Around line 508-533: The current check uses the raw JSON string in variable
machinesets to decide whether MachineSets exist, which is unreliable; instead
parse or directly query for an item and skip based on a parsed/queried result:
replace the len(machinesets) == 0 check by either decoding the JSON output into
a struct/slice and verifying len(items) > 0 or by running a jsonpath query (like
the existing jsonpath used to get machineSetName) and skipping when
machineSetName is empty; update the logic around machinesets and machineSetName
(the oc.AsAdmin().Run("get") call that fetches machinesets and the subsequent
jsonpath call that assigns machineSetName) so the existence check is performed
on the parsed result (or machineSetName) before proceeding.
- Around line 150-176: The test titled "be accessible on all master nodes..."
only verifies discovery of NodeInternalIP (masterNodes, nodeIP) but does not
actually probe registry reachability; modify the loop in the test to call an
active probe function (e.g., checkIRIRegistryAccessible(nodeIP)) for each master
after resolving nodeIP and assert its success, ensuring any errors are surfaced
via o.Expect or t/Fatal equivalents; if checkIRIRegistryAccessible does not
exist, implement it to perform a network probe (HTTP/TCP as appropriate) and
return an error/boolean so the test can fail when a master cannot reach the IRI
registry.

---

Duplicate comments:
In `@test/extended/internalreleaseimage/internalreleaseimage.go`:
- Around line 691-694: The TLS config used for the http.Client sets
InsecureSkipVerify but omits a minimum TLS version; update the tls.Config in the
http.Transport (the client variable) to include MinVersion: tls.VersionTLS12
(use tls.VersionTLS12) so the config explicitly enforces TLS 1.2 as the minimum
while keeping InsecureSkipVerify as currently set.
- Around line 544-551: The scale call is using initialWorkerCount (cluster-wide
workers) instead of the MachineSet's current replica count, which can mis-scale
when multiple MachineSets exist; update the logic around machineSetName to first
query the MachineSet's current replicas (e.g., via
oc.AsAdmin().Run("get").Args("machineset", machineSetName, "-n",
"openshift-machine-api", "-o", "jsonpath={.spec.replicas}") or equivalent),
parse that value to an int, compute desiredReplicas = currentReplicas + 1, and
then pass fmt.Sprintf("--replicas=%d", desiredReplicas) to the
oc.AsAdmin().Run("scale").Args(...) call (replace usage of initialWorkerCount),
ensuring error handling for the get/parse steps.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 1e880cb and d70e8e0.

📒 Files selected for processing (2)
  • test/extended/include.go
  • test/extended/internalreleaseimage/internalreleaseimage.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/extended/include.go

Comment on lines +611 to +620
g.By(fmt.Sprintf("Verifying new worker node %s can access IRI registry", newNode.Name))
nodeReady := false
for _, condition := range newNode.Status.Conditions {
if condition.Type == corev1.NodeReady && condition.Status == corev1.ConditionTrue {
nodeReady = true
break
}
}
o.Expect(nodeReady).To(o.BeTrue(), "New worker node should be Ready, indicating successful registry access")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Connectivity claim is not validated; only NodeReady is checked.

Line 611 says registry connectivity is being verified, but Lines 612-620 only assert NodeReady. Add an active connectivity probe from the new worker node to the registry endpoint on port 22625.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 611
- 620, The test currently only checks newNode.Status.Conditions for NodeReady
but does not perform the promised registry connectivity check; add an active
probe that runs from the new worker node to the registry endpoint on port 22625
(e.g. create a short-lived Pod scheduled onto newNode.Name or exec into a probe
Pod on that node) that attempts a TCP connect or HTTP request to the registry
host:22625, waits for a successful response/exit code, and assert that the probe
succeeded before or in addition to the existing o.Expect(nodeReady) check;
reference newNode.Name for scheduling and the registry endpoint/port (22625)
when implementing the probe and teardown.

@pawanpinjarkar pawanpinjarkar force-pushed the no-registry-cluster-install-e2e-extended-tests branch from d70e8e0 to 2d74b43 Compare March 2, 2026 22:07
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 2, 2026

@pawanpinjarkar: This pull request references AGENT-1428 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

WIP

Summary by CodeRabbit

  • Tests
  • Added a comprehensive extended test suite for the InternalReleaseImage feature: validates resource presence and status, registry accessibility from masters and workloads, disruption and recovery behavior, deletion protection, post-install attributes, HA across masters, and dynamic node scaling scenarios.
  • Expanded the set of included extended tests so the new suite is executed as part of extended testing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
test/extended/internalreleaseimage/internalreleaseimage.go (4)

313-320: ⚠️ Potential issue | 🟠 Major

Assert target-node registry outage before validating HA peers.

Line 314 uses a fixed sleep, and the test never verifies the disrupted node actually became unreachable. This can pass even when disruption failed.

Proposed fix
 		g.By("Waiting for registry to be unavailable on target node")
-		time.Sleep(10 * time.Second)
+		err = wait.PollUntilContextTimeout(context.Background(), 2*time.Second, 30*time.Second, true,
+			func(context.Context) (bool, error) {
+				return !checkIRIRegistryAccessible(targetIP), nil
+			})
+		o.Expect(err).NotTo(o.HaveOccurred(),
+			"Target node registry should become unavailable after service stop")
 
 		g.By("Verifying registry is still accessible on other master nodes")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 313
- 320, The test currently uses a fixed time.Sleep and never asserts the
disrupted target node actually went down; replace the sleep with a polling loop
that repeatedly calls checkIRIRegistryAccessible(masterIPs[0]) until it returns
false (with a configurable timeout and interval) and use o.Expect(false) or
o.Expect(accessible).To(o.BeFalse()) to fail the test if the node remains
reachable after the timeout; once the loop confirms masterIPs[0] is unreachable,
proceed to the existing loop that verifies other masters via
checkIRIRegistryAccessible(masterIPs[i]); keep references to
checkIRIRegistryAccessible and masterIPs so the change is local to this block.

556-565: ⚠️ Potential issue | 🟠 Major

The “registry connectivity” claim is not validated.

Lines 557-565 only re-check NodeReady; they don’t probe port 22625 from the new worker node. This can report success without proving registry reachability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 556
- 565, The current check only inspects newNode.Status.Conditions (the NodeReady
loop around newNode) but does not verify registry connectivity on port 22625;
replace or augment that validation by executing a network probe from the new
node (for example schedule or exec a short pod onto the node or create a
Job/DaemonSet targeted to newNode) that attempts a TCP connection to the
registry host:22625 and assert success; reference the existing newNode object
and the NodeReady check to gate running the probe (i.e., once newNode is Ready
run the probe) and fail the test if the TCP dial from the node to port 22625
fails.

449-457: ⚠️ Potential issue | 🟠 Major

MachineSet support check is ineffective on raw JSON length.

Line 455 checks len(machinesets) == 0, but JSON output is non-empty even when .items is empty. This can fail late with noisier errors.

Proposed fix
-		g.By("Checking if cluster supports worker scaling")
-		machinesets, err := oc.AsAdmin().Run("get").Args(
-			"machinesets",
-			"-n", "openshift-machine-api",
-			"-o", "json",
-		).Output()
-		if err != nil || len(machinesets) == 0 {
-			g.Skip("Cluster does not support MachineSet scaling (may be bare metal or single node)")
-		}
+		g.By("Selecting a MachineSet to scale")
+		machineSetName, err := oc.AsAdmin().Run("get").Args(
+			"machinesets",
+			"-n", "openshift-machine-api",
+			"-o", "jsonpath={.items[0].metadata.name}",
+		).Output()
+		if err != nil || strings.TrimSpace(machineSetName) == "" {
+			g.Skip("Cluster does not support MachineSet scaling (may be bare metal or single node)")
+		}
+		machineSetName = strings.TrimSpace(machineSetName)
+		e2e.Logf("Selected MachineSet: %s", machineSetName)
 
 		g.By("Getting current worker count")
@@
-		g.By("Selecting a MachineSet to scale")
-		machineSetName, err := oc.AsAdmin().Run("get").Args(
-			"machinesets",
-			"-n", "openshift-machine-api",
-			"-o", "jsonpath={.items[0].metadata.name}",
-		).Output()
-		o.Expect(err).NotTo(o.HaveOccurred(), "Failed to get MachineSet name")
-		o.Expect(machineSetName).ShouldNot(o.BeEmpty(), "MachineSet name should not be empty")
-		e2e.Logf("Selected MachineSet: %s", machineSetName)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 449
- 457, The check using len(machinesets) is unreliable because the JSON string is
non-empty even when .items is empty; after calling
oc.AsAdmin().Run("get").Args(...).Output() (the machinesets variable),
parse/unmarshal that JSON and verify the .items array length (e.g., unmarshal
into map[string]interface{} and check items slice length) or alternatively call
oc.AsAdmin().Run("get").Args("machinesets","-n","openshift-machine-api","-o","jsonpath={.items}").Output()
and ensure the returned items are non-empty before deciding to Skip; update the
condition that currently references machinesets to inspect the parsed .items
result instead.

158-161: ⚠️ Potential issue | 🟡 Minor

Avoid assuming OwnerReferences[0] is the IRI owner.

Both Line 160 and Lines 624-627 rely on the first owner ref entry. Ordering is not guaranteed, so this is brittle and can cause false negatives.

Proposed fix
-			o.Expect(mc.OwnerReferences[0].Kind).Should(o.Equal("InternalReleaseImage"),
-				"MachineConfig %s should be owned by InternalReleaseImage", mcName)
+			ownedByIRI := false
+			for _, ownerRef := range mc.OwnerReferences {
+				if ownerRef.Kind == "InternalReleaseImage" && ownerRef.Name == IRIResourceName {
+					ownedByIRI = true
+					break
+				}
+			}
+			o.Expect(ownedByIRI).To(o.BeTrue(),
+				"MachineConfig %s should be owned by InternalReleaseImage", mcName)
@@
-	for _, mc := range mcList.Items {
-		if len(mc.OwnerReferences) > 0 &&
-			mc.OwnerReferences[0].Kind == "InternalReleaseImage" &&
-			mc.OwnerReferences[0].Name == IRIResourceName {
-			iriMCs = append(iriMCs, mc.Name)
-		}
-	}
+	for _, mc := range mcList.Items {
+		for _, ownerRef := range mc.OwnerReferences {
+			if ownerRef.Kind == "InternalReleaseImage" && ownerRef.Name == IRIResourceName {
+				iriMCs = append(iriMCs, mc.Name)
+				break
+			}
+		}
+	}

Also applies to: 623-628

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 158
- 161, The test is brittle because it assumes mc.OwnerReferences[0] is the
InternalReleaseImage owner; update the assertions that reference
mc.OwnerReferences (the checks around mc and mcName and the similar block at
lines ~623-628) to instead verify that at least one OwnerReference entry has
Kind == "InternalReleaseImage" (and, if applicable, the expected Name/UID), i.e.
scan mc.OwnerReferences for a matching entry rather than indexing [0]; replace
the index-based assertions with a predicate-based check that confirms existence
of an InternalReleaseImage owner.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/extended/internalreleaseimage/internalreleaseimage.go`:
- Around line 313-320: The test currently uses a fixed time.Sleep and never
asserts the disrupted target node actually went down; replace the sleep with a
polling loop that repeatedly calls checkIRIRegistryAccessible(masterIPs[0])
until it returns false (with a configurable timeout and interval) and use
o.Expect(false) or o.Expect(accessible).To(o.BeFalse()) to fail the test if the
node remains reachable after the timeout; once the loop confirms masterIPs[0] is
unreachable, proceed to the existing loop that verifies other masters via
checkIRIRegistryAccessible(masterIPs[i]); keep references to
checkIRIRegistryAccessible and masterIPs so the change is local to this block.
- Around line 556-565: The current check only inspects newNode.Status.Conditions
(the NodeReady loop around newNode) but does not verify registry connectivity on
port 22625; replace or augment that validation by executing a network probe from
the new node (for example schedule or exec a short pod onto the node or create a
Job/DaemonSet targeted to newNode) that attempts a TCP connection to the
registry host:22625 and assert success; reference the existing newNode object
and the NodeReady check to gate running the probe (i.e., once newNode is Ready
run the probe) and fail the test if the TCP dial from the node to port 22625
fails.
- Around line 449-457: The check using len(machinesets) is unreliable because
the JSON string is non-empty even when .items is empty; after calling
oc.AsAdmin().Run("get").Args(...).Output() (the machinesets variable),
parse/unmarshal that JSON and verify the .items array length (e.g., unmarshal
into map[string]interface{} and check items slice length) or alternatively call
oc.AsAdmin().Run("get").Args("machinesets","-n","openshift-machine-api","-o","jsonpath={.items}").Output()
and ensure the returned items are non-empty before deciding to Skip; update the
condition that currently references machinesets to inspect the parsed .items
result instead.
- Around line 158-161: The test is brittle because it assumes
mc.OwnerReferences[0] is the InternalReleaseImage owner; update the assertions
that reference mc.OwnerReferences (the checks around mc and mcName and the
similar block at lines ~623-628) to instead verify that at least one
OwnerReference entry has Kind == "InternalReleaseImage" (and, if applicable, the
expected Name/UID), i.e. scan mc.OwnerReferences for a matching entry rather
than indexing [0]; replace the index-based assertions with a predicate-based
check that confirms existence of an InternalReleaseImage owner.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between d70e8e0 and 2d74b43.

📒 Files selected for processing (2)
  • test/extended/include.go
  • test/extended/internalreleaseimage/internalreleaseimage.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/extended/include.go

@pawanpinjarkar pawanpinjarkar force-pushed the no-registry-cluster-install-e2e-extended-tests branch from 2d74b43 to 23e3532 Compare March 2, 2026 22:18
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 2, 2026

@pawanpinjarkar: This pull request references AGENT-1428 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

WIP

Summary by CodeRabbit

  • Tests
  • Added a comprehensive extended test suite for the InternalReleaseImage feature: validates resource presence and status, registry accessibility from masters and workloads, disruption and recovery behavior, deletion protection, post-install attributes, HA across masters, and dynamic node scaling scenarios.
  • Included the new suite in the extended test set so it runs as part of extended testing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

e2e.Logf("Status validation complete: all %d releases are valid and healthy", len(iri.Status.Releases))
})

g.It("restore deleted MachineConfigs [Disruptive][apigroup:machineconfiguration.openshift.io]", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a Disruptive test, as it does not alter or reboot a node (at least maybe it could be considered serial, but I guess the code is resilient enough to remain as a parallel one)

e2e.Logf("Successfully verified restoration of all %d MachineConfigs", originalCount)
})

g.It("prevent deletion when in use [Disruptive][apigroup:machineconfiguration.openshift.io]", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not a disruptive test

mcClient := oc.MachineConfigurationClient().MachineconfigurationV1alpha1()

g.By("Verifying exactly one IRI resource exists cluster-wide")
iriList, err := mcClient.InternalReleaseImages().List(context.Background(), metav1.ListOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant, because just few lines below the IRI is retrieved by name (which is the correct thing to do)

var oc = exutil.NewCLIWithoutNamespace("no-registry")

g.BeforeEach(func() {
skipIfNoRegistryFeatureNotEnabled(oc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skip is fine (for now), but not enough. I think the skipping logic may be fine tuned to avoid running the IRI tests against unsupported platforms or have false negatives.

  • Once the feature will be promoted as GA, the FG skip test should be removed
  • Currently, the feature is available only for baremetal and none platform types (any other platform can be safely skipped)
  • Having the feature enabled in general is not enough, because it must be activated. In this case the activation is given by the presence of the IRI resource

o.Expect(err).NotTo(o.HaveOccurred(), "Failed to list InternalReleaseImage resources")
o.Expect(iriList.Items).Should(o.HaveLen(1))

g.By("Getting the InternalReleaseImage resource")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's ok, even though as mentioned before this check should be used in the skipping logic

e2e.Logf("IRI resource still exists as expected")
})

g.It("maintain high availability of registry across master nodes [Disruptive][apigroup:machineconfiguration.openshift.io]", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some real concerns about having this test here (for the feature promotion), let's have a discuss about that.

len(masterIPs)-1, len(masterIPs))
})

g.It("allow workloads to pull images from internal registry [apigroup:machineconfiguration.openshift.io]", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example of weird phrasing given the previous Describe behavior (it's not the resource that allows that)

o.Expect(iri.Status.Releases).ShouldNot(o.BeEmpty(), "IRI should have releases")

g.By("Selecting an image from the release payload")
releaseImage := iri.Status.Releases[0].Image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably any well-known payload image could work as well. But I guess it should not be easy now to retrieve its pullspec (probably it will be after https://issues.redhat.com/browse/AGENT-1312 will be available (cc @danielerez )

return iriMCs, nil
}

func checkIRIRegistryAccessible(ip string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not going to work in the CI, since the port is not accessible

e2e.Logf("Workload successfully pulled and ran image from internal IRI registry")
})

g.It("allow new worker nodes to connect to registry on port 22625 [Disruptive][Slow][apigroup:machineconfiguration.openshift.io]", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test is going to work in the CI, and I've also found too much complicated, so let's discuss as I think it should be removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants