Skip to content

Update todolist sample app for a more robust todo2 container and other stability improvements#2112

Merged
weshayutin merged 9 commits intoopenshift:oadp-devfrom
weshayutin:westest
Mar 6, 2026
Merged

Update todolist sample app for a more robust todo2 container and other stability improvements#2112
weshayutin merged 9 commits intoopenshift:oadp-devfrom
weshayutin:westest

Conversation

@weshayutin
Copy link
Contributor

@weshayutin weshayutin commented Mar 4, 2026

Why the changes were made

Update todolist maria and todolist mongo to use new version
migtools/mig-demo-apps#9

Summary

  • Migrate all E2E sample applications (mongo-persistent, mysql-persistent) from the legacy oadp-ci-todolist-* container images to the new unified oadp-ci-todo2-go-testing-* images, updating environment variables (DB_BACKEND, MONGO_DATABASE), mount paths (/var/lib/mongodb), and device paths (/dev/block-pv) to match the new image contracts.
  • Add a /healthz polling gate to VerifyBackupRestoreData so that post-restore data verification waits for the app to become alive before comparing todo data, with configurable retry and linear backoff.
  • After KOPIA (file-system) restores, delete and let the deployment controller recreate pods to work around OVN-Kubernetes networking issues where Velero-restored pods have broken network namespaces.
  • Add a skipReadyz option to VerifyBackupRestoreData for VM-based (virt) tests where the app route is not directly reachable until the VM completes cloud-init; in this path, /todo-incomplete is polled with up to 10 retries.

Details

Image migration

App Old Image New Image
mongo-persistent (all variants) quay.io/migtools/oadp-ci-todolist-mongo-go-testing:latest quay.io/migtools/oadp-ci-todo2-go-testing-mongodb:latest
mysql-persistent (all variants) quay.io/migtools/oadp-ci-todolist-mariadb-go-testing:testing quay.io/migtools/oadp-ci-todo2-go-testing-mariadb:latest

Post-restore verification improvements (lib/apps.go)

  • VerifyBackupRestoreData now accepts an optional skipReadyz ...bool variadic parameter.
  • For container-based todo apps: polls /healthz up to 5 times (5s linear backoff) before fetching /todo-incomplete.
  • For VM-based todo apps (skipReadyz=true): skips the healthz gate and instead polls /todo-incomplete up to 10 times (10s linear backoff) to wait for the VM to finish booting.
  • New helpers: isHealthzAlive() and errRespIndicatesHTTPError() for cleaner response validation.

KOPIA restore pod restart

  • In both backup_restore_suite_test.go and backup_restore_cli_suite_test.go, after a KOPIA restore completes, pods with label e2e-app=true are deleted so the deployment controller creates fresh pods with proper OVN-Kubernetes networking.

Manifest updates

  • mongo-persistent-block: Updated init container to use the same app image, switched from sh to bash, installs e2fsprogs at runtime for block device formatting, updated device path to /dev/block-pv.
  • mysql-persistent: Added CHOWN and FOWNER capabilities to the SCC and container security context to support the new image.
  • All manifests: Added DB_BACKEND environment variable, updated readiness probe to /readyz where applicable.

Test Plan

  • Run E2E backup/restore tests for mongo-persistent (CSI, file-system, block)
  • Run E2E backup/restore tests for mysql-persistent (CSI, file-system, two-vol)
  • Run E2E virt backup/restore tests to validate the skipReadyz polling path
  • Run CLI-based backup/restore tests to validate the KOPIA pod restart path
  • Verify that healthz polling correctly gates data verification on slow-starting apps

Summary by CodeRabbit

  • New Features

    • Automatic pod recovery for file-system restores now triggers fresh pod creation with proper networking configuration.
  • Tests

    • Enhanced restore verification with improved readiness checks across multiple environments and enhanced retry logic.
    • Updated container images and configurations for test database applications (MongoDB, MySQL/MariaDB).

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 4, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/e2e/sample-applications/mysql-persistent/mysql-persistent.yaml (1)

98-98: Replace :latest tag with a pinned digest for reproducible test runs.

The :latest tag is mutable and can introduce test flakes when the image is updated. This issue appears across multiple e2e sample application manifests; consider pinning to a specific @sha256:<digest> instead.

♻️ Proposed fix
-            image: quay.io/migtools/oadp-ci-todo2-go-testing-mariadb:latest
+            image: quay.io/migtools/oadp-ci-todo2-go-testing-mariadb@sha256:<pinned-digest>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/sample-applications/mysql-persistent/mysql-persistent.yaml` at line
98, The manifest currently pins the container image with a mutable tag
"quay.io/migtools/oadp-ci-todo2-go-testing-mariadb:latest"; replace that tag
with an immutable digest (e.g.,
"quay.io/migtools/oadp-ci-todo2-go-testing-mariadb@sha256:<digest>") by
resolving the correct sha256 via the registry (quay or `skopeo/ctr/podman`
inspect) and update this image reference; also scan other e2e sample manifests
for the same pattern and pin those images similarly to ensure reproducible test
runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/lib/apps.go`:
- Around line 496-516: The readyz loop treats err == nil as success but
MakeRequest (used with getRequestParameters) doesn't enforce HTTP success codes,
so HTTP 5xx/500 responses can falsely mark ready; modify MakeRequest to return
the HTTP status code (e.g., add a statusCode int return) or otherwise expose it,
then in the readyz check loop (the block using getRequestParameters and
MakeRequest) require statusCode == 200 (and optionally validate a known body
like "ok"/"ready" if the app defines one) before breaking; also update any tests
that call MakeRequest (and the proxy curl invocations in common_helpers.go) to
ensure curl uses --fail or that MakeRequest treats non-2xx responses as errors
so the readyz check only passes on a real healthy response.

---

Nitpick comments:
In `@tests/e2e/sample-applications/mysql-persistent/mysql-persistent.yaml`:
- Line 98: The manifest currently pins the container image with a mutable tag
"quay.io/migtools/oadp-ci-todo2-go-testing-mariadb:latest"; replace that tag
with an immutable digest (e.g.,
"quay.io/migtools/oadp-ci-todo2-go-testing-mariadb@sha256:<digest>") by
resolving the correct sha256 via the registry (quay or `skopeo/ctr/podman`
inspect) and update this image reference; also scan other e2e sample manifests
for the same pattern and pin those images similarly to ensure reproducible test
runs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e6a2acaa-fb4c-4049-afaf-5cdd83eafeb4

📥 Commits

Reviewing files that changed from the base of the PR and between 13e07c0 and 1afcdc6.

📒 Files selected for processing (6)
  • tests/e2e/lib/apps.go
  • tests/e2e/sample-applications/mongo-persistent/mongo-persistent-csi.yaml
  • tests/e2e/sample-applications/mongo-persistent/mongo-persistent.yaml
  • tests/e2e/sample-applications/mysql-persistent/mysql-persistent-csi.yaml
  • tests/e2e/sample-applications/mysql-persistent/mysql-persistent-twovol-csi.yaml
  • tests/e2e/sample-applications/mysql-persistent/mysql-persistent.yaml

Signed-off-by: Wesley Hayutin <weshayutin@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/e2e/sample-applications/mysql-persistent/mysql-persistent.yaml (1)

36-38: allowedCapabilities is good, but SCC is still effectively privileged.

Line 36 adds a capability allowlist, but with allowPrivilegedContainer: true and allowPrivilegeEscalation: true in the same SCC, this does not materially enforce least privilege. Consider tightening those flags if this manifest is intended to model secure defaults.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/sample-applications/mysql-persistent/mysql-persistent.yaml` around
lines 36 - 38, The SCC currently lists allowedCapabilities but still permits
privileged behavior; update the SecurityContextConstraints/SCC by setting
allowPrivilegedContainer: false and allowPrivilegeEscalation: false (or remove
these true settings) so the capability allowlist is effective; locate the SCC
resource containing allowedCapabilities, allowPrivilegedContainer, and
allowPrivilegeEscalation and tighten those flags to false to enforce least
privilege.
tests/e2e/backup_restore_cli_suite_test.go (1)

178-191: Extract duplicated KOPIA pod-restart logic into a shared helper.

This block is duplicated with tests/e2e/backup_restore_suite_test.go (Line 269-282). A shared helper in package e2e_test will reduce drift between CLI and non-CLI restore paths.

Refactor sketch
-	if brCase.BackupRestoreType == lib.KOPIA {
-		log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
-		err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
-			context.Background(),
-			metav1.DeleteOptions{},
-			metav1.ListOptions{LabelSelector: "e2e-app=true"},
-		)
-		gomega.Expect(err).ToNot(gomega.HaveOccurred())
-	}
+	restartRestoredAppPodsIfNeeded(brCase.BackupRestoreType, brCase.Namespace)
// package e2e_test (shared by both *_suite_test.go files)
func restartRestoredAppPodsIfNeeded(backupRestoreType lib.BackupRestoreType, appNamespace string) {
	if backupRestoreType != lib.KOPIA {
		return
	}
	log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", appNamespace)
	err := kubernetesClientForSuiteRun.CoreV1().Pods(appNamespace).DeleteCollection(
		context.Background(),
		metav1.DeleteOptions{},
		metav1.ListOptions{LabelSelector: "e2e-app=true"},
	)
	gomega.Expect(err).ToNot(gomega.HaveOccurred())
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/backup_restore_cli_suite_test.go` around lines 178 - 191, Extract
the duplicated KOPIA pod-restart logic into a shared helper named
restartRestoredAppPodsIfNeeded(backupRestoreType lib.BackupRestoreType,
appNamespace string) in package e2e_test; implement it to return early if
backupRestoreType != lib.KOPIA, otherwise log the restart message and call
kubernetesClientForSuiteRun.CoreV1().Pods(appNamespace).DeleteCollection(context.Background(),
metav1.DeleteOptions{}, metav1.ListOptions{LabelSelector: "e2e-app=true"}) and
assert no error with gomega.Expect(err).ToNot(gomega.HaveOccurred()); then
replace the duplicated blocks in both backup_restore_cli_suite_test.go and
backup_restore_suite_test.go with a single call to
restartRestoredAppPodsIfNeeded(brCase.BackupRestoreType, brCase.Namespace).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/backup_restore_suite_test.go`:
- Around line 269-282: The DeleteCollection call used when
brCase.BackupRestoreType == lib.KOPIA
(kubernetesClientForSuiteRun.CoreV1().Pods(...).DeleteCollection) currently uses
context.Background() and can hang tests; replace it with a context created via
context.WithTimeout (choose a reasonable timeout like 30s), pass that context to
DeleteCollection, and defer the cancel() to ensure the call is bounded; make the
identical change in tests/e2e/backup_restore_cli_suite_test.go for the
corresponding DeleteCollection call as well.

In `@tests/e2e/sample-applications/mysql-persistent/mysql-persistent.yaml`:
- Line 101: Replace the mutable image tag used in the manifest: update the image
reference quay.io/migtools/oadp-ci-todo2-go-testing-mariadb:latest to an
immutable tag or digest (e.g.,
quay.io/migtools/oadp-ci-todo2-go-testing-mariadb:<specific-version> or
quay.io/migtools/oadp-ci-todo2-go-testing-mariadb@sha256:<digest>) so the e2e
manifest in mysql-persistent.yaml references a fixed artifact and tests become
deterministic.

---

Nitpick comments:
In `@tests/e2e/backup_restore_cli_suite_test.go`:
- Around line 178-191: Extract the duplicated KOPIA pod-restart logic into a
shared helper named restartRestoredAppPodsIfNeeded(backupRestoreType
lib.BackupRestoreType, appNamespace string) in package e2e_test; implement it to
return early if backupRestoreType != lib.KOPIA, otherwise log the restart
message and call
kubernetesClientForSuiteRun.CoreV1().Pods(appNamespace).DeleteCollection(context.Background(),
metav1.DeleteOptions{}, metav1.ListOptions{LabelSelector: "e2e-app=true"}) and
assert no error with gomega.Expect(err).ToNot(gomega.HaveOccurred()); then
replace the duplicated blocks in both backup_restore_cli_suite_test.go and
backup_restore_suite_test.go with a single call to
restartRestoredAppPodsIfNeeded(brCase.BackupRestoreType, brCase.Namespace).

In `@tests/e2e/sample-applications/mysql-persistent/mysql-persistent.yaml`:
- Around line 36-38: The SCC currently lists allowedCapabilities but still
permits privileged behavior; update the SecurityContextConstraints/SCC by
setting allowPrivilegedContainer: false and allowPrivilegeEscalation: false (or
remove these true settings) so the capability allowlist is effective; locate the
SCC resource containing allowedCapabilities, allowPrivilegedContainer, and
allowPrivilegeEscalation and tighten those flags to false to enforce least
privilege.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 060dfeb3-febd-4e79-98f4-b425fd3bddcb

📥 Commits

Reviewing files that changed from the base of the PR and between 1afcdc6 and d15f16d.

📒 Files selected for processing (3)
  • tests/e2e/backup_restore_cli_suite_test.go
  • tests/e2e/backup_restore_suite_test.go
  • tests/e2e/sample-applications/mysql-persistent/mysql-persistent.yaml

Signed-off-by: Wesley Hayutin <weshayutin@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/e2e/sample-applications/mongo-persistent/mongo-persistent-csi.yaml (1)

80-100: Recommend matching readiness behavior with the other Mongo restore checks.

For consistency with the updated block variant and restore readiness flow, consider using /readyz for readiness in this variant as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/sample-applications/mongo-persistent/mongo-persistent-csi.yaml`
around lines 80 - 100, The readiness behavior in this Mongo variant should match
the other Mongo restore checks by using the /readyz endpoint; update the
container's readinessProbe (the container with image
quay.io/migtools/oadp-ci-todo2-go-testing-mongodb:latest) to probe /readyz
instead of the current path (or add a readinessProbe if missing) so the restore
readiness flow is consistent with the updated block variant.
tests/e2e/sample-applications/mongo-persistent/mongo-persistent.yaml (1)

93-107: Consider aligning readiness endpoint with /readyz across Mongo variants.

This file still uses /healthz probes while the block variant moved readiness to /readyz. Keeping endpoint semantics uniform across variants reduces e2e drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/sample-applications/mongo-persistent/mongo-persistent.yaml` around
lines 93 - 107, The readiness endpoint in this Kubernetes container spec uses
/healthz while other Mongo variants use /readyz; update the readinessProbe
configuration (look for the readinessProbe.path entry in this container spec) to
use "/readyz" instead of "/healthz" so the readinessProbe aligns with other
Mongo variants (leave other fields like livenessProbe, ports, env, volumeMounts
unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/sample-applications/mongo-persistent/mongo-persistent-block.yaml`:
- Around line 97-100: The init-container currently installs e2fsprogs at runtime
using apt-get; instead, update the container image (used by this init-container)
to include e2fsprogs baked in and change the init logic to validate its presence
(e.g., check for mkfs.ext4 or e2fsck) rather than running apt-get. Locate the
init script that sets DEVICE and MOUNT_POINT and replace the apt-get install
lines with a runtime check that exits with a clear error if the required
binaries are missing, ensuring the test image is rebuilt with e2fsprogs
included.
- Around line 141-145: The startup snippet should fail fast and safely quote
variables: add "set -e" at the top of the script and change unquoted variable
uses to quoted ones (use "$DEVICE" and "$MOUNT_POINT" wherever referenced, e.g.,
in mkdir and mount calls) so mount failures abort the script instead of falling
through to exec /opt/todolist/entrypoint.sh; ensure the mount command is run as
mount "$DEVICE" "$MOUNT_POINT" (or followed by an explicit check/exit on
failure) before exec-ing entrypoint.sh.

---

Nitpick comments:
In `@tests/e2e/sample-applications/mongo-persistent/mongo-persistent-csi.yaml`:
- Around line 80-100: The readiness behavior in this Mongo variant should match
the other Mongo restore checks by using the /readyz endpoint; update the
container's readinessProbe (the container with image
quay.io/migtools/oadp-ci-todo2-go-testing-mongodb:latest) to probe /readyz
instead of the current path (or add a readinessProbe if missing) so the restore
readiness flow is consistent with the updated block variant.

In `@tests/e2e/sample-applications/mongo-persistent/mongo-persistent.yaml`:
- Around line 93-107: The readiness endpoint in this Kubernetes container spec
uses /healthz while other Mongo variants use /readyz; update the readinessProbe
configuration (look for the readinessProbe.path entry in this container spec) to
use "/readyz" instead of "/healthz" so the readinessProbe aligns with other
Mongo variants (leave other fields like livenessProbe, ports, env, volumeMounts
unchanged).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e9cf6dde-fd7b-4214-990e-405b4c7c23af

📥 Commits

Reviewing files that changed from the base of the PR and between d15f16d and 5402ae3.

📒 Files selected for processing (3)
  • tests/e2e/sample-applications/mongo-persistent/mongo-persistent-block.yaml
  • tests/e2e/sample-applications/mongo-persistent/mongo-persistent-csi.yaml
  • tests/e2e/sample-applications/mongo-persistent/mongo-persistent.yaml

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/e2e/lib/apps.go (1)

498-529: ⚠️ Potential issue | 🔴 Critical

skipReadyz currently skips restore data verification for mysql/mongo.

Line 498 gates both the /readyz probe and the /todo-incomplete fetch. When skipReadyz is true, respData is never populated for mysql-persistent/mongo-persistent, so the comparison at Lines 550-553 can fail with empty response data. This breaks the intended “skip healthz only” behavior.

Proposed fix
-		if !shouldSkipReadyz && (namespace == "mysql-persistent" || namespace == "mongo-persistent") {
-			// ensure that the application endpoint is reachable
-			requestParams := getRequestParameters(appEndpointURL+"/readyz", proxyPodParams, GET, nil)
-			const maxReadyzAttempts = 5
-			for attempt := 1; attempt <= maxReadyzAttempts; attempt++ {
-				log.Printf("readyz check attempt %d/%d: GET %s/readyz\n", attempt, maxReadyzAttempts, appEndpointURL)
-				respData, errResp, err = MakeRequest(*requestParams)
-				if err == nil {
-					log.Printf("readyz endpoint is alive (attempt %d/%d): %s\n", attempt, maxReadyzAttempts, respData)
-					break
-				}
-				if errResp != "" {
-					log.Printf("Request response error msg: %s\n", errResp)
-				}
-				if attempt == maxReadyzAttempts {
-					log.Printf("readyz endpoint did not become alive after %d attempts: %v\n", maxReadyzAttempts, err)
-					return err
-				}
-				backoff := time.Duration(attempt) * 5 * time.Second
-				log.Printf("readyz attempt %d/%d failed, retrying in %s: %v\n", attempt, maxReadyzAttempts, backoff, err)
-				time.Sleep(backoff)
-			}
-			// Make request to the "todo-incomplete" endpoint
-			requestParamsTodoIncomplete := getRequestParameters(appEndpointURL+"/todo-incomplete", proxyPodParams, GET, nil)
-			respData, errResp, err = MakeRequest(*requestParamsTodoIncomplete)
-			if err != nil {
-				if errResp != "" {
-					log.Printf("Request response error msg: %s\n", errResp)
-				}
-				return err
-			}
-		}
+		if namespace == "mysql-persistent" || namespace == "mongo-persistent" {
+			if !shouldSkipReadyz {
+				// ensure that the application endpoint is reachable
+				requestParams := getRequestParameters(appEndpointURL+"/readyz", proxyPodParams, GET, nil)
+				const maxReadyzAttempts = 5
+				for attempt := 1; attempt <= maxReadyzAttempts; attempt++ {
+					log.Printf("readyz check attempt %d/%d: GET %s/readyz\n", attempt, maxReadyzAttempts, appEndpointURL)
+					respData, errResp, err = MakeRequest(*requestParams)
+					if err == nil {
+						log.Printf("readyz endpoint is alive (attempt %d/%d): %s\n", attempt, maxReadyzAttempts, respData)
+						break
+					}
+					if errResp != "" {
+						log.Printf("Request response error msg: %s\n", errResp)
+					}
+					if attempt == maxReadyzAttempts {
+						log.Printf("readyz endpoint did not become alive after %d attempts: %v\n", maxReadyzAttempts, err)
+						return err
+					}
+					backoff := time.Duration(attempt) * 5 * time.Second
+					log.Printf("readyz attempt %d/%d failed, retrying in %s: %v\n", attempt, maxReadyzAttempts, backoff, err)
+					time.Sleep(backoff)
+				}
+			}
+			// always fetch restore data, even when readyz is skipped
+			requestParamsTodoIncomplete := getRequestParameters(appEndpointURL+"/todo-incomplete", proxyPodParams, GET, nil)
+			respData, errResp, err = MakeRequest(*requestParamsTodoIncomplete)
+			if err != nil {
+				if errResp != "" {
+					log.Printf("Request response error msg: %s\n", errResp)
+				}
+				return err
+			}
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/lib/apps.go` around lines 498 - 529, The current if-block around
shouldSkipReadyz also skips the "/todo-incomplete" verification for
mysql-persistent/mongo-persistent; change logic so only the "/readyz" probe is
skipped when shouldSkipReadyz is true but the "/todo-incomplete" request always
runs. Concretely, keep the readyz retry loop guarded by shouldSkipReadyz (the
block using appEndpointURL+"/readyz", maxReadyzAttempts, backoff and
MakeRequest), then move or duplicate the
getRequestParameters(appEndpointURL+"/todo-incomplete", proxyPodParams, GET,
nil) + respData, errResp, err = MakeRequest(*requestParamsTodoIncomplete)
handling outside that conditional so respData is populated for subsequent
comparisons even when shouldSkipReadyz is true.
♻️ Duplicate comments (1)
tests/e2e/lib/apps.go (1)

505-507: ⚠️ Potential issue | 🟠 Major

/readyz probe can still false-pass on HTTP error responses.

Line 505 treats err == nil as readiness, but this was already flagged: transport success is not equivalent to health success. Please enforce HTTP success semantics (or a known ready payload) before breaking the retry loop.

#!/bin/bash
# Verify current readyz success criterion and request execution behavior (read-only).
set -euo pipefail

echo "== Verify readyz loop in tests/e2e/lib/apps.go =="
rg -n -C3 'readyz|err == nil|MakeRequest\(' tests/e2e/lib/apps.go

echo
echo "== Verify MakeRequest proxy curl flags in tests/e2e/lib/common_helpers.go =="
rg -n -C3 'curl|--silent|--show-error|--fail|MakeRequest\(' tests/e2e/lib/common_helpers.go
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/lib/apps.go` around lines 505 - 507, The readyz loop currently
treats transport success (err == nil) as readiness; update the loop that calls
MakeRequest for the readyz endpoint so it also validates HTTP success semantics
or a known success payload before breaking: use the status code returned by
MakeRequest (or the response object) and only break when the status is in the
2xx range (or respData contains the expected ready payload), otherwise log the
non-2xx status/body and continue retrying; adjust the condition around err ==
nil in the readyz check (variables: MakeRequest, respData, attempt,
maxReadyzAttempts) to require both no error and an explicit HTTP success/payload
match.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/e2e/lib/apps.go`:
- Around line 498-529: The current if-block around shouldSkipReadyz also skips
the "/todo-incomplete" verification for mysql-persistent/mongo-persistent;
change logic so only the "/readyz" probe is skipped when shouldSkipReadyz is
true but the "/todo-incomplete" request always runs. Concretely, keep the readyz
retry loop guarded by shouldSkipReadyz (the block using
appEndpointURL+"/readyz", maxReadyzAttempts, backoff and MakeRequest), then move
or duplicate the getRequestParameters(appEndpointURL+"/todo-incomplete",
proxyPodParams, GET, nil) + respData, errResp, err =
MakeRequest(*requestParamsTodoIncomplete) handling outside that conditional so
respData is populated for subsequent comparisons even when shouldSkipReadyz is
true.

---

Duplicate comments:
In `@tests/e2e/lib/apps.go`:
- Around line 505-507: The readyz loop currently treats transport success (err
== nil) as readiness; update the loop that calls MakeRequest for the readyz
endpoint so it also validates HTTP success semantics or a known success payload
before breaking: use the status code returned by MakeRequest (or the response
object) and only break when the status is in the 2xx range (or respData contains
the expected ready payload), otherwise log the non-2xx status/body and continue
retrying; adjust the condition around err == nil in the readyz check (variables:
MakeRequest, respData, attempt, maxReadyzAttempts) to require both no error and
an explicit HTTP success/payload match.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7ad1748a-5ff3-4467-b6de-fd5de3d461a2

📥 Commits

Reviewing files that changed from the base of the PR and between 5402ae3 and 3abf894.

📒 Files selected for processing (2)
  • tests/e2e/lib/apps.go
  • tests/e2e/virt_backup_restore_suite_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/e2e/lib/apps.go (1)

502-508: ⚠️ Potential issue | 🟠 Major

Add response body validation to the readyz check.

The check at line 505 treats err == nil as success, but MakeRequest uses curl with --silent --show-error without --fail. HTTP error responses (5xx) will still return exit code 0, causing the readyz check to falsely pass when the endpoint is unhealthy.

Consider validating the response body (e.g., checking for "ok" or expected content):

Proposed fix
 			for attempt := 1; attempt <= maxReadyzAttempts; attempt++ {
 				log.Printf("readyz check attempt %d/%d: GET %s/readyz\n", attempt, maxReadyzAttempts, appEndpointURL)
 				respData, errResp, err = MakeRequest(*requestParams)
-				if err == nil {
+				if err == nil && strings.TrimSpace(respData) == "ok" {
 					log.Printf("readyz endpoint is alive (attempt %d/%d): %s\n", attempt, maxReadyzAttempts, respData)
 					break
 				}

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/lib/apps.go` around lines 502 - 508, The readyz loop currently
treats err == nil from MakeRequest as success, but MakeRequest (used in the loop
around appEndpointURL) can return exit code 0 for HTTP 5xx; update the success
check in the readyz check to validate the response body and/or HTTP status:
after calling MakeRequest(*requestParams) inspect respData and errResp (or
parsed status in errResp) and only break when respData contains the expected OK
payload (e.g., "ok" or the service-specific healthy string) and there is no HTTP
error status; otherwise log the failure and continue retrying. Ensure you update
the check that uses respData, errResp, err and keep the existing logging around
MakeRequest.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/e2e/lib/apps.go`:
- Around line 502-508: The readyz loop currently treats err == nil from
MakeRequest as success, but MakeRequest (used in the loop around appEndpointURL)
can return exit code 0 for HTTP 5xx; update the success check in the readyz
check to validate the response body and/or HTTP status: after calling
MakeRequest(*requestParams) inspect respData and errResp (or parsed status in
errResp) and only break when respData contains the expected OK payload (e.g.,
"ok" or the service-specific healthy string) and there is no HTTP error status;
otherwise log the failure and continue retrying. Ensure you update the check
that uses respData, errResp, err and keep the existing logging around
MakeRequest.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a48a2d6f-ac68-4df0-b523-b2cb65761ad9

📥 Commits

Reviewing files that changed from the base of the PR and between 3abf894 and 541a300.

📒 Files selected for processing (1)
  • tests/e2e/lib/apps.go

Signed-off-by: Wesley Hayutin <weshayutin@gmail.com>
@weshayutin weshayutin changed the title DNM: testing todo version 2 Update todolist sample app for a more robust todo2 container and other stability improvements Mar 5, 2026
@openshift openshift deleted a comment from coderabbitai bot Mar 5, 2026
@weshayutin
Copy link
Contributor Author

wooot.. let's go. https://prow.ci.openshift.org/pr-history/?org=openshift&repo=oadp-operator&pr=2112
e2e-test-cli is optional atm..
@kaovilai @mpryc please review for merge. e2e tests passing 4, 5 times in a row.

@mpryc
Copy link
Contributor

mpryc commented Mar 6, 2026

/test 4.21-e2e-test-aws

Copy link
Contributor

@mpryc mpryc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm with small room to improve around apt-get

Signed-off-by: Wesley Hayutin <weshayutin@gmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

Walkthrough

This PR adds post-restore pod recovery for file-system backups (KOPIA/restic) to trigger fresh pod creation with proper networking, enhances restore verification logic with healthz checks and polling, updates function signatures to support optional readiness checks, and updates sample application manifests with new container images and environment variable configurations.

Changes

Cohort / File(s) Summary
Post-Restore Pod Recovery
tests/e2e/backup_restore_cli_suite_test.go, tests/e2e/backup_restore_suite_test.go
Adds pod deletion logic after KOPIA file-system restores. Deletes pods labeled e2e-app=true in the target namespace to force deployment controller to recreate pods with proper networking, preserving PVC data.
Restore Verification Enhancement
tests/e2e/lib/apps.go
Expands VerifyBackupRestoreData function with variadic skipReadyz parameter to support optional post-restore readiness checks. Introduces healthz-based readiness gates, polling and retry logic for Todo app endpoints, and separate verification paths for parks-app, replacing single-request verification with multi-path resilient validation.
Test Suite Integration
tests/e2e/virt_backup_restore_suite_test.go
Updates VerifyBackupRestoreData function call to pass additional boolean argument, enabling new readiness check logic for VM-based restores.
MongoDB Sample Applications
tests/e2e/sample-applications/mongo-persistent/mongo-persistent*.yaml
Updates container images across block-device and CSI variants. Replaces MONGO_INITDB_ROOT_USERNAME/PASSWORD and MONGO_INITDB_DATABASE with DB_BACKEND and MONGO_DATABASE environment variables. Changes mount path from /data/db to /var/lib/mongodb. Block variant updates device path and readiness probe.
MySQL Sample Applications
tests/e2e/sample-applications/mysql-persistent/mysql-persistent*.yaml
Updates container images to oadp-ci-todo2-go-testing-mariadb variant. Adds DB_BACKEND=mariadb environment variable across all manifests. Standard variant adds allowedCapabilities (CHOWN, FOWNER) to SecurityContextConstraints and container securityContext.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Test code lacks error messages in assertions, pod deletion is placed inline rather than in cleanup blocks, creating resource leak risks if tests fail before cleanup executes. Add error messages to all Expect assertions; move pod deletion to AfterEach/defer blocks; ensure all Eventually calls consistently specify timeouts; follow existing repository cleanup patterns.
Stable And Deterministic Test Names ❓ Inconclusive Unable to read the virt_backup_restore_suite_test.go file. No shell command execution capability available. Please provide the file contents directly or specify the verification criteria for this test file.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: updating the todolist sample app image to a newer version and adding stability improvements, which is fully reflected in the PR's comprehensive changes.
Description check ✅ Passed The description is comprehensive and follows the template with both required sections completed; it explains why changes were made, references the upstream PR, provides detailed summaries of image migrations, verification improvements, and includes a test plan.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
tests/e2e/backup_restore_suite_test.go (1)

276-281: ⚠️ Potential issue | 🟠 Major

Use a timeout-bounded context for pod DeleteCollection.

Line 277 uses context.Background() for a network API call; this can hang the test flow under API/server stalls.

🔧 Suggested fix
 	if brCase.BackupRestoreType == lib.KOPIA {
 		log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
+		ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
+		defer cancel()
 		err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
-			context.Background(),
+			ctx,
 			metav1.DeleteOptions{},
 			metav1.ListOptions{LabelSelector: "e2e-app=true"},
 		)
 		gomega.Expect(err).ToNot(gomega.HaveOccurred())
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/backup_restore_suite_test.go` around lines 276 - 281, Replace the
unbounded context in the DeleteCollection call with a timeout-bounded context:
create a context with a reasonable timeout using context.WithTimeout (and defer
cancel()), then pass that ctx into
kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(...).
Ensure you cancel the context to release resources and keep the DeleteCollection
invocation and subsequent gomega.Expect(err).ToNot(gomega.HaveOccurred())
behavior unchanged.
tests/e2e/backup_restore_cli_suite_test.go (1)

185-190: ⚠️ Potential issue | 🟠 Major

Bound the pod deletion call with a timeout context.

Line 186 uses context.Background() for DeleteCollection; if the API call stalls, this can hang the test path.

🔧 Suggested fix
 	if brCase.BackupRestoreType == lib.KOPIA {
 		log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
+		ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
+		defer cancel()
 		err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
-			context.Background(),
+			ctx,
 			metav1.DeleteOptions{},
 			metav1.ListOptions{LabelSelector: "e2e-app=true"},
 		)
 		gomega.Expect(err).ToNot(gomega.HaveOccurred())
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/backup_restore_cli_suite_test.go` around lines 185 - 190, The
DeleteCollection call uses context.Background() which can hang; change it to a
bounded context by creating a context with timeout (e.g., via
context.WithTimeout) before calling
kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection and
ensure you defer the cancel() so the timeout is enforced and resources are
cleaned up; update the call site that currently passes context.Background() to
pass the new timed context instead.
tests/e2e/sample-applications/mongo-persistent/mongo-persistent-block.yaml (1)

141-145: ⚠️ Potential issue | 🟠 Major

Make the block-device startup script fail fast and quote variables.

Lines 143-144 run mount flow without strict mode and unquoted vars; mount failures can lead to misleading startup behavior.

🔧 Suggested fix
               - |
+                set -eu
                 DEVICE="/dev/block-pv"
                 MOUNT_POINT="/var/lib/mongodb"
-                mkdir -p $MOUNT_POINT
-                mount $DEVICE $MOUNT_POINT
+                mkdir -p "$MOUNT_POINT"
+                mount "$DEVICE" "$MOUNT_POINT"
                 exec /opt/todolist/entrypoint.sh
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/sample-applications/mongo-persistent/mongo-persistent-block.yaml`
around lines 141 - 145, The startup script should fail fast and safely quote
variables: enable strict mode (set -euo pipefail) at the top, quote DEVICE and
MOUNT_POINT when creating the directory and when calling mount (use "$DEVICE"
and "$MOUNT_POINT"), check the return value of mount and exit with a non‑zero
status and an error log if mount fails, and only exec
/opt/todolist/entrypoint.sh after a successful mount; update the block that
references DEVICE, MOUNT_POINT and the final exec to follow these changes.
tests/e2e/lib/apps.go (1)

519-527: ⚠️ Potential issue | 🟠 Major

HTTP error responses can still be treated as healthy/ready in proxy mode.

Line 526 and Line 567 assume err == nil reliably means HTTP success, but proxy requests can still return err == nil on HTTP 4xx/5xx when curl doesn’t fail on status. That leaves this block vulnerable to false readiness/data-availability decisions.

🔧 Suggested hardening (root cause in request helper)
--- a/tests/e2e/lib/common_helpers.go
+++ b/tests/e2e/lib/common_helpers.go
@@
- curlInProxyCmd = "curl -X GET --silent --show-error " + params.URL
+ curlInProxyCmd = "curl -X GET --fail-with-body --silent --show-error " + params.URL
@@
- curlInProxyCmd = fmt.Sprintf("curl -X POST -d %s --silent --show-error %s", body, params.URL)
+ curlInProxyCmd = fmt.Sprintf("curl -X POST -d %s --fail-with-body --silent --show-error %s", body, params.URL)
@@
- curlInProxyCmd = fmt.Sprintf("curl -X POST --silent --show-error %s", params.URL)
+ curlInProxyCmd = fmt.Sprintf("curl -X POST --fail-with-body --silent --show-error %s", params.URL)

Use this to verify the concern and affected paths:

#!/bin/bash
# Verify curl proxy commands do not currently fail on HTTP status codes.
rg -n -C2 --type go 'curl -X (GET|POST).*--silent --show-error' tests/e2e/lib/common_helpers.go

# Verify readiness/data checks that currently depend on err/body heuristics.
rg -n -C3 --type go 'healthz check attempt|isHealthzAlive|maxTodoAttempts == 1|todo-incomplete|errRespIndicatesHTTPError' tests/e2e/lib/apps.go

Expected result: first command shows proxy curl invocations without --fail/--fail-with-body; second command shows call sites in this file that rely on that behavior.

Also applies to: 567-568, 632-650

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/lib/apps.go` around lines 519 - 527, The readiness check
incorrectly treats err == nil as success for proxy-mode requests; update either
the request helper or this caller so HTTP error statuses are not considered
healthy: modify MakeRequest (used here) to make proxy curl fail on HTTP errors
(add --fail or --fail-with-body to the curl args in
getRequestParameters/MakeRequest) so it returns a non-nil error for 4xx/5xx, or,
if you prefer caller-side hardening, explicitly inspect errResp for HTTP status
>= 400 before accepting success and only call isHealthzAlive when status is 2xx;
reference MakeRequest, getRequestParameters, isHealthzAlive and proxyPodParams
to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/e2e/backup_restore_cli_suite_test.go`:
- Around line 185-190: The DeleteCollection call uses context.Background() which
can hang; change it to a bounded context by creating a context with timeout
(e.g., via context.WithTimeout) before calling
kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection and
ensure you defer the cancel() so the timeout is enforced and resources are
cleaned up; update the call site that currently passes context.Background() to
pass the new timed context instead.

In `@tests/e2e/backup_restore_suite_test.go`:
- Around line 276-281: Replace the unbounded context in the DeleteCollection
call with a timeout-bounded context: create a context with a reasonable timeout
using context.WithTimeout (and defer cancel()), then pass that ctx into
kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(...).
Ensure you cancel the context to release resources and keep the DeleteCollection
invocation and subsequent gomega.Expect(err).ToNot(gomega.HaveOccurred())
behavior unchanged.

In `@tests/e2e/lib/apps.go`:
- Around line 519-527: The readiness check incorrectly treats err == nil as
success for proxy-mode requests; update either the request helper or this caller
so HTTP error statuses are not considered healthy: modify MakeRequest (used
here) to make proxy curl fail on HTTP errors (add --fail or --fail-with-body to
the curl args in getRequestParameters/MakeRequest) so it returns a non-nil error
for 4xx/5xx, or, if you prefer caller-side hardening, explicitly inspect errResp
for HTTP status >= 400 before accepting success and only call isHealthzAlive
when status is 2xx; reference MakeRequest, getRequestParameters, isHealthzAlive
and proxyPodParams to locate the code to change.

In `@tests/e2e/sample-applications/mongo-persistent/mongo-persistent-block.yaml`:
- Around line 141-145: The startup script should fail fast and safely quote
variables: enable strict mode (set -euo pipefail) at the top, quote DEVICE and
MOUNT_POINT when creating the directory and when calling mount (use "$DEVICE"
and "$MOUNT_POINT"), check the return value of mount and exit with a non‑zero
status and an error log if mount fails, and only exec
/opt/todolist/entrypoint.sh after a successful mount; update the block that
references DEVICE, MOUNT_POINT and the final exec to follow these changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f1f76b6a-4093-458a-bf4e-7a6b9f2a018a

📥 Commits

Reviewing files that changed from the base of the PR and between 13e07c0 and 996ca99.

📒 Files selected for processing (10)
  • tests/e2e/backup_restore_cli_suite_test.go
  • tests/e2e/backup_restore_suite_test.go
  • tests/e2e/lib/apps.go
  • tests/e2e/sample-applications/mongo-persistent/mongo-persistent-block.yaml
  • tests/e2e/sample-applications/mongo-persistent/mongo-persistent-csi.yaml
  • tests/e2e/sample-applications/mongo-persistent/mongo-persistent.yaml
  • tests/e2e/sample-applications/mysql-persistent/mysql-persistent-csi.yaml
  • tests/e2e/sample-applications/mysql-persistent/mysql-persistent-twovol-csi.yaml
  • tests/e2e/sample-applications/mysql-persistent/mysql-persistent.yaml
  • tests/e2e/virt_backup_restore_suite_test.go

@mpryc
Copy link
Contributor

mpryc commented Mar 6, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 6, 2026
@weshayutin
Copy link
Contributor Author

Screenshot From 2026-03-06 09-52-22

@openshift-ci
Copy link

openshift-ci bot commented Mar 6, 2026

@weshayutin: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.21-e2e-test-cli-aws 996ca99 link false /test 4.21-e2e-test-cli-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines +178 to +182
// For file-system backup restores (KOPIA/restic), the restored pods may have
// broken networking because OVN-Kubernetes doesn't fully wire the network
// namespace for pods recreated by Velero with a restore-wait init container.
// Deleting the pods lets the deployment controller create fresh ones with
// proper networking while preserving the restored PVC data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this doc'd in velero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

um... ur the maintainer :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaovilai I can easily recreate the issue now, perhaps we need a new clean upstream velero bug and doc :)

@openshift-ci
Copy link

openshift-ci bot commented Mar 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai, mpryc, weshayutin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@weshayutin weshayutin merged commit 7c8a3f1 into openshift:oadp-dev Mar 6, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants