Skip to content

VMBackup HotplugFailed on VMs with Fedora/RHEL DataSource: "disk.img: no such file or directory" #26

@shubham-pampattiwar

Description

@shubham-pampattiwar

Problem

When performing a backup of a VM provisioned from a Fedora DataSource (using dataVolumeTemplates with sourceRef), the KubeVirt VMBackup hotplug step fails with:

Warning  HotplugFailed  virtualmachineinstance/test-vm  
  failed to mount filesystem hotplug volume vmb-...-backup-target-pvc: 
  lstat /proc/1/root/var/lib/kubelet/plugins/kubernetes.io/csi/
  openshift-storage.rbd.csi.ceph.com/.../disk.img: no such file or directory

The VMBackup still reports Done: True despite this failure, resulting in an empty backup target PVC (directory structure exists but no qcow2 files). The datamover pod then fails with no qcow2 files found in /backup-data.

Environment

  • OpenShift CNV v4.99.0-0.1771785652
  • Storage: ocs-storagecluster-ceph-rbd
  • Feature gates enabled: IncrementalBackup, UtilityVolumes, HotplugVolumes

What works vs. what doesn't

VM Root Disk Machine Type Firmware Hotplug Backup
cirros (containerdisk source, 150Mi, RWO Block) Small containerdisk q35 BIOS Works Works
test-vm (Fedora DataSource, 30Gi, RWX Block) Large DataSource pc-q35-rhel9.2.0 EFI + SMM HotplugFailed Empty PVC

Both VMs have changedBlockTracking: "true". Both backup target PVCs are created identically (10Gi, RWO, Filesystem mode, same storage class).

How the backup target PVC is created

Our datamover controller creates the backup target PVC via ensureTempPVC() — a plain Filesystem PVC with no special configuration:

pvc := &corev1.PersistentVolumeClaim{
    ObjectMeta: metav1.ObjectMeta{
        Name:      pvcName,      // "kubevirt-backup-<du.Name>"
        Namespace: namespace,    // VM namespace
    },
    Spec: corev1.PersistentVolumeClaimSpec{
        AccessModes: []corev1.PersistentVolumeAccessMode{corev1.ReadWriteOnce},
        Resources: corev1.VolumeResourceRequirements{
            Requests: corev1.ResourceList{
                corev1.ResourceStorage: resource.MustParse("10Gi"),
            },
        },
        // No volumeMode specified (defaults to Filesystem)
        // No storageClassName specified (uses cluster default)
    },
}

Question for KubeVirt team: Should the backup target PVC be created differently? Does the VMBackup API expect a specific volumeMode, annotation, or pre-existing disk.img file?

Observed behavior

  1. Controller creates temp PVC kubevirt-backup-<name> in VM namespace (Filesystem, 10Gi, empty)
  2. VMBackup CR is created with forceFullBackup: true
  3. virt-handler logs "successfully mounted" at mount.go:561 (Kubernetes-level PVC attachment succeeds)
  4. HotplugFailed: virt-handler's filesystem hotplug into the VM fails — looks for disk.img inside the freshly provisioned empty PVC
  5. VMBackup reports VirtualMachineBackupCompletedSuccessfully / Done: True despite the hotplug failure
  6. QEMU never receives the backup volume — virt-launcher logs show zero backup activity (no backup commands, no NBD export, no volume attachment inside the VM)
  7. PV is rebound to OADP namespace, datamover pod mounts it
  8. Debug pod inspection confirmed the PVC is empty — the checkpoint directory structure exists (test-vm/<checkpoint-name>/) but contains no files at all
  9. Datamover pod fails: no qcow2 files found in /backup-data

Reproduced consistently across 4 backup attempts (b, c, d, e) for test-vm.

Key observations from investigation

The disk.img expectation

KubeVirt's filesystem PV disk documentation states that for standard filesystem hotplug volumes, a disk.img file must exist inside the PVC (auto-created for regular VM disks). However, the backup target PVC is not a VM disk — it's a destination for qcow2 export data. Nobody creates disk.img in it, and the hotplug code doesn't distinguish between a regular disk PVC and a backup target PVC.

Two-step mount process

The mount at mount.go:561 (Kubernetes-level PVC attachment to the virt-launcher pod) succeeds for both VMs. The failure happens at the second step — virt-handler's filesystem hotplug into the VM — which tries to bind-mount disk.img from the PVC into QEMU. This second step succeeds for cirros but fails for test-vm, despite identical backup target PVC setup.

Mount duration difference

The backup volume mount duration differs between the two VMs:

  • cirros: mounted for ~1 second before unmount
  • test-vm: mounted for ~33 seconds before unmount

UtilityVolumes feature gate

The UtilityVolumes feature gate is enabled on the cluster. KubeVirt introduced UtilityVolumes (PR #15922) specifically for backup PVC attachment — UtilityVolumes mount as a directory (no disk.img needed) rather than using the standard filesystem hotplug path. However, the VMBackup controller appears to attach the backup PVC as a regular hotplug volume rather than via spec.utilityVolumes on the VMI.

VMBackup completion status is misleading

  • cirros: VirtualMachineBackupCompletedWithWarning (guest agent not connected) — backup data was written
  • test-vm: VirtualMachineBackupCompletedSuccessfully — but no backup data was written at all

The "successful" completion is incorrect. The VMBackup should detect the HotplugFailed condition and report failure.

Questions for KubeVirt/CNV team

  1. Why does hotplug work for cirros but fail for test-vm? Both backup target PVCs are identical. The VM configuration differences (EFI firmware, machine type pc-q35-rhel9.2.0, disk size/access mode) may influence which hotplug code path is taken.

  2. Should the backup target PVC use UtilityVolumes? The UtilityVolumes feature gate is enabled but the VMBackup controller doesn't seem to use it. Is this expected for this KubeVirt version?

  3. Should we create the backup target PVC differently? Does it need Block volumeMode, a pre-created disk.img, or specific annotations?

  4. VMBackup should fail when HotplugFailed occurs. Currently it reports Done: True with no data written. This makes it impossible for downstream consumers to detect the failure from the VMBackup status alone.

How to reproduce

  1. Deploy a Fedora VM from the fedora DataSource in openshift-virtualization-os-images with changedBlockTracking: "true", EFI firmware, pc-q35-rhel9.2.0 machine type
  2. Trigger a Velero backup with snapshotMoveData: true
  3. Observe HotplugFailed events: oc get events -n <vm-namespace> --field-selector reason=HotplugFailed
  4. Inspect the backup PVC contents — checkpoint directory will be empty
  5. Observe datamover pod failure: no qcow2 files found in /backup-data

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions