diff --git a/Root Cause Analysis and Fix Proposal b/Root Cause Analysis and Fix Proposal new file mode 100644 index 0000000000..9e273c2085 --- /dev/null +++ b/Root Cause Analysis and Fix Proposal @@ -0,0 +1,94 @@ +Root Cause Analysis and Fix Proposal +I've completed the investigation into the HCP test BackupRepository Kopia initialization failure. Here are my findings: +Root Cause +The error occurs in Velero's PrepareRepo function at pkg/repository/provider/unified_repo.go:170-215: +func (urp *unifiedRepoProvider) PrepareRepo(ctx context.Context, param RepoParam) error { + // ... setup code ... + + repoOption, err := udmrepo.NewRepoOptions( + udmrepo.WithPassword(urp, param), + udmrepo.WithConfigFile(urp.workPath, string(param.BackupRepo.UID)), + // ... other options ... + ) + + // Check if repo exists in backend storage + if created, err := urp.repoService.IsCreated(ctx, *repoOption); err != nil { + return errors.Wrap(err, "error to check backup repo") + } else if created { + log.Debug("Repo has already been initialized remotely") + return nil // ❌ BUG: Returns without creating local config file! + } + + // Only reached if repo doesn't exist in backend + err = urp.repoService.Create(ctx, *repoOption) + // ... +} +The Problem: +Kopia Architecture: Kopia repositories have two components: +Backend storage (S3/Azure/GCS) containing repository metadata +Local config file (/home/velero/udmrepo/repo-.conf) containing connection info +Current Flow: +repo.Initialize(): Creates backend storage metadata AND creates initial local config file +repo.Connect(): Connects to existing backend AND creates/updates local config file +repo.Open(): Opens repo using EXISTING local config file +The Bug: +When PrepareRepo finds the repository already exists in backend storage (via IsCreated()), it returns immediately without calling Connect() +This means the local config file is never created/recreated +Since the config file is stored in ephemeral pod storage (/home/velero), it's lost when: + - A different node-agent pod tries to use the repository + - The pod restarts + - The creating pod is deleted +Later, when DataUpload tries to open the repository via repoService.Open(), it fails because the config file doesn't exist +When This Occurs +This bug manifests in these scenarios: +Multi-pod environments: One node-agent pod creates the BackupRepository, another pod tries to use it +Pod restarts: After the creating pod restarts, it loses the local config file (ephemeral storage) +HCP tests: Multiple backups in sequence where different resources trigger repository initialization +The Fix +The fix is to ensure PrepareRepo always calls Connect() when the repository exists remotely: +func (urp *unifiedRepoProvider) PrepareRepo(ctx context.Context, param RepoParam) error { + // ... setup code ... + + repoOption, err := udmrepo.NewRepoOptions(/* ... */) + if err != nil { + return errors.Wrap(err, "error to get repo options") + } + + if created, err := urp.repoService.IsCreated(ctx, *repoOption); err != nil { + return errors.Wrap(err, "error to check backup repo") + } else if created { + log.Debug("Repo has already been initialized remotely, connecting to it") + // ✅ FIX: Call Connect to create local config file + return urp.repoService.Connect(ctx, *repoOption) + } + + // Repository doesn't exist, create it + if param.BackupLocation.Spec.AccessMode == velerov1api.BackupStorageLocationAccessModeReadOnly { + return errors.Errorf("cannot create new backup repo for read-only backup storage location %s/%s", + param.BackupLocation.Namespace, param.BackupLocation.Name) + } + + err = urp.repoService.Create(ctx, *repoOption) + if err != nil { + return errors.Wrap(err, "error to create backup repo") + } + + log.Debug("Prepare repo complete") + return nil +} +File to Fix +Location: /Users/tkaovila/git/velero/pkg/repository/provider/unified_repo.go +Lines: 211-216 (specifically line 215) +Change: Replace `return nil` with `return urp.repoService.Connect(ctx, *repoOption)` +Additional Context +From lib_repo.go:98-106, Create() calls writeInitParameters() which opens the repo. This means: +The initial repo.Initialize() does create a local config file +However, this config file is stored in ephemeral pod storage (/home/velero/udmrepo/) +The issue is that subsequent PrepareRepo() calls don't recreate the config file when it's missing + +This explains the failure pattern: +The first DataUpload might succeed (if it runs on the same pod that created the repo) +Subsequent DataUploads fail when: + - Running on different node-agent pods (don't have the config file) + - Running after pod restart (config file lost from ephemeral storage) +All HCP resources fail with the same error because they're processed by pods without the config file diff --git a/changelogs/unreleased/0000-kaovilai b/changelogs/unreleased/0000-kaovilai new file mode 100644 index 0000000000..8ab4e03dd9 --- /dev/null +++ b/changelogs/unreleased/0000-kaovilai @@ -0,0 +1 @@ +Fix BackupRepository PrepareRepo to ensure local config file exists when repo exists remotely, using Open() fallback to Connect() pattern. This fixes failures in multi-pod environments and after pod restarts where config files are missing or invalid. diff --git a/changelogs/unreleased/9379-Lyndon-Li b/changelogs/unreleased/9379-Lyndon-Li new file mode 100644 index 0000000000..481bf8bdd3 --- /dev/null +++ b/changelogs/unreleased/9379-Lyndon-Li @@ -0,0 +1 @@ +Refactor repo provider interface for static configuration \ No newline at end of file diff --git a/design/wildcard-namespace-support-design.md b/design/wildcard-namespace-support-design.md new file mode 100644 index 0000000000..45c4d46416 --- /dev/null +++ b/design/wildcard-namespace-support-design.md @@ -0,0 +1,115 @@ + +# Wildcard Namespace Support + +## Abstract + +Velero currently treats namespace patterns with glob characters as literal strings. This design adds wildcard expansion to support flexible namespace selection using patterns like `app-*` or `test-{dev,staging}`. + +## Background + +Requested in [#1874](https://github.com/vmware-tanzu/velero/issues/1874) for more flexible namespace selection. + +## Goals + +- Support glob pattern expansion in namespace includes/excludes +- Maintain backward compatibility with existing `*` behavior + +## Non-Goals + +- Complex regex patterns beyond basic globs + +## High-Level Design + +Wildcard expansion occurs early in both backup and restore flows, converting patterns to literal namespace lists before normal processing. + +### Backup Flow + +Expansion happens in `getResourceItems()` before namespace collection: +1. Check if wildcards exist using `ShouldExpandWildcards()` +2. Expand patterns against active cluster namespaces +3. Replace includes/excludes with expanded literal namespaces +4. Continue with normal backup processing + +### Restore Flow + +Expansion occurs in `execute()` after parsing backup contents: +1. Extract available namespaces from backup tar +2. Expand patterns against backup namespaces (not cluster namespaces) +3. Update restore context with expanded namespaces +4. Continue with normal restore processing + +This ensures restore wildcards match actual backup contents, not current cluster state. + +## Detailed Design + +### Status Fields + +Add wildcard expansion tracking to backup and restore CRDs: + +```go +type WildcardNamespaceStatus struct { + // IncludeWildcardMatches records namespaces that matched include patterns + // +optional + IncludeWildcardMatches []string `json:"includeWildcardMatches,omitempty"` + + // ExcludeWildcardMatches records namespaces that matched exclude patterns + // +optional + ExcludeWildcardMatches []string `json:"excludeWildcardMatches,omitempty"` + + // WildcardResult records final namespaces after wildcard processing + // +optional + WildcardResult []string `json:"wildcardResult,omitempty"` +} + +// Added to both BackupStatus and RestoreStatus +type BackupStatus struct { + // WildcardNamespaces contains wildcard expansion results + // +optional + WildcardNamespaces *WildcardNamespaceStatus `json:"wildcardNamespaces,omitempty"` +} +``` + +### Wildcard Expansion Package + +New `pkg/util/wildcard/expand.go` package provides: + +- `ShouldExpandWildcards()` - Skip expansion for simple "*" case +- `ExpandWildcards()` - Main expansion function using `github.com/gobwas/glob` +- Pattern validation rejecting unsupported regex symbols + +**Supported patterns**: `*`, `?`, `[abc]`, `{a,b,c}` +**Unsupported**: `|()`, `**` + +### Implementation Details + +#### Backup Integration (`pkg/backup/item_collector.go`) + +Expansion in `getResourceItems()`: +- Call `wildcard.ExpandWildcards()` with cluster namespaces +- Update `NamespaceIncludesExcludes` with expanded results +- Populate status fields with expansion results + +#### Restore Integration (`pkg/restore/restore.go`) + +Expansion in `execute()`: +```go +if wildcard.ShouldExpandWildcards(includes, excludes) { + availableNamespaces := extractNamespacesFromBackup(backupResources) + expandedIncludes, expandedExcludes, err := wildcard.ExpandWildcards( + availableNamespaces, includes, excludes) + // Update context and status +} +``` + +## Alternatives Considered + +1. **Client-side expansion**: Rejected because it wouldn't work for scheduled backups +2. **Expansion in `collectNamespaces`**: Rejected because these functions expect literal namespaces + +## Compatibility + +Maintains full backward compatibility - existing "*" behavior unchanged. + +## Implementation + +Target: Velero 1.18 diff --git a/pkg/repository/manager/manager.go b/pkg/repository/manager/manager.go index 7027c96500..ebbf2e8e2f 100644 --- a/pkg/repository/manager/manager.go +++ b/pkg/repository/manager/manager.go @@ -60,7 +60,19 @@ type Manager interface { BatchForget(context.Context, *velerov1api.BackupRepository, []string) []error // DefaultMaintenanceFrequency returns the default maintenance frequency from the specific repo - DefaultMaintenanceFrequency(repo *velerov1api.BackupRepository) (time.Duration, error) + DefaultMaintenanceFrequency(*velerov1api.BackupRepository) (time.Duration, error) + + // ClientSideCacheLimit returns the max cache size required on client side + ClientSideCacheLimit(*velerov1api.BackupRepository) (int64, error) +} + +// ConfigProvider defines the methods to get configurations of a backup repository +type ConfigManager interface { + // DefaultMaintenanceFrequency returns the default maintenance frequency from the specific repo + DefaultMaintenanceFrequency(string) (time.Duration, error) + + // ClientSideCacheLimit returns the max cache size required on client side + ClientSideCacheLimit(string, map[string]string) (int64, error) } type manager struct { @@ -74,6 +86,11 @@ type manager struct { log logrus.FieldLogger } +type configManager struct { + providers map[string]provider.ConfigProvider + log logrus.FieldLogger +} + // NewManager create a new repository manager. func NewManager( namespace string, @@ -101,6 +118,20 @@ func NewManager( return mgr } +// NewConfigManager create a new repository config manager. +func NewConfigManager( + log logrus.FieldLogger, +) ConfigManager { + mgr := &configManager{ + providers: map[string]provider.ConfigProvider{}, + log: log, + } + + mgr.providers[velerov1api.BackupRepositoryTypeKopia] = provider.NewUnifiedRepoConfigProvider(velerov1api.BackupRepositoryTypeKopia, mgr.log) + + return mgr +} + func (m *manager) InitRepo(repo *velerov1api.BackupRepository) error { m.repoLocker.LockExclusive(repo.Name) defer m.repoLocker.UnlockExclusive(repo.Name) @@ -227,12 +258,16 @@ func (m *manager) DefaultMaintenanceFrequency(repo *velerov1api.BackupRepository return 0, errors.WithStack(err) } - param, err := m.assembleRepoParam(repo) + return prd.DefaultMaintenanceFrequency(), nil +} + +func (m *manager) ClientSideCacheLimit(repo *velerov1api.BackupRepository) (int64, error) { + prd, err := m.getRepositoryProvider(repo) if err != nil { return 0, errors.WithStack(err) } - return prd.DefaultMaintenanceFrequency(context.Background(), param), nil + return prd.ClientSideCacheLimit(repo.Spec.RepositoryConfig), nil } func (m *manager) getRepositoryProvider(repo *velerov1api.BackupRepository) (provider.Provider, error) { @@ -256,3 +291,30 @@ func (m *manager) assembleRepoParam(repo *velerov1api.BackupRepository) (provide BackupRepo: repo, }, nil } + +func (cm *configManager) DefaultMaintenanceFrequency(repoType string) (time.Duration, error) { + prd, err := cm.getRepositoryProvider(repoType) + if err != nil { + return 0, errors.WithStack(err) + } + + return prd.DefaultMaintenanceFrequency(), nil +} + +func (cm *configManager) ClientSideCacheLimit(repoType string, repoOption map[string]string) (int64, error) { + prd, err := cm.getRepositoryProvider(repoType) + if err != nil { + return 0, errors.WithStack(err) + } + + return prd.ClientSideCacheLimit(repoOption), nil +} + +func (cm *configManager) getRepositoryProvider(repoType string) (provider.ConfigProvider, error) { + switch repoType { + case velerov1api.BackupRepositoryTypeKopia: + return cm.providers[velerov1api.BackupRepositoryTypeKopia], nil + default: + return nil, fmt.Errorf("failed to get provider for repository %s", repoType) + } +} diff --git a/pkg/repository/manager/manager_test.go b/pkg/repository/manager/manager_test.go index d743e267ae..ea38c7448b 100644 --- a/pkg/repository/manager/manager_test.go +++ b/pkg/repository/manager/manager_test.go @@ -47,3 +47,20 @@ func TestGetRepositoryProvider(t *testing.T) { _, err = mgr.getRepositoryProvider(repo) require.Error(t, err) } + +func TestGetRepositoryConfigProvider(t *testing.T) { + mgr := NewConfigManager(nil).(*configManager) + + // empty repository type + _, err := mgr.getRepositoryProvider("") + require.Error(t, err) + + // valid repository type + provider, err := mgr.getRepositoryProvider(velerov1.BackupRepositoryTypeKopia) + require.NoError(t, err) + assert.NotNil(t, provider) + + // invalid repository type + _, err = mgr.getRepositoryProvider(velerov1.BackupRepositoryTypeRestic) + require.Error(t, err) +} diff --git a/pkg/repository/mocks/ConfigManager.go b/pkg/repository/mocks/ConfigManager.go new file mode 100644 index 0000000000..590f49c33b --- /dev/null +++ b/pkg/repository/mocks/ConfigManager.go @@ -0,0 +1,84 @@ +// Code generated by mockery v2.53.2. DO NOT EDIT. + +package mocks + +import ( + mock "github.com/stretchr/testify/mock" + + time "time" +) + +// ConfigManager is an autogenerated mock type for the ConfigManager type +type ConfigManager struct { + mock.Mock +} + +// ClientSideCacheLimit provides a mock function with given fields: repoType, repoOption +func (_m *ConfigManager) ClientSideCacheLimit(repoType string, repoOption map[string]string) (int64, error) { + ret := _m.Called(repoType, repoOption) + + if len(ret) == 0 { + panic("no return value specified for ClientSideCacheLimit") + } + + var r0 int64 + var r1 error + if rf, ok := ret.Get(0).(func(string, map[string]string) (int64, error)); ok { + return rf(repoType, repoOption) + } + if rf, ok := ret.Get(0).(func(string, map[string]string) int64); ok { + r0 = rf(repoType, repoOption) + } else { + r0 = ret.Get(0).(int64) + } + + if rf, ok := ret.Get(1).(func(string, map[string]string) error); ok { + r1 = rf(repoType, repoOption) + } else { + r1 = ret.Error(1) + } + + return r0, r1 +} + +// DefaultMaintenanceFrequency provides a mock function with given fields: repoType +func (_m *ConfigManager) DefaultMaintenanceFrequency(repoType string) (time.Duration, error) { + ret := _m.Called(repoType) + + if len(ret) == 0 { + panic("no return value specified for DefaultMaintenanceFrequency") + } + + var r0 time.Duration + var r1 error + if rf, ok := ret.Get(0).(func(string) (time.Duration, error)); ok { + return rf(repoType) + } + if rf, ok := ret.Get(0).(func(string) time.Duration); ok { + r0 = rf(repoType) + } else { + r0 = ret.Get(0).(time.Duration) + } + + if rf, ok := ret.Get(1).(func(string) error); ok { + r1 = rf(repoType) + } else { + r1 = ret.Error(1) + } + + return r0, r1 +} + +// NewConfigManager creates a new instance of ConfigManager. It also registers a testing interface on the mock and a cleanup function to assert the mocks expectations. +// The first argument is typically a *testing.T value. +func NewConfigManager(t interface { + mock.TestingT + Cleanup(func()) +}) *ConfigManager { + mock := &ConfigManager{} + mock.Mock.Test(t) + + t.Cleanup(func() { mock.AssertExpectations(t) }) + + return mock +} diff --git a/pkg/repository/mocks/Manager.go b/pkg/repository/mocks/Manager.go index 2264117752..954967adc7 100644 --- a/pkg/repository/mocks/Manager.go +++ b/pkg/repository/mocks/Manager.go @@ -1,4 +1,4 @@ -// Code generated by mockery v2.39.1. DO NOT EDIT. +// Code generated by mockery v2.53.2. DO NOT EDIT. package mocks @@ -37,6 +37,34 @@ func (_m *Manager) BatchForget(_a0 context.Context, _a1 *v1.BackupRepository, _a return r0 } +// ClientSideCacheLimit provides a mock function with given fields: _a0 +func (_m *Manager) ClientSideCacheLimit(_a0 *v1.BackupRepository) (int64, error) { + ret := _m.Called(_a0) + + if len(ret) == 0 { + panic("no return value specified for ClientSideCacheLimit") + } + + var r0 int64 + var r1 error + if rf, ok := ret.Get(0).(func(*v1.BackupRepository) (int64, error)); ok { + return rf(_a0) + } + if rf, ok := ret.Get(0).(func(*v1.BackupRepository) int64); ok { + r0 = rf(_a0) + } else { + r0 = ret.Get(0).(int64) + } + + if rf, ok := ret.Get(1).(func(*v1.BackupRepository) error); ok { + r1 = rf(_a0) + } else { + r1 = ret.Error(1) + } + + return r0, r1 +} + // ConnectToRepo provides a mock function with given fields: repo func (_m *Manager) ConnectToRepo(repo *v1.BackupRepository) error { ret := _m.Called(repo) @@ -55,9 +83,9 @@ func (_m *Manager) ConnectToRepo(repo *v1.BackupRepository) error { return r0 } -// DefaultMaintenanceFrequency provides a mock function with given fields: repo -func (_m *Manager) DefaultMaintenanceFrequency(repo *v1.BackupRepository) (time.Duration, error) { - ret := _m.Called(repo) +// DefaultMaintenanceFrequency provides a mock function with given fields: _a0 +func (_m *Manager) DefaultMaintenanceFrequency(_a0 *v1.BackupRepository) (time.Duration, error) { + ret := _m.Called(_a0) if len(ret) == 0 { panic("no return value specified for DefaultMaintenanceFrequency") @@ -66,16 +94,16 @@ func (_m *Manager) DefaultMaintenanceFrequency(repo *v1.BackupRepository) (time. var r0 time.Duration var r1 error if rf, ok := ret.Get(0).(func(*v1.BackupRepository) (time.Duration, error)); ok { - return rf(repo) + return rf(_a0) } if rf, ok := ret.Get(0).(func(*v1.BackupRepository) time.Duration); ok { - r0 = rf(repo) + r0 = rf(_a0) } else { r0 = ret.Get(0).(time.Duration) } if rf, ok := ret.Get(1).(func(*v1.BackupRepository) error); ok { - r1 = rf(repo) + r1 = rf(_a0) } else { r1 = ret.Error(1) } diff --git a/pkg/repository/provider/provider.go b/pkg/repository/provider/provider.go index 822e69aba9..7681d830bb 100644 --- a/pkg/repository/provider/provider.go +++ b/pkg/repository/provider/provider.go @@ -32,6 +32,8 @@ type RepoParam struct { // Provider defines the methods to manipulate a backup repository type Provider interface { + ConfigProvider + // InitRepo is to initialize a repository from a new storage place InitRepo(ctx context.Context, param RepoParam) error @@ -60,7 +62,13 @@ type Provider interface { // BatchForget is to delete a list of snapshots from the repository BatchForget(ctx context.Context, snapshotIDs []string, param RepoParam) []error +} +// ConfigProvider defines the methods to get configurations of a backup repository +type ConfigProvider interface { // DefaultMaintenanceFrequency returns the default frequency to run maintenance - DefaultMaintenanceFrequency(ctx context.Context, param RepoParam) time.Duration + DefaultMaintenanceFrequency() time.Duration + + // ClientSideCacheLimit returns the max cache size required on client side + ClientSideCacheLimit(repoOption map[string]string) int64 } diff --git a/pkg/repository/provider/restic.go b/pkg/repository/provider/restic.go index 3ca5988cb3..ad4e0a2575 100644 --- a/pkg/repository/provider/restic.go +++ b/pkg/repository/provider/restic.go @@ -90,6 +90,10 @@ func (r *resticRepositoryProvider) BatchForget(ctx context.Context, snapshotIDs return errs } -func (r *resticRepositoryProvider) DefaultMaintenanceFrequency(ctx context.Context, param RepoParam) time.Duration { +func (r *resticRepositoryProvider) DefaultMaintenanceFrequency() time.Duration { return r.svc.DefaultMaintenanceFrequency() } + +func (r *resticRepositoryProvider) ClientSideCacheLimit(repoOption map[string]string) int64 { + return 0 +} diff --git a/pkg/repository/provider/unified_repo.go b/pkg/repository/provider/unified_repo.go index c13665d2df..efaf77d214 100644 --- a/pkg/repository/provider/unified_repo.go +++ b/pkg/repository/provider/unified_repo.go @@ -46,6 +46,12 @@ type unifiedRepoProvider struct { log logrus.FieldLogger } +type unifiedRepoConfigProvider struct { + repoService udmrepo.BackupRepoService + repoBackend string + log logrus.FieldLogger +} + // this func is assigned to a package-level variable so it can be // replaced when unit-testing var getS3Credentials = repoconfig.GetS3Credentials @@ -86,9 +92,18 @@ func NewUnifiedRepoProvider( return &repo } -func GetUnifiedRepoClientSideCacheLimit(repoOption map[string]string, repoBackend string, log logrus.FieldLogger) int64 { - repoService := createRepoService(repoBackend, log) - return repoService.ClientSideCacheLimit(repoOption) +func NewUnifiedRepoConfigProvider( + repoBackend string, + log logrus.FieldLogger, +) ConfigProvider { + repo := unifiedRepoConfigProvider{ + repoBackend: repoBackend, + log: log, + } + + repo.repoService = createRepoService(repoBackend, log) + + return &repo } func (urp *unifiedRepoProvider) InitRepo(ctx context.Context, param RepoParam) error { @@ -196,8 +211,20 @@ func (urp *unifiedRepoProvider) PrepareRepo(ctx context.Context, param RepoParam if created, err := urp.repoService.IsCreated(ctx, *repoOption); err != nil { return errors.Wrap(err, "error to check backup repo") } else if created { - log.Debug("Repo has already been initialized remotely") - return nil + // Repo exists remotely. Try to open with existing config file. + bkRepo, err := urp.repoService.Open(ctx, *repoOption) + if err == nil { + // Config file exists and is valid + if c := bkRepo.Close(ctx); c != nil { + log.WithError(c).Error("Failed to close repo") + } + log.Debug("Repo has already been initialized remotely and config file is valid") + return nil + } + + // Config file missing or invalid, recreate it + log.WithError(err).Debug("Repo exists remotely but config file issue detected, connecting to it") + return urp.repoService.Connect(ctx, *repoOption) } if param.BackupLocation.Spec.AccessMode == velerov1api.BackupStorageLocationAccessModeReadOnly { @@ -375,10 +402,14 @@ func (urp *unifiedRepoProvider) BatchForget(ctx context.Context, snapshotIDs []s return errs } -func (urp *unifiedRepoProvider) DefaultMaintenanceFrequency(ctx context.Context, param RepoParam) time.Duration { +func (urp *unifiedRepoProvider) DefaultMaintenanceFrequency() time.Duration { return urp.repoService.DefaultMaintenanceFrequency() } +func (urp *unifiedRepoProvider) ClientSideCacheLimit(repoOption map[string]string) int64 { + return urp.repoService.ClientSideCacheLimit(repoOption) +} + func (urp *unifiedRepoProvider) GetPassword(param any) (string, error) { _, ok := param.(RepoParam) if !ok { @@ -429,6 +460,14 @@ func (urp *unifiedRepoProvider) GetStoreOptions(param any) (map[string]string, e return storeOptions, nil } +func (urcp *unifiedRepoConfigProvider) DefaultMaintenanceFrequency() time.Duration { + return urcp.repoService.DefaultMaintenanceFrequency() +} + +func (urcp *unifiedRepoConfigProvider) ClientSideCacheLimit(repoOption map[string]string) int64 { + return urcp.repoService.ClientSideCacheLimit(repoOption) +} + func getRepoPassword(secretStore credentials.SecretStore) (string, error) { if secretStore == nil { return "", errors.New("invalid credentials interface") diff --git a/pkg/repository/provider/unified_repo_test.go b/pkg/repository/provider/unified_repo_test.go index 05f9acd3b6..64bc2ce739 100644 --- a/pkg/repository/provider/unified_repo_test.go +++ b/pkg/repository/provider/unified_repo_test.go @@ -612,8 +612,11 @@ func TestPrepareRepo(t *testing.T) { funcTable localFuncTable getter *credmock.SecretStore repoService *reposervicenmocks.BackupRepoService + backupRepo *reposervicenmocks.BackupRepo retFuncCreate func(context.Context, udmrepo.RepoOptions) error retFuncCheck func(context.Context, udmrepo.RepoOptions) (bool, error) + retFuncConnect func(context.Context, udmrepo.RepoOptions) error + retFuncOpen []any credStoreReturn string credStoreError error readOnlyBSL bool @@ -662,7 +665,7 @@ func TestPrepareRepo(t *testing.T) { expectedErr: "error to check backup repo: fake-error", }, { - name: "already initialized", + name: "already initialized with valid config", getter: new(credmock.SecretStore), credStoreReturn: "fake-password", funcTable: localFuncTable{ @@ -674,13 +677,66 @@ func TestPrepareRepo(t *testing.T) { }, }, repoService: new(reposervicenmocks.BackupRepoService), + backupRepo: new(reposervicenmocks.BackupRepo), retFuncCheck: func(ctx context.Context, repoOption udmrepo.RepoOptions) (bool, error) { return true, nil }, + retFuncOpen: []any{nil}, // Open succeeds - config file exists and is valid + retFuncCreate: func(ctx context.Context, repoOption udmrepo.RepoOptions) error { + return errors.New("fake-error-should-not-be-called") + }, + }, + { + name: "already initialized but config missing and connect succeeds", + getter: new(credmock.SecretStore), + credStoreReturn: "fake-password", + funcTable: localFuncTable{ + getStorageVariables: func(*velerov1api.BackupStorageLocation, string, string, map[string]string) (map[string]string, error) { + return map[string]string{}, nil + }, + getStorageCredentials: func(*velerov1api.BackupStorageLocation, velerocredentials.FileStore) (map[string]string, error) { + return map[string]string{}, nil + }, + }, + repoService: new(reposervicenmocks.BackupRepoService), + backupRepo: new(reposervicenmocks.BackupRepo), + retFuncCheck: func(ctx context.Context, repoOption udmrepo.RepoOptions) (bool, error) { + return true, nil + }, + retFuncOpen: []any{errors.New("fake-open-error")}, // Open fails - config file missing/invalid + retFuncConnect: func(ctx context.Context, repoOption udmrepo.RepoOptions) error { + return nil // Connect succeeds - recreates config file + }, retFuncCreate: func(ctx context.Context, repoOption udmrepo.RepoOptions) error { - return errors.New("fake-error") + return errors.New("fake-error-should-not-be-called") }, }, + { + name: "already initialized but config missing and connect fails", + getter: new(credmock.SecretStore), + credStoreReturn: "fake-password", + funcTable: localFuncTable{ + getStorageVariables: func(*velerov1api.BackupStorageLocation, string, string, map[string]string) (map[string]string, error) { + return map[string]string{}, nil + }, + getStorageCredentials: func(*velerov1api.BackupStorageLocation, velerocredentials.FileStore) (map[string]string, error) { + return map[string]string{}, nil + }, + }, + repoService: new(reposervicenmocks.BackupRepoService), + backupRepo: new(reposervicenmocks.BackupRepo), + retFuncCheck: func(ctx context.Context, repoOption udmrepo.RepoOptions) (bool, error) { + return true, nil + }, + retFuncOpen: []any{errors.New("fake-open-error")}, // Open fails - config file missing/invalid + retFuncConnect: func(ctx context.Context, repoOption udmrepo.RepoOptions) error { + return errors.New("fake-connect-error") // Connect also fails + }, + retFuncCreate: func(ctx context.Context, repoOption udmrepo.RepoOptions) error { + return errors.New("fake-error-should-not-be-called") + }, + expectedErr: "fake-connect-error", + }, { name: "bsl is readonly", readOnlyBSL: true, @@ -766,6 +822,21 @@ func TestPrepareRepo(t *testing.T) { tc.repoService.On("IsCreated", mock.Anything, mock.Anything).Return(tc.retFuncCheck) tc.repoService.On("Create", mock.Anything, mock.Anything, mock.Anything).Return(tc.retFuncCreate) + tc.repoService.On("Connect", mock.Anything, mock.Anything).Return(tc.retFuncConnect) + + // Mock Open() - returns BackupRepo if successful, error if not + if tc.retFuncOpen != nil { + if tc.retFuncOpen[0] == nil { + // Open succeeds - return the mock BackupRepo + if tc.backupRepo != nil { + tc.backupRepo.On("Close", mock.Anything).Return(nil) + } + tc.repoService.On("Open", mock.Anything, mock.Anything).Return(tc.backupRepo, nil) + } else { + // Open fails - return error + tc.repoService.On("Open", mock.Anything, mock.Anything).Return(nil, tc.retFuncOpen[0]) + } + } if tc.readOnlyBSL { bsl.Spec.AccessMode = velerov1api.BackupStorageLocationAccessModeReadOnly diff --git a/velero-preparerepo-bug.md b/velero-preparerepo-bug.md new file mode 100644 index 0000000000..cfb86c9e8e --- /dev/null +++ b/velero-preparerepo-bug.md @@ -0,0 +1,103 @@ +# Velero PrepareRepo Bug: Missing Local Kopia Config File + +## Summary + +The `PrepareRepo()` function in Velero's unified repository provider fails to create the local Kopia configuration file when connecting to an existing remote repository. This causes subsequent backup operations to fail with "config file not found" errors in multi-pod environments, after pod restarts, or during sequential backup operations. + +## Bug Location + +**File**: `pkg/repository/provider/unified_repo.go` +**Lines**: 196-200 +**Velero Version**: Current main branch (observed in OpenShift OADP operator using Velero) + +## Root Cause + +When `PrepareRepo()` detects that a repository already exists remotely (via `IsCreated()`), it returns immediately without calling `Connect()`: + +```go +if created, err := urp.repoService.IsCreated(ctx, *repoOption); err != nil { + return errors.Wrap(err, "error to check backup repo") +} else if created { + log.Debug("Repo has already been initialized remotely") + return nil // ❌ BUG: Returns without creating local config file! +} +``` + +**Problem**: The `Connect()` function is responsible for creating the local Kopia config file (e.g., `/home/velero/udmrepo/repo-.conf`). Skipping this call means the config file is never created, causing later `Open()` operations to fail. + +## Kopia Architecture Context + +Kopia requires **two components** to function: +1. **Backend Storage**: Repository metadata stored in S3/Azure/GCS (created by `Create()`) +2. **Local Config File**: Connection information stored locally (created by `Connect()`) + +The current code only creates the local config file during initial repository creation, not when connecting to existing repositories. + +## Reproduction Scenarios + +1. **Multi-node environments (node-agent daemonset)**: + - Node-agent pod on Node A creates the repository (config file exists only on Node A's pod) + - Node-agent pod on Node B tries to use the same repository for a different PVC/backup + - Result: Node B's node-agent pod fails with "config file not found" + +2. **Node-agent pod restarts**: + - Node-agent pod creates repository with config in ephemeral storage (`/home/velero/udmrepo/`) + - Node-agent pod restarts (upgrade, crash, node drain), losing the config file + - Next backup attempt by the restarted node-agent fails + +3. **Sequential backups across nodes (node-agent daemonset)**: + - First backup creates repository via node-agent pod on Node A + - Second backup of a PVC on Node B is handled by that node's node-agent pod + - Node B's node-agent pod can't access the repository (no local config file) + +## Error Message + +``` +error to initialize data path: error to boost backup repository connection: +error to connect backup repo: error to connect repo with storage: +error to connect to repository: error loading config file: +open /home/velero/udmrepo/repo-.conf: no such file or directory +``` + +## Proposed Fix + +Replace the early return with a `Connect()` call: + +```go +if created, err := urp.repoService.IsCreated(ctx, *repoOption); err != nil { + return errors.Wrap(err, "error to check backup repo") +} else if created { + log.Debug("Repo has already been initialized remotely, connecting to it") + return urp.repoService.Connect(ctx, *repoOption) // ✅ FIX +} +``` + +## Impact + +**Severity**: High +**Affected Scenarios**: +- Multi-node Kubernetes clusters +- Any environment with pod restarts +- Sequential backup operations +- CSI Data Mover backups in OADP/OpenShift + +**Current Workaround**: None known. The issue manifests as intermittent backup failures. + +## Testing + +The bug was discovered while investigating HCP (Hosted Control Plane) backup test failures in OpenShift OADP operator: +- **PR**: openshift/oadp-operator#2013 +- **Test**: HCP backup/restore with CSI Data Mover +- **Build Log**: [CI Build Log](https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_oadp-operator/2013/pull-ci-openshift-oadp-operator-oadp-dev-4.20-e2e-test-hcp-aws/1986610135258107904/artifacts/e2e-test-hcp-aws/e2e/build-log.txt) + +## Related Code + +- `pkg/repository/udmrepo/kopialib/repo_init.go`: `ConnectBackupRepo()` creates local config +- `pkg/repository/udmrepo/kopialib/lib_repo.go`: `Open()` requires existing config file +- `pkg/repository/provider/unified_repo.go`: `PrepareRepo()` contains the bug + +## Status + +- **Discovered**: 2025-11-07 +- **Velero Issue Tracker**: No existing issues found matching this bug +- **Fix Status**: Proposed but not yet submitted