Valkey master-replica support by olehoerb · Pull Request #131 · metal-stack/backup-restore-sidecar

olehoerb · 2025-10-29T14:26:18Z

Description

feat: Implements complete Valkey master-replica support with leader election-based backup-restore coordination. Ensures safe backup and restore operations in multi-replica deployment.
Only Valkey-master (pod-0) performs backup/restore operations

Add leader election package using https://pkg.go.dev/k8s.io/client-go/tools/leaderelection for coordination
Add Valkey master-replica database implementation
Add integration tests for master-replica restore workflow
Add Kubernetes deployment examples and backing resources
Integrate leader election checks into backup coordination

References: #124
To close #124 helm chart is still needed

-cluster setup not working correctly

- fix cluster logic issues

- fix backup leader/master mismatch - Add context management - Optimize backup coordination

refactored the name from valkey-cluster to valkey-master-replica

majst01

Nice work, first simple comments from my side

cmd/internal/database/valkey/valkey.go

majst01 · 2025-10-29T15:55:47Z

cmd/internal/database/valkey/valkey.go

+	parts := strings.Split(podName, "-")
+	if len(parts) == 0 {
+		return -1
+	}
+
+	ordinalStr := parts[len(parts)-1]


consider strings.Cut

Thanks for pointing it out! But I think I found an even better solution here after reviewing the strings functions

majst01 · 2025-10-29T15:57:41Z

deploy/valkey-local.yaml

+  init.sh: "#!/bin/sh\nset -e\n\n# Extract pod ordinal from hostname (valkey-0, valkey-1,
+    etc.)\nORDINAL=$(hostname | sed 's/.*-//')\n\n# Pod 0 is the master, others are
+    replicas\nif [ \"$ORDINAL\" = \"0\" ]; then\n  echo \"I am the master (pod-0)\"\t\t\n
+    \ exec valkey-server --port 6379 --bind 0.0.0.0\nelse\n  echo \"I am a replica
+    (pod-$ORDINAL), connecting to master at valkey-0.valkey.${POD_NAMESPACE}.svc.cluster.local\"\n
+    \ exec valkey-server --port 6379 --bind 0.0.0.0 --replicaof valkey-0.valkey.${POD_NAMESPACE}.svc.cluster.local
+    6379\nfi\n"


Can be done with a multiline yaml content e.g. |

majst01 · 2025-10-29T15:58:09Z

deploy/valkey-master-replica-local.yaml

+  init.sh: "#!/bin/sh\nset -e\n\n# Extract pod ordinal from hostname (valkey-0, valkey-1,
+    etc.)\nORDINAL=$(hostname | sed 's/.*-//')\n\n# Pod 0 is the master, others are
+    replicas\nif [ \"$ORDINAL\" = \"0\" ]; then\n  echo \"I am the master (pod-0)\"\t\t\n
+    \ exec valkey-server --port 6379 --bind 0.0.0.0\nelse\n  echo \"I am a replica
+    (pod-$ORDINAL), connecting to master at valkey-0.valkey.${POD_NAMESPACE}.svc.cluster.local\"\n
+    \ exec valkey-server --port 6379 --bind 0.0.0.0 --replicaof valkey-0.valkey.${POD_NAMESPACE}.svc.cluster.local
+    6379\nfi\n"


multiline content

…figMaps

Gerrit91

Actually looks really good and promising!

Not fully done yet but here are some first review comments.

Gerrit91 · 2025-12-09T12:24:30Z

cmd/internal/backup/backup.go

+		if leaderElector, ok := b.db.(database.DatabaseLeaderElector); ok {
+			if !leaderElector.ShouldPerformBackup(ctx) {
+				b.log.Debug("skipping backup - not elected as leader")
+				return
+			}
+		}


Better use an extended contract in the DatabaseProber interface instead of a type cast, something like:

if !b.db.IsLeader(ctx) { b.log.Debug("skipping backup - not elected as leader") return }

Gerrit91 · 2025-12-09T12:32:49Z

Makefile

+.PHONY: test-integration-valkey-master-replica
+test-integration-valkey-master-replica: kind-cluster-create
+	kind --name backup-restore-sidecar load docker-image ghcr.io/metal-stack/backup-restore-sidecar:latest
+	kind --name backup-restore-sidecar load docker-image ghcr.io/valkey-io/valkey:8.1-alpine


Can be removed as this comes from the internet.

Gerrit91 · 2025-12-09T13:01:56Z

cmd/internal/database/valkey/valkey.go

+	if podName == "" {
+		return nil, fmt.Errorf("cluster mode requires POD_NAME environment variable to be set")
+	}
+	if podNamespace == "" {
+		return nil, fmt.Errorf("cluster mode requires POD_NAMESPACE environment variable to be set")
+	}


These checks can be moved into the leader election package if they are required for it.

Gerrit91 · 2025-12-09T13:06:00Z

cmd/internal/database/valkey/valkey.go

+	client *redis.Client
+
+	clusterMode     bool
+	clusterSize     int


This field is not used and can probably be removed.

Gerrit91 · 2025-12-09T13:09:55Z

pkg/generate/examples/examples/valkey.go

 backup-cron-schedule: "*/1 * * * *"
-object-prefix: valkey-test
+object-prefix: valkey-test-${POD_NAME}
 redis-addr: localhost:6379


This key is not used for the valkey backend, so it can be removed.

Gerrit91 · 2025-12-09T13:11:23Z

cmd/internal/database/valkey/valkey.go

+	log.Info("Creating Valkey instance", "clusterMode", clusterMode, "clusterSize", clusterSize)
+
+	v.client = redis.NewClient(&redis.Options{
+		Addr:     "localhost:6379",


This should ideally come from a configuration parameter (e.g. "valkey-addr")

Gerrit91 · 2025-12-09T13:18:17Z

cmd/internal/database/valkey/valkey.go

+	}
+
+	if !isMaster {
+		db.log.Info("elected as backup leader but not Valkey master, skipping backup")


How do you ensure that the leader election changes to the database leader to prevent there are no backups taken in case of a constant mismatch?

Gerrit91 · 2025-12-09T13:20:54Z

cmd/internal/database/valkey/valkey.go

+	v := &Valkey{
+		log:             log,
+		datadir:         datadir,
+		password:        getPassword(password),


You can replace the getPassword function in favor of a function from metal-lib:

Suggested change

password: getPassword(password),

password: pointer.SafeDerefOrDefault(password, ""),

Gerrit91 · 2025-12-09T13:24:12Z

cmd/internal/database/valkey/valkey.go

+
+					// Leader election considers database role: only restore if this pod will be the Valkey master
+					// In master-replica mode, pod-0 is always the master (determined by init.sh)
+					podName := os.Getenv("POD_NAME")


Maybe it's more secure to retrieve this from the leaderElection package because the pod name might originate from the configuration and not only from the environment variable.

Gerrit91 · 2025-12-09T13:31:09Z

cmd/internal/database/valkey/valkey.go

+		return fmt.Errorf("restore file not present: %s", dump)
+	}
+
+	if db.clusterMode {


Probably it's better to reduce code intention by negating this expression, the for loop in this function has explicit returns anyway so last line of this function does not need to be repeated.

Suggested change

if db.clusterMode {

if !db.clusterMode {

Gerrit91

I was able to play around a bit with the setup. Is it correct that this is not a real Valkey Cluster as described here but rather just adding replicas, which cannot accept writes and are not promoted to master instances under any circumstances?

I came to this conclusion because I run:

❯ k exec -it valkey-master-replica-0 -c valkey -- valkey-cli cluster info
ERR This instance has cluster support disabled

When killing pod-0 constantly, I can still read values from database, but it's not possible to write anymore:

❯ k exec -it statefulsets/valkey-master-replica -c valkey -- valkey-cli set foo foo
(error) READONLY You can't write against a read only replica.

I think, this is also a good improvement in general as it allows serving read-requests during node roll or whatever. But I am not sure if leader election is really required in this setup because backups and restores can only be done from pod-0 anyway? Also we should not call it clusterMode?

Can I ask you to somewhere describe the approach in a markdown document in the docs folder? I think this would help a lot. :)

Gerrit91 · 2025-12-10T12:13:54Z

pkg/generate/examples/examples/valkey.go

 				Spec: corev1.PodSpec{
-					HostNetwork: true,
+					HostNetwork:        true,
+					ServiceAccountName: "valkey-backup-restore",


It would probably be good to add a topology spread constraint here for an example on the node name, like:

topologySpreadConstraints: - labelSelector: matchLabels: app: valkey maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway

-use isMaster for backup coordination instead of leader election -fix STATEFUL_NAME mismatch on valkey container -restore HostNetwork for standalone, disable for master-replica -fix newValkeyClient retry to actually ping -added docs describing the approach

-replaced pointer usage with GO 1.26 pointer

Gerrit91 · 2026-02-18T14:15:13Z

docs/valkey_master_replica.md

+1. All pods compete for a `Lease` resource.
+2. The winner checks its pod ordinal. Only pod-0 (the future master) actually restores from backup.


This I do not understand. Why isn't checking the ordinal enough if the backup needs to be restored? What benefit does the leader election bring?

Imagine the the scenario all pods are terminated and all volumes are lost. Then the pods start up fresh and the replica wins the election and restores the data. When the master comes up, it will not restore the data because it's not the leader and start without data. Wouldn't the replicas then sync empty data from the master?

Thats actually true. The ordinal check is enough and using ordinal check should fix the issue with syncing

Gerrit91 · 2026-02-18T14:45:31Z

docs/valkey_master_replica.md

+
+## Topology
+
+This is **not** a Valkey Cluster (which requires `cluster-enabled yes` and has built-in sharding/failover). Instead it is a simple master-replica setup using a Kubernetes StatefulSet:


Did you evaluate the cluster mode once? What were the issues why it cannot be used as in general this would be really nice if a node roll of a K8s cluster would not cause any interruptions of the service in terms of write operations.

…tion

olehoerb added 6 commits September 9, 2025 17:24

valkey: added cluster mode support.

aa0d9cc

-cluster setup not working correctly

added simple leader election functionality as described in issue #124

84a96f0

wip: fix race condition in leader election

cdd1f2e

- fix cluster logic issues

fix: fixes issues for production readiness

437aa6a

- fix backup leader/master mismatch - Add context management - Optimize backup coordination

fix: verify database role in leader election for valkey master-replica

4faa8ed

refactored the name from valkey-cluster to valkey-master-replica

add valkey-master-replica-local.yaml

0ca5537

olehoerb requested a review from a team as a code owner October 29, 2025 14:26

metal-robot bot added the area: control-plane Affects the metal-stack control-plane area. label Oct 29, 2025

metal-robot bot added this to Development Oct 29, 2025

fix: remove ObjectMeta from embedded field selectors

d1681fc

majst01 requested changes Oct 29, 2025

View reviewed changes

github-project-automation bot moved this to In Progress in Development Oct 29, 2025

fix: use lastIndex instead of Split and multiline YAML format for con…

bf3fe6b

…figMaps

olehoerb requested a review from majst01 October 31, 2025 07:29

vknabel moved this from In Progress to Upcoming in Development Nov 3, 2025

Gerrit91 reviewed Dec 9, 2025

View reviewed changes

Gerrit91 reviewed Dec 10, 2025

View reviewed changes

olehoerb added 3 commits February 16, 2026 12:46

fix:

bb0fd31

-use isMaster for backup coordination instead of leader election -fix STATEFUL_NAME mismatch on valkey container -restore HostNetwork for standalone, disable for master-replica -fix newValkeyClient retry to actually ping -added docs describing the approach

fix:

d0cc89a

-replaced pointer usage with GO 1.26 pointer

fix:

2d5988c

-replaced pointer usage with GO 1.26 pointer

olehoerb requested a review from Gerrit91 February 17, 2026 08:00

Merge branch 'master' into valkey-cluster-config

e9a2087

Gerrit91 reviewed Feb 18, 2026

View reviewed changes

change leader election with simple ordinal check for restore coordina…

64db056

…tion

	password: getPassword(password),
	password: pointer.SafeDerefOrDefault(password, ""),

		1. All pods compete for a `Lease` resource.
		2. The winner checks its pod ordinal. Only pod-0 (the future master) actually restores from backup.


		## Topology

		This is not a Valkey Cluster (which requires `cluster-enabled yes` and has built-in sharding/failover). Instead it is a simple master-replica setup using a Kubernetes StatefulSet:

Conversation

olehoerb commented Oct 29, 2025

Description

Uh oh!

majst01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gerrit91 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gerrit91 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants