Description
There is a test to reproduce it:
make test-chaos GINKGO_EXTRA_OPTS='--focus="recovers when Robin ConfigMap has stale primaries from failed scale-down"'
When a RedkeyCluster with purgeKeysOnRebalance=true is scaled down (e.g. from 8 to 3 primaries), Robin can enter a permanent loop trying to reach "ghost nodes" — pods that no longer exist because the StatefulSet was recreated with fewer replicas. The cluster never recovers without manual intervention.
Root cause: The operator does not update Robin's ConfigMap (redis-cluster-robin) when scaling down with purgeKeysOnRebalance=true. The ConfigMap retains the old primaries: 8 value while the CR specifies 3 primaries and the StatefulSet has only 3 replicas. Robin faithfully tries to reach 8 nodes because that's what its configuration says, but pods 3-7 don't exist.
Steps to Reproduce
- Create a RedkeyCluster with
purgeKeysOnRebalance=true and e.g. 6 primaries
- Scale up to 8 primaries (StatefulSet is deleted and recreated with 8 replicas)
- While scaling/integrity-check is still in progress, delete some pods and scale down to 3 primaries
- The StatefulSet is recreated with 3 replicas (pods 0-2 exist, pods 3-7 do not)
Root Cause: Robin ConfigMap Not Updated
The operator fails to update the Robin ConfigMap before recreating the StatefulSet. Evidence from the stuck cluster:
# kubectl get configmaps redis-cluster-robin -o json | jq -r '.data["application-configmap.yml"]'
metadata:
namespace: chaos-5-bc6m5
redis:
standalone: false
reconciler:
interval_seconds: 30
operation_cleanup_interval_seconds: 30
cluster:
namespace: chaos-5-bc6m5
name: redis-cluster
primaries: 8 # <-- CR says 3, StatefulSet has 3 replicas
replicas_per_primary: 0
status: ScalingUp # <-- operator never updated this either
ephemeral: true
health_probe_interval_seconds: 60
healing_time_seconds: 60
max_retries: 10
back_off: 10s
metrics:
interval_seconds: 60
redis_info_keys: []
The CR specifies primaries: 3 and the StatefulSet was recreated with 3 replicas, but the Robin ConfigMap still shows primaries: 8 and status: ScalingUp. Robin is doing exactly what it was told — trying to bring up 8 nodes — but 5 of them don't exist.
The operator likely updates the ConfigMap in a code path that is skipped or fails silently during the purgeKeysOnRebalance=true scale-down flow, particularly when a conflicting Robin operation (e.g. CheckIntegrity) is in progress at the time of the scale request.
Observed Behavior
Robin retains redis-cluster-3 through redis-cluster-7 in its target node list (because its ConfigMap says primaries: 8) despite only 3 pods existing. It enters an infinite cycle:
- Robin tries to initialize each ghost node sequentially, each failing with
failed to connect after 10 retries
- Robin attempts
CLUSTER MEET with ghost nodes, which fails with ERR Invalid node address specified: :6379 (empty IP since pod doesn't exist)
- Robin alternates between
ScalingUp, ScalingUpError, and CheckingIntegrity status
- The operator is stuck in
ScalingDown / EndingFastScaling, polling Robin every 30 seconds
- The 3 running Redis pods have
cluster_slots_assigned:0 — slots were never distributed after recreation
Additionally, when the operator asks Robin to recreate the cluster during an ongoing CheckIntegrity operation, Robin responds: Cluster cannot be recreated right now due to conflicting operation (operation=CheckIntegrity, status=Running).
The CR status.nodes map still references old node IDs and IPs from before the StatefulSet recreation, while the actual running nodes have new IDs. Robin detected the ID/IP changes but was unable to complete the reconciliation.
This loop continues indefinitely (observed running 3+ hours with no recovery).
Expected Behavior
The operator must update the Robin ConfigMap with the correct primaries count before (or at the same time as) recreating the StatefulSet during a purgeKeysOnRebalance=true scale-down. This ensures Robin targets the correct number of nodes.
Robin should also be resilient to stale configuration: when pods in its target list don't exist, it should detect the mismatch and either reload its configuration or report a clear error rather than retrying indefinitely.
Environment
- Discovered during chaos testing with
purgeKeysOnRebalance=true
- Namespace:
chaos-5-bc6m5 (chaos test run)
- Chaos test configuration: 10 iterations, scaling between 3-8 primaries with concurrent pod deletions and operator restarts
- Failure occurred at iteration 4 after scaling 8→3 primaries while pods were being deleted
Relevant Logs
Robin ConfigMap (stale)
primaries: 8 # should be 3
status: ScalingUp
Robin logs (ghost node connection failures)
Error initializing node redis-cluster-7: "failed to connect after 10 retries"
Error initializing node redis-cluster-6: "failed to connect after 10 retries"
Error initializing node redis-cluster-5: "failed to connect after 10 retries"
Error initializing node redis-cluster-4: "failed to connect after 10 retries"
Error initializing node redis-cluster-3: "failed to connect after 10 retries"
Robin logs (CLUSTER MEET with empty IP)
ERR Invalid node address specified: :6379
Operator logs (stuck polling loop)
Finishing fast scaling {"redkey-cluster": "chaos-5-bc6m5/redis-cluster"}
Waiting for cluster to be Ready in Robin {"redkey-cluster": "chaos-5-bc6m5/redis-cluster"}
Redis CLUSTER INFO (0 slots assigned)
cluster_slots_assigned:0
cluster_slots_ok:0
cluster_known_nodes:3
cluster_size:0
Description
There is a test to reproduce it:
When a RedkeyCluster with
purgeKeysOnRebalance=trueis scaled down (e.g. from 8 to 3 primaries), Robin can enter a permanent loop trying to reach "ghost nodes" — pods that no longer exist because the StatefulSet was recreated with fewer replicas. The cluster never recovers without manual intervention.Root cause: The operator does not update Robin's ConfigMap (
redis-cluster-robin) when scaling down withpurgeKeysOnRebalance=true. The ConfigMap retains the oldprimaries: 8value while the CR specifies 3 primaries and the StatefulSet has only 3 replicas. Robin faithfully tries to reach 8 nodes because that's what its configuration says, but pods 3-7 don't exist.Steps to Reproduce
purgeKeysOnRebalance=trueand e.g. 6 primariesRoot Cause: Robin ConfigMap Not Updated
The operator fails to update the Robin ConfigMap before recreating the StatefulSet. Evidence from the stuck cluster:
The CR specifies
primaries: 3and the StatefulSet was recreated with 3 replicas, but the Robin ConfigMap still showsprimaries: 8andstatus: ScalingUp. Robin is doing exactly what it was told — trying to bring up 8 nodes — but 5 of them don't exist.The operator likely updates the ConfigMap in a code path that is skipped or fails silently during the
purgeKeysOnRebalance=truescale-down flow, particularly when a conflicting Robin operation (e.g.CheckIntegrity) is in progress at the time of the scale request.Observed Behavior
Robin retains
redis-cluster-3throughredis-cluster-7in its target node list (because its ConfigMap saysprimaries: 8) despite only 3 pods existing. It enters an infinite cycle:failed to connect after 10 retriesCLUSTER MEETwith ghost nodes, which fails withERR Invalid node address specified: :6379(empty IP since pod doesn't exist)ScalingUp,ScalingUpError, andCheckingIntegritystatusScalingDown/EndingFastScaling, polling Robin every 30 secondscluster_slots_assigned:0— slots were never distributed after recreationAdditionally, when the operator asks Robin to recreate the cluster during an ongoing
CheckIntegrityoperation, Robin responds:Cluster cannot be recreated right now due to conflicting operation(operation=CheckIntegrity, status=Running).The CR
status.nodesmap still references old node IDs and IPs from before the StatefulSet recreation, while the actual running nodes have new IDs. Robin detected the ID/IP changes but was unable to complete the reconciliation.This loop continues indefinitely (observed running 3+ hours with no recovery).
Expected Behavior
The operator must update the Robin ConfigMap with the correct
primariescount before (or at the same time as) recreating the StatefulSet during apurgeKeysOnRebalance=truescale-down. This ensures Robin targets the correct number of nodes.Robin should also be resilient to stale configuration: when pods in its target list don't exist, it should detect the mismatch and either reload its configuration or report a clear error rather than retrying indefinitely.
Environment
purgeKeysOnRebalance=truechaos-5-bc6m5(chaos test run)Relevant Logs
Robin ConfigMap (stale)
Robin logs (ghost node connection failures)
Robin logs (CLUSTER MEET with empty IP)
Operator logs (stuck polling loop)
Redis CLUSTER INFO (0 slots assigned)