Skip to content

Robin stuck in infinite loop after purgeKeysOnRebalance scale-down #48

@DanielDorado

Description

@DanielDorado

Description

There is a test to reproduce it:

  make test-chaos GINKGO_EXTRA_OPTS='--focus="recovers when Robin ConfigMap has stale primaries from failed scale-down"'

When a RedkeyCluster with purgeKeysOnRebalance=true is scaled down (e.g. from 8 to 3 primaries), Robin can enter a permanent loop trying to reach "ghost nodes" — pods that no longer exist because the StatefulSet was recreated with fewer replicas. The cluster never recovers without manual intervention.

Root cause: The operator does not update Robin's ConfigMap (redis-cluster-robin) when scaling down with purgeKeysOnRebalance=true. The ConfigMap retains the old primaries: 8 value while the CR specifies 3 primaries and the StatefulSet has only 3 replicas. Robin faithfully tries to reach 8 nodes because that's what its configuration says, but pods 3-7 don't exist.

Steps to Reproduce

  1. Create a RedkeyCluster with purgeKeysOnRebalance=true and e.g. 6 primaries
  2. Scale up to 8 primaries (StatefulSet is deleted and recreated with 8 replicas)
  3. While scaling/integrity-check is still in progress, delete some pods and scale down to 3 primaries
  4. The StatefulSet is recreated with 3 replicas (pods 0-2 exist, pods 3-7 do not)

Root Cause: Robin ConfigMap Not Updated

The operator fails to update the Robin ConfigMap before recreating the StatefulSet. Evidence from the stuck cluster:

# kubectl get configmaps redis-cluster-robin -o json | jq -r '.data["application-configmap.yml"]'
metadata:
    namespace: chaos-5-bc6m5
redis:
    standalone: false
    reconciler:
        interval_seconds: 30
        operation_cleanup_interval_seconds: 30
    cluster:
        namespace: chaos-5-bc6m5
        name: redis-cluster
        primaries: 8           # <-- CR says 3, StatefulSet has 3 replicas
        replicas_per_primary: 0
        status: ScalingUp      # <-- operator never updated this either
        ephemeral: true
        health_probe_interval_seconds: 60
        healing_time_seconds: 60
        max_retries: 10
        back_off: 10s
    metrics:
        interval_seconds: 60
        redis_info_keys: []

The CR specifies primaries: 3 and the StatefulSet was recreated with 3 replicas, but the Robin ConfigMap still shows primaries: 8 and status: ScalingUp. Robin is doing exactly what it was told — trying to bring up 8 nodes — but 5 of them don't exist.

The operator likely updates the ConfigMap in a code path that is skipped or fails silently during the purgeKeysOnRebalance=true scale-down flow, particularly when a conflicting Robin operation (e.g. CheckIntegrity) is in progress at the time of the scale request.

Observed Behavior

Robin retains redis-cluster-3 through redis-cluster-7 in its target node list (because its ConfigMap says primaries: 8) despite only 3 pods existing. It enters an infinite cycle:

  1. Robin tries to initialize each ghost node sequentially, each failing with failed to connect after 10 retries
  2. Robin attempts CLUSTER MEET with ghost nodes, which fails with ERR Invalid node address specified: :6379 (empty IP since pod doesn't exist)
  3. Robin alternates between ScalingUp, ScalingUpError, and CheckingIntegrity status
  4. The operator is stuck in ScalingDown / EndingFastScaling, polling Robin every 30 seconds
  5. The 3 running Redis pods have cluster_slots_assigned:0 — slots were never distributed after recreation

Additionally, when the operator asks Robin to recreate the cluster during an ongoing CheckIntegrity operation, Robin responds: Cluster cannot be recreated right now due to conflicting operation (operation=CheckIntegrity, status=Running).

The CR status.nodes map still references old node IDs and IPs from before the StatefulSet recreation, while the actual running nodes have new IDs. Robin detected the ID/IP changes but was unable to complete the reconciliation.

This loop continues indefinitely (observed running 3+ hours with no recovery).

Expected Behavior

The operator must update the Robin ConfigMap with the correct primaries count before (or at the same time as) recreating the StatefulSet during a purgeKeysOnRebalance=true scale-down. This ensures Robin targets the correct number of nodes.

Robin should also be resilient to stale configuration: when pods in its target list don't exist, it should detect the mismatch and either reload its configuration or report a clear error rather than retrying indefinitely.

Environment

  • Discovered during chaos testing with purgeKeysOnRebalance=true
  • Namespace: chaos-5-bc6m5 (chaos test run)
  • Chaos test configuration: 10 iterations, scaling between 3-8 primaries with concurrent pod deletions and operator restarts
  • Failure occurred at iteration 4 after scaling 8→3 primaries while pods were being deleted

Relevant Logs

Robin ConfigMap (stale)

primaries: 8    # should be 3
status: ScalingUp

Robin logs (ghost node connection failures)

Error initializing node redis-cluster-7: "failed to connect after 10 retries"
Error initializing node redis-cluster-6: "failed to connect after 10 retries"
Error initializing node redis-cluster-5: "failed to connect after 10 retries"
Error initializing node redis-cluster-4: "failed to connect after 10 retries"
Error initializing node redis-cluster-3: "failed to connect after 10 retries"

Robin logs (CLUSTER MEET with empty IP)

ERR Invalid node address specified: :6379

Operator logs (stuck polling loop)

Finishing fast scaling {"redkey-cluster": "chaos-5-bc6m5/redis-cluster"}
Waiting for cluster to be Ready in Robin {"redkey-cluster": "chaos-5-bc6m5/redis-cluster"}

Redis CLUSTER INFO (0 slots assigned)

cluster_slots_assigned:0
cluster_slots_ok:0
cluster_known_nodes:3
cluster_size:0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions