Cluster Autoscaler w/ Cluster API does not scale down node

**Which component are you using?**:
/area cluster-autoscaler


**What version of the component are you using?**:
v1.33.0


Component version:

**What k8s version are you using (`kubectl version`)?**: 1.34.0

<details><summary><code>kubectl version</code> Output</summary><br><pre>
$ kubectl version

</pre></details>

**What environment is this in?**: Cluster Autoscaler in management cluster, along with Cluster API+CAPA objects, workcluster is setup by CAPI+CAPA on ec2 instances (not EKS)



**What did you expect to happen?**:
Scaling down of node



**What happened instead?**:
while Cluster Autoscaler was able to scale-up, and also mark the appropriate node for scale-down, the node was never removed/MachineDeployment replica remained 1 (should become 0 as there is no pod on the node)



**How to reproduce it (as minimally and precisely as possible)**:
* create a Management Cluster (on Kind)
* create a workload cluster on ec2 instance (not EKS) using Cluster API
* Create MachineDeployments as follows for all 3 AZs
```
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: tpfm-k0-md-storage-ap-south-1a
  namespace: default
  annotations:
    # CA discovery bounds (required)
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "0"
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "5"

    # Scale-from-zero capacity (set to *allocatable* for t3a.medium)
    capacity.cluster-autoscaler.kubernetes.io/cpu: "2"
    capacity.cluster-autoscaler.kubernetes.io/memory: "3748Mi"

    # Scale-from-zero scheduling predicates (so CA knows a new node matches your pods)
    capacity.cluster-autoscaler.kubernetes.io/labels: "node-role.tpfm.io=storage"
    capacity.cluster-autoscaler.kubernetes.io/taints: "node-role.tpfm.io=storage:NoSchedule"
spec:
  clusterName: tpfm-k0
  replicas: 0
  selector:
    matchLabels: null
```
* Install Cluster Autoscaler with helm chart values
```
cloudProvider: clusterapi

# Choose a tag matching your workload Kubernetes minor when available
image:
  tag: v1.33.0 

# clusterapi-specific
clusterAPIMode: kubeconfig-incluster
clusterAPIKubeconfigSecret: ca-workload-kubeconfig
clusterAPIWorkloadKubeconfigPath: /etc/kubernetes/workload
clusterAPIConfigMapsNamespace: kube-system

# Autodiscovery: search CAPI objects in mgmt cluster "default" ns, by labels
autoDiscovery:
  namespace: default
  labels:
    - cluster.x-k8s.io/cluster-name: tpfm-k0


extraArgs:
  v: 7
  kubeconfig: /etc/kubernetes/workload/tpfm-k0.kubeconfig
  clusterapi-cloud-config-authoritative: "true"
  balance-similar-node-groups: "true"
  balancing-label_1: node-role.tpfm.io
  # balancing-ignore-label_1: "topology.kubernetes.io/zone"
  # balancing-ignore-label_2: "failure-domain.beta.kubernetes.io/zone"
  expander: "least-waste"

  # Scale UP responsiveness
  new-pod-scale-up-delay: "0s" 
  max-node-provision-time: "5m"

  # Scale DOWN responsiveness
  scale-down-unneeded-time: "2m" # default ~10m, How long a node should be unneeded before it is eligible for scale down
  scale-down-delay-after-add: "2m" # default ~10m, How long after scale up that scale down evaluation resumes
  # scale-down-delay-after-delete: "30s" # How long after node deletion that scale down evaluation resumes, defaults to scanInterval
  scale-down-delay-after-failure: "2m" # default ~3m
  scale-down-utilization-threshold: "0.6" # default 0.5 → marks nodes underutilized sooner
  max-scale-down-parallelism: "20" # delete more empty nodes per loop (safe if churn is OK)

rbac:
  additionalRules:
    - apiGroups:
        - infrastructure.cluster.x-k8s.io
      resources:
        - awsmachinetemplates
      verbs:
        - get
        - list
        - watch

```




**Anything else we need to know?**:
* cluster autoscaler logs show the correct node is marked for scale-down
```
tdin│I0915 14:46:25.853378       1 actuator.go:175] Scale-down: removing empty node "ip-10-0-233-80.ap-south-1.compute.internal"                                                                          │
stdin│I0915 14:46:25.854400       1 clusterapi_controller.go:790] node "ip-10-0-233-80.ap-south-1.compute.internal" is in nodegroup "MachineDeployment/default/tpfm-k0-md-storage-ap-south-1c"             │
stdin│I0915 14:46:25.854984       1 actuator.go:295] Scale-down: waiting 5s before trying to delete nodes                                                                                                  │
stdin│I0915 14:46:25.906719       1 round_trippers.go:632] "Response" status="200 OK" milliseconds=53                                                                                                      │
stdin│I0915 14:46:25.906959       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"82103712-160b-4c59-939f-2a│
9964446011", APIVersion:"v1", ResourceVersion:"93959", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node "ip-10-0-233-80.ap-south-1.compute.internal"                │
stdin│I0915 14:46:25.907027       1 round_trippers.go:527] "Request" verb="PATCH" url="https://default-tpfm-k0-apiserver.com:6443/api/v1/namespaces/kube-system/e│
vents/cluster-autoscaler-status.1865706f86b05b8d" headers=<                                                                                                                                                │
stdin│  Accept: application/vnd.kubernetes.protobuf, */*                        
```
* the machine object is marked with 
```
annotations:
    cluster.x-k8s.io/delete-machine: 2025-09-15 14:16:56.862721311 +0000 UTC m=+11157.135416490
```
but the replica on MachineDeployment remains 1
* in the status configmap of the cluster autoscaler in the workload cluster, 
```
      scaleDown:
        status: CandidatesPresent
        candidates: 1
        lastProbeTime: "2025-09-15T14:50:26.294111768Z"
        lastTransitionTime: "2025-09-15T11:19:24.712973162Z"
```
* since, Cluster Autoscaler actually added the node, I believe RBAC is not a problem




any help is appreciated, thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster Autoscaler w/ Cluster API does not scale down node #8538

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cluster Autoscaler w/ Cluster API does not scale down node #8538

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions