Skip to content

Multiple errors after upgrade #5671

@josemrs

Description

@josemrs

After checking multiple breaking changes, I thought got it under control, apparently not.

We run EKS 1.32 AWSManagedControlPlanes with 1.32 AWSManagedMachinePools with AL2 custom AMIs

The upgrade was going to be in 2 stages, first to "latest 1beta1" then latest 1beta2 as it is recommended here

So I did:

./clusterctl-v1.10.6 upgrade plan

Checking new release availability...

Latest release available for the v1beta1 API Version of Cluster API (contract):

NAME                    NAMESPACE                           TYPE                     CURRENT VERSION   NEXT VERSION
bootstrap-kubeadm       capi-kubeadm-bootstrap-system       BootstrapProvider        v1.7.3            v1.10.6
control-plane-kubeadm   capi-kubeadm-control-plane-system   ControlPlaneProvider     v1.7.3            v1.10.6
cluster-api             capi-system                         CoreProvider             v1.7.3            v1.10.6
infrastructure-aws      capa-system                         InfrastructureProvider   v2.5.2            v2.9.1

You can now apply the upgrade by executing the following command:

clusterctl upgrade apply --contract v1beta1

So I run the upgrade command to do the intermediate upgrade and I got all upgraded, however, both, CAPI and CAPA, started complaining constantly about reconciliation and connection errors.

Perhaps is this but I thought I had it under control because of this

These are the logs, I tried to pick only the ones for one particular cluster, we have almost 30, all failing like this.

Logs from capa-controller-manager
I0919 10:58:42.605598       1 awsmanagedmachinepool_controller.go:202] "Reconciling AWSManagedMachinePool" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="82996b04-ef8f-4b26-b570-95f5010121cb" MachinePool="prod/services-prod-pool-ap-southeast-2a" cluster="prod/services.REDACTED"
I0919 10:58:42.605729       1 launchtemplate.go:81] "checking for existing launch template" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="82996b04-ef8f-4b26-b570-95f5010121cb" MachinePool="prod/services-prod-pool-ap-southeast-2a" cluster="prod/services.REDACTED"
[...]
I0919 10:58:45.429754       1 tags.go:128] "Reconciling ASG tags" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="82996b04-ef8f-4b26-b570-95f5010121cb" MachinePool="prod/services-prod-pool-ap-southeast-2a" cluster="prod/services.REDACTED" cluster-name="services_ap-southeast-2_prod_alienvault_cloud" nodegroup-name="services-prod-pool-ap-southeast-2a"
Logs from capi-controller-manager
E0919 11:01:39.644472       1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2b" namespace="prod" name="services-prod-pool-ap-southeast-2b" reconcileID="dd96348e-37dc-4d9d-90f8-33b72cca5aa1"
E0919 11:01:42.691574       1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2c" namespace="prod" name="services-prod-pool-ap-southeast-2c" reconcileID="4b104a11-3d94-401f-b227-c89eceb45e71"
E0919 11:01:44.009112       1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2c" namespace="prod" name="services-prod-pool-ap-southeast-2c" reconcileID="35240758-7625-420d-85cc-517b095fa4f4"
E0919 11:01:52.674593       1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="5cd6d5a9-452a-474b-bcff-09ad0e98e6a1"
E0919 11:01:52.952752       1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="36a5298a-d1d2-4e8c-a7e3-da275b13d90b"
Logs from capi-kubeadm-bootstrap-controller-manager
I0919 10:57:44.297447       1 cluster_accessor.go:320] "Disconnecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="de112319-22c9-4bc8-a248-da3869cb4f13"
I0919 10:57:44.297492       1 cluster_accessor.go:327] "Disconnected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="de112319-22c9-4bc8-a248-da3869cb4f13"
I0919 10:57:44.298712       1 cluster_accessor.go:252] "Connecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="b212685a-8419-4acd-8ff3-7d893b41a2e3"
I0919 10:57:47.933214       1 cluster_accessor.go:274] "Connected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="b212685a-8419-4acd-8ff3-7d893b41a2e3"
Logs from capi-kubeadm-control-plane-system
I0919 11:00:09.828007       1 cluster_accessor.go:320] "Disconnecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="f74b3271-9d4b-4b6a-95a7-7abe21839a7b"
I0919 11:00:09.828056       1 cluster_accessor.go:327] "Disconnected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="f74b3271-9d4b-4b6a-95a7-7abe21839a7b"
I0919 11:00:09.829332       1 cluster_accessor.go:252] "Connecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="95222f01-14a5-4e4b-bec3-372e95d9b983"
I0919 11:00:13.479651       1 cluster_accessor.go:274] "Connected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="95222f01-14a5-4e4b-bec3-372e95d9b983"

This is the config of this particular cluster:

ap-southeast-2 cluster YAML
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: services.REDACTED
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "0"
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 192.168.0.0/16
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v2beta2
    kind: AWSManagedControlPlane
    name: services.REDACTED
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSManagedCluster
    name: services.REDACTED

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedCluster
metadata:
  name: services.REDACTED
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "10"
spec: {}

---
apiVersion: controlplane.cluster.x-k8s.io/v1beta2
kind: AWSManagedControlPlane
metadata:
  name: services.REDACTED
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "20"
spec:
  associateOIDCProvider: true
  eksClusterName: services_REDACTED_1
  region: ap-southeast-2
  version: v1.32.0
  network:
    vpc:
      id: vpc-XXXXXXXXXX
    subnets:
    - id: subnet-X
    - id: subnet-Y
    - id: subnet-Z
    securityGroupOverrides: 
      node-eks-additional: sg-W
  endpointAccess:
    private: true
    public: false
  bastion:
    enabled: false
  oidcIdentityProviderConfig:
    identityProviderConfigName: Okta
    issuerUrl: https://.okta.com/oauth2/XXXXXXXXXXXX
    clientId: XXXXXXXXX
    usernameClaim: preferred_username
    groupsClaim: groups
    groupsPrefix: "okta:"
  logging:
    apiServer: false
    controllerManager: false
    audit: false
    authenticator: false
    scheduler: false
  iamAuthenticatorConfig:
    mapRoles:
    - username: "kubernetes-admin"
      rolearn: "arn:aws:iam::XXXXXXXXXXXX:role/saas-OktaAdmins"
      groups:
      - "system:masters"
  addons:
  - name: "kube-proxy"
    version: "v1.32.6-eksbuild.6"
    conflictResolution: "overwrite"
  - name: "vpc-cni"
    version: "v1.20.1-eksbuild.1"
    conflictResolution: "overwrite"
  - name: "aws-ebs-csi-driver"
    version: "v1.48.0-eksbuild.1"
    conflictResolution: "overwrite"
    serviceAccountRoleARN: "arn:aws:iam::XXXXXXXXXXXX:role/prod-AmazonEKS_EBS_CSI_DriverRole"
  vpcCni:
    env:
    - name: POD_SECURITY_GROUP_ENFORCING_MODE
      value: standard
    - name: ENABLE_POD_ENI
      value: "true"
    - name: ENABLE_PREFIX_DELEGATION
      value: "true"
  additionalTags:
    Owner: "EngOps"
    created_by: "https://bitbucket.org/redacted/capi-cluster"
    Environment: "prod"
  identityRef:
    kind: AWSClusterRoleIdentity
    name: prod
  roleAdditionalPolicies:
  - arn:aws:iam::aws:policy/AmazonEKSVPCResourceController
---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfig
metadata:
  name: services.REDACTED
  namespace: prod
spec:
  boostrapCommandOverride: "# Self-bootstrap embedded in AMI, doing nothing here for cluster"
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  name: services-prod-pool-prometheus-ap-southeast-2
  namespace: prod
  annotations:
    cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
    argocd.argoproj.io/sync-wave: "30"
spec:
  clusterName: services.REDACTED
  replicas: 2
  failureDomains:
  - ap-southeast-2a
  - ap-southeast-2b
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
          kind: EKSConfig
          name: services.REDACTED
          namespace: prod
        dataSecretName: services.REDACTED
      clusterName: services.REDACTED
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSManagedMachinePool
        name: services-prod-pool-prometheus-ap-southeast-2

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
  name: services-prod-pool-prometheus-ap-southeast-2
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "30"
spec:
  eksNodegroupName: services-prod-pool-prometheus
  availabilityZones:
  - ap-southeast-2a
  - ap-southeast-2b
  scaling:
    minSize: 2
    maxSize: 4
  updateConfig:
    maxUnavailable: 1
  awsLaunchTemplate:
    instanceType: m5.large
    ami:
      id: ami-YYYYYY
  labels:
    usm.io/role: prometheus
  taints:
  - key: dedicated
    effect: no-schedule
    value: prometheus
  subnetIDs:
  - subnet-X
  - subnet-Y
  roleAdditionalPolicies:
  - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  name: services-prod-pool-ap-southeast-2a
  namespace: prod
  annotations:
    cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
    argocd.argoproj.io/sync-wave: "40"
spec:
  clusterName: services.REDACTED
  replicas: 2
  failureDomains:
  - ap-southeast-2a
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
          kind: EKSConfig
          name: services.REDACTED
          namespace: prod
        dataSecretName: services.REDACTED
      clusterName: services.REDACTED
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSManagedMachinePool
        name: services-prod-pool-ap-southeast-2a

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
  name: services-prod-pool-ap-southeast-2a
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "40"
spec:
  eksNodegroupName: services-prod-pool-ap-southeast-2a
  availabilityZones:
  - ap-southeast-2a
  scaling:
    minSize: 2
    maxSize: 25
  updateConfig:
    maxUnavailablePercentage: 40
  subnetIDs:
  - subnet-X
  awsLaunchTemplate:
    instanceType: m5.xlarge
    ami:
      id: ami-YYYYYY
  roleAdditionalPolicies:
  - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  name: services-prod-pool-ap-southeast-2b
  namespace: prod
  annotations:
    cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
    argocd.argoproj.io/sync-wave: "41"
spec:
  clusterName: services.REDACTED
  replicas: 2
  failureDomains:
  - ap-southeast-2b
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
          kind: EKSConfig
          name: services.REDACTED
          namespace: prod
        dataSecretName: services.REDACTED
      clusterName: services.REDACTED
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSManagedMachinePool
        name: services-prod-pool-ap-southeast-2b

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
  name: services-prod-pool-ap-southeast-2b
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "41"
spec:
  eksNodegroupName: services-prod-pool-ap-southeast-2b
  availabilityZones:
  - ap-southeast-2b
  scaling:
    minSize: 2
    maxSize: 25
  updateConfig:
    maxUnavailablePercentage: 40
  subnetIDs:
  - subnet-Y
  awsLaunchTemplate:
    instanceType: m5.xlarge
    ami:
      id: ami-YYYYYY
  roleAdditionalPolicies:
  - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  name: services-prod-pool-ap-southeast-2c
  namespace: prod
  annotations:
    cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
    argocd.argoproj.io/sync-wave: "42"
spec:
  clusterName: services.REDACTED
  replicas: 2
  failureDomains:
  - ap-southeast-2c
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
          kind: EKSConfig
          name: services.REDACTED
          namespace: prod
        dataSecretName: services.REDACTED
      clusterName: services.REDACTED
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSManagedMachinePool
        name: services-prod-pool-ap-southeast-2c

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
  name: services-prod-pool-ap-southeast-2c
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "42"
spec:
  eksNodegroupName: services-prod-pool-ap-southeast-2c
  availabilityZones:
  - ap-southeast-2c
  scaling:
    minSize: 2
    maxSize: 25
  updateConfig:
    maxUnavailablePercentage: 40
  subnetIDs:
  - subnet-Z
  awsLaunchTemplate:
    instanceType: m5.xlarge
    ami:
      id: ami-YYYYYY
  roleAdditionalPolicies:
  - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions