Skip to content

Enable kubernetes_node_scale benchmark (up to 5k nodes) on AWS EKS with Karpenter#6512

Open
kiryl-filatau wants to merge 11 commits intoGoogleCloudPlatform:masterfrom
kiryl-filatau:aws-5k
Open

Enable kubernetes_node_scale benchmark (up to 5k nodes) on AWS EKS with Karpenter#6512
kiryl-filatau wants to merge 11 commits intoGoogleCloudPlatform:masterfrom
kiryl-filatau:aws-5k

Conversation

@kiryl-filatau
Copy link
Collaborator

@kiryl-filatau kiryl-filatau commented Mar 4, 2026

Summary

Enables running the kubernetes_node_scale benchmark (0→5k→0→5k nodes) on AWS EKS with Karpenter. The benchmark scales a deployment with pod anti-affinity, measures scale-up, scale-down, and a second scale-up, then tears down the cluster.

Main changes

  • Kubernetes_node_scale benchmark — Template and scaling logic (scale up, scale down, phases), metrics collection, and timeouts tuned for large runs.

  • EKS + Karpenter — Nodepool template (instance types including t, CPU limit derived from scale target), EKS/Karpenter cluster lifecycle and cleanup.

  • Karpenter scaling by node count — NodePool CPU limit is computed from kubernetes_scale_num_nodes: max(1000, ceil(nodes × 2 × 1.05)) (e.g. 10 nodes → 1000, 5k → 10500). Controller pod resources scale with the same flag:

    • Default or <500 nodes: 1 CPU / 1Gi
    • 500–1000 nodes: 2 CPU / 8Gi
    • >1000 nodes: 4 CPU / 16Gi
      One configuration works for both small and 5k-node runs.
  • Teardown robustness — Orphan ENI deletion in _CleanupKarpenter: retry with backoff on AWS throttle (RequestLimitExceeded), treat "ENI not found" as success; uses suppress_failure for these cases.

  • Tracker — Single get nodes pass in _StopWatchingForNodeChanges; resolve machine type only for current nodes, use "unknown" for others to avoid thousands of kubectl calls on 5k-node runs.

  • Testskubernetes_scale_benchmark_test mocks updated to return valid kubectl -o json output ({"items": [...]}) so tests pass after GetStatusConditionsForResourceType was switched from jsonpath to full JSON.

@kiryl-filatau kiryl-filatau marked this pull request as ready for review March 11, 2026 19:18
# Output can be quite large, so we'll conditionally suppress it.
['get', resource_type, '-o', 'json'],
timeout=60 * 5, # 5 minutes for large clusters (e.g. 1000 pods)
suppress_logging=NUM_PODS.value > 20,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice this is clever

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

def _PostCreate(self):
"""Performs post-creation steps for the cluster."""
super()._PostCreate()
# Karpenter controller resources: default 1/1Gi; scale up when node_scale target is set.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just not specify anything & let Karpenter decide? Or is this indeed necessary? It seems clever but a little annoying / bad user experience by Karpenter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the resources for the Karpenter controller pod (the node where Karpenter itself runs). Karpenter doesn’t manage that node, so it can’t “decide” these values, we have to set them. For runs with ~10 nodes, 1/1Gi is sufficient; we only increase when node_scale is 500+ or 1000+.

'v'
+ full_version.strip().strip('"').split(f'{self.cluster_version}-v')[1]
)
# NodePool CPU limit: scale with benchmark target (nodes * 2 + 5%), min 1000.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the machine type matter here as well? If I am using a larger machine type, do I need to also set a larger cpu limit? This again seems a little annoying to have to set manually (but maybe makes senses given Karpenter can be machine type agnostic).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to include machine type adjustment, I’ll think about how to cover it.
Thanks.

suppress_failure=lambda stdout, stderr, retcode: (
'no matching resources found' in stderr.lower()
or 'timed out' in stderr.lower()
or 'context deadline exceeded' in stderr.lower()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look very similar to the RETRYABLE_KUBECTL_ERRORS list:

Just use kubectl.RunRetryableKubectlCommmand instead & get these for free. If that code is missing some of these (like 'timed out') then consider adding. It looks suppress_failure is supported too, so you can mix both - which would probably be good for 'no matching resources found' as that sounds like a wait/this command specific error message to ignore.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hubatish
Updated: EKS cleanup now uses RunRetryableKubectlCommand with suppress_failure only for "no resources found" style messages, retryable list extended and matching is case insensitive, please check.

),
)
max_retries = 5
backoff_seconds = 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this backoff logic looks pretty reasonable, prefer reusing backoff logic in vm_util.Retry. Which means moving this code to a subfunction & adding said decorator.

"""Stop watching the cluster for node add/remove events."""
polled_events = self._cluster.GetEvents()

# Resolve machine type only for current nodes; use "unknown" for the rest.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O this makes sense. Was this causing the cluster to take a long time querying everything?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it was the main reason.

if name in _current_node_names:
machine_type = _GetMachineTypeFromNodeName(self._cluster, name)
else:
machine_type = "unknown"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something around here is probably what is causing the TypeError.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants