Enable kubernetes_node_scale benchmark (up to 5k nodes) on AWS EKS with Karpenter by kiryl-filatau · Pull Request #6512 · GoogleCloudPlatform/PerfKitBenchmarker

kiryl-filatau · 2026-03-04T18:27:06Z

Summary

Enables running the kubernetes_node_scale benchmark (0→5k→0→5k nodes) on AWS EKS with Karpenter. The benchmark scales a deployment with pod anti-affinity, measures scale-up, scale-down, and a second scale-up, then tears down the cluster.

Main changes

Kubernetes_node_scale benchmark — Template and scaling logic (scale up, scale down, phases), metrics collection, and timeouts tuned for large runs.
EKS + Karpenter — Nodepool template (instance types including t, CPU limit derived from scale target), EKS/Karpenter cluster lifecycle and cleanup.
Karpenter scaling by node count — NodePool CPU limit is computed from kubernetes_scale_num_nodes: max(1000, ceil(nodes × 2 × 1.05)) (e.g. 10 nodes → 1000, 5k → 10500). Controller pod resources scale with the same flag:
- Default or <500 nodes: 1 CPU / 1Gi
- 500–1000 nodes: 2 CPU / 8Gi
- >1000 nodes: 4 CPU / 16Gi
  One configuration works for both small and 5k-node runs.
Teardown robustness — Orphan ENI deletion in _CleanupKarpenter: retry with backoff on AWS throttle (RequestLimitExceeded), treat "ENI not found" as success; uses suppress_failure for these cases.
Tracker — Single get nodes pass in _StopWatchingForNodeChanges; resolve machine type only for current nodes, use "unknown" for others to avoid thousands of kubectl calls on 5k-node runs.
Tests — kubernetes_scale_benchmark_test mocks updated to return valid kubectl -o json output ({"items": [...]}) so tests pass after GetStatusConditionsForResourceType was switched from jsonpath to full JSON.

hubatish · 2026-03-12T17:02:02Z

perfkitbenchmarker/linux_benchmarks/kubernetes_scale_benchmark.py

-      # Output can be quite large, so we'll conditionally suppress it.
+      ['get', resource_type, '-o', 'json'],
+      timeout=60 * 5,  # 5 minutes for large clusters (e.g. 1000 pods)
      suppress_logging=NUM_PODS.value > 20,


nice this is clever

hubatish · 2026-03-12T17:04:05Z

perfkitbenchmarker/providers/aws/elastic_kubernetes_service.py

  def _PostCreate(self):
    """Performs post-creation steps for the cluster."""
    super()._PostCreate()
+    # Karpenter controller resources: default 1/1Gi; scale up when node_scale target is set.


Can we just not specify anything & let Karpenter decide? Or is this indeed necessary? It seems clever but a little annoying / bad user experience by Karpenter.

These are the resources for the Karpenter controller pod (the node where Karpenter itself runs). Karpenter doesn’t manage that node, so it can’t “decide” these values, we have to set them. For runs with ~10 nodes, 1/1Gi is sufficient; we only increase when node_scale is 500+ or 1000+.

hubatish · 2026-03-12T17:05:47Z

perfkitbenchmarker/providers/aws/elastic_kubernetes_service.py

        'v'
        + full_version.strip().strip('"').split(f'{self.cluster_version}-v')[1]
    )
+    # NodePool CPU limit: scale with benchmark target (nodes * 2 + 5%), min 1000.


Does the machine type matter here as well? If I am using a larger machine type, do I need to also set a larger cpu limit? This again seems a little annoying to have to set manually (but maybe makes senses given Karpenter can be machine type agnostic).

Makes sense to include machine type adjustment, I’ll think about how to cover it.
Thanks.

hubatish · 2026-03-12T17:12:12Z

perfkitbenchmarker/providers/aws/elastic_kubernetes_service.py

        suppress_failure=lambda stdout, stderr, retcode: (
            'no matching resources found' in stderr.lower()
            or 'timed out' in stderr.lower()
+            or 'context deadline exceeded' in stderr.lower()


These look very similar to the RETRYABLE_KUBECTL_ERRORS list:

PerfKitBenchmarker/perfkitbenchmarker/resources/container_service/kubectl.py

Line 7 in 29845bd

RETRYABLE_KUBECTL_ERRORS = [

Just use kubectl.RunRetryableKubectlCommmand instead & get these for free. If that code is missing some of these (like 'timed out') then consider adding. It looks suppress_failure is supported too, so you can mix both - which would probably be good for 'no matching resources found' as that sounds like a wait/this command specific error message to ignore.

@hubatish
Updated: EKS cleanup now uses RunRetryableKubectlCommand with suppress_failure only for "no resources found" style messages, retryable list extended and matching is case insensitive, please check.

hubatish · 2026-03-12T17:14:08Z

perfkitbenchmarker/providers/aws/elastic_kubernetes_service.py

-            ),
-        )
+        max_retries = 5
+        backoff_seconds = 10


While this backoff logic looks pretty reasonable, prefer reusing backoff logic in vm_util.Retry. Which means moving this code to a subfunction & adding said decorator.

hubatish · 2026-03-12T17:15:24Z

perfkitbenchmarker/traces/kubernetes_tracker.py

    """Stop watching the cluster for node add/remove events."""
    polled_events = self._cluster.GetEvents()

+    # Resolve machine type only for current nodes; use "unknown" for the rest.


O this makes sense. Was this causing the cluster to take a long time querying everything?

Yep, it was the main reason.

hubatish · 2026-03-12T17:15:52Z

perfkitbenchmarker/traces/kubernetes_tracker.py

+        if name in _current_node_names:
+          machine_type = _GetMachineTypeFromNodeName(self._cluster, name)
+        else:
+          machine_type = "unknown"


Something around here is probably what is causing the TypeError.

…test

…ded and matched case-insensitively

…arrowing

kiryl-filatau added 7 commits February 24, 2026 10:47

add pyink modification

214e6f0

Merge branch 'azure-5k' into aws-5k

e0b33ec

Merge branch 'azure-5k' into aws-5k

3519bf3

extend timeout for deletion

5577dee

decrease the timeout, as 1 hour is enough

5869a7e

pyink reformat

ed94853

reformatted by pyink

684215e

kiryl-filatau marked this pull request as ready for review March 11, 2026 19:18

Merge branch 'master' into aws-5k

568ba1a

hubatish reviewed Mar 12, 2026

View reviewed changes

kiryl-filatau added 3 commits March 12, 2026 18:39

Fix pytype in Prepare() and GetNodeNames mocks in kubernetes_tracker_…

d14b242

…test

EKS cleanup uses RunRetryableKubectlCommand, kubectl retry list exten…

f76985c

…ded and matched case-insensitively

Fix pytype error in kubernetes_tracker by using isinstance for type n…

c72880e

…arrowing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable kubernetes_node_scale benchmark (up to 5k nodes) on AWS EKS with Karpenter#6512

Enable kubernetes_node_scale benchmark (up to 5k nodes) on AWS EKS with Karpenter#6512
kiryl-filatau wants to merge 11 commits intoGoogleCloudPlatform:masterfrom
kiryl-filatau:aws-5k

kiryl-filatau commented Mar 4, 2026 •

edited

Loading

Uh oh!

hubatish Mar 12, 2026

Uh oh!

kiryl-filatau Mar 13, 2026

Uh oh!

hubatish Mar 12, 2026

Uh oh!

kiryl-filatau Mar 13, 2026

Uh oh!

hubatish Mar 12, 2026

Uh oh!

kiryl-filatau Mar 13, 2026

Uh oh!

hubatish Mar 12, 2026

Uh oh!

kiryl-filatau Mar 16, 2026

Uh oh!

hubatish Mar 12, 2026

Uh oh!

hubatish Mar 12, 2026

Uh oh!

kiryl-filatau Mar 17, 2026

Uh oh!

hubatish Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kiryl-filatau commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Main changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kiryl-filatau commented Mar 4, 2026 •

edited

Loading