Skip to content

Splunk Operator: CPU changes break total cluster capacity + rolling updates are too slow #1645

@ductrung-nguyen

Description

@ductrung-nguyen

Please select the type of request

Enhancement

Tell us more

Describe the request

So here's the problem - when you change CPU requests on StatefulSet pods (indexers, SHs, etc), the operator just blindly applies the change without caring about what that does to your total cluster capacity.

If you double CPU per pod? Congrats, you're now using 2x total CPU. Halve it? You've just cut your capacity in half. Neither is what you probably wanted.

This is really annoying for license-based deployments or when you're trying to optimize costs. Every time you wanna resize pods, you have to manually recalculate replicas. It's tedious and error-prone.

Also, rolling updates on large clusters are painfully slow since pods update one at a time. With 50+ replicas, you're sitting around forever waiting for updates to finish.

Expected behavior

Would be nice to have:

  1. CPU-aware scaling - operator should auto-adjust replica count when CPU per pod changes, so total CPU stays constant. Like if I go from 10 pods @ 4 CPU to 8 CPU per pod, just give me 5 pods instead of 10 pods @ 8 CPU.

  2. Parallel pod updates - let me configure how many pods update at once. Either as a percentage (25% at a time) or absolute number (3 at a time). Default can stay at 1 for backward compat.

Splunk setup on K8S

Happens with any StatefulSet-based component - indexers, search heads, cluster manager, etc.

Reproduction/Testing steps

For the CPU issue:

  1. Deploy indexer cluster with 10 replicas @ 4 CPU each (40 total CPU)
  2. Update CPU request to 8 per pod
  3. Watch as you now have 10 pods @ 8 CPU = 80 total CPU
  4. Your cloud bill just doubled

For slow updates:

  1. Deploy a large cluster (20+ replicas)
  2. Change image or any pod template config
  3. Go make coffee. Then another coffee. Then another one.
  4. Still updating one pod at a time...

K8s environment

Any K8s 1.19+. Happens everywhere.

Proposed changes (optional)

Could use annotations to opt-in:

  • Something like operator.splunk.com/preserve-total-cpu: "true" for CPU awareness
  • Something like operator.splunk.com/parallel-pod-updates: "0.25" for 25% at a time

Additional context

This would be super helpful for:

  • Cost optimization when resizing pods without changing overall footprint
  • Large cluster maintenance where you don't want to wait all day for rolling updates

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions