-
Notifications
You must be signed in to change notification settings - Fork 127
Description
Please select the type of request
Enhancement
Tell us more
Describe the request
So here's the problem - when you change CPU requests on StatefulSet pods (indexers, SHs, etc), the operator just blindly applies the change without caring about what that does to your total cluster capacity.
If you double CPU per pod? Congrats, you're now using 2x total CPU. Halve it? You've just cut your capacity in half. Neither is what you probably wanted.
This is really annoying for license-based deployments or when you're trying to optimize costs. Every time you wanna resize pods, you have to manually recalculate replicas. It's tedious and error-prone.
Also, rolling updates on large clusters are painfully slow since pods update one at a time. With 50+ replicas, you're sitting around forever waiting for updates to finish.
Expected behavior
Would be nice to have:
-
CPU-aware scaling - operator should auto-adjust replica count when CPU per pod changes, so total CPU stays constant. Like if I go from 10 pods @ 4 CPU to 8 CPU per pod, just give me 5 pods instead of 10 pods @ 8 CPU.
-
Parallel pod updates - let me configure how many pods update at once. Either as a percentage (25% at a time) or absolute number (3 at a time). Default can stay at 1 for backward compat.
Splunk setup on K8S
Happens with any StatefulSet-based component - indexers, search heads, cluster manager, etc.
Reproduction/Testing steps
For the CPU issue:
- Deploy indexer cluster with 10 replicas @ 4 CPU each (40 total CPU)
- Update CPU request to 8 per pod
- Watch as you now have 10 pods @ 8 CPU = 80 total CPU
- Your cloud bill just doubled
For slow updates:
- Deploy a large cluster (20+ replicas)
- Change image or any pod template config
- Go make coffee. Then another coffee. Then another one.
- Still updating one pod at a time...
K8s environment
Any K8s 1.19+. Happens everywhere.
Proposed changes (optional)
Could use annotations to opt-in:
- Something like
operator.splunk.com/preserve-total-cpu: "true"for CPU awareness - Something like
operator.splunk.com/parallel-pod-updates: "0.25"for 25% at a time
Additional context
This would be super helpful for:
- Cost optimization when resizing pods without changing overall footprint
- Large cluster maintenance where you don't want to wait all day for rolling updates