Version: 4.0.0 Last Updated: December 2025 Audience: Platform Engineers, DevOps Teams
- Sizing Recommendations
- AKS Cluster Optimization
- Node Pool Configuration
- Pod Resource Management
- Autoscaling Configuration
- Database Optimization
- Network Performance
- Observability Overhead
- AI Workload Optimization
- Load Testing
- Capacity Planning
This guide provides recommendations for optimizing the performance of the Three Horizons Platform across all components.
| Profile | Nodes | vCPUs | Memory | Monthly Cost | Use Case |
|---|---|---|---|---|---|
| Small | 3 | 6 | 12 GB | ~$800 | Dev/POC |
| Medium | 5 | 20 | 40 GB | ~$3,500 | Standard Production |
| Large | 10 | 40 | 80 GB | ~$12,000 | Enterprise |
| XLarge | 15+ | 60+ | 120+ GB | ~$35,000 | Mission Critical |
Choose Small when:
- Development or testing environment
- < 10 concurrent developers
- < 20 microservices
- No GPU workloads
Choose Medium when:
- Standard production workload
- 10-50 concurrent developers
- 20-50 microservices
- Light AI/ML workloads
Choose Large when:
- Enterprise production
- 50-200 concurrent developers
- 50-100 microservices
- Moderate AI/ML workloads
- GPU requirements
Choose XLarge when:
- Mission-critical systems
- 200+ concurrent developers
- 100+ microservices
- Heavy AI/ML workloads
- Multi-region deployment
# terraform.tfvars optimizations
sku_tier = "Standard" # Use Standard for production (SLA backed)
# For high-throughput environments
kubernetes_version = "1.29" # Latest stable version
# API server authorized IP ranges (reduces attack surface)
api_server_authorized_ip_ranges = ["10.0.0.0/8", "YOUR_OFFICE_IP/32"]# Optimized autoscaler settings
auto_scaler_profile = {
balance_similar_node_groups = true
expander = "random"
max_graceful_termination_sec = 600
max_node_provisioning_time = "15m"
max_unready_nodes = 3
max_unready_percentage = 45
new_pod_scale_up_delay = "10s"
scale_down_delay_after_add = "10m"
scale_down_delay_after_delete = "10s"
scale_down_delay_after_failure = "3m"
scan_interval = "10s"
scale_down_unneeded = "10m"
scale_down_unready = "20m"
scale_down_utilization_threshold = 0.5
empty_bulk_delete_max = 10
skip_nodes_with_local_storage = false
skip_nodes_with_system_pods = true
}- Use availability zones for high availability
- Separate node pools for different workload types
- Enable Uptime SLA for production clusters
- Use managed identities instead of service principals
| Workload Type | Recommended SKU | vCPUs | Memory | Notes |
|---|---|---|---|---|
| System | Standard_D2s_v5 | 2 | 8 GB | Control plane components |
| General | Standard_D4s_v5 | 4 | 16 GB | Most workloads |
| Memory-intensive | Standard_E4s_v5 | 4 | 32 GB | Databases, caches |
| CPU-intensive | Standard_F8s_v2 | 8 | 16 GB | Compute workloads |
| GPU | Standard_NC6s_v3 | 6 | 112 GB | AI/ML training |
# System node pool - minimal, dedicated to system components
default_node_pool = {
name = "system"
vm_size = "Standard_D2s_v5"
node_count = 3
zones = ["1", "2", "3"]
enable_auto_scaling = true
min_count = 3
max_count = 5
only_critical_addons_enabled = true
}
# User workloads node pool
additional_node_pools = {
user = {
name = "user"
vm_size = "Standard_D4s_v5"
node_count = 3
zones = ["1", "2", "3"]
enable_auto_scaling = true
min_count = 3
max_count = 20
node_labels = {
"workload" = "user"
}
}
# GPU node pool for AI workloads
gpu = {
name = "gpu"
vm_size = "Standard_NC6s_v3"
node_count = 0
enable_auto_scaling = true
min_count = 0
max_count = 4
node_taints = ["nvidia.com/gpu=true:NoSchedule"]
node_labels = {
"workload" = "gpu"
"accelerator" = "nvidia"
}
}
}# Recommended starting points
resources:
requests:
cpu: "100m" # 0.1 CPU core
memory: "128Mi" # 128 MB
limits:
cpu: "500m" # 0.5 CPU core
memory: "512Mi" # 512 MB| App Type | CPU Request | CPU Limit | Memory Request | Memory Limit |
|---|---|---|---|---|
| API Service | 100m | 500m | 128Mi | 512Mi |
| Web Frontend | 50m | 200m | 64Mi | 256Mi |
| Background Worker | 200m | 1000m | 256Mi | 1Gi |
| Database Client | 100m | 500m | 256Mi | 1Gi |
| AI Inference | 500m | 2000m | 1Gi | 4Gi |
# Guaranteed QoS (requests = limits)
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "512Mi"
# Burstable QoS (requests < limits)
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Auto" # or "Off" for recommendations only
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: "50m"
memory: "64Mi"
maxAllowed:
cpu: "2000m"
memory: "4Gi"apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: MaxapiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: queue-processor
spec:
scaleTargetRef:
name: queue-processor
minReplicaCount: 0
maxReplicaCount: 50
triggers:
- type: azure-servicebus
metadata:
queueName: orders
messageCount: "5"
connectionFromEnv: SERVICEBUS_CONNECTION-- Connection pooling settings
max_connections = 200
shared_buffers = 256MB
effective_cache_size = 768MB
maintenance_work_mem = 64MB
work_mem = 4MB
-- Performance settings
random_page_cost = 1.1 -- For SSD storage
effective_io_concurrency = 200
-- Write performance
wal_buffers = 16MB
checkpoint_completion_target = 0.9# terraform.tfvars
postgresql_sku_name = "GP_Standard_D4s_v3" # 4 vCores, 16 GB RAM
postgresql_storage_mb = 65536 # 64 GB
# High availability
postgresql_ha_enabled = true
postgresql_geo_redundant_backup = true# terraform.tfvars
redis_sku_name = "Premium"
redis_family = "P"
redis_capacity = 1 # P1 = 6GB
# Cluster mode for high throughput
redis_cluster_enabled = true
redis_shard_count = 2- Use connection pooling - Reduce connection overhead
- Set appropriate TTLs - Prevent memory bloat
- Use pipelining - Batch commands for efficiency
- Monitor memory - Set maxmemory-policy
# terraform.tfvars
network_plugin = "azure"
network_plugin_mode = "overlay" # More efficient IP usage
network_policy = "calico"| Cluster Size | Pod CIDR | Max Pods |
|---|---|---|
| Small | /20 | 4,096 |
| Medium | /18 | 16,384 |
| Large | /16 | 65,536 |
For high-traffic environments, consider:
- Istio - Full-featured but resource intensive
- Linkerd - Lighter weight, lower overhead
- No mesh - Direct service-to-service communication
# NGINX Ingress tuning
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-configuration
namespace: ingress-nginx
data:
proxy-body-size: "100m"
proxy-connect-timeout: "15"
proxy-read-timeout: "60"
proxy-send-timeout: "60"
use-gzip: "true"
gzip-level: "5"
worker-processes: "auto"
max-worker-connections: "65536"
keepalive: "75"# prometheus-values.yaml
prometheus:
prometheusSpec:
retention: 15d
retentionSize: "50GB"
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi# Drop high-cardinality labels
metric_relabel_configs:
- source_labels: [__name__]
regex: 'go_.*' # Drop Go runtime metrics
action: drop
- source_labels: [pod]
regex: '.*-[a-z0-9]{5}-[a-z0-9]{5}'
action: labeldrop# Loki retention
loki:
limits_config:
retention_period: 168h # 7 days
max_query_series: 5000
chunk_store_config:
max_look_back_period: 168h| Model | TPM (Tokens/min) | RPM (Requests/min) |
|---|---|---|
| GPT-4o | 30,000 | 300 |
| GPT-4o-mini | 100,000 | 1,000 |
| Embeddings | 1,000,000 | 6,000 |
# Batch embeddings
texts = ["text1", "text2", "text3"]
embeddings = openai.embeddings.create(
model="text-embedding-3-large",
input=texts # Send as batch
)
# Use streaming for long responses
response = openai.chat.completions.create(
model="gpt-4o",
messages=[...],
stream=True # Reduces perceived latency
)# terraform.tfvars
ai_search_sku = "standard" # or "standard2" for high volume
ai_search_replica_count = 2
ai_search_partition_count = 1
# Enable semantic search for better relevance
ai_search_semantic_search_sku = "standard"- k6 - Modern load testing tool
- Locust - Python-based, distributed
- Apache JMeter - Enterprise grade
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up
{ duration: '5m', target: 100 }, // Sustained load
{ duration: '2m', target: 200 }, // Peak
{ duration: '2m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<500'], // 95% under 500ms
http_req_failed: ['rate<0.01'], // <1% errors
},
};
export default function () {
const res = http.get('https://api.example.com/health');
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1);
}- Test in staging - Mirror production configuration
- Start small - Gradually increase load
- Monitor everything - Watch cluster metrics during tests
- Test failure scenarios - Include error conditions
- Document baselines - Record performance benchmarks
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Node CPU | >70% | >85% | Add nodes |
| Node Memory | >75% | >90% | Add nodes |
| Pod CPU | >80% | >95% | Increase limits or replicas |
| Pod Memory | >80% | >95% | Increase limits |
| API Latency | >500ms | >1s | Scale or optimize |
| Error Rate | >1% | >5% | Investigate |
Required Nodes = (Current Workload × Growth Factor) / Node Capacity
Example:
- Current: 50 pods using 100m CPU each = 5 CPU cores
- Growth: 2x in 6 months
- Node capacity: 4 cores usable (D4s_v5)
Required = (5 × 2) / 4 = 2.5 → 3 nodes minimum
Add buffer: 3 × 1.3 = 4 nodes recommended
- Use spot instances for non-critical workloads
- Right-size resources based on actual usage
- Implement pod disruption budgets for efficient scaling
- Use reserved instances for predictable workloads
- Schedule non-prod shutdowns during off-hours
-
Cluster Overview
- Node count and status
- Total CPU/Memory utilization
- Pod count by namespace
-
Application Performance
- Request rate (RPS)
- Error rate
- P50/P95/P99 latency
-
Resource Efficiency
- Request vs actual usage
- Cost per namespace
- Idle resource percentage
-
Scaling Events
- HPA scaling events
- Node scaling events
- Pod restarts
| Task | Agent | Example Prompt |
|---|---|---|
| Prometheus metric analysis | @sre |
"Query Prometheus for p99 latency over the last 24h" |
| AKS node tuning | @terraform |
"Help me adjust the AKS nodepool SKU and autoscaler settings" |
| Resource limits review | @sre |
"Which pods are over their memory limits?" |
| Cost vs performance analysis | @architect |
"Evaluate trade-offs between Standard_D4s_v3 and Standard_D8s_v3" |
Tip: Ask
@sreto analyze metrics before making tuning decisions — data-driven changes are safer.
Document Version: 2.0.0 Last Updated: December 2025 Maintainer: Platform Engineering Team