This guide walks you through verifying GPU setup and deploying your first GPU-accelerated LLM.
- GKE cluster deployed via
terraform/gke/quick-start.sh - kubectl configured and connected
- Updated controller deployed with GPU support
# Check all nodes
kubectl get nodes -o wide
# Check GPU nodes specifically (may be 0 if auto-scaled down)
kubectl get nodes -l cloud.google.com/gke-accelerator=nvidia-tesla-t4
# Check GPU allocatable resources
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable."nvidia\.com/gpu"# Check device plugin daemonset
kubectl get daemonset -n kube-system | grep nvidia
# Check device plugin pods (only runs on GPU nodes)
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds -o wide
# Check logs
kubectl logs -n kube-system -l name=nvidia-device-plugin-ds --tail=50# Deploy test pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-test
image: nvidia/cuda:12.2.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-t4
EOF
# Wait for pod (may take 2-3 min if node scaling from 0)
kubectl wait --for=condition=Ready pod/gpu-test --timeout=300s || echo "Pod may still be pending..."
# Check status
kubectl get pod gpu-test -o wide
# View GPU info
kubectl logs gpu-test
# Expected output: nvidia-smi showing 1x Tesla T4, ~15GB memory
# Clean up
kubectl delete pod gpu-testTroubleshooting:
- If pod stays Pending: Check
kubectl describe pod gpu-testfor events - If "Insufficient nvidia.com/gpu": GPU nodes may be scaling up (wait 2-3 min)
- If node selector doesn't match: Verify GPU node labels with
kubectl get nodes --show-labels
The controller now includes GPU scheduling logic. Rebuild and deploy:
# From project root
cd /Users/defilan/stuffy/code/ai/llmkube
# Regenerate CRDs with new GPU fields
make generate
make manifests
# Install updated CRDs
make install
# Build and push new controller image
make docker-build docker-push IMG=ghcr.io/defilan/llmkube-controller:v0.2.0
# Deploy updated controller
make deploy IMG=ghcr.io/defilan/llmkube-controller:v0.2.0
# Verify controller is running
kubectl get pods -n llmkube-system
kubectl logs -n llmkube-system deployment/llmkube-controller-manager --tail=50GPU Features Added:
- ✅ Node selector for
nvidia-tesla-t4nodes - ✅ Tolerations for
nvidia.com/gputaint - ✅
--n-gpu-layersflag for layer offloading - ✅ Automatic GPU resource requests
# Create a GPU-accelerated model
kubectl apply -f - <<EOF
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: phi-3-7b-gpu
namespace: default
spec:
source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf
format: gguf
quantization: Q4
hardware:
accelerator: cuda
gpu:
enabled: true
count: 1
vendor: nvidia
layers: -1 # Offload all layers to GPU
resources:
cpu: "4"
memory: "8Gi"
EOF
# Create GPU inference service
kubectl apply -f - <<EOF
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
name: phi-3-7b-gpu-service
namespace: default
spec:
modelRef: phi-3-7b-gpu
replicas: 1
resources:
gpu: 1
gpuMemory: "16Gi"
cpu: "2"
memory: "4Gi"
endpoint:
port: 8080
type: ClusterIP
EOF# Watch model status
kubectl get model phi-3-7b-gpu -w
# Watch service status
kubectl get inferenceservice phi-3-7b-gpu-service -w
# Check pod scheduling (should land on GPU node)
kubectl get pods -l app=phi-3-7b-gpu-service -o wide
# View init container logs (model download)
POD=$(kubectl get pod -l app=phi-3-7b-gpu-service -o jsonpath='{.items[0].metadata.name}')
kubectl logs $POD -c model-downloader -f
# View main container logs (GPU detection)
kubectl logs $POD -c llama-server -fExpected GPU Logs:
llama_model_load: loaded meta data with XX key-value pairs and XX tensors
llama_model_load: using CUDA backend
llm_load_tensors: offloading XX repeating layers to GPU
llm_load_tensors: offloaded XX/XX layers to GPU
llama_new_context_with_model: CUDA buffer size = 4096.00 MiB
# Port forward to service
kubectl port-forward svc/phi-3-7b-gpu-service 8080:8080 &
# Run benchmark script
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-3-7b-gpu",
"messages": [
{"role": "user", "content": "Explain quantum computing in 100 words"}
],
"max_tokens": 100,
"stream": false
}' | jq .
# Check response time and tokens/sec in logs
kubectl logs $POD -c llama-server --tail=20Benchmark Targets (T4 GPU):
- Prompt Processing: >50 tok/s (vs ~29 CPU)
- Token Generation: >100 tok/s (vs ~18.5 CPU)
- Latency P99: <1s for simple queries
- GPU Utilization: Should see >80% during inference
# SSH into GPU node (for debugging)
NODE=$(kubectl get pod $POD -o jsonpath='{.spec.nodeName}')
echo "Pod is running on node: $NODE"
# Deploy monitoring pod on same node
kubectl run gpu-monitor --rm -it --image=nvidia/cuda:12.2.0-base-ubuntu22.04 \
--overrides='{"spec": {"nodeSelector": {"kubernetes.io/hostname": "'$NODE'"}, "tolerations": [{"key": "nvidia.com/gpu", "operator": "Exists"}]}}' \
-- watch -n 1 nvidia-smi# Deploy same model on CPU
kubectl apply -f - <<EOF
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
name: phi-3-7b-cpu-service
spec:
modelRef: phi-3-7b-gpu # Same model, different service
replicas: 1
resources:
cpu: "4"
memory: "8Gi"
endpoint:
type: ClusterIP
EOF
# Test CPU performance
kubectl port-forward svc/phi-3-7b-cpu-service 8081:8080 &
time curl -X POST http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'| Metric | CPU (n2-standard-4) | GPU (T4) | Speedup |
|---|---|---|---|
| Prompt Processing | ~29 tok/s | >100 tok/s | 3-5x |
| Token Generation | ~18.5 tok/s | >100 tok/s | 5-10x |
| Cold Start | ~5s | ~8s | -1.6x |
| Latency P99 | ~2s | <1s | 2x |
| Cost/1K tokens | $0.002 | $0.008 | 4x more expensive |
Cost Analysis:
- GPU is 4x more expensive but 5-10x faster
- For production workloads with latency SLOs, GPU is cost-effective
- For batch/async workloads, CPU may be sufficient
# Delete inference services
kubectl delete inferenceservice phi-3-7b-gpu-service phi-3-7b-cpu-service
# Delete model
kubectl delete model phi-3-7b-gpu
# Stop port forwards
killall kubectl
# Optionally: Scale GPU nodes to 0 to save money
kubectl scale deployment -n kube-system --replicas=0 $(kubectl get deploy -n kube-system -o name | grep gpu)- Set up monitoring (Phase 2-3): Prometheus + Grafana for GPU metrics
- CLI support (Phase 2): Add
llmkube deploy --gpu 1flag - Multi-GPU (Phase 4-5): Test 13B models on 2x T4 GPUs
- Optimize costs: Implement auto-scaling based on traffic
kubectl describe pod $POD
# Check for:
# - "Insufficient nvidia.com/gpu": Wait for node to scale up
# - "node(s) didn't match Pod's node affinity": GPU nodes not available
# - "taint node.kubernetes.io/not-ready": Node still initializingkubectl logs $POD -c llama-server | grep -i cuda
# Should see: "using CUDA backend"
# If not: Image may not have CUDA support, check image tag# Check GPU utilization
kubectl exec -it $POD -- nvidia-smi
# Should show >80% GPU-Util during inference
# If low: Increase batch size or concurrent requestskubectl logs $POD -c llama-server | grep -i "out of memory"
# Solution: Reduce layers offloaded or use smaller quantization
# Example: Set gpu.layers: 20 instead of -1 in Model spec