diff --git a/documentdb-playground/telemetry/README.md b/documentdb-playground/telemetry/README.md new file mode 100644 index 00000000..f20df3a7 --- /dev/null +++ b/documentdb-playground/telemetry/README.md @@ -0,0 +1,311 @@ +# DocumentDB Multi-Tenant Telemetry Setup + +This directory contains scripts to set up complete multi-tenant telemetry infrastructure for DocumentDB on Azure Kubernetes Service (AKS) with namespace-based isolation and dedicated monitoring stacks per team. + +## Prerequisites + +- Azure CLI installed and configured +- kubectl installed +- Helm installed +- jq installed (for JSON parsing) +- An active Azure subscription +- Existing AKS cluster with DocumentDB Operator installed + +## Scripts Overview + +### deploy-multi-tenant-telemetry.sh + +**Primary deployment script** that sets up complete multi-tenant infrastructure: +- Creates isolated namespaces for teams (sales-namespace, accounts-namespace) +- Deploys DocumentDB clusters per team with proper CNPG configuration +- Sets up dedicated OpenTelemetry Collectors with CPU/memory monitoring +- Installs separate Prometheus and Grafana instances per team +- Configures proper RBAC and service accounts + +**Usage:** +```bash +# Deploy complete multi-tenant stack +./deploy-multi-tenant-telemetry.sh + +# Deploy only DocumentDB clusters +./deploy-multi-tenant-telemetry.sh --documentdb-only + +# Deploy only telemetry stack +./deploy-multi-tenant-telemetry.sh --telemetry-only + +# Skip waiting for deployments (for status checking) +./deploy-multi-tenant-telemetry.sh --skip-wait +``` + +### setup-grafana-dashboards.sh + +**Automated dashboard creation** that programmatically sets up monitoring dashboards: +- Creates comprehensive CPU and Memory monitoring dashboards +- Configures namespace-specific metric filtering +- Includes pod count and resource utilization metrics +- Uses Grafana API for automated deployment + +**Usage:** +```bash +# Create dashboard for sales team +./setup-grafana-dashboards.sh sales-namespace + +# Create dashboard for accounts team +./setup-grafana-dashboards.sh accounts-namespace +``` + +### delete-multi-tenant-telemetry.sh + +**Application cleanup script** that removes multi-tenant applications while preserving infrastructure: +- Deletes DocumentDB clusters per team +- Removes OpenTelemetry collectors +- Cleans up Prometheus and Grafana monitoring stacks +- Deletes team namespaces and associated resources + +**Usage:** +```bash +# Delete everything (applications only, keeps infrastructure) +./delete-multi-tenant-telemetry.sh --delete-all + +# Delete only DocumentDB clusters +./delete-multi-tenant-telemetry.sh --delete-documentdb + +# Delete only monitoring (Prometheus/Grafana) +./delete-multi-tenant-telemetry.sh --delete-monitoring + +# Delete with no confirmation prompts +./delete-multi-tenant-telemetry.sh --delete-all --force +``` + +### Infrastructure Management Scripts + +#### create-cluster.sh +**Infrastructure setup** - Creates AKS cluster and operators only: +```bash +# Create cluster + DocumentDB operator + OpenTelemetry operator +./create-cluster.sh --install-all + +# Create cluster only +./create-cluster.sh + +# Install operators on existing cluster +./create-cluster.sh --install-operator +``` + +#### delete-cluster.sh +**Infrastructure cleanup** - Removes cluster and all Azure resources: +```bash +# Delete entire AKS cluster and Azure resources +./delete-cluster.sh --delete-all + +# Delete only cluster (keeps resource group) +./delete-cluster.sh --delete-cluster +``` + +## Script Organization + +### Infrastructure vs Applications + +Our scripts are organized with **clean separation of concerns**: + +| **Infrastructure Scripts** | **Application Scripts** | +|---------------------------|-------------------------| +| `create-cluster.sh` | `deploy-multi-tenant-telemetry.sh` | +| `delete-cluster.sh` | `delete-multi-tenant-telemetry.sh` | +| | `setup-grafana-dashboards.sh` | + +**Infrastructure Scripts** manage: +- ✅ AKS cluster creation/deletion +- ✅ Azure resource management +- ✅ DocumentDB operator installation +- ✅ OpenTelemetry operator installation +- ✅ Core platform components (cert-manager, CSI drivers) + +**Application Scripts** manage: +- 📦 DocumentDB cluster deployments per team +- 🔧 OpenTelemetry collector configurations +- 📊 Monitoring stacks (Prometheus, Grafana) +- 🏠 Team namespaces and application resources + +### Benefits of This Approach + +- **🔄 Reusable Infrastructure**: Create cluster once, deploy multiple application stacks +- **💰 Cost Optimization**: Delete applications without losing cluster setup +- **🔧 Independent Updates**: Update monitoring without touching infrastructure +- **👥 Team Isolation**: Each team can manage their own application stack +- **🚀 Faster Iterations**: Deploy/destroy applications in seconds, not minutes + +## Architecture Overview + +### Multi-Tenant DocumentDB + Telemetry Stack + +Our implementation provides **complete namespace isolation** with dedicated resources per team: + +``` +┌─── sales-namespace ────────────────────────────┐ ┌─── accounts-namespace ──────────────────────┐ +│ • DocumentDB Cluster (documentdb-sales) │ │ • DocumentDB Cluster (documentdb-accounts) │ +│ • OpenTelemetry Collector (sales-focused) │ │ • OpenTelemetry Collector (accounts-focused)│ +│ • Prometheus Server (prometheus-sales) │ │ • Prometheus Server (prometheus-accounts) │ +│ • Grafana Instance (grafana-sales) │ │ • Grafana Instance (grafana-accounts) │ +│ • Dedicated RBAC & Service Accounts │ │ • Dedicated RBAC & Service Accounts │ +└─────────────────────────────────────────────────┘ └──────────────────────────────────────────────┘ +``` + +### What Gets Deployed + +#### Per Team/Namespace: +- **DocumentDB Cluster**: CNPG-managed PostgreSQL cluster with proper operator integration +- **OpenTelemetry Collector**: Namespace-scoped metric collection focusing on CPU/Memory +- **Prometheus Server**: Time-series database for storing team-specific metrics +- **Grafana Instance**: Visualization dashboard with automated dashboard provisioning +- **RBAC Configuration**: Service accounts, cluster roles, and bindings for secure access + +#### Shared Components: +- **DocumentDB Operator**: Cluster-wide operator managing all DocumentDB instances +- **OpenTelemetry Operator**: Cluster-wide operator managing collector deployments + +## Recommended Workflow + +### 1. Infrastructure Setup (One Time) +```bash +# Create AKS cluster with all required operators +cd scripts/ +./create-cluster.sh --install-all +``` + +### 2. Application Deployment (Repeatable) +```bash +# Deploy multi-tenant DocumentDB + monitoring +./deploy-multi-tenant-telemetry.sh + +# Create automated dashboards +./setup-grafana-dashboards.sh sales-namespace +./setup-grafana-dashboards.sh accounts-namespace +``` + +### 3. Access & Monitor +```bash +# Access Grafana dashboards +kubectl port-forward -n sales-namespace svc/grafana-sales 3001:3000 & +kubectl port-forward -n accounts-namespace svc/grafana-accounts 3002:3000 & + +# Open in browser: http://localhost:3001 and http://localhost:3002 +# Login: admin / admin123 +``` + +### 4. Cleanup Applications (Keep Infrastructure) +```bash +# Remove all applications, keep cluster running +./delete-multi-tenant-telemetry.sh --delete-all +``` + +### 5. Full Cleanup (When Done) +```bash +# Delete entire Azure infrastructure +./delete-cluster.sh --delete-all +``` + +## Quick Start Guide + +### 1. Deploy Complete Multi-Tenant Stack +```bash +# Deploy DocumentDB clusters + telemetry for both teams +cd scripts/ +./deploy-multi-tenant-telemetry.sh +``` + +### 2. Create Monitoring Dashboards +```bash +# Create automated dashboards for both teams +./setup-grafana-dashboards.sh sales-namespace +./setup-grafana-dashboards.sh accounts-namespace +``` + +### 3. Access Grafana Dashboards +```bash +# Port-forward to sales Grafana (runs in background) +kubectl port-forward -n sales-namespace svc/grafana-sales 3001:3000 > /dev/null 2>&1 & + +# Port-forward to accounts Grafana (runs in background) +kubectl port-forward -n accounts-namespace svc/grafana-accounts 3002:3000 > /dev/null 2>&1 & + +# Access dashboards in browser: +# Sales Team: http://localhost:3001 +# Accounts Team: http://localhost:3002 +# Login: admin / admin123 +``` + +## Monitoring Capabilities + +### Metrics Collected (CPU & Memory Focus) +- **container_cpu_usage_seconds_total**: CPU usage per container +- **container_memory_working_set_bytes**: Memory usage per container +- **container_spec_memory_limit_bytes**: Memory limits per container +- **Pod count and status metrics** + +### Dashboard Features +- **CPU Usage by Container**: Real-time CPU utilization with 5-minute rate calculation +- **Memory Usage by Container**: Memory consumption in MB per container +- **Memory Usage Percentage**: Memory usage as percentage of configured limits +- **Pod Count Monitoring**: Number of active pods per namespace + +### Namespace Isolation +Each OpenTelemetry collector is configured with strict namespace filtering: +```yaml +metric_relabel_configs: + - source_labels: [namespace] + regex: '^(sales-namespace)$' # Only sales-namespace metrics + action: keep +``` + +## Advanced Usage + +### Deployment Options +```bash +# Deploy only DocumentDB clusters (skip telemetry) +./deploy-multi-tenant-telemetry.sh --documentdb-only + +# Deploy only telemetry stack (skip DocumentDB) +./deploy-multi-tenant-telemetry.sh --telemetry-only + +# Check deployment status without waiting +./deploy-multi-tenant-telemetry.sh --skip-wait +``` + +### Accessing Different Components +```bash +# Check DocumentDB cluster status +kubectl get clusters -n sales-namespace +kubectl get clusters -n accounts-namespace + +# View OpenTelemetry collector logs +kubectl logs -n sales-namespace -l app.kubernetes.io/name=opentelemetry-collector + +# Access Prometheus directly +kubectl port-forward -n sales-namespace svc/prometheus-sales-server 9090:80 +``` + +### Troubleshooting +```bash +# Check all pods status +kubectl get pods -n sales-namespace +kubectl get pods -n accounts-namespace + +# View collector configuration +kubectl get otelcol -n sales-namespace otel-collector-sales -o yaml + +# Check metric collection +kubectl logs -n sales-namespace deployment/otel-collector-sales +``` + +## Cost Management + +**Important**: This setup creates dedicated resources per team. Monitor costs and clean up when testing is complete: + +```bash +# Clean up multi-tenant resources +kubectl delete namespace sales-namespace accounts-namespace + +# Or use legacy cleanup (if applicable) +./delete-cluster.sh +``` \ No newline at end of file diff --git a/documentdb-playground/telemetry/otel-collector-accounts.yaml b/documentdb-playground/telemetry/otel-collector-accounts.yaml new file mode 100644 index 00000000..786eeb41 --- /dev/null +++ b/documentdb-playground/telemetry/otel-collector-accounts.yaml @@ -0,0 +1,118 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + name: otel-collector + namespace: accounts-namespace +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: otel-collector-accounts +rules: +- apiGroups: [""] + resources: ["nodes", "nodes/proxy", "nodes/metrics", "services", "endpoints", "pods"] + verbs: ["get", "list", "watch"] +- nonResourceURLs: ["/metrics", "/metrics/cadvisor"] + verbs: ["get"] +- apiGroups: ["apps"] + resources: ["daemonsets", "deployments", "replicasets"] + verbs: ["get", "list", "watch"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: otel-collector-accounts +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: otel-collector-accounts +subjects: +- kind: ServiceAccount + name: otel-collector + namespace: accounts-namespace +--- +apiVersion: opentelemetry.io/v1beta1 +kind: OpenTelemetryCollector +metadata: + name: documentdb-accounts-collector + namespace: accounts-namespace +spec: + mode: deployment # Single pod per namespace, not DaemonSet + replicas: 1 + serviceAccount: otel-collector + config: + receivers: + # Scrape container CPU/Memory metrics from DocumentDB pods + prometheus: + config: + scrape_configs: + # Container CPU/Memory metrics via Kubernetes API proxy to cAdvisor + - job_name: 'accounts-container-metrics' + kubernetes_sd_configs: + - role: node + relabel_configs: + # Use Kubernetes API proxy to access cAdvisor + - target_label: __address__ + replacement: kubernetes.default.svc:443 + - source_labels: [__meta_kubernetes_node_name] + regex: (.+) + target_label: __metrics_path__ + replacement: '/api/v1/nodes/$1/proxy/metrics/cadvisor' + - source_labels: [__meta_kubernetes_node_name] + target_label: instance + - target_label: tenant + replacement: 'accounts' + scheme: https + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + insecure_skip_verify: true + bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + metric_relabel_configs: + # Filter only accounts namespace containers after scraping + - source_labels: [namespace] + regex: 'accounts-namespace' + action: keep + # Keep only running containers (exclude POD sandbox) + - source_labels: [container] + regex: '^$|POD' + action: drop + + processors: + batch: + timeout: 10s + send_batch_size: 1024 + + attributes: + actions: + - key: service.name + value: "documentdb-accounts-telemetry" + action: insert + - key: telemetry.source + value: "otel-collector-accounts" + action: insert + - key: tenant + value: "accounts" + action: insert + + exporters: + # Export to accounts team's dedicated Prometheus + prometheusremotewrite: + endpoint: "http://prometheus-accounts-server.accounts-namespace.svc.cluster.local:80/api/v1/write" + external_labels: + tenant: "accounts" + cluster: "documentdb-accounts" + + # Alternative: Export to tenant-specific external backend + # azuremonitor: + # instrumentation_key: "${ACCOUNTS_AZURE_MONITOR_KEY}" + + service: + pipelines: + metrics: + receivers: [prometheus] + processors: [attributes, batch] + exporters: [prometheusremotewrite] + + telemetry: + logs: + level: "info" \ No newline at end of file diff --git a/documentdb-playground/telemetry/otel-collector-sales.yaml b/documentdb-playground/telemetry/otel-collector-sales.yaml new file mode 100644 index 00000000..96d1c4ea --- /dev/null +++ b/documentdb-playground/telemetry/otel-collector-sales.yaml @@ -0,0 +1,118 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + name: otel-collector + namespace: sales-namespace +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: otel-collector-sales +rules: +- apiGroups: [""] + resources: ["nodes", "nodes/proxy", "nodes/metrics", "services", "endpoints", "pods"] + verbs: ["get", "list", "watch"] +- nonResourceURLs: ["/metrics", "/metrics/cadvisor"] + verbs: ["get"] +- apiGroups: ["apps"] + resources: ["daemonsets", "deployments", "replicasets"] + verbs: ["get", "list", "watch"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: otel-collector-sales +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: otel-collector-sales +subjects: +- kind: ServiceAccount + name: otel-collector + namespace: sales-namespace +--- +apiVersion: opentelemetry.io/v1beta1 +kind: OpenTelemetryCollector +metadata: + name: documentdb-sales-collector + namespace: sales-namespace +spec: + mode: deployment # Single pod per namespace, not DaemonSet + replicas: 1 + serviceAccount: otel-collector + config: + receivers: + # Scrape container CPU/Memory metrics from DocumentDB pods + prometheus: + config: + scrape_configs: + # Container CPU/Memory metrics via Kubernetes API proxy to cAdvisor + - job_name: 'sales-container-metrics' + kubernetes_sd_configs: + - role: node + relabel_configs: + # Use Kubernetes API proxy to access cAdvisor + - target_label: __address__ + replacement: kubernetes.default.svc:443 + - source_labels: [__meta_kubernetes_node_name] + regex: (.+) + target_label: __metrics_path__ + replacement: '/api/v1/nodes/$1/proxy/metrics/cadvisor' + - source_labels: [__meta_kubernetes_node_name] + target_label: instance + - target_label: tenant + replacement: 'sales' + scheme: https + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + insecure_skip_verify: true + bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + metric_relabel_configs: + # Filter only sales namespace containers after scraping + - source_labels: [namespace] + regex: 'sales-namespace' + action: keep + # Keep only running containers (exclude POD sandbox) + - source_labels: [container] + regex: '^$|POD' + action: drop + + processors: + batch: + timeout: 10s + send_batch_size: 1024 + + attributes: + actions: + - key: service.name + value: "documentdb-sales-telemetry" + action: insert + - key: telemetry.source + value: "otel-collector-sales" + action: insert + - key: tenant + value: "sales" + action: insert + + exporters: + # Export to sales team's dedicated Prometheus + prometheusremotewrite: + endpoint: "http://prometheus-sales-server.sales-namespace.svc.cluster.local:80/api/v1/write" + external_labels: + tenant: "sales" + cluster: "documentdb-sales" + + # Alternative: Export to tenant-specific external backend + # azuremonitor: + # instrumentation_key: "${SALES_AZURE_MONITOR_KEY}" + + service: + pipelines: + metrics: + receivers: [prometheus] + processors: [attributes, batch] + exporters: [prometheusremotewrite] + + telemetry: + logs: + level: "info" \ No newline at end of file diff --git a/documentdb-playground/telemetry/scripts/create-cluster.sh b/documentdb-playground/telemetry/scripts/create-cluster.sh new file mode 100755 index 00000000..916ff132 --- /dev/null +++ b/documentdb-playground/telemetry/scripts/create-cluster.sh @@ -0,0 +1,731 @@ +#!/bin/bash + +# DocumentDB AKS Cluster Creation Script +# This script creates a complete AKS cluster with all dependencies for DocumentDB + +set -e # Exit on any error + +# Configuration +CLUSTER_NAME="ray-ddb-cluster" +RESOURCE_GROUP="ray-documentdb-rg" +LOCATION="West US 2" +NODE_COUNT=2 +NODE_SIZE="Standard_D4s_v5" +KUBERNETES_VERSION="1.31.11" + +# DocumentDB Operator Configuration +# For testing: use hossain-rayhan/documentdb-operator (fork with Azure enhancements) +# For production: use microsoft/documentdb-operator (official) +OPERATOR_GITHUB_ORG="hossain-rayhan" +OPERATOR_CHART_VERSION="0.1.112" + +# Feature flags - set to "true" to enable, "false" to skip +INSTALL_OPERATOR="${INSTALL_OPERATOR:-false}" +DEPLOY_INSTANCE="${DEPLOY_INSTANCE:-false}" +CREATE_STORAGE_CLASS="${CREATE_STORAGE_CLASS:-false}" + + +# GitHub credentials - check environment variables first, can be overridden by command line +GITHUB_USERNAME="${GITHUB_USERNAME:-}" +GITHUB_TOKEN="${GITHUB_TOKEN:-}" + +# Parse command line arguments +while [[ $# -gt 0 ]]; do + case $1 in + --skip-operator) + INSTALL_OPERATOR="false" + shift + ;; + --skip-instance) + DEPLOY_INSTANCE="false" + shift + ;; + --install-operator) + INSTALL_OPERATOR="true" + shift + ;; + --deploy-instance) + DEPLOY_INSTANCE="true" + shift + ;; + --install-all) + INSTALL_OPERATOR="true" + DEPLOY_INSTANCE="true" + shift + ;; + + --create-storage-class) + CREATE_STORAGE_CLASS="true" + shift + ;; + --skip-storage-class) + CREATE_STORAGE_CLASS="false" + shift + ;; + --cluster-name) + CLUSTER_NAME="$2" + shift 2 + ;; + --resource-group) + RESOURCE_GROUP="$2" + shift 2 + ;; + --location) + LOCATION="$2" + shift 2 + ;; + --github-username) + GITHUB_USERNAME="$2" + shift 2 + ;; + --github-token) + GITHUB_TOKEN="$2" + shift 2 + ;; + -h|--help) + echo "Usage: $0 [OPTIONS]" + echo "" + echo "Options:" + echo " --skip-operator Skip DocumentDB operator installation (default)" + echo " --skip-instance Skip DocumentDB instance deployment (default)" + echo " --install-operator Install DocumentDB operator only (assumes cluster exists)" + echo " --deploy-instance Deploy DocumentDB instance only (assumes cluster+operator exist)" + echo " --install-all Create cluster + install operator + deploy instance" + + echo " --create-storage-class Create custom Premium SSD storage class" + echo " --skip-storage-class Use AKS default storage (StandardSSD_LRS) - default" + echo " --cluster-name NAME AKS cluster name (default: documentdb-cluster)" + echo " --resource-group RG Azure resource group (default: documentdb-rg)" + echo " --location LOCATION Azure location (default: East US)" + echo " --github-username GitHub username for operator installation" + echo " --github-token GitHub token for operator installation" + echo " -h, --help Show this help message" + echo "" + echo "Examples:" + echo " $0 # Create cluster only" + echo " $0 --install-operator # Install operator only (assumes cluster exists)" + echo " $0 --deploy-instance # Deploy DocumentDB only (assumes cluster+operator exist)" + + echo " $0 --install-all --github-username myuser --github-token ghp_xxx # Full setup with GitHub auth" + echo " $0 --install-all # Create cluster + install operator + deploy instance" + exit 0 + ;; + *) + echo "Unknown option: $1" + exit 1 + ;; + esac +done + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +# Logging function +log() { + echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1" +} + +success() { + echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')] ✅ $1${NC}" +} + +warn() { + echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] ⚠️ $1${NC}" +} + +error() { + echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ❌ $1${NC}" + exit 1 +} + +# Check prerequisites +check_prerequisites() { + log "Checking prerequisites..." + + # Check Azure CLI + if ! command -v az &> /dev/null; then + error "Azure CLI not found. Please install Azure CLI first." + fi + + # Check kubectl + if ! command -v kubectl &> /dev/null; then + error "kubectl not found. Please install kubectl first." + fi + + # Check Helm + if ! command -v helm &> /dev/null; then + error "Helm not found. Please install Helm first." + fi + + # Check Azure login + if ! az account show &> /dev/null; then + error "Not logged into Azure. Please run 'az login' first." + fi + + success "All prerequisites met" +} + +# Create resource group +create_resource_group() { + log "Creating resource group: $RESOURCE_GROUP in location: $LOCATION" + + # Check if resource group already exists + if az group show --name $RESOURCE_GROUP &> /dev/null; then + warn "Resource group $RESOURCE_GROUP already exists. Skipping creation." + return 0 + fi + + # Create resource group + az group create --name $RESOURCE_GROUP --location "$LOCATION" + + if [ $? -eq 0 ]; then + success "Resource group created successfully" + else + error "Failed to create resource group" + fi +} + +# Create AKS cluster +create_cluster() { + log "Creating AKS cluster: $CLUSTER_NAME" + + # Check if cluster already exists + if az aks show --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME &> /dev/null; then + warn "Cluster $CLUSTER_NAME already exists. Skipping cluster creation." + else + # Create AKS cluster with managed identity and required addons + az aks create \ + --resource-group $RESOURCE_GROUP \ + --name $CLUSTER_NAME \ + --node-count $NODE_COUNT \ + --node-vm-size $NODE_SIZE \ + --kubernetes-version $KUBERNETES_VERSION \ + --enable-managed-identity \ + --enable-addons monitoring \ + --enable-cluster-autoscaler \ + --min-count 2 \ + --max-count 5 \ + --generate-ssh-keys \ + --network-plugin azure \ + --network-policy azure \ + --load-balancer-sku standard + + if [ $? -eq 0 ]; then + success "AKS cluster created successfully" + else + error "Failed to create AKS cluster" + fi + fi + + # Get cluster credentials + log "Getting cluster credentials..." + az aks get-credentials --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME --overwrite-existing + + # Handle WSL case - copy Windows kubeconfig to WSL + if grep -qi microsoft /proc/version 2>/dev/null; then + log "Detected WSL environment, copying kubeconfig from Windows to WSL..." + WIN_KUBE_CONFIG="/mnt/c/Users/$(whoami)/.kube/config" + if [ -f "$WIN_KUBE_CONFIG" ]; then + mkdir -p ~/.kube + cp "$WIN_KUBE_CONFIG" ~/.kube/config + chmod 600 ~/.kube/config + log "Kubeconfig copied to WSL" + else + warn "Windows kubeconfig not found at expected location" + fi + fi + + success "Cluster credentials configured" +} + +# Install Azure CSI drivers +install_azure_csi_drivers() { + log "Checking Azure CSI drivers..." + + # Check if CSI drivers are already enabled (modern AKS clusters have them by default) + CSI_STATUS=$(az aks show --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME --query "storageProfile" -o json 2>/dev/null) + DISK_CSI_ENABLED=$(echo "$CSI_STATUS" | jq -r '.diskCsiDriver.enabled // false') + FILE_CSI_ENABLED=$(echo "$CSI_STATUS" | jq -r '.fileCsiDriver.enabled // false') + + if [ "$DISK_CSI_ENABLED" == "true" ] && [ "$FILE_CSI_ENABLED" == "true" ]; then + success "Azure CSI drivers already enabled (Disk: ✅, File: ✅)" + return 0 + fi + + log "CSI drivers not fully enabled - installing..." + log "Current status: Disk=$DISK_CSI_ENABLED, File=$FILE_CSI_ENABLED" + + # Azure Disk CSI driver (only if not enabled) + if [ "$DISK_CSI_ENABLED" != "true" ]; then + log "Enabling Azure Disk CSI driver..." + az aks update --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME --enable-disk-driver >/dev/null 2>&1 + fi + + # Azure File CSI driver (only if not enabled) + if [ "$FILE_CSI_ENABLED" != "true" ]; then + log "Enabling Azure File CSI driver..." + az aks update --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME --enable-file-driver >/dev/null 2>&1 + fi + + success "Azure CSI drivers configured" +} + +# Verify Azure Load Balancer (built-in to AKS) +configure_load_balancer() { + log "Verifying Azure Load Balancer..." + + # Azure Load Balancer is built into AKS, just verify it's working + if kubectl get service kubernetes -n default >/dev/null 2>&1; then + success "Azure Load Balancer verified (built-in to AKS)" + else + warn "Unable to verify Kubernetes API service" + fi +} + +# Install cert-manager +install_cert_manager() { + log "Installing cert-manager..." + + # Check if already installed + if helm list -n cert-manager | grep -q cert-manager; then + warn "cert-manager already installed. Skipping installation." + return 0 + fi + + # Add Jetstack Helm repository + helm repo add jetstack https://charts.jetstack.io + helm repo update + + # Install cert-manager + helm install cert-manager jetstack/cert-manager \ + --namespace cert-manager \ + --create-namespace \ + --version v1.13.2 \ + --set installCRDs=true \ + --set prometheus.enabled=false \ + --set webhook.timeoutSeconds=30 + + # Wait for cert-manager to be ready + log "Waiting for cert-manager to be ready..." + sleep 30 + kubectl wait --for=condition=ready pod -l app.kubernetes.io/instance=cert-manager -n cert-manager --timeout=300s || warn "cert-manager pods may still be starting" + + success "cert-manager installed" +} + +# Create optimized storage class for Azure (optional) +create_storage_class() { + if [ "$CREATE_STORAGE_CLASS" != "true" ]; then + warn "Skipping custom storage class creation (using AKS default StandardSSD_LRS)" + return 0 + fi + + log "Creating DocumentDB custom Premium SSD storage class..." + + # Check if storage class already exists + if kubectl get storageclass documentdb-storage &> /dev/null; then + warn "DocumentDB storage class already exists. Skipping creation." + return 0 + fi + + kubectl apply -f - < /dev/null; then + error "Cannot reach ghcr.io. Please check your internet connection and firewall settings." + fi + + # Install DocumentDB operator using enhanced fork with Azure support + log "Installing DocumentDB operator from GitHub Container Registry (enhanced fork with Azure support)..." + + # Check for GitHub authentication + if [ -z "$GITHUB_TOKEN" ] || [ -z "$GITHUB_USERNAME" ]; then + error "DocumentDB operator installation requires GitHub authentication. + +GitHub credentials can be provided via: +1. Environment variables (recommended): + export GITHUB_USERNAME='your-github-username' + export GITHUB_TOKEN='your-github-token' + +2. Command line arguments: + --github-username --github-token + +To create a GitHub token: +1. Go to https://github.com/settings/tokens +2. Generate a new token with 'read:packages' scope +3. Set the environment variables as shown above + +Then run the script again with --install-operator" + fi + + # Authenticate with GitHub Container Registry + log "Authenticating with GitHub Container Registry..." + if ! echo "$GITHUB_TOKEN" | helm registry login ghcr.io --username "$GITHUB_USERNAME" --password-stdin; then + error "Failed to authenticate with GitHub Container Registry. Please verify your GITHUB_TOKEN and GITHUB_USERNAME." + fi + + # Install DocumentDB operator from OCI registry + log "Pulling and installing DocumentDB operator from ghcr.io/${OPERATOR_GITHUB_ORG}/documentdb-operator..." + helm install documentdb-operator \ + oci://ghcr.io/${OPERATOR_GITHUB_ORG}/documentdb-operator \ + --version ${OPERATOR_CHART_VERSION} \ + --namespace documentdb-operator \ + --create-namespace \ + --wait \ + --timeout 10m + + if [ $? -eq 0 ]; then + success "DocumentDB operator installed successfully from ${OPERATOR_GITHUB_ORG}/documentdb-operator:${OPERATOR_CHART_VERSION}" + else + error "Failed to install DocumentDB operator from OCI registry. Please verify: +- Your GitHub token has 'read:packages' scope +- You have access to ${OPERATOR_GITHUB_ORG}/documentdb-operator repository +- The chart version ${OPERATOR_CHART_VERSION} exists" + fi + + # Wait for operator to be ready + log "Waiting for DocumentDB operator to be ready..." + kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=documentdb-operator -n documentdb-operator --timeout=300s || warn "DocumentDB operator pods may still be starting" + + success "DocumentDB operator installed" +} + +# Deploy DocumentDB instance (optional) +deploy_documentdb_instance() { + if [ "$DEPLOY_INSTANCE" != "true" ]; then + warn "Skipping DocumentDB instance deployment (--skip-instance specified or not enabled)" + return 0 + fi + + log "Deploying DocumentDB instance..." + + # Check if operator is installed + if ! kubectl get deployment -n documentdb-operator documentdb-operator &> /dev/null; then + error "DocumentDB operator not found. Cannot deploy instance without operator." + fi + + # Create DocumentDB namespace + kubectl apply -f - < /dev/null; then + warn "OpenTelemetry Operator already installed. Skipping installation." + return 0 + fi + + # Install OpenTelemetry Operator + log "Installing OpenTelemetry Operator from upstream..." + kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml + + # Wait for operator to be ready + log "Waiting for OpenTelemetry Operator to be ready..." + kubectl wait --for=condition=available deployment/opentelemetry-operator-controller-manager -n opentelemetry-operator-system --timeout=300s || warn "OpenTelemetry Operator may still be starting" + + success "OpenTelemetry Operator installed (ready for multi-tenant collectors)" +} + +# Print summary +print_summary() { + echo "" + echo "==================================================" + echo "🎉 AKS CLUSTER SETUP COMPLETE!" + echo "==================================================" + echo "Cluster Name: $CLUSTER_NAME" + echo "Resource Group: $RESOURCE_GROUP" + echo "Location: $LOCATION" + echo "Operator Installed: $INSTALL_OPERATOR" + echo "Instance Deployed: $DEPLOY_INSTANCE" + echo "OpenTelemetry Operator: Installed" + echo "Custom Storage Class: $CREATE_STORAGE_CLASS" + echo "" + echo "✅ Components installed:" + echo " - AKS cluster with managed nodes" + echo " - Azure CSI drivers (Disk & File)" + echo " - Azure Load Balancer (built-in)" + echo " - cert-manager" + if [ "$CREATE_STORAGE_CLASS" == "true" ]; then + echo " - DocumentDB Premium SSD storage class" + else + echo " - Using AKS default StandardSSD_LRS storage" + fi + if [ "$INSTALL_OPERATOR" == "true" ]; then + echo " - DocumentDB operator" + fi + if [ "$DEPLOY_INSTANCE" == "true" ]; then + echo " - DocumentDB instance (sample-documentdb)" + fi + echo " - OpenTelemetry Operator (for multi-tenant collectors)" + echo "" + echo "💡 Next steps:" + echo " - Verify cluster: kubectl get nodes" + echo " - Check all pods: kubectl get pods --all-namespaces" + if [ "$INSTALL_OPERATOR" == "true" ]; then + echo " - Check operator: kubectl get pods -n documentdb-operator" + fi + if [ "$DEPLOY_INSTANCE" == "true" ]; then + echo " - Check DocumentDB: kubectl get documentdb -n documentdb-instance-ns" + echo " - Check service status: kubectl get svc -n documentdb-instance-ns" + echo " - Wait for LoadBalancer IP: kubectl get svc documentdb-service-sample-documentdb -n documentdb-instance-ns -w" + echo " - Once IP is assigned, connect: mongodb://docdbadmin:SecurePassword123!@:10260/" + fi + if [ "$ENABLE_TELEMETRY" == "true" ]; then + echo " - Check telemetry: kubectl get pods -n documentdb-telemetry" + echo " - Access Grafana: kubectl port-forward -n documentdb-telemetry svc/grafana 3000:80" + echo " - Access Prometheus: kubectl port-forward -n documentdb-telemetry svc/prometheus-server 9090:80" + echo " - Grafana login: admin / admin123" + fi + echo "" + echo "⚠️ IMPORTANT: Run './delete-cluster.sh' when done to avoid Azure charges!" + echo "==================================================" +} + +# Main execution +main() { + log "Starting DocumentDB AKS cluster setup..." + log "Configuration:" + log " Cluster: $CLUSTER_NAME" + log " Resource Group: $RESOURCE_GROUP" + log " Location: $LOCATION" + log " Install Operator: $INSTALL_OPERATOR" + log " Deploy Instance: $DEPLOY_INSTANCE" + log " Enable Telemetry: $ENABLE_TELEMETRY" + if [ ! -z "$GITHUB_USERNAME" ]; then + log " GitHub Username: $GITHUB_USERNAME" + log " GitHub Token: ${GITHUB_TOKEN:+***provided***}" + fi + echo "" + + # Validate GitHub credentials if operator installation is requested + if [ "$INSTALL_OPERATOR" == "true" ] && ([ -z "$GITHUB_TOKEN" ] || [ -z "$GITHUB_USERNAME" ]); then + error "DocumentDB operator installation requires GitHub authentication. + +GitHub credentials can be provided via: + +1. Environment variables (recommended): + export GITHUB_USERNAME= + export GITHUB_TOKEN= + +2. Command line arguments: + --github-username --github-token + +Example with command line: + $0 --install-operator --github-username myuser --github-token ghp_xxxxxxxxxxxx + +To create a GitHub token: +1. Go to https://github.com/settings/tokens +2. Generate a new token with 'read:packages' scope +3. Set via environment variables or command line arguments" + fi + + check_prerequisites + + # Simple logic based on parameters + if [ "$INSTALL_OPERATOR" == "true" ] && [ "$DEPLOY_INSTANCE" != "true" ]; then + # Case 1: --install-operator only + log "🔧 Installing operator only (assumes cluster exists)" + setup_kubeconfig + install_documentdb_operator + + elif [ "$DEPLOY_INSTANCE" == "true" ] && [ "$INSTALL_OPERATOR" != "true" ]; then + # Case 2: --deploy-instance only + log "🚀 Deploying DocumentDB instance only (assumes cluster+operator exist)" + setup_kubeconfig + deploy_documentdb_instance + + elif [ "$INSTALL_OPERATOR" == "true" ] && [ "$DEPLOY_INSTANCE" == "true" ]; then + # Case 3: --install-all (both flags set) + log "🎯 Installing everything: cluster + operator + instance" + setup_cluster_infrastructure + install_documentdb_operator + deploy_documentdb_instance + + else + # Case 4: No flags - create cluster only + log "🏗️ Creating cluster only (no operator, no instance)" + setup_cluster_infrastructure + fi + + # Always install OpenTelemetry Operator (infrastructure component for multi-tenant collectors) + log "📊 Installing OpenTelemetry Operator (infrastructure)..." + setup_kubeconfig # Ensure we have cluster access + install_opentelemetry_operator + + print_summary +} + +# Helper function to set up cluster infrastructure +setup_cluster_infrastructure() { + # Check if cluster already exists + CLUSTER_EXISTS=$(az aks show --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME --query "name" -o tsv 2>/dev/null) + + if [ "$CLUSTER_EXISTS" == "$CLUSTER_NAME" ]; then + log "✅ Cluster $CLUSTER_NAME already exists, skipping infrastructure setup" + setup_kubeconfig + else + log "Creating new cluster and infrastructure..." + create_resource_group + create_cluster + install_azure_csi_drivers + configure_load_balancer + install_cert_manager + create_storage_class + fi +} + +# Helper function to set up kubeconfig +setup_kubeconfig() { + # Verify cluster exists + if ! az aks show --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME >/dev/null 2>&1; then + error "Cluster $CLUSTER_NAME not found. Create cluster first." + fi + + # Get cluster credentials + log "Getting cluster credentials..." + az aks get-credentials --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME --overwrite-existing + + # Handle WSL case + if grep -qi microsoft /proc/version 2>/dev/null; then + log "Detected WSL environment, copying kubeconfig from Windows to WSL..." + WIN_KUBE_CONFIG="/mnt/c/Users/$(whoami)/.kube/config" + if [ -f "$WIN_KUBE_CONFIG" ]; then + mkdir -p ~/.kube + cp "$WIN_KUBE_CONFIG" ~/.kube/config + chmod 600 ~/.kube/config + log "Kubeconfig copied to WSL" + fi + fi + + success "Cluster credentials configured" +} + +# Run main function +main "$@" \ No newline at end of file diff --git a/documentdb-playground/telemetry/scripts/delete-cluster.sh b/documentdb-playground/telemetry/scripts/delete-cluster.sh new file mode 100755 index 00000000..72cdd379 --- /dev/null +++ b/documentdb-playground/telemetry/scripts/delete-cluster.sh @@ -0,0 +1,407 @@ +#!/bin/bash + +# DocumentDB AKS Cluster Deletion Script +# This script comprehensively deletes the AKS cluster and all associated Azure resources + +set -e # Exit on any error + +# Configuration (should match create-cluster.sh) +CLUSTER_NAME="ray-ddb-cluster" +RESOURCE_GROUP="ray-documentdb-rg" +LOCATION="West US 2" + +# Deletion scope flags +DELETE_INSTANCE="${DELETE_INSTANCE:-false}" +DELETE_OPERATOR="${DELETE_OPERATOR:-false}" +DELETE_CLUSTER="${DELETE_CLUSTER:-false}" +DELETE_ALL="${DELETE_ALL:-false}" + +# Parse command line arguments +while [[ $# -gt 0 ]]; do + case $1 in + --cluster-name) + CLUSTER_NAME="$2" + shift 2 + ;; + --resource-group) + RESOURCE_GROUP="$2" + shift 2 + ;; + --delete-instance) + DELETE_INSTANCE="true" + shift + ;; + --delete-operator) + DELETE_OPERATOR="true" + shift + ;; + --delete-cluster) + DELETE_CLUSTER="true" + shift + ;; + --delete-all) + DELETE_ALL="true" + DELETE_INSTANCE="true" + DELETE_OPERATOR="true" + DELETE_CLUSTER="true" + shift + ;; + --force) + FORCE_DELETE="true" + shift + ;; + -h|--help) + echo "Usage: $0 [OPTIONS]" + echo "" + echo "Options:" + echo " --delete-instance Delete DocumentDB instance only" + echo " --delete-operator Delete DocumentDB operator only" + echo " --delete-cluster Delete AKS cluster only" + echo " --delete-all Delete everything (instance + operator + cluster)" + echo " --cluster-name NAME AKS cluster name (default: ray-ddb-cluster)" + echo " --resource-group RG Azure resource group (default: ray-documentdb-rg)" + echo " --force Skip confirmation prompts" + echo " -h, --help Show this help message" + echo "" + echo "Examples:" + echo " $0 --delete-instance # Delete DocumentDB instance only" + echo " $0 --delete-operator # Delete operator only" + echo " $0 --delete-cluster # Delete cluster only" + echo " $0 --delete-all # Delete everything" + exit 0 + ;; + *) + echo "Unknown option: $1" + exit 1 + ;; + esac +done + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +# Logging function +log() { + echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1" +} + +success() { + echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')] ✅ $1${NC}" +} + +warn() { + echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] ⚠️ $1${NC}" +} + +error() { + echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ❌ $1${NC}" +} + +# Check prerequisites +check_prerequisites() { + log "Checking prerequisites..." + + # Check Azure CLI + if ! command -v az &> /dev/null; then + error "Azure CLI not found. Cannot proceed with deletion." + exit 1 + fi + + # Check kubectl + if ! command -v kubectl &> /dev/null; then + warn "kubectl not found. Some cleanup steps may be skipped." + fi + + # Check Azure login + if ! az account show &> /dev/null; then + error "Not logged into Azure. Please run 'az login' first." + exit 1 + fi + + success "Prerequisites met" +} + +# Confirmation prompt +confirm_deletion() { + if [ "$FORCE_DELETE" == "true" ]; then + return 0 + fi + + echo "" + echo "⚠️ WARNING: This will permanently delete the following resources:" + + if [ "$DELETE_INSTANCE" == "true" ]; then + echo " - DocumentDB instances and namespaces" + fi + + if [ "$DELETE_OPERATOR" == "true" ]; then + echo " - DocumentDB operator" + fi + + if [ "$DELETE_CLUSTER" == "true" ]; then + echo " - AKS Cluster: $CLUSTER_NAME" + echo " - Resource Group: $RESOURCE_GROUP (and ALL resources within it)" + echo " - All associated Azure resources (LoadBalancers, Disks, Network Security Groups, etc.)" + echo "" + echo "💰 This action will stop all Azure charges for these resources." + fi + + echo "" + read -p "Are you sure you want to proceed? Type 'yes' to confirm: " confirmation + + if [ "$confirmation" != "yes" ]; then + echo "Deletion cancelled." + exit 0 + fi +} + +# Delete DocumentDB instances (legacy single-tenant only) +delete_documentdb_instances() { + log "Deleting legacy DocumentDB instances..." + + if command -v kubectl &> /dev/null && kubectl cluster-info &> /dev/null; then + # Delete legacy DocumentDB instances (single-tenant setup) + kubectl delete documentdb --all -n documentdb-instance-ns --ignore-not-found=true || warn "No legacy DocumentDB instances found" + + # Delete legacy DocumentDB namespace + kubectl delete namespace documentdb-instance-ns --ignore-not-found=true || warn "Legacy DocumentDB namespace not found" + + warn "⚠️ For multi-tenant DocumentDB cleanup, use: ./delete-multi-tenant-telemetry.sh" + success "Legacy DocumentDB instances cleanup completed" + else + warn "kubectl not available or cluster not accessible. Skipping DocumentDB cleanup." + fi +} + +# Delete DocumentDB operator +delete_documentdb_operator() { + log "Deleting DocumentDB operator..." + + if command -v kubectl &> /dev/null && kubectl cluster-info &> /dev/null; then + # Delete operator using Helm if available + if command -v helm &> /dev/null; then + helm uninstall documentdb-operator -n documentdb-operator --ignore-not-found 2>/dev/null || warn "DocumentDB operator Helm release not found" + fi + + # Delete operator namespace + kubectl delete namespace documentdb-operator --ignore-not-found=true || warn "Failed to delete DocumentDB operator namespace" + + success "DocumentDB operator deleted" + else + warn "kubectl not available or cluster not accessible. Skipping operator cleanup." + fi +} + +# Delete cert-manager +delete_cert_manager() { + log "Deleting cert-manager..." + + if command -v kubectl &> /dev/null && kubectl cluster-info &> /dev/null && command -v helm &> /dev/null; then + helm uninstall cert-manager -n cert-manager --ignore-not-found 2>/dev/null || warn "cert-manager Helm release not found" + kubectl delete namespace cert-manager --ignore-not-found=true || warn "Failed to delete cert-manager namespace" + success "cert-manager deleted" + else + warn "kubectl or helm not available. Skipping cert-manager cleanup." + fi +} + +# Delete Load Balancer services +delete_load_balancer_services() { + log "Deleting LoadBalancer services..." + + if command -v kubectl &> /dev/null && kubectl cluster-info &> /dev/null; then + # Delete all LoadBalancer services to trigger Azure LoadBalancer cleanup + kubectl get services --all-namespaces -o json | \ + jq -r '.items[] | select(.spec.type=="LoadBalancer") | "\(.metadata.namespace) \(.metadata.name)"' | \ + while read namespace name; do + if [ -n "$namespace" ] && [ -n "$name" ]; then + log "Deleting LoadBalancer service: $name in namespace: $namespace" + kubectl delete service "$name" -n "$namespace" --ignore-not-found=true || warn "Failed to delete service $name" + fi + done 2>/dev/null || warn "Failed to query LoadBalancer services" + + # Wait a moment for Azure to process the deletions + log "Waiting for Azure LoadBalancer cleanup..." + sleep 30 + + success "LoadBalancer services deleted" + else + warn "kubectl not available. Skipping LoadBalancer service cleanup." + fi +} + +# Delete AKS cluster +delete_aks_cluster() { + log "Deleting AKS cluster: $CLUSTER_NAME" + + # Check if cluster exists + if ! az aks show --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME &> /dev/null; then + warn "AKS cluster $CLUSTER_NAME not found. Skipping cluster deletion." + return 0 + fi + + # Delete the AKS cluster + log "This may take 10-15 minutes..." + az aks delete --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME --yes --no-wait + + # Wait for deletion to complete + log "Waiting for AKS cluster deletion to complete..." + while az aks show --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME &> /dev/null; do + log "Cluster still exists, waiting..." + sleep 30 + done + + success "AKS cluster deleted" +} + +# Delete resource group and all resources +delete_resource_group() { + log "Deleting resource group: $RESOURCE_GROUP" + + # Check if resource group exists + if ! az group show --name $RESOURCE_GROUP &> /dev/null; then + warn "Resource group $RESOURCE_GROUP not found. Skipping resource group deletion." + return 0 + fi + + # Delete the entire resource group (this removes all resources within it) + log "This may take 10-20 minutes..." + az group delete --name $RESOURCE_GROUP --yes --no-wait + + # Wait for deletion to complete + log "Waiting for resource group deletion to complete..." + while az group show --name $RESOURCE_GROUP &> /dev/null; do + log "Resource group still exists, waiting..." + sleep 60 + done + + success "Resource group deleted" +} + +# Clean up local kubectl context +cleanup_kubectl_context() { + log "Cleaning up local kubectl context..." + + if command -v kubectl &> /dev/null; then + # Remove the cluster context + kubectl config delete-context "$CLUSTER_NAME" 2>/dev/null || warn "kubectl context not found" + kubectl config delete-cluster "$CLUSTER_NAME" 2>/dev/null || warn "kubectl cluster config not found" + kubectl config unset "users.clusterUser_${RESOURCE_GROUP}_${CLUSTER_NAME}" 2>/dev/null || warn "kubectl user config not found" + + success "kubectl context cleaned up" + else + warn "kubectl not available. Skipping kubectl context cleanup." + fi +} + +# Verify cleanup +verify_cleanup() { + log "Verifying cleanup..." + + # Check if resource group still exists + if az group show --name $RESOURCE_GROUP &> /dev/null; then + error "Resource group $RESOURCE_GROUP still exists. Manual cleanup may be required." + return 1 + fi + + success "✅ All Azure resources have been successfully deleted" + success "✅ No Azure charges should be incurred for these resources" +} + +# Print summary +print_summary() { + echo "" + echo "==================================================" + echo "🗑️ SELECTIVE DELETION COMPLETE!" + echo "==================================================" + echo "Deleted Resources:" + + if [ "$DELETE_INSTANCE" == "true" ]; then + echo " - DocumentDB instances and namespaces" + fi + + if [ "$DELETE_OPERATOR" == "true" ]; then + echo " - DocumentDB operator" + fi + + if [ "$DELETE_CLUSTER" == "true" ]; then + echo " - AKS Cluster: $CLUSTER_NAME" + echo " - Resource Group: $RESOURCE_GROUP" + echo " - All associated Azure resources" + fi + + echo "" + echo "✅ Cleanup completed successfully" + + if [ "$DELETE_CLUSTER" == "true" ]; then + echo "✅ All Azure charges for these resources have been stopped" + echo "" + echo "💡 If you need to recreate the cluster:" + echo " ./create-cluster.sh --install-all" + else + echo "" + echo "💡 Next steps based on what's still running:" + if [ "$DELETE_INSTANCE" == "true" ] && [ "$DELETE_OPERATOR" == "false" ]; then + echo " - Deploy new instance: ./create-cluster.sh --deploy-instance" + fi + if [ "$DELETE_OPERATOR" == "true" ] && [ "$DELETE_CLUSTER" == "false" ]; then + echo " - Install operator: ./create-cluster.sh --install-operator" + echo " - Deploy instance: ./create-cluster.sh --deploy-instance" + fi + fi + echo "==================================================" +} + +# Main execution +main() { + log "Starting DocumentDB AKS selective deletion..." + log "Target cluster: $CLUSTER_NAME in resource group: $RESOURCE_GROUP" + log "Deletion scope:" + log " Instance: $DELETE_INSTANCE" + log " Operator: $DELETE_OPERATOR" + log " Cluster: $DELETE_CLUSTER" + echo "" + + # Check if any deletion flag is set + if [ "$DELETE_INSTANCE" != "true" ] && [ "$DELETE_OPERATOR" != "true" ] && [ "$DELETE_CLUSTER" != "true" ]; then + error "No deletion scope specified. Use --delete-instance, --delete-operator, --delete-cluster, or --delete-all" + exit 1 + fi + + # Execute deletion steps + check_prerequisites + confirm_deletion + + log "🗑️ Beginning selective deletion process..." + + # Selective deletion based on flags + if [ "$DELETE_INSTANCE" == "true" ]; then + delete_documentdb_instances + fi + + if [ "$DELETE_OPERATOR" == "true" ]; then + delete_documentdb_operator + fi + + if [ "$DELETE_CLUSTER" == "true" ]; then + delete_cert_manager + delete_load_balancer_services + delete_aks_cluster + delete_resource_group + cleanup_kubectl_context + verify_cleanup + fi + + # Show summary + print_summary +} + +# Handle script interruption +trap 'echo -e "\n${RED}Script interrupted. Some resources may not have been deleted.${NC}"; exit 1' INT + +# Run main function +main "$@" \ No newline at end of file diff --git a/documentdb-playground/telemetry/scripts/delete-multi-tenant-telemetry.sh b/documentdb-playground/telemetry/scripts/delete-multi-tenant-telemetry.sh new file mode 100755 index 00000000..80682c98 --- /dev/null +++ b/documentdb-playground/telemetry/scripts/delete-multi-tenant-telemetry.sh @@ -0,0 +1,378 @@ +#!/bin/bash + +# Multi-Tenant DocumentDB + Telemetry Cleanup Script +# This script removes all multi-tenant DocumentDB applications and monitoring stack + +set -e + +# Configuration +TEAMS=("sales" "accounts") +NAMESPACES=("sales-namespace" "accounts-namespace") + +# Cleanup scope flags +DELETE_DOCUMENTDB="${DELETE_DOCUMENTDB:-false}" +DELETE_COLLECTORS="${DELETE_COLLECTORS:-false}" +DELETE_MONITORING="${DELETE_MONITORING:-false}" +DELETE_NAMESPACES="${DELETE_NAMESPACES:-false}" +DELETE_ALL="${DELETE_ALL:-false}" + +# Parse command line arguments +while [[ $# -gt 0 ]]; do + case $1 in + --delete-documentdb) + DELETE_DOCUMENTDB="true" + shift + ;; + --delete-collectors) + DELETE_COLLECTORS="true" + shift + ;; + --delete-monitoring) + DELETE_MONITORING="true" + shift + ;; + --delete-namespaces) + DELETE_NAMESPACES="true" + shift + ;; + --delete-all) + DELETE_ALL="true" + DELETE_DOCUMENTDB="true" + DELETE_COLLECTORS="true" + DELETE_MONITORING="true" + DELETE_NAMESPACES="true" + shift + ;; + --force) + FORCE_DELETE="true" + shift + ;; + -h|--help) + echo "Usage: $0 [OPTIONS]" + echo "" + echo "Multi-tenant DocumentDB and telemetry cleanup script" + echo "" + echo "Options:" + echo " --delete-documentdb Delete DocumentDB clusters only" + echo " --delete-collectors Delete OpenTelemetry collectors only" + echo " --delete-monitoring Delete Prometheus/Grafana monitoring only" + echo " --delete-namespaces Delete team namespaces (includes all above)" + echo " --delete-all Delete everything (DocumentDB + collectors + monitoring + namespaces)" + echo " --force Skip confirmation prompts" + echo " -h, --help Show this help message" + echo "" + echo "Examples:" + echo " $0 --delete-all # Remove everything" + echo " $0 --delete-documentdb # Remove only DocumentDB clusters" + echo " $0 --delete-monitoring # Remove only Prometheus/Grafana" + echo " $0 --delete-all --force # Remove everything without confirmation" + echo "" + echo "Affected namespaces: ${NAMESPACES[*]}" + exit 0 + ;; + *) + echo "Unknown option: $1" + echo "Use --help for usage information" + exit 1 + ;; + esac +done + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +# Logging functions +log() { + echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1" +} + +success() { + echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')] ✅ $1${NC}" +} + +warn() { + echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] ⚠️ $1${NC}" +} + +error() { + echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ❌ $1${NC}" + exit 1 +} + +# Check prerequisites +check_prerequisites() { + log "Checking prerequisites..." + + # Check kubectl + if ! command -v kubectl &> /dev/null; then + error "kubectl not found. Cannot proceed with cleanup." + fi + + # Check cluster access + if ! kubectl cluster-info &> /dev/null; then + error "Cannot access Kubernetes cluster. Please check your kubectl configuration." + fi + + # Check Helm + if ! command -v helm &> /dev/null; then + warn "Helm not found. Some monitoring cleanup may require manual intervention." + fi + + success "Prerequisites met" +} + +# Confirmation prompt +confirm_deletion() { + if [ "$FORCE_DELETE" == "true" ]; then + return 0 + fi + + echo "" + echo "⚠️ WARNING: This will permanently delete the following multi-tenant resources:" + echo "" + + if [ "$DELETE_DOCUMENTDB" == "true" ] || [ "$DELETE_ALL" == "true" ]; then + echo "📦 DocumentDB Clusters:" + for team in "${TEAMS[@]}"; do + echo " - documentdb-$team (in ${team}-namespace)" + done + fi + + if [ "$DELETE_COLLECTORS" == "true" ] || [ "$DELETE_ALL" == "true" ]; then + echo "🔧 OpenTelemetry Collectors:" + for team in "${TEAMS[@]}"; do + echo " - documentdb-${team}-collector (in ${team}-namespace)" + done + fi + + if [ "$DELETE_MONITORING" == "true" ] || [ "$DELETE_ALL" == "true" ]; then + echo "📊 Monitoring Stacks:" + for team in "${TEAMS[@]}"; do + echo " - prometheus-$team (Helm release)" + echo " - grafana-$team (Helm release)" + done + fi + + if [ "$DELETE_NAMESPACES" == "true" ] || [ "$DELETE_ALL" == "true" ]; then + echo "🏠 Namespaces:" + for ns in "${NAMESPACES[@]}"; do + echo " - $ns (and ALL resources within it)" + done + fi + + echo "" + echo "💡 This will NOT affect:" + echo " - AKS cluster infrastructure" + echo " - DocumentDB operator" + echo " - OpenTelemetry operator" + echo " - Other namespaces" + echo "" + + read -p "Are you sure you want to proceed? (yes/no): " -r + if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then + log "Operation cancelled by user" + exit 0 + fi +} + +# Delete DocumentDB clusters +delete_documentdb_clusters() { + log "Deleting DocumentDB clusters..." + + for i in "${!TEAMS[@]}"; do + team="${TEAMS[$i]}" + namespace="${NAMESPACES[$i]}" + + log "Deleting DocumentDB cluster for team: $team" + + # Delete DocumentDB cluster + kubectl delete documentdb documentdb-$team -n $namespace --ignore-not-found=true || warn "DocumentDB cluster for $team not found or failed to delete" + + # Wait for cluster to be fully deleted + log "Waiting for DocumentDB cluster $team to be fully deleted..." + timeout=120 + while kubectl get documentdb documentdb-$team -n $namespace &> /dev/null && [ $timeout -gt 0 ]; do + echo -n "." + sleep 2 + timeout=$((timeout - 2)) + done + echo "" + + if [ $timeout -le 0 ]; then + warn "Timeout waiting for DocumentDB cluster $team to be deleted" + else + success "DocumentDB cluster $team deleted successfully" + fi + + # Delete secrets and configmaps + kubectl delete secret documentdb-credentials -n $namespace --ignore-not-found=true || true + kubectl delete configmap --all -n $namespace --ignore-not-found=true || true + done + + success "DocumentDB clusters cleanup completed" +} + +# Delete OpenTelemetry collectors +delete_otel_collectors() { + log "Deleting OpenTelemetry collectors..." + + for i in "${!TEAMS[@]}"; do + team="${TEAMS[$i]}" + namespace="${NAMESPACES[$i]}" + + log "Deleting OpenTelemetry collector for team: $team" + + # Delete OpenTelemetry collector + kubectl delete otelcol documentdb-${team}-collector -n $namespace --ignore-not-found=true || warn "OpenTelemetry collector for $team not found" + + # Delete collector service account and RBAC + kubectl delete serviceaccount otel-collector-$team -n $namespace --ignore-not-found=true || true + kubectl delete clusterrolebinding otel-collector-$team --ignore-not-found=true || true + done + + success "OpenTelemetry collectors cleanup completed" +} + +# Delete monitoring stack (Prometheus & Grafana) +delete_monitoring_stack() { + log "Deleting monitoring stacks..." + + if ! command -v helm &> /dev/null; then + error "Helm is required to delete monitoring stack. Please install Helm or delete manually." + fi + + for team in "${TEAMS[@]}"; do + namespace="${team}-namespace" + + log "Deleting monitoring stack for team: $team" + + # Delete Grafana + log "Deleting Grafana for $team..." + helm uninstall grafana-$team -n $namespace --ignore-not-found 2>/dev/null || warn "Grafana release for $team not found" + + # Delete Prometheus + log "Deleting Prometheus for $team..." + helm uninstall prometheus-$team -n $namespace --ignore-not-found 2>/dev/null || warn "Prometheus release for $team not found" + + # Wait for PVCs to be cleaned up (they may have finalizers) + log "Waiting for persistent volumes to be cleaned up..." + sleep 5 + + # Force delete any remaining PVCs if they exist + kubectl delete pvc --all -n $namespace --ignore-not-found=true || true + done + + success "Monitoring stacks cleanup completed" +} + +# Delete team namespaces +delete_team_namespaces() { + log "Deleting team namespaces..." + + for namespace in "${NAMESPACES[@]}"; do + log "Deleting namespace: $namespace" + + # Delete namespace (this will delete all resources within it) + kubectl delete namespace $namespace --ignore-not-found=true || warn "Failed to delete namespace $namespace" + + # Wait for namespace to be fully deleted + log "Waiting for namespace $namespace to be fully deleted..." + timeout=120 + while kubectl get namespace $namespace &> /dev/null && [ $timeout -gt 0 ]; do + echo -n "." + sleep 2 + timeout=$((timeout - 2)) + done + echo "" + + if [ $timeout -le 0 ]; then + warn "Timeout waiting for namespace $namespace to be deleted" + else + success "Namespace $namespace deleted successfully" + fi + done + + success "Team namespaces cleanup completed" +} + +# Clean up cluster-wide resources specific to multi-tenant setup +cleanup_cluster_resources() { + log "Cleaning up cluster-wide multi-tenant resources..." + + # Delete cluster roles and bindings for each team + for team in "${TEAMS[@]}"; do + kubectl delete clusterrole otel-collector-$team --ignore-not-found=true || true + kubectl delete clusterrolebinding otel-collector-$team --ignore-not-found=true || true + done + + success "Cluster-wide resources cleaned up" +} + +# Main execution function +main() { + log "Starting multi-tenant DocumentDB + telemetry cleanup..." + + check_prerequisites + + # If no specific flags are set, show help + if [ "$DELETE_DOCUMENTDB" != "true" ] && [ "$DELETE_COLLECTORS" != "true" ] && [ "$DELETE_MONITORING" != "true" ] && [ "$DELETE_NAMESPACES" != "true" ] && [ "$DELETE_ALL" != "true" ]; then + warn "No cleanup scope specified. Use --help to see available options." + echo "" + echo "Quick options:" + echo " --delete-all Delete everything" + echo " --delete-documentdb Delete DocumentDB clusters only" + echo " --help Show full help" + exit 1 + fi + + confirm_deletion + + # Execute cleanup in proper order + if [ "$DELETE_DOCUMENTDB" == "true" ] || [ "$DELETE_ALL" == "true" ]; then + delete_documentdb_clusters + fi + + if [ "$DELETE_COLLECTORS" == "true" ] || [ "$DELETE_ALL" == "true" ]; then + delete_otel_collectors + fi + + if [ "$DELETE_MONITORING" == "true" ] || [ "$DELETE_ALL" == "true" ]; then + delete_monitoring_stack + fi + + if [ "$DELETE_NAMESPACES" == "true" ] || [ "$DELETE_ALL" == "true" ]; then + delete_team_namespaces + else + # Clean up cluster resources even if not deleting namespaces + cleanup_cluster_resources + fi + + # Summary + echo "" + echo "==================================================" + echo "🎉 MULTI-TENANT CLEANUP COMPLETE!" + echo "==================================================" + echo "" + echo "✅ Cleanup completed successfully" + echo "" + echo "💡 What was cleaned up:" + [ "$DELETE_DOCUMENTDB" == "true" ] || [ "$DELETE_ALL" == "true" ] && echo " - DocumentDB clusters for teams: ${TEAMS[*]}" + [ "$DELETE_COLLECTORS" == "true" ] || [ "$DELETE_ALL" == "true" ] && echo " - OpenTelemetry collectors for teams: ${TEAMS[*]}" + [ "$DELETE_MONITORING" == "true" ] || [ "$DELETE_ALL" == "true" ] && echo " - Prometheus/Grafana monitoring stacks" + [ "$DELETE_NAMESPACES" == "true" ] || [ "$DELETE_ALL" == "true" ] && echo " - Team namespaces: ${NAMESPACES[*]}" + echo "" + echo "🏗️ Infrastructure still available:" + echo " - AKS cluster (use delete-cluster.sh to remove)" + echo " - DocumentDB operator" + echo " - OpenTelemetry operator" + echo "" + echo "🚀 Ready for new multi-tenant deployments!" + echo " Use: ./deploy-multi-tenant-telemetry.sh" +} + +# Run main function +main "$@" \ No newline at end of file diff --git a/documentdb-playground/telemetry/scripts/deploy-multi-tenant-telemetry.sh b/documentdb-playground/telemetry/scripts/deploy-multi-tenant-telemetry.sh new file mode 100755 index 00000000..ccfdce3a --- /dev/null +++ b/documentdb-playground/telemetry/scripts/deploy-multi-tenant-telemetry.sh @@ -0,0 +1,551 @@ +#!/bin/bash + +# Multi-Tenant DocumentDB + Telemetry Deployment Script +# This script deploys complete DocumentDB clusters with isolated monitoring stacks for different teams + +set -e + +# Configuration +SALES_NAMESPACE="sales-namespace" +ACCOUNTS_NAMESPACE="accounts-namespace" +TELEMETRY_NAMESPACE="documentdb-telemetry" + +# Deployment options +DEPLOY_DOCUMENTDB=true +DEPLOY_TELEMETRY=true +SKIP_WAIT=false + +# Parse command line arguments +usage() { + echo "Usage: $0 [OPTIONS]" + echo "" + echo "Options:" + echo " --telemetry-only Deploy only telemetry stack (skip DocumentDB)" + echo " --documentdb-only Deploy only DocumentDB (skip telemetry)" + echo " --skip-wait Skip waiting for deployments to be ready" + echo " --help Show this help message" + echo "" + echo "Examples:" + echo " $0 # Deploy everything (DocumentDB + Telemetry)" + echo " $0 --telemetry-only # Deploy only collectors, Prometheus, Grafana" + echo " $0 --documentdb-only # Deploy only DocumentDB clusters" +} + +while [[ $# -gt 0 ]]; do + case $1 in + --telemetry-only) + DEPLOY_DOCUMENTDB=false + shift + ;; + --documentdb-only) + DEPLOY_TELEMETRY=false + shift + ;; + --skip-wait) + SKIP_WAIT=true + shift + ;; + --help) + usage + exit 0 + ;; + *) + error "Unknown option: $1" + usage + exit 1 + ;; + esac +done + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +log() { + echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1" +} + +success() { + echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')] ✅${NC} $1" +} + +warn() { + echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] ⚠️${NC} $1" +} + +error() { + echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ❌${NC} $1" + exit 1 +} + +# Check if OpenTelemetry Operator is installed +check_prerequisites() { + log "Checking prerequisites..." + + if ! kubectl get namespace opentelemetry-operator-system > /dev/null 2>&1; then + error "OpenTelemetry Operator is not installed. Please install it first." + fi + + if ! helm version > /dev/null 2>&1; then + error "Helm is not installed. Please install Helm first." + fi + + # Add Prometheus Helm repo if not already added + if ! helm repo list | grep -q prometheus-community; then + log "Adding Prometheus Helm repository..." + helm repo add prometheus-community https://prometheus-community.github.io/helm-charts + helm repo update + fi + + # Add Grafana Helm repo if not already added + if ! helm repo list | grep -q grafana; then + log "Adding Grafana Helm repository..." + helm repo add grafana https://grafana.github.io/helm-charts + helm repo update + fi + + success "Prerequisites check completed" +} + +# Create namespaces for teams +create_namespaces() { + log "Creating team namespaces..." + + # Sales namespace + if ! kubectl get namespace $SALES_NAMESPACE > /dev/null 2>&1; then + kubectl create namespace $SALES_NAMESPACE + kubectl label namespace $SALES_NAMESPACE team=sales + success "Created sales namespace: $SALES_NAMESPACE" + else + log "Sales namespace already exists: $SALES_NAMESPACE" + fi + + # Accounts namespace + if ! kubectl get namespace $ACCOUNTS_NAMESPACE > /dev/null 2>&1; then + kubectl create namespace $ACCOUNTS_NAMESPACE + kubectl label namespace $ACCOUNTS_NAMESPACE team=accounts + success "Created accounts namespace: $ACCOUNTS_NAMESPACE" + else + log "Accounts namespace already exists: $ACCOUNTS_NAMESPACE" + fi +} + +# Deploy Prometheus for a namespace +deploy_prometheus() { + local namespace=$1 + local team=$2 + + log "Deploying Prometheus for $team team in namespace: $namespace" + + helm upgrade --install prometheus-$team prometheus-community/prometheus \ + --namespace $namespace \ + --set server.persistentVolume.size=10Gi \ + --set server.retention=15d \ + --set server.global.scrape_interval=15s \ + --set server.global.evaluation_interval=15s \ + --set alertmanager.enabled=false \ + --set prometheus-node-exporter.enabled=false \ + --set prometheus-pushgateway.enabled=false \ + --set kube-state-metrics.enabled=false \ + --set server.service.type=ClusterIP \ + --set server.ingress.enabled=false \ + --wait --timeout=300s + + success "Prometheus deployed for $team team" +} + +# Deploy Grafana for a namespace +deploy_grafana() { + local namespace=$1 + local team=$2 + local prometheus_url="http://prometheus-$team-server.$namespace.svc.cluster.local" + + log "Deploying Grafana for $team team in namespace: $namespace" + + # Create Grafana values for this team + cat > /tmp/grafana-$team-values.yaml < /dev/null && pwd )" + TELEMETRY_DIR="$(dirname "$SCRIPT_DIR")" + + # Deploy Sales collector + if [ -f "$TELEMETRY_DIR/otel-collector-sales.yaml" ]; then + log "Deploying Sales team OpenTelemetry Collector..." + kubectl apply -f "$TELEMETRY_DIR/otel-collector-sales.yaml" + success "Sales collector deployed" + else + error "Sales collector configuration not found: $TELEMETRY_DIR/otel-collector-sales.yaml" + fi + + # Deploy Accounts collector + if [ -f "$TELEMETRY_DIR/otel-collector-accounts.yaml" ]; then + log "Deploying Accounts team OpenTelemetry Collector..." + kubectl apply -f "$TELEMETRY_DIR/otel-collector-accounts.yaml" + success "Accounts collector deployed" + else + error "Accounts collector configuration not found: $TELEMETRY_DIR/otel-collector-accounts.yaml" + fi +} + +# Deploy monitoring stack for each team +deploy_monitoring_stacks() { + log "Deploying monitoring stacks for each team..." + + # Deploy Sales monitoring stack + deploy_prometheus $SALES_NAMESPACE "sales" + deploy_grafana $SALES_NAMESPACE "sales" + + # Deploy Accounts monitoring stack + deploy_prometheus $ACCOUNTS_NAMESPACE "accounts" + deploy_grafana $ACCOUNTS_NAMESPACE "accounts" + + success "All monitoring stacks deployed" +} + +# Deploy DocumentDB instance for a team +deploy_documentdb() { + local namespace=$1 + local team=$2 + local cluster_name="documentdb-$team" + + log "Deploying DocumentDB cluster for $team team in namespace: $namespace" + + # Create DocumentDB credentials secret (must be named 'documentdb-credentials') + cat > /tmp/documentdb-$team-secret.yaml < /tmp/documentdb-$team-cluster.yaml < /dev/null; then + echo "Error: Cannot connect to Grafana at $grafana_url" + echo "Make sure port-forward is running: kubectl port-forward -n $NAMESPACE svc/grafana-$TEAM ${1}:3000" + return 1 + fi + + # Create the dashboard + response=$(curl -s -X POST \ + -H "Content-Type: application/json" \ + -u "$auth" \ + -d "$DASHBOARD_JSON" \ + "$grafana_url/api/dashboards/db") + + if echo "$response" | grep -q '"status":"success"'; then + dashboard_url=$(echo "$response" | jq -r '.url') + echo "✅ Dashboard created successfully!" + echo "🔗 Access it at: $grafana_url$dashboard_url" + else + echo "❌ Error creating dashboard:" + echo "$response" | jq '.' + return 1 + fi +} + +# Create dashboards based on namespace +case $NAMESPACE in + "sales-namespace") + echo "Setting up Sales team dashboard..." + create_dashboard 3001 + ;; + "accounts-namespace") + echo "Setting up Accounts team dashboard..." + create_dashboard 3002 + ;; + *) + echo "Unknown namespace: $NAMESPACE" + echo "Supported namespaces: sales-namespace, accounts-namespace" + exit 1 + ;; +esac + +echo "" +echo "Dashboard setup complete! 🎉" +echo "" +echo "To view your dashboard:" +echo "1. Open your browser to the URL shown above" +echo "2. Login with username: admin, password: admin123" +echo "3. The dashboard should be available in your dashboards list" +echo "" +echo "The dashboard includes:" +echo "- CPU Usage by Container" +echo "- Memory Usage by Container" +echo "- Memory Usage Percentage" +echo "- Pod Count" +echo "" +echo "All metrics are filtered to show only workloads in the $NAMESPACE namespace." \ No newline at end of file diff --git a/documentdb-playground/telemetry/telemetry-design.md b/documentdb-playground/telemetry/telemetry-design.md new file mode 100644 index 00000000..567f3611 --- /dev/null +++ b/documentdb-playground/telemetry/telemetry-design.md @@ -0,0 +1,637 @@ +# DocumentDB Telemetry Architecture Design + +## Overview + +This document outlines the telemetry architecture for collecting CPU and memory metrics from DocumentDB instances running on Kubernetes and visualizing them through Grafana dashboards. + +## Current DocumentDB Architecture + +### Pod Structure +Each DocumentDB instance consists of: +- **1 Pod per instancePerNode** (currently limited to 1) +- **2 Containers per Pod**: + 1. **PostgreSQL Container**: The main DocumentDB engine (based on PostgreSQL with DocumentDB extensions) + 2. **Gateway Container**: DocumentDB gateway sidecar for MongoDB API compatibility + +### Deployment Flow +1. **Cluster Preparation**: Install dependencies (CloudNative-PG operator, storage classes, etc.) +2. **Operator Installation**: Deploy DocumentDB operator +3. **Instance Deployment**: Create DocumentDB custom resources + +## Proposed Telemetry Architecture + +### Architecture Decision: DaemonSet vs Sidecar + +**RECOMMENDED: DaemonSet Approach (One Collector Per Node)** + +For DocumentDB monitoring, we recommend **one OpenTelemetry Collector per node** (DaemonSet) rather than sidecar injection: + +#### **Why DaemonSet is Better for DocumentDB:** + +| Factor | DaemonSet (✅ Recommended) | Sidecar | +|--------|---------------------------|---------| +| **Resource Usage** | 50MB RAM per node | 50MB RAM per DocumentDB pod | +| **Node Metrics** | ✅ Full node visibility | ❌ No node-level metrics | +| **Scalability** | Linear with nodes | Linear with pods | +| **Management** | Simple (3-5 collectors) | Complex (10+ collectors) | +| **DocumentDB Context** | Perfect for current 1-pod-per-node | Overkill for current setup | + +#### **Resource Comparison Example:** +```yaml +# Scenario: 9 DocumentDB pods across 3 nodes (3 pods per node) +# instancesPerNode: 3 (maximum supported) + +# DaemonSet: 3 collectors total (1 per node) +Total Resources: 150MB RAM, 150m CPU + +# Sidecar: 9 collectors (1 per DocumentDB pod) +Total Resources: 450MB RAM, 450m CPU + +# DaemonSet saves: 67% resources +``` + +#### **When to Consider Sidecar:** +- High-cardinality custom application metrics +- Pod-specific configuration requirements +- Multi-tenant isolation needs +- Different metric collection intervals per pod + +#### **For DocumentDB Use Case:** +- ✅ **Infrastructure monitoring focus** (CPU, memory, I/O) +- ✅ **Node-level context important** (node resources affect DocumentDB performance) +- ✅ **Current architecture**: 1 pod per node, future support for up to 3 pods per node +- ✅ **Resource efficiency** critical for production deployments + +### Architecture Components + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Grafana Dashboard │ +│ (Visualization Layer) │ +└─────────────────────────┬───────────────────────────────────────┘ + │ +┌─────────────────────────┴───────────────────────────────────────┐ +│ Prometheus │ +│ (Metrics Storage) │ +└─────────────────────────┬───────────────────────────────────────┘ + │ +┌─────────────────────────┴───────────────────────────────────────┐ +│ OpenTelemetry Collector (DaemonSet) │ +│ (Unified Metrics Collection) │ +│ ┌─────────────────────────────────────────────────────────────┐│ +│ │ Receivers: ││ +│ │ • kubeletstats (cAdvisor + Node metrics) ││ +│ │ • k8s_cluster (Kube State Metrics) ││ +│ │ • prometheus (scraping endpoints) ││ +│ │ • filelog (container logs) ││ +│ └─────────────────────────────────────────────────────────────┘│ +│ ┌─────────────────────────────────────────────────────────────┐│ +│ │ Processors: ││ +│ │ • resource detection ││ +│ │ • attribute enhancement ││ +│ │ • metric filtering ││ +│ └─────────────────────────────────────────────────────────────┘│ +│ ┌─────────────────────────────────────────────────────────────┐│ +│ │ Exporters: ││ +│ │ • prometheusremotewrite ││ +│ └─────────────────────────────────────────────────────────────┘│ +└─────────────────────────┬───────────────────────────────────────┘ + │ +┌─────────────────────────┴───────────────────────────────────────┐ +│ Kubernetes Cluster │ +│ ┌─────────────────────────────────────────────────────────────┐│ +│ │ DocumentDB Pods ││ +│ │ ┌─────────────────┐ ┌─────────────────┐ ││ +│ │ │ PostgreSQL │ │ Gateway │ ││ +│ │ │ Container │ │ Container │ ││ +│ │ │ (DocumentDB) │ │ (MongoDB API) │ ││ +│ │ └─────────────────┘ └─────────────────┘ ││ +│ └─────────────────────────────────────────────────────────────┘│ +└─────────────────────────────────────────────────────────────────┘ +``` + +### 1. Metrics Collection Layer (OpenTelemetry Collector) + +The OpenTelemetry Collector runs as a DaemonSet on each node and provides unified collection of all metrics through various receivers: + +#### A. Kubelet Stats Receiver (Replaces cAdvisor + Node Exporter) +- **Source**: Kubelet's built-in metrics API +- **Container Metrics Collected**: + - CPU usage (cores, percentage) + - Memory usage (RSS, cache, swap) + - Memory limits and requests + - CPU limits and requests + - Network I/O + - Filesystem I/O +- **Node Metrics Collected**: + - Node CPU utilization + - Node memory utilization + - Node filesystem usage + - Node network statistics + +#### B. Kubernetes Cluster Receiver (Replaces Kube State Metrics) +- **Source**: Kubernetes API server +- **Metrics Collected**: + - Pod status and phases + - Container restart counts + - Resource requests and limits + - DocumentDB custom resource status + - Node status and conditions + +#### C. Prometheus Receiver (For Application Metrics) +- **Source**: Application metrics endpoints from DocumentDB containers +- **Use Case**: Custom DocumentDB application metrics +- **Future Enhancement**: Gateway container request metrics (Read/Write operations) + +#### D. OTLP Receiver (Optional Future Enhancement) +- **Source**: Direct OpenTelemetry instrumentation from applications +- **Use Case**: High-performance metrics collection from DocumentDB Gateway +- **Protocol**: Native OpenTelemetry Protocol (OTLP) + +#### OpenTelemetry Collector Configuration +```yaml +receivers: + kubeletstats: + collection_interval: 20s + auth_type: "serviceAccount" + endpoint: "https://${env:K8S_NODE_NAME}:10250" + insecure_skip_verify: true + metric_groups: + - container + - pod + - node + - volume + metrics: + k8s.container.cpu_limit: + enabled: true + k8s.container.cpu_request: + enabled: true + k8s.container.memory_limit: + enabled: true + k8s.container.memory_request: + enabled: true + + k8s_cluster: + auth_type: serviceAccount + node: ${env:K8S_NODE_NAME} + metadata_exporters: [prometheus] + + # Application metrics from DocumentDB Gateway containers + prometheus/gateway: + config: + scrape_configs: + - job_name: 'documentdb-gateway' + kubernetes_sd_configs: + - role: pod + relabel_configs: + - source_labels: [__meta_kubernetes_pod_label_app] + regex: 'documentdb.*' + action: keep + - source_labels: [__meta_kubernetes_pod_container_name] + regex: 'gateway' + action: keep + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] + action: keep + regex: true + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] + action: replace + target_label: __metrics_path__ + regex: (.+) + - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] + action: replace + regex: ([^:]+)(?::\d+)?;(\d+) + replacement: $1:$2 + target_label: __address__ + + # Future: Native OTLP for high-performance metrics + otlp: + protocols: + grpc: + endpoint: 0.0.0.0:4317 + http: + endpoint: 0.0.0.0:4318 + +processors: + resourcedetection: + detectors: [env, k8snode, kubernetes] + timeout: 2s + override: false + + attributes/documentdb: + actions: + - key: documentdb.instance + from_attribute: k8s.pod.label.app + action: insert + - key: documentdb.component + from_attribute: k8s.container.name + action: insert + - key: documentdb.operation_type + from_attribute: operation + action: insert + + filter/documentdb: + metrics: + include: + match_type: regexp + resource_attributes: + - key: k8s.pod.label.app + value: "documentdb.*" + +exporters: + prometheusremotewrite: + endpoint: "http://prometheus:9090/api/v1/write" + tls: + insecure: true + +service: + pipelines: + metrics: + receivers: [kubeletstats, k8s_cluster, prometheus/gateway, otlp] + processors: [resourcedetection, attributes/documentdb, filter/documentdb] + exporters: [prometheusremotewrite] +``` + +### 2. Metrics Storage Layer + +#### Prometheus Configuration (Simplified) +Since OpenTelemetry Collector handles all metric collection and forwarding, Prometheus configuration is simplified: + +```yaml +# Prometheus receives metrics via remote write from OpenTelemetry Collector +global: + scrape_interval: 15s + evaluation_interval: 15s + +# OpenTelemetry Collector pushes metrics here +remote_write_configs: [] # Not needed as OTel pushes via API + +# Optional: Direct scraping of Prometheus metrics from OTel Collector itself +scrape_configs: + - job_name: 'otel-collector' + static_configs: + - targets: ['otel-collector:8888'] # OTel Collector's own metrics +``` + +### 3. Visualization Layer + +#### Grafana Dashboard Structure + +##### Panel 1: DocumentDB Instance Overview +- **Metrics**: + - Total number of DocumentDB instances + - Instance health status + - Pod restarts in last 24h + +##### Panel 2: CPU Metrics +- **PostgreSQL Container CPU**: + - `rate(k8s_container_cpu_time{k8s_container_name="postgres",k8s_pod_label_app=~"documentdb.*"}[5m])` +- **Gateway Container CPU**: + - `rate(k8s_container_cpu_time{k8s_container_name="gateway",k8s_pod_label_app=~"documentdb.*"}[5m])` +- **CPU Utilization vs Limits**: + - `(rate(k8s_container_cpu_time[5m]) / k8s_container_cpu_limit) * 100` + +##### Panel 3: Memory Metrics +- **PostgreSQL Container Memory**: + - `k8s_container_memory_usage{k8s_container_name="postgres",k8s_pod_label_app=~"documentdb.*"}` +- **Gateway Container Memory**: + - `k8s_container_memory_usage{k8s_container_name="gateway",k8s_pod_label_app=~"documentdb.*"}` +- **Memory Utilization vs Limits**: + - `(k8s_container_memory_usage / k8s_container_memory_limit) * 100` + +##### Panel 4: Gateway Application Metrics (Future Enhancement) +- **Read Operations per Second**: + - `rate(documentdb_gateway_read_operations_total[5m])` +- **Write Operations per Second**: + - `rate(documentdb_gateway_write_operations_total[5m])` +- **Operation Latency**: + - `histogram_quantile(0.95, rate(documentdb_gateway_operation_duration_seconds_bucket[5m]))` +- **Error Rate**: + - `rate(documentdb_gateway_errors_total[5m]) / rate(documentdb_gateway_operations_total[5m]) * 100` + +##### Panel 5: Resource Efficiency +- **CPU Requests vs Usage** +- **Memory Requests vs Usage** +- **Resource waste indicators** + +## Application Metrics Integration (Future Enhancement) + +### Gateway Container Metrics + +When the DocumentDB Gateway container starts emitting application metrics, the DaemonSet architecture seamlessly supports this through multiple collection methods: + +#### Method 1: Prometheus Metrics Endpoint (Recommended) +```yaml +# Gateway container exposes metrics on /metrics endpoint +apiVersion: v1 +kind: Pod +metadata: + annotations: + prometheus.io/scrape: "true" + prometheus.io/port: "8080" + prometheus.io/path: "/metrics" +spec: + containers: + - name: gateway + image: ghcr.io/microsoft/documentdb/documentdb-local:16 + ports: + - containerPort: 8080 + name: metrics +``` + +#### Method 2: OTLP Direct Push (High Performance) +```yaml +# Gateway pushes metrics directly to OTel Collector +# No scraping needed, lower latency, higher throughput +environment: + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: "http://localhost:4317" # OTel Collector on same node + - name: OTEL_SERVICE_NAME + value: "documentdb-gateway" +``` + +### Expected Gateway Metrics + +#### Request Metrics +- `documentdb_gateway_requests_total{method, status}` - Total API requests +- `documentdb_gateway_request_duration_seconds` - Request latency histogram +- `documentdb_gateway_active_connections` - Current active connections + +#### Operation Metrics +- `documentdb_gateway_read_operations_total{database, collection}` - Read operations +- `documentdb_gateway_write_operations_total{database, collection}` - Write operations +- `documentdb_gateway_delete_operations_total{database, collection}` - Delete operations +- `documentdb_gateway_query_operations_total{database, collection}` - Query operations + +#### Performance Metrics +- `documentdb_gateway_operation_duration_seconds{operation_type}` - Operation latency +- `documentdb_gateway_cache_hits_total` - Cache hit rate +- `documentdb_gateway_cache_misses_total` - Cache miss rate +- `documentdb_gateway_connection_pool_size` - Connection pool metrics + +#### Error Metrics +- `documentdb_gateway_errors_total{error_type, operation}` - Error counts +- `documentdb_gateway_timeouts_total{operation}` - Timeout counts +- `documentdb_gateway_retries_total{operation}` - Retry attempts + +### DaemonSet Advantages for Application Metrics + +#### ✅ **Perfect Compatibility** +- **Prometheus scraping**: OTel Collector autodiscovers Gateway pods +- **OTLP push**: Gateway can push directly to collector on same node +- **Service discovery**: Automatic discovery of new DocumentDB instances +- **Label propagation**: Kubernetes labels automatically added to metrics + +#### ✅ **Network Efficiency** +- **Local collection**: Metrics collected on same node (low latency) +- **Reduced hops**: No cross-node network traffic for metrics +- **Batch processing**: Efficient batching before sending to Prometheus + +#### ✅ **Operational Benefits** +- **Single configuration**: Same collector handles infra + app metrics +- **Unified pipeline**: Infrastructure and application metrics in same flow +- **Consistent labeling**: Same resource detection and attribute processing +- **Simplified debugging**: One place to troubleshoot metrics collection + +### Updated Architecture with Application Metrics + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Grafana Dashboard │ +│ Infrastructure + Application Metrics │ +└─────────────────────────┬───────────────────────────────────────┘ + │ +┌─────────────────────────┴───────────────────────────────────────┐ +│ Prometheus │ +│ (Unified Storage) │ +└─────────────────────────┬───────────────────────────────────────┘ + │ +┌─────────────────────────┴───────────────────────────────────────┐ +│ OpenTelemetry Collector (DaemonSet) │ +│ (Unified Collection Agent) │ +│ ┌─────────────────────────────────────────────────────────────┐│ +│ │ Receivers: ││ +│ │ • kubeletstats (Infrastructure metrics) ││ +│ │ • k8s_cluster (Kubernetes metrics) ││ +│ │ • prometheus (Gateway /metrics scraping) ││ +│ │ • otlp (Gateway direct push) ← NEW ││ +│ └─────────────────────────────────────────────────────────────┘│ +└─────────────────────────┬───────────────────────────────────────┘ + │ +┌─────────────────────────┴───────────────────────────────────────┐ +│ DocumentDB Pods │ +│ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ PostgreSQL │ │ Gateway │ │ +│ │ Container │ │ Container │ │ +│ │ │ │ • /metrics ← NEW│ │ +│ │ │ │ • OTLP push ← NEW│ │ +│ └─────────────────┘ └─────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +## Implementation Plan + +### Phase 1: OpenTelemetry Collector Setup +1. **Deploy OpenTelemetry Operator** + ```bash + # Install OpenTelemetry Operator + kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml + ``` + +2. **Deploy OpenTelemetry Collector as DaemonSet** + ```yaml + apiVersion: opentelemetry.io/v1alpha1 + kind: OpenTelemetryCollector + metadata: + name: documentdb-metrics-collector + namespace: documentdb-telemetry + spec: + mode: daemonset + serviceAccount: otel-collector + config: | + # [OpenTelemetry configuration from above] + ``` + +3. **Deploy Prometheus (Simplified)** + ```bash + # Deploy Prometheus without Node Exporter or Kube State Metrics + helm install prometheus prometheus-community/prometheus \ + --namespace monitoring \ + --create-namespace \ + --set nodeExporter.enabled=false \ + --set kubeStateMetrics.enabled=false \ + --set server.persistentVolume.enabled=true + ``` + +### Phase 2: DocumentDB Application Metrics Integration +1. **Gateway Container Enhancement** + - Add metrics endpoint (`/metrics` on port 8080) + - Implement OpenTelemetry instrumentation + - Add prometheus annotations to pods + +2. **Collector Configuration Update** + ```yaml + # Add to existing OTel Collector config + receivers: + prometheus/gateway: + config: + scrape_configs: + - job_name: 'documentdb-gateway' + kubernetes_sd_configs: + - role: pod + ``` + +3. **Enhanced Dashboards** + - Add application metrics panels + - Create alerts for operation errors + - Add capacity planning metrics + +### Phase 3: Advanced Application Monitoring +1. **Create DocumentDB-specific Grafana dashboard** +2. **Implement custom metrics for DocumentDB operations** +3. **Add capacity planning metrics** + +## Configuration Examples + +### DocumentDB Pod Labels for Monitoring +The DocumentDB operator should add these labels to pods for proper metric collection: + +```yaml +metadata: + labels: + app.kubernetes.io/name: documentdb + app.kubernetes.io/instance: "{{ .Values.documentdb.name }}" + app.kubernetes.io/component: database + documentdb.microsoft.com/instance: "{{ .Values.documentdb.name }}" +``` + +### Prometheus Recording Rules (Updated for OpenTelemetry metrics) +```yaml +groups: + - name: documentdb.rules + rules: + - record: documentdb:cpu_usage_rate + expr: rate(k8s_container_cpu_time{k8s_container_name=~"postgres|gateway",k8s_pod_label_app=~"documentdb.*"}[5m]) + + - record: documentdb:memory_usage_bytes + expr: k8s_container_memory_usage{k8s_container_name=~"postgres|gateway",k8s_pod_label_app=~"documentdb.*"} + + - record: documentdb:cpu_utilization_percent + expr: (documentdb:cpu_usage_rate / k8s_container_cpu_limit) * 100 +``` + +### Alert Rules (Updated for OpenTelemetry metrics) +```yaml +groups: + - name: documentdb.alerts + rules: + - alert: DocumentDBHighCPUUsage + expr: documentdb:cpu_utilization_percent > 80 + for: 5m + labels: + severity: warning + annotations: + summary: "DocumentDB instance {{ $labels.k8s_pod_name }} has high CPU usage" + description: "CPU usage is above 80% for 5 minutes" + + - alert: DocumentDBHighMemoryUsage + expr: (documentdb:memory_usage_bytes / k8s_container_memory_limit) * 100 > 85 + for: 5m + labels: + severity: warning + annotations: + summary: "DocumentDB instance {{ $labels.k8s_pod_name }} has high memory usage" + description: "Memory usage is above 85% for 5 minutes" +``` + +## Deployment Instructions + +### 1. Deploy OpenTelemetry Monitoring Stack +```bash +# Create telemetry namespace +kubectl create namespace documentdb-telemetry + +# Deploy OpenTelemetry Operator +kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml + +# Deploy OpenTelemetry Collector +kubectl apply -f documentdb-playground/telemetry/otel-collector.yaml + +# Deploy Prometheus (simplified without Node Exporter) +helm install prometheus prometheus-community/prometheus \ + --namespace documentdb-telemetry \ + --set nodeExporter.enabled=false \ + --set kubeStateMetrics.enabled=false + +# Deploy Grafana +helm install grafana grafana/grafana \ + --namespace documentdb-telemetry +``` + +### 2. Configure DocumentDB for Monitoring +Update the DocumentDB operator to include monitoring labels and annotations in the CNPG cluster specification. + +### 3. Import Grafana Dashboard +Import the pre-built DocumentDB dashboard JSON into Grafana for immediate visualization. + +## Security Considerations + +1. **RBAC**: Ensure OpenTelemetry Collector has minimal required permissions for Kubelet API access +2. **Network Policies**: Restrict access to metrics endpoints and collector APIs +3. **Data Retention**: Configure appropriate retention policies for metrics in Prometheus +4. **Authentication**: Secure Grafana with proper authentication +5. **Service Account**: Use dedicated service account for OpenTelemetry Collector with appropriate cluster roles + +## OpenTelemetry RBAC Configuration +```yaml +apiVersion: v1 +kind: ServiceAccount +metadata: + name: otel-collector + namespace: documentdb-telemetry +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: otel-collector +rules: +- apiGroups: [""] + resources: ["nodes", "nodes/proxy", "nodes/metrics", "services", "endpoints", "pods"] + verbs: ["get", "list", "watch"] +- apiGroups: ["apps"] + resources: ["daemonsets", "deployments", "replicasets"] + verbs: ["get", "list", "watch"] +- apiGroups: ["db.microsoft.com"] + resources: ["documentdbs"] + verbs: ["get", "list", "watch"] +- nonResourceURLs: ["/metrics", "/metrics/cadvisor"] + verbs: ["get"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: otel-collector +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: otel-collector +subjects: +- kind: ServiceAccount + name: otel-collector + namespace: documentdb-telemetry +``` + +## Monitoring Best Practices + +1. **Label Consistency**: Use consistent labeling across all DocumentDB resources +2. **Metric Cardinality**: Avoid high-cardinality labels that could impact Prometheus performance +3. **Alert Thresholds**: Set realistic thresholds based on workload patterns +4. **Dashboard Organization**: Group related metrics and use consistent color schemes +5. **Performance Impact**: Monitor the monitoring stack's own resource usage + +## Future Enhancements + +1. **Custom DocumentDB Metrics**: Implement DocumentDB-specific application metrics +2. **Distributed Tracing**: Add OpenTelemetry for request tracing +3. **Log Aggregation**: Integrate with ELK stack for log analysis +4. **Capacity Planning**: Implement predictive analytics for resource planning +5. **Multi-Cloud Support**: Extend monitoring to work across different cloud providers \ No newline at end of file