Skip to content

Create monitoring dashboards and alerting configuration #198

@johnproblems

Description

@johnproblems

Task: Create monitoring dashboards and alerting configuration

Description

Implement comprehensive production monitoring and alerting infrastructure for the Coolify Enterprise platform using Laravel, Grafana, Prometheus, and custom health check systems. This task establishes the observability layer that enables proactive incident detection, performance tracking, and operational insights across the entire multi-tenant enterprise deployment.

The Operational Visibility Challenge:

Operating a multi-tenant enterprise platform presents unique monitoring challenges:

  1. Multi-Tenant Complexity: Track metrics per organization, aggregate globally, detect anomalies
  2. Resource Monitoring: Monitor Terraform deployments, server capacity, queue health, cache performance
  3. Security Events: Track failed authentication, API rate limiting, suspicious activity
  4. Business Metrics: License usage, payment processing, subscription lifecycle events
  5. Performance SLAs: Response times, deployment durations, WebSocket latency
  6. Infrastructure Health: Database connections, Redis memory, disk space, Docker daemon status

Without comprehensive monitoring, production issues remain invisible until customers report them. Silent failures in background jobs, gradual performance degradation, and resource exhaustion can go undetected for hours or days. This task creates the early warning system that transforms reactive firefighting into proactive maintenance.

Solution Architecture:

The monitoring system integrates three complementary layers:

1. Application-Level Metrics (Laravel + Custom Services)

  • Health check endpoints exposing application state
  • Database query performance tracking
  • Job queue monitoring (Horizon integration)
  • Cache hit rates and Redis memory usage
  • Custom business metrics (deployments/hour, active licenses, etc.)

2. Infrastructure Monitoring (Prometheus + Node Exporter)

  • Server CPU, memory, disk, network metrics
  • Docker container statistics
  • PostgreSQL connection pool metrics
  • Redis memory and command statistics
  • Terraform execution tracking

3. Visualization & Alerting (Grafana + AlertManager)

  • Real-time dashboards for operations team
  • Organization-specific dashboards for customers
  • Alert rules with severity levels (info, warning, critical)
  • Multi-channel notifications (email, Slack, PagerDuty)
  • Historical trend analysis and capacity planning

Key Features:

  1. Production Dashboards (Grafana)

    • System Overview: Health, uptime, request rates, error rates
    • Resource Dashboard: CPU, memory, disk across all servers
    • Queue Dashboard: Job throughput, failure rates, queue depth
    • Terraform Dashboard: Active deployments, success rates, average duration
    • Organization Dashboard: Per-tenant resource usage and performance
    • Payment Dashboard: Transaction success rates, revenue metrics
  2. Health Check System (Laravel)

    • HTTP endpoint /health for load balancer health checks
    • Detailed diagnostics endpoint /health/detailed (authenticated)
    • Database connectivity and query performance checks
    • Redis connectivity and memory checks
    • Queue worker process verification
    • Terraform binary availability check
    • Cloud provider API connectivity check
    • Disk space and filesystem health check
  3. Alert Configuration (Prometheus AlertManager)

    • Critical: Database down, queue workers stopped, disk > 90% full
    • Warning: High error rate (> 1%), slow queries (> 1s), queue depth > 1000
    • Info: Deployment completed, license expiring soon, payment succeeded
    • Custom: Organization-specific SLA violations
    • On-call rotation with PagerDuty integration
    • Alert deduplication and grouping
  4. Custom Metrics Collection (Laravel Middleware + Jobs)

    • HTTP request duration histogram
    • API endpoint hit counts
    • Deployment success/failure rates
    • License validation latency
    • Payment processing success rates
    • WebSocket connection counts
    • Organization resource quota usage
  5. Log Aggregation (Optional - Preparation for ELK/Loki)

    • Structured logging with organization context
    • Error tracking with stack traces
    • Audit logging for security events
    • Performance logging for slow queries

Integration Points:

Existing Infrastructure:

  • Laravel Horizon: Queue monitoring built-in, expose metrics via Prometheus exporter
  • Laravel Telescope: Development debugging, disable in production but preserve logging patterns
  • Reverb WebSocket: Add connection count metrics
  • Existing Jobs: Add duration tracking to TerraformDeploymentJob, ResourceMonitoringJob, etc.

New Components:

  • HealthCheckService: Centralized health check logic
  • MetricsCollector: Custom Prometheus metric collection
  • AlertingService: Business event → alert mapping
  • GrafanaProvisioner: Automated dashboard deployment

Why This Task is Critical:

Monitoring is not optional for production systems—it's the difference between knowing issues exist and discovering them through customer complaints. For multi-tenant enterprise platforms, monitoring becomes even more critical:

  1. Customer SLA Compliance: Prove uptime and performance commitments with metrics
  2. Capacity Planning: Identify resource bottlenecks before they cause outages
  3. Security Incident Response: Detect and respond to attacks in real-time
  4. Performance Optimization: Identify slow queries, inefficient code paths
  5. Business Intelligence: Track platform growth, usage patterns, revenue trends
  6. On-Call Effectiveness: Alert on-call engineers with actionable context

This task establishes the foundation for reliable operations at scale, enabling the team to maintain high availability and performance as the platform grows.

Acceptance Criteria

  • Prometheus server deployed and collecting metrics from all application nodes
  • Grafana deployed with data source connected to Prometheus
  • 8+ production dashboards created (System, Resource, Queue, Terraform, Organization, Payment, Security, Business)
  • Health check endpoint /health returns 200 OK when system healthy
  • Detailed health check endpoint /health/detailed returns comprehensive diagnostics
  • HealthCheckService implements 10+ health checks (database, Redis, queue, disk, etc.)
  • MetricsCollector middleware tracks HTTP request duration and status codes
  • Custom metrics exported for business events (deployments, licenses, payments)
  • AlertManager configured with alert rules (critical, warning, info levels)
  • Alert rules created for critical scenarios (database down, queue stopped, disk full)
  • Multi-channel alerting configured (email, Slack, PagerDuty)
  • Alert deduplication and grouping configured
  • Organization-specific metrics filtered and displayed correctly
  • Historical data retention configured (30 days detailed, 1 year aggregated)
  • Dashboard refresh rates optimized (real-time: 5s, historical: 1m)
  • Grafana authentication integrated with Laravel Sanctum or SSO
  • API documentation for health check and metrics endpoints
  • Operational runbook for interpreting alerts and dashboards

Technical Details

File Paths

Health Check System:

  • /home/topgun/topgun/app/Services/Monitoring/HealthCheckService.php (new)
  • /home/topgun/topgun/app/Http/Controllers/HealthCheckController.php (new)
  • /home/topgun/topgun/routes/web.php (modify - add health check routes)

Metrics Collection:

  • /home/topgun/topgun/app/Services/Monitoring/MetricsCollector.php (new)
  • /home/topgun/topgun/app/Http/Middleware/CollectMetrics.php (new)
  • /home/topgun/topgun/app/Console/Commands/ExportMetrics.php (new)

Alert Configuration:

  • /home/topgun/topgun/app/Services/Monitoring/AlertingService.php (new)
  • /home/topgun/topgun/config/monitoring.php (new)

Infrastructure (Deployment):

  • /home/topgun/topgun/docker/prometheus/prometheus.yml (new)
  • /home/topgun/topgun/docker/prometheus/alerts.yml (new)
  • /home/topgun/topgun/docker/grafana/provisioning/datasources/prometheus.yml (new)
  • /home/topgun/topgun/docker/grafana/provisioning/dashboards/ (dashboard JSON files)
  • /home/topgun/topgun/docker-compose.monitoring.yml (new - monitoring stack)

Documentation:

  • /home/topgun/topgun/docs/operations/monitoring-guide.md (new)
  • /home/topgun/topgun/docs/operations/alert-runbook.md (new)

Database Schema

No new database tables required. Existing tables used for metrics:

-- Query for organization metrics
SELECT
    organization_id,
    COUNT(DISTINCT server_id) as server_count,
    COUNT(DISTINCT application_id) as app_count,
    SUM(CASE WHEN status = 'running' THEN 1 ELSE 0 END) as running_apps
FROM applications
WHERE deleted_at IS NULL
GROUP BY organization_id;

-- Query for deployment metrics
SELECT
    DATE_TRUNC('hour', created_at) as hour,
    COUNT(*) as total_deployments,
    COUNT(*) FILTER (WHERE status = 'completed') as successful_deployments,
    AVG(EXTRACT(EPOCH FROM (completed_at - started_at))) as avg_duration_seconds
FROM terraform_deployments
WHERE created_at >= NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour DESC;

HealthCheckService Implementation

File: app/Services/Monitoring/HealthCheckService.php

<?php

namespace App\Services\Monitoring;

use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\Queue;
use Illuminate\Support\Facades\Redis;
use Symfony\Component\Process\Process;

class HealthCheckService
{
    private array $checks = [];

    /**
     * Run all health checks
     *
     * @return array Health check results
     */
    public function runAll(): array
    {
        $results = [
            'status' => 'healthy',
            'timestamp' => now()->toIso8601String(),
            'checks' => [],
            'metadata' => [
                'environment' => config('app.env'),
                'version' => config('app.version', 'unknown'),
            ],
        ];

        // Run all checks
        $results['checks']['database'] = $this->checkDatabase();
        $results['checks']['redis'] = $this->checkRedis();
        $results['checks']['queue'] = $this->checkQueue();
        $results['checks']['disk'] = $this->checkDiskSpace();
        $results['checks']['terraform'] = $this->checkTerraform();
        $results['checks']['docker'] = $this->checkDocker();
        $results['checks']['reverb'] = $this->checkReverb();

        // Determine overall health status
        foreach ($results['checks'] as $check) {
            if ($check['status'] === 'unhealthy') {
                $results['status'] = 'unhealthy';
                break;
            } elseif ($check['status'] === 'degraded' && $results['status'] === 'healthy') {
                $results['status'] = 'degraded';
            }
        }

        return $results;
    }

    /**
     * Check database connectivity and performance
     *
     * @return array
     */
    private function checkDatabase(): array
    {
        try {
            $start = microtime(true);

            // Test connection
            DB::connection()->getPdo();

            // Test query performance
            DB::table('organizations')->limit(1)->get();

            $duration = (microtime(true) - $start) * 1000;

            // Get connection pool stats
            $connections = DB::select('SELECT count(*) as active_connections FROM pg_stat_activity');
            $activeConnections = $connections[0]->active_connections ?? 0;

            $status = 'healthy';
            if ($duration > 1000) {
                $status = 'degraded';
            }

            return [
                'status' => $status,
                'message' => 'Database connection healthy',
                'latency_ms' => round($duration, 2),
                'active_connections' => $activeConnections,
            ];
        } catch (\Exception $e) {
            return [
                'status' => 'unhealthy',
                'message' => 'Database connection failed',
                'error' => $e->getMessage(),
            ];
        }
    }

    /**
     * Check Redis connectivity and memory usage
     *
     * @return array
     */
    private function checkRedis(): array
    {
        try {
            $start = microtime(true);

            // Test connection
            Cache::store('redis')->get('health-check-test');

            $duration = (microtime(true) - $start) * 1000;

            // Get Redis info
            $redis = Redis::connection();
            $info = $redis->info('memory');

            $usedMemory = $info['used_memory_human'] ?? 'unknown';
            $maxMemory = $info['maxmemory_human'] ?? 'unlimited';

            $status = 'healthy';
            if ($duration > 100) {
                $status = 'degraded';
            }

            return [
                'status' => $status,
                'message' => 'Redis connection healthy',
                'latency_ms' => round($duration, 2),
                'used_memory' => $usedMemory,
                'max_memory' => $maxMemory,
            ];
        } catch (\Exception $e) {
            return [
                'status' => 'unhealthy',
                'message' => 'Redis connection failed',
                'error' => $e->getMessage(),
            ];
        }
    }

    /**
     * Check queue worker status
     *
     * @return array
     */
    private function checkQueue(): array
    {
        try {
            // Check Horizon status (if available)
            $masters = Cache::get('illuminate:queue:restart');

            // Get queue size
            $queueSize = Queue::size('default');
            $terraformQueueSize = Queue::size('terraform');

            $status = 'healthy';
            if ($queueSize > 1000 || $terraformQueueSize > 50) {
                $status = 'degraded';
            }

            return [
                'status' => $status,
                'message' => 'Queue system operational',
                'default_queue_size' => $queueSize,
                'terraform_queue_size' => $terraformQueueSize,
                'horizon_restart' => $masters !== null,
            ];
        } catch (\Exception $e) {
            return [
                'status' => 'unhealthy',
                'message' => 'Queue check failed',
                'error' => $e->getMessage(),
            ];
        }
    }

    /**
     * Check disk space
     *
     * @return array
     */
    private function checkDiskSpace(): array
    {
        try {
            $path = base_path();
            $freeSpace = disk_free_space($path);
            $totalSpace = disk_total_space($path);

            $percentUsed = 100 - (($freeSpace / $totalSpace) * 100);

            $status = 'healthy';
            if ($percentUsed > 90) {
                $status = 'unhealthy';
            } elseif ($percentUsed > 80) {
                $status = 'degraded';
            }

            return [
                'status' => $status,
                'message' => 'Disk space sufficient',
                'percent_used' => round($percentUsed, 2),
                'free_space_gb' => round($freeSpace / (1024 ** 3), 2),
                'total_space_gb' => round($totalSpace / (1024 ** 3), 2),
            ];
        } catch (\Exception $e) {
            return [
                'status' => 'unhealthy',
                'message' => 'Disk space check failed',
                'error' => $e->getMessage(),
            ];
        }
    }

    /**
     * Check Terraform binary availability
     *
     * @return array
     */
    private function checkTerraform(): array
    {
        try {
            $terraformPath = config('terraform.binary_path', '/usr/local/bin/terraform');

            $process = new Process([$terraformPath, 'version', '-json']);
            $process->run();

            if ($process->isSuccessful()) {
                $output = json_decode($process->getOutput(), true);

                return [
                    'status' => 'healthy',
                    'message' => 'Terraform available',
                    'version' => $output['terraform_version'] ?? 'unknown',
                    'path' => $terraformPath,
                ];
            }

            return [
                'status' => 'degraded',
                'message' => 'Terraform command failed',
                'error' => $process->getErrorOutput(),
            ];
        } catch (\Exception $e) {
            return [
                'status' => 'unhealthy',
                'message' => 'Terraform binary not found',
                'error' => $e->getMessage(),
            ];
        }
    }

    /**
     * Check Docker daemon connectivity
     *
     * @return array
     */
    private function checkDocker(): array
    {
        try {
            $process = new Process(['docker', 'version', '--format', '{{.Server.Version}}']);
            $process->run();

            if ($process->isSuccessful()) {
                return [
                    'status' => 'healthy',
                    'message' => 'Docker daemon accessible',
                    'version' => trim($process->getOutput()),
                ];
            }

            return [
                'status' => 'degraded',
                'message' => 'Docker command failed',
                'error' => $process->getErrorOutput(),
            ];
        } catch (\Exception $e) {
            return [
                'status' => 'unhealthy',
                'message' => 'Docker daemon not accessible',
                'error' => $e->getMessage(),
            ];
        }
    }

    /**
     * Check Reverb WebSocket server
     *
     * @return array
     */
    private function checkReverb(): array
    {
        try {
            // Check if Reverb process is running
            $process = new Process(['pgrep', '-f', 'reverb:start']);
            $process->run();

            $isRunning = $process->isSuccessful();

            return [
                'status' => $isRunning ? 'healthy' : 'degraded',
                'message' => $isRunning ? 'Reverb WebSocket server running' : 'Reverb not detected',
                'running' => $isRunning,
            ];
        } catch (\Exception $e) {
            return [
                'status' => 'degraded',
                'message' => 'Could not check Reverb status',
                'error' => $e->getMessage(),
            ];
        }
    }

    /**
     * Get quick health status (for load balancer)
     *
     * @return bool
     */
    public function isHealthy(): bool
    {
        try {
            // Quick checks only
            DB::connection()->getPdo();
            Cache::store('redis')->get('health-check-test');

            return true;
        } catch (\Exception $e) {
            return false;
        }
    }
}

HealthCheckController Implementation

File: app/Http/Controllers/HealthCheckController.php

<?php

namespace App\Http\Controllers;

use App\Services\Monitoring\HealthCheckService;
use Illuminate\Http\JsonResponse;

class HealthCheckController extends Controller
{
    public function __construct(
        private HealthCheckService $healthCheckService
    ) {}

    /**
     * Simple health check for load balancers
     *
     * @return JsonResponse
     */
    public function index(): JsonResponse
    {
        if ($this->healthCheckService->isHealthy()) {
            return response()->json([
                'status' => 'healthy',
                'timestamp' => now()->toIso8601String(),
            ]);
        }

        return response()->json([
            'status' => 'unhealthy',
            'timestamp' => now()->toIso8601String(),
        ], 503);
    }

    /**
     * Detailed health check (authenticated)
     *
     * @return JsonResponse
     */
    public function detailed(): JsonResponse
    {
        $results = $this->healthCheckService->runAll();

        $statusCode = match ($results['status']) {
            'healthy' => 200,
            'degraded' => 200,
            'unhealthy' => 503,
            default => 500,
        };

        return response()->json($results, $statusCode);
    }
}

MetricsCollector Middleware

File: app/Http/Middleware/CollectMetrics.php

<?php

namespace App\Http\Middleware;

use Closure;
use Illuminate\Http\Request;
use Illuminate\Support\Facades\Cache;

class CollectMetrics
{
    /**
     * Handle an incoming request
     *
     * @param Request $request
     * @param Closure $next
     * @return mixed
     */
    public function handle(Request $request, Closure $next): mixed
    {
        $start = microtime(true);

        $response = $next($request);

        $duration = (microtime(true) - $start) * 1000;

        // Collect metrics
        $this->recordMetric([
            'type' => 'http_request',
            'method' => $request->method(),
            'path' => $request->path(),
            'status' => $response->status(),
            'duration_ms' => round($duration, 2),
            'timestamp' => now()->timestamp,
            'organization_id' => $request->user()?->current_organization_id,
        ]);

        return $response;
    }

    /**
     * Record metric to Redis for Prometheus scraping
     *
     * @param array $metric
     * @return void
     */
    private function recordMetric(array $metric): void
    {
        try {
            // Store in Redis list for Prometheus exporter to consume
            Cache::store('redis')->rpush('metrics:http_requests', json_encode($metric));

            // Trim to last 10000 metrics to prevent unbounded growth
            Cache::store('redis')->ltrim('metrics:http_requests', -10000, -1);
        } catch (\Exception $e) {
            // Fail silently - don't let metrics collection break requests
            \Log::debug('Failed to record metric', ['error' => $e->getMessage()]);
        }
    }
}

Configuration File

File: config/monitoring.php

<?php

return [
    /*
    |--------------------------------------------------------------------------
    | Health Check Configuration
    |--------------------------------------------------------------------------
    */
    'health_checks' => [
        'enabled' => env('HEALTH_CHECKS_ENABLED', true),
        'cache_ttl' => env('HEALTH_CHECK_CACHE_TTL', 30), // Cache results for 30 seconds
    ],

    /*
    |--------------------------------------------------------------------------
    | Metrics Collection
    |--------------------------------------------------------------------------
    */
    'metrics' => [
        'enabled' => env('METRICS_COLLECTION_ENABLED', true),
        'endpoints' => [
            'http_requests' => true,
            'queue_jobs' => true,
            'database_queries' => false, // Too verbose for production
        ],
    ],

    /*
    |--------------------------------------------------------------------------
    | Alerting Configuration
    |--------------------------------------------------------------------------
    */
    'alerting' => [
        'enabled' => env('ALERTING_ENABLED', true),

        'channels' => [
            'email' => [
                'enabled' => env('ALERT_EMAIL_ENABLED', true),
                'to' => env('ALERT_EMAIL_TO', 'ops@example.com'),
            ],
            'slack' => [
                'enabled' => env('ALERT_SLACK_ENABLED', false),
                'webhook_url' => env('ALERT_SLACK_WEBHOOK_URL'),
            ],
            'pagerduty' => [
                'enabled' => env('ALERT_PAGERDUTY_ENABLED', false),
                'integration_key' => env('ALERT_PAGERDUTY_KEY'),
            ],
        ],

        'thresholds' => [
            'error_rate' => env('ALERT_ERROR_RATE_THRESHOLD', 0.01), // 1%
            'response_time_p95' => env('ALERT_RESPONSE_TIME_P95_MS', 1000), // 1 second
            'queue_depth' => env('ALERT_QUEUE_DEPTH_THRESHOLD', 1000),
            'disk_usage_percent' => env('ALERT_DISK_USAGE_PERCENT', 90),
        ],
    ],

    /*
    |--------------------------------------------------------------------------
    | Prometheus Configuration
    |--------------------------------------------------------------------------
    */
    'prometheus' => [
        'enabled' => env('PROMETHEUS_ENABLED', true),
        'scrape_interval' => env('PROMETHEUS_SCRAPE_INTERVAL', '15s'),
        'retention_days' => env('PROMETHEUS_RETENTION_DAYS', 30),
    ],

    /*
    |--------------------------------------------------------------------------
    | Grafana Configuration
    |--------------------------------------------------------------------------
    */
    'grafana' => [
        'enabled' => env('GRAFANA_ENABLED', true),
        'url' => env('GRAFANA_URL', 'http://grafana:3000'),
        'admin_user' => env('GRAFANA_ADMIN_USER', 'admin'),
        'admin_password' => env('GRAFANA_ADMIN_PASSWORD', 'admin'),
    ],
];

Prometheus Configuration

File: docker/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'coolify-enterprise'
    environment: 'production'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Load alerting rules
rule_files:
  - 'alerts.yml'

# Scrape configurations
scrape_configs:
  # Laravel application metrics
  - job_name: 'laravel-app'
    static_configs:
      - targets: ['app:9090']
    metrics_path: '/metrics'
    scrape_interval: 15s

  # Node Exporter for server metrics
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 30s

  # PostgreSQL Exporter
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']
    scrape_interval: 30s

  # Redis Exporter
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
    scrape_interval: 30s

  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Alert Rules Configuration

File: docker/prometheus/alerts.yml

groups:
  - name: critical_alerts
    interval: 1m
    rules:
      # Database down
      - alert: DatabaseDown
        expr: up{job="postgres"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL database is down"
          description: "Database {{ $labels.instance }} has been down for more than 1 minute"

      # Redis down
      - alert: RedisDown
        expr: up{job="redis"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Redis cache is down"
          description: "Redis instance {{ $labels.instance }} is unreachable"

      # High disk usage
      - alert: HighDiskUsage
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space critically low"
          description: "Disk usage on {{ $labels.instance }} is above 90% ({{ $value }}%)"

      # Queue workers stopped
      - alert: QueueWorkersDown
        expr: horizon_workers_total == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Queue workers are not running"
          description: "No Horizon workers detected for 2 minutes"

  - name: warning_alerts
    interval: 5m
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High HTTP error rate detected"
          description: "Error rate is {{ humanizePercentage $value }} over the last 5 minutes"

      # Slow database queries
      - alert: SlowDatabaseQueries
        expr: histogram_quantile(0.95, rate(database_query_duration_seconds_bucket[5m])) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Slow database queries detected"
          description: "95th percentile query time is {{ humanizeDuration $value }}"

      # High queue depth
      - alert: HighQueueDepth
        expr: horizon_queue_depth > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Queue depth is high"
          description: "Queue {{ $labels.queue }} has {{ $value }} pending jobs"

      # High memory usage
      - alert: HighMemoryUsage
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 20
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage is high"
          description: "Available memory on {{ $labels.instance }} is below 20%"

  - name: info_alerts
    interval: 15m
    rules:
      # Deployment completed
      - alert: DeploymentCompleted
        expr: increase(terraform_deployments_completed_total[15m]) > 0
        labels:
          severity: info
        annotations:
          summary: "Infrastructure deployment completed"
          description: "{{ $value }} Terraform deployment(s) completed in the last 15 minutes"

      # License expiring soon
      - alert: LicenseExpiringSoon
        expr: (enterprise_license_expiry_timestamp - time()) < 604800
        labels:
          severity: info
        annotations:
          summary: "Enterprise license expiring soon"
          description: "License for organization {{ $labels.organization }} expires in {{ humanizeDuration $value }}"

Grafana Dashboard Provisioning

File: docker/grafana/provisioning/datasources/prometheus.yml

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
    jsonData:
      timeInterval: '15s'
      queryTimeout: '60s'

Routes Configuration

File: routes/web.php (add these routes)

// Health check endpoints
Route::get('/health', [HealthCheckController::class, 'index'])
    ->name('health');

Route::get('/health/detailed', [HealthCheckController::class, 'detailed'])
    ->middleware('auth:sanctum')
    ->name('health.detailed');

// Metrics endpoint (for Prometheus scraping)
Route::get('/metrics', [MetricsController::class, 'export'])
    ->middleware('throttle:60,1')
    ->name('metrics');

Implementation Approach

Step 1: Set Up Health Check System

  1. Create HealthCheckService with all check methods
  2. Create HealthCheckController with simple and detailed endpoints
  3. Register routes in web.php
  4. Test health checks manually

Step 2: Implement Metrics Collection

  1. Create MetricsCollector middleware
  2. Register middleware in Kernel.php
  3. Create MetricsController for Prometheus export
  4. Test metrics collection and export

Step 3: Deploy Prometheus

  1. Create prometheus.yml configuration
  2. Create alerts.yml with alert rules
  3. Add Prometheus to docker-compose.monitoring.yml
  4. Deploy and verify scraping

Step 4: Deploy Grafana

  1. Create datasource provisioning configuration
  2. Create dashboard JSON files (System, Resource, Queue, etc.)
  3. Add Grafana to docker-compose.monitoring.yml
  4. Configure authentication

Step 5: Configure AlertManager

  1. Create alertmanager.yml configuration
  2. Configure notification channels (email, Slack, PagerDuty)
  3. Test alert routing and delivery
  4. Set up alert deduplication

Step 6: Create Dashboards

  1. System Overview Dashboard (general health)
  2. Resource Dashboard (CPU, memory, disk)
  3. Queue Dashboard (Horizon metrics)
  4. Terraform Dashboard (deployment tracking)
  5. Organization Dashboard (per-tenant metrics)
  6. Payment Dashboard (transaction tracking)
  7. Security Dashboard (failed auth, rate limits)
  8. Business Dashboard (KPIs, growth metrics)

Step 7: Integrate with Existing Systems

  1. Add metrics to TerraformDeploymentJob
  2. Add metrics to ResourceMonitoringJob
  3. Add metrics to payment processing
  4. Add metrics to license validation

Step 8: Documentation

  1. Write monitoring guide
  2. Write alert runbook
  3. Document dashboard usage
  4. Create troubleshooting guide

Step 9: Testing

  1. Trigger alerts manually
  2. Verify alert delivery
  3. Test dashboard functionality
  4. Load test metrics collection

Step 10: Deployment and Training

  1. Deploy monitoring stack to production
  2. Train operations team on dashboards
  3. Establish on-call rotation
  4. Document escalation procedures

Test Strategy

Unit Tests

File: tests/Unit/Services/HealthCheckServiceTest.php

<?php

use App\Services\Monitoring\HealthCheckService;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Cache;

beforeEach(function () {
    $this->healthCheckService = app(HealthCheckService::class);
});

it('returns healthy status when all checks pass', function () {
    $result = $this->healthCheckService->runAll();

    expect($result['status'])->toBe('healthy');
    expect($result['checks'])->toHaveKeys([
        'database', 'redis', 'queue', 'disk', 'terraform', 'docker', 'reverb'
    ]);
});

it('checks database connectivity', function () {
    $result = invade($this->healthCheckService)->checkDatabase();

    expect($result)->toHaveKeys(['status', 'message', 'latency_ms']);
    expect($result['status'])->toBeIn(['healthy', 'degraded']);
});

it('checks Redis connectivity', function () {
    $result = invade($this->healthCheckService)->checkRedis();

    expect($result)->toHaveKeys(['status', 'message', 'latency_ms']);
    expect($result['status'])->toBeIn(['healthy', 'degraded']);
});

it('detects unhealthy state when database is down', function () {
    DB::shouldReceive('connection->getPdo')
        ->andThrow(new \PDOException('Connection failed'));

    $result = invade($this->healthCheckService)->checkDatabase();

    expect($result['status'])->toBe('unhealthy');
    expect($result)->toHaveKey('error');
});

it('provides quick health status for load balancers', function () {
    $isHealthy = $this->healthCheckService->isHealthy();

    expect($isHealthy)->toBeTrue();
});

Integration Tests

File: tests/Feature/Monitoring/HealthCheckEndpointTest.php

<?php

use App\Models\User;

it('returns 200 OK when system is healthy', function () {
    $response = $this->get('/health');

    $response->assertOk();
    $response->assertJson([
        'status' => 'healthy',
    ]);
});

it('requires authentication for detailed health check', function () {
    $response = $this->get('/health/detailed');

    $response->assertUnauthorized();
});

it('returns detailed health information when authenticated', function () {
    $user = User::factory()->create();

    $response = $this->actingAs($user)
        ->get('/health/detailed');

    $response->assertOk();
    $response->assertJsonStructure([
        'status',
        'timestamp',
        'checks' => [
            'database',
            'redis',
            'queue',
            'disk',
        ],
        'metadata',
    ]);
});

it('returns 503 when system is unhealthy', function () {
    // Mock database failure
    DB::shouldReceive('connection->getPdo')
        ->andThrow(new \PDOException('Connection failed'));

    $response = $this->get('/health');

    $response->assertStatus(503);
    $response->assertJson([
        'status' => 'unhealthy',
    ]);
});

Metrics Collection Tests

File: tests/Feature/Monitoring/MetricsCollectionTest.php

<?php

use Illuminate\Support\Facades\Cache;

it('collects HTTP request metrics', function () {
    $this->get('/api/organizations');

    // Check metrics were recorded
    $metrics = Cache::store('redis')->lrange('metrics:http_requests', -1, -1);

    expect($metrics)->not->toBeEmpty();

    $metric = json_decode($metrics[0], true);
    expect($metric)->toHaveKeys(['type', 'method', 'path', 'status', 'duration_ms']);
    expect($metric['type'])->toBe('http_request');
});

it('does not break requests when metrics fail', function () {
    // Simulate Redis failure
    Cache::shouldReceive('rpush')
        ->andThrow(new \Exception('Redis down'));

    // Request should still succeed
    $response = $this->get('/api/organizations');

    $response->assertOk();
});

Alert Testing

Manual Test Plan:

  1. Database Down Alert

    • Stop PostgreSQL container
    • Verify alert fires within 1 minute
    • Verify notification delivery
    • Restart PostgreSQL
    • Verify alert resolves
  2. High Disk Usage Alert

    • Fill disk to >90%
    • Verify alert fires within 5 minutes
    • Clean up disk space
    • Verify alert resolves
  3. High Error Rate Alert

    • Trigger 500 errors (e.g., break database connection)
    • Generate traffic to hit 1% error threshold
    • Verify alert fires
    • Fix error source
    • Verify alert resolves

Definition of Done

  • HealthCheckService implemented with 10+ health checks
  • HealthCheckController created with simple and detailed endpoints
  • Health check routes registered and tested
  • MetricsCollector middleware implemented
  • Metrics export endpoint created for Prometheus
  • Prometheus deployed and scraping metrics
  • Grafana deployed with Prometheus datasource
  • 8+ production dashboards created (System, Resource, Queue, Terraform, Organization, Payment, Security, Business)
  • AlertManager configured with notification channels
  • Alert rules created for critical, warning, and info levels
  • Alerts tested and verified delivery
  • Organization-specific metrics filtering working
  • Historical data retention configured (30 days detailed, 1 year aggregated)
  • Grafana authentication configured
  • Monitoring configuration documented
  • Alert runbook created with response procedures
  • Operations team trained on dashboards
  • Unit tests written for health checks (>90% coverage)
  • Integration tests written for endpoints
  • Manual alert testing completed
  • Production deployment successful
  • On-call rotation established
  • Laravel Pint formatting applied
  • PHPStan level 5 passing
  • Code reviewed and approved

Related Tasks

  • Depends on: Task 89 (CI/CD pipeline for deployment automation)
  • Integrates with: Task 18 (TerraformDeploymentJob metrics)
  • Integrates with: Task 24 (ResourceMonitoringJob metrics)
  • Integrates with: Task 46 (PaymentService metrics)
  • Integrates with: Task 54 (API rate limiting metrics)
  • Supports: All production operations and incident response

Metadata

Metadata

Assignees

No one assigned

    Labels

    epic:topgunTasks for topguntaskIndividual task

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions