Write resource monitoring and capacity management documentation


# Task: Write resource monitoring and capacity management documentation

## Description

Create comprehensive user and administrator documentation for Coolify Enterprise's resource monitoring and capacity management system. This documentation covers real-time server metrics monitoring, intelligent server selection algorithms, organization-level resource quotas, capacity planning tools, and the advanced deployment strategies enabled by capacity awareness.

This documentation is critical for enterprise administrators who need to understand how Coolify automatically optimizes resource utilization across their infrastructure, prevents over-provisioning, enforces organizational quotas, and ensures deployments are placed on optimal servers based on real-time capacity analysis.

**Target Audiences:**

1. **Organization Administrators** - Understanding quota management, resource monitoring dashboards, and capacity planning
2. **DevOps Engineers** - Configuring resource monitoring, understanding server selection algorithms, troubleshooting capacity issues
3. **Application Developers** - Understanding how capacity affects their deployment experience and automatic server selection
4. **System Architects** - Planning infrastructure scaling, understanding resource allocation patterns
5. **Enterprise Support Teams** - Troubleshooting resource-related issues, understanding monitoring data

**Documentation Scope:**

- **User Guides** - Step-by-step instructions for accessing dashboards, configuring quotas, interpreting metrics
- **Technical Reference** - Detailed explanation of monitoring architecture, scoring algorithms, data retention policies
- **Administrator Guides** - Setting up monitoring, configuring thresholds, managing organization quotas
- **API Documentation** - Programmatic access to monitoring data and capacity information
- **Troubleshooting Guides** - Common issues, diagnostic procedures, resolution steps
- **Best Practices** - Resource planning, quota sizing, monitoring optimization

**Integration Context:**

This documentation builds upon the implementation completed in Tasks 22-31 (resource monitoring system). It must accurately reflect the implemented features:
- Real-time metrics collection (CPU, memory, disk, network, load average)
- Server scoring algorithm with weighted criteria
- Organization resource quotas linked to enterprise licenses
- WebSocket-powered real-time dashboards
- Capacity-aware deployment server selection
- Time-series metrics storage with configurable retention

**Why This Documentation Is Critical:**

Resource monitoring and capacity management are complex enterprise features that differentiate Coolify Enterprise from standard Coolify. Without comprehensive documentation, administrators cannot effectively utilize these features, leading to:
- Under-utilization of capacity planning tools
- Misunderstanding of quota enforcement
- Inability to troubleshoot resource allocation issues
- Poor infrastructure scaling decisions
- Confusion about automatic server selection behavior

Professional documentation ensures enterprise customers can fully leverage these advanced features, reducing support burden and increasing customer satisfaction.

## Acceptance Criteria

- [ ] User guide covering all dashboard features with screenshots and walkthroughs
- [ ] Administrator guide for quota configuration and management
- [ ] Technical reference explaining monitoring architecture and data flow
- [ ] Server scoring algorithm documentation with examples and scoring breakdowns
- [ ] API documentation for all resource monitoring endpoints with examples
- [ ] Troubleshooting guide covering common capacity issues and resolutions
- [ ] Best practices guide for resource planning and quota sizing
- [ ] Configuration reference for monitoring settings and thresholds
- [ ] Migration guide for enabling monitoring on existing installations
- [ ] Integration guide for connecting monitoring to external systems (Prometheus, Grafana, etc.)
- [ ] Performance tuning guide for high-volume metrics collection
- [ ] Security documentation covering metric access controls and organization scoping
- [ ] All documentation includes real-world examples and use cases
- [ ] Documentation follows Coolify's established style guide and formatting
- [ ] All code examples are tested and working

## Technical Details

### Documentation Structure

**File Locations:**

Primary documentation directory:
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/` (new directory)

Individual documentation files:
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/overview.md` - Feature overview and introduction
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/user-guide.md` - End-user dashboard walkthrough
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/admin-guide.md` - Administrator configuration guide
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/technical-reference.md` - Architecture and algorithms
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/api-reference.md` - API endpoint documentation
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/troubleshooting.md` - Issue diagnosis and resolution
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/best-practices.md` - Planning and optimization
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/configuration.md` - Settings and environment variables
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/migration.md` - Enabling on existing installations
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/integration.md` - External monitoring integration
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/security.md` - Access controls and permissions

Supporting files:
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/images/` - Screenshots and diagrams
- `/home/topgun/topgun/docs/enterprise/resource-monitoring/examples/` - Code examples and API calls

### Overview Document Structure

**File:** `docs/enterprise/resource-monitoring/overview.md`

```markdown
# Resource Monitoring and Capacity Management

## Overview

Coolify Enterprise provides comprehensive resource monitoring and intelligent capacity management to optimize infrastructure utilization, prevent over-provisioning, and ensure deployments are placed on optimal servers based on real-time capacity analysis.

### Key Features

- **Real-time Metrics Collection** - CPU, memory, disk, network, and load average metrics collected every 30 seconds
- **Intelligent Server Selection** - Weighted scoring algorithm automatically selects optimal servers for deployments
- **Organization Quotas** - Hierarchical quota enforcement linked to enterprise license tiers
- **Capacity Planning** - Visual tools for forecasting resource needs and planning infrastructure scaling
- **WebSocket Dashboards** - Real-time dashboard updates without page refreshes
- **Time-Series Storage** - Efficient metrics storage with configurable retention policies
- **API Access** - Programmatic access to all monitoring data and capacity information

### Architecture Overview

The resource monitoring system consists of four primary components:

1. **ResourceMonitoringJob** - Background job collecting metrics from all servers every 30 seconds
2. **SystemResourceMonitor** - Service for metric aggregation, storage, and time-series management
3. **CapacityManager** - Intelligent server selection using weighted scoring algorithm
4. **ResourceDashboard.vue** - Real-time WebSocket-powered dashboard with ApexCharts visualization

### Monitoring Data Flow

```
Server Metrics Collection (every 30s)
    ↓
ResourceMonitoringJob executes on all servers
    ↓
SSH connection retrieves system metrics
    ↓
SystemResourceMonitor processes and stores metrics
    ↓
server_resource_metrics table (time-series data)
    ↓
Redis cache for recent metrics
    ↓
WebSocket broadcast to connected clients
    ↓
ResourceDashboard.vue updates in real-time
```

### Server Scoring Algorithm

Deployments automatically select the optimal server based on weighted scoring:

- **CPU Availability (30%)** - Remaining CPU capacity
- **Memory Availability (30%)** - Free memory for application allocation
- **Disk Space (20%)** - Available storage for application data
- **Network Bandwidth (10%)** - Available network capacity
- **Current Load (10%)** - Server load average (penalizes heavily loaded servers)

**Example Score Calculation:**

```
Server: production-app-1
CPU: 40% used (60% available) = 60 points × 30% weight = 18 points
Memory: 50% used (50% available) = 50 points × 30% weight = 15 points
Disk: 30% used (70% available) = 70 points × 20% weight = 14 points
Network: 20% used (80% available) = 80 points × 10% weight = 8 points
Load: 1.2/4.0 (70% available) = 70 points × 10% weight = 7 points
Total Score: 62 / 100
```

Higher scores indicate better deployment candidates.

### Organization Quota Enforcement

Organization resource usage is tracked and enforced based on enterprise license quotas:

```
Organization: Acme Corp
License Tier: Professional
Quotas:
  - Max Servers: 20
  - Max Applications: 100
  - Max CPU Cores: 80
  - Max RAM: 256 GB
  - Max Storage: 2 TB

Current Usage:
  - Servers: 15 / 20 (75%)
  - Applications: 67 / 100 (67%)
  - CPU Cores: 52 / 80 (65%)
  - RAM: 168 GB / 256 GB (65%)
  - Storage: 1.2 TB / 2 TB (60%)
```

Quota violations prevent new resource creation with clear error messages.

### Metric Retention Policies

Metrics are stored with varying granularity based on age:

- **Raw metrics (30s intervals):** Retained for 7 days
- **5-minute aggregates:** Retained for 30 days
- **1-hour aggregates:** Retained for 90 days
- **Daily aggregates:** Retained for 1 year

This provides high-resolution recent data while maintaining long-term trends.

### Getting Started

1. **Enable Monitoring** - Monitoring is automatically enabled on all servers in Enterprise installations
2. **Configure Quotas** - Set organization quotas via License Management interface
3. **Access Dashboards** - Navigate to Resources → Monitoring to view real-time metrics
4. **Plan Capacity** - Use Capacity Planner to forecast resource needs
5. **Monitor Quotas** - Track organization usage in Organization Settings → Resources

### Next Steps

- [User Guide](./user-guide.md) - Dashboard walkthrough and feature tutorials
- [Administrator Guide](./admin-guide.md) - Configuration and quota management
- [Technical Reference](./technical-reference.md) - Architecture deep-dive and algorithms
- [API Reference](./api-reference.md) - Programmatic access to monitoring data
```

### User Guide Document Structure

**File:** `docs/enterprise/resource-monitoring/user-guide.md`

```markdown
# Resource Monitoring User Guide

## Accessing the Resource Dashboard

Navigate to **Resources → Monitoring** in the main navigation menu. The Resource Dashboard displays real-time metrics for all servers in your organization.

### Dashboard Overview

![Resource Dashboard Overview](./images/dashboard-overview.png)

The dashboard consists of four main sections:

1. **Server List** - Left sidebar showing all servers with health status indicators
2. **Metrics Charts** - Main area displaying CPU, memory, disk, and network usage over time
3. **Current Status** - Top bar showing aggregate statistics across all servers
4. **Server Details** - Right panel with detailed metrics for the selected server

### Understanding Health Status Indicators

Servers display colored health indicators based on resource utilization:

- **🟢 Green (Healthy)** - All resources below 70% utilization
- **🟡 Yellow (Warning)** - Any resource between 70-85% utilization
- **🔴 Red (Critical)** - Any resource above 85% utilization
- **⚫ Gray (Offline)** - Server unreachable or metrics collection failed

### Real-Time Metrics

Metrics update automatically every 30 seconds without page refresh via WebSocket connection.

#### CPU Usage

Displays CPU utilization across all cores:

```
Current: 42% (12 cores)
Average (1h): 38%
Average (24h): 45%
Peak (24h): 78% at 14:23 UTC
```

**Interpreting CPU Metrics:**
- **0-50%** - Normal operation, sufficient capacity for new deployments
- **50-70%** - Moderate load, deployments may be routed to other servers
- **70-85%** - High load, new deployments redirected to other servers
- **85-100%** - Critical load, investigate resource-intensive applications

#### Memory Usage

Shows RAM allocation and availability:

```
Used: 24.3 GB / 32 GB (76%)
Available: 7.7 GB
Cached: 4.2 GB (can be freed)
Application Usage: 20.1 GB
```

**Memory States:**
- **Active** - Currently in use by applications
- **Cached** - File cache, automatically freed when needed
- **Available** - Free for immediate use
- **Swap Used** - Indicates memory pressure (should be minimal)

#### Disk Usage

Displays storage utilization by mount point:

```
/data - 450 GB / 1 TB (45%)
/var/lib/docker - 125 GB / 500 GB (25%)
/backups - 80 GB / 200 GB (40%)
```

**Disk Metrics:**
- **Total Size** - Physical disk capacity
- **Used Space** - Allocated storage
- **Available Space** - Free for new data
- **Inodes Used** - File count (important for containers)

#### Network Usage

Shows network throughput in/out:

```
Inbound: 125 Mbps (current)
Outbound: 85 Mbps (current)
Total (24h): 450 GB in / 320 GB out
Peak: 850 Mbps in at 18:45 UTC
```

### Time Range Selection

Use the time range selector to view metrics over different periods:

- **Last Hour** - High-resolution 30-second intervals
- **Last 24 Hours** - 5-minute aggregates
- **Last 7 Days** - 1-hour aggregates
- **Last 30 Days** - 1-hour aggregates
- **Last 90 Days** - Daily aggregates
- **Custom Range** - Select specific start/end dates

### Filtering and Sorting

#### Filter by Server Tags

Filter servers by tags to view specific groups:

```
Production: 8 servers
Staging: 4 servers
Development: 6 servers
Database: 3 servers
```

Click tag names to filter dashboard to tagged servers.

#### Sort Servers

Sort server list by various criteria:

- **Name (A-Z / Z-A)**
- **CPU Usage (High to Low)**
- **Memory Usage (High to Low)**
- **Disk Usage (High to Low)**
- **Health Status (Critical First)**
- **Last Metric Update (Newest First)**

### Exporting Metrics

Export metrics for external analysis:

1. Click **Export** button in dashboard toolbar
2. Select time range and metrics to export
3. Choose format: CSV, JSON, or Prometheus format
4. Download file

**Example CSV Export:**

```csv
timestamp,server_id,server_name,cpu_percent,memory_percent,disk_percent
2025-10-06 14:30:00,15,production-app-1,42.3,68.5,45.2
2025-10-06 14:30:30,15,production-app-1,43.1,68.7,45.2
```

### Setting Up Alerts

Configure custom alerts for resource thresholds:

1. Navigate to **Resources → Monitoring → Alerts**
2. Click **Create Alert Rule**
3. Configure alert parameters:
   - **Metric**: CPU, Memory, Disk, Network, or Load Average
   - **Threshold**: Percentage or absolute value
   - **Duration**: How long threshold must be exceeded
   - **Severity**: Info, Warning, Critical
   - **Notification Channels**: Email, Slack, PagerDuty

**Example Alert:**

```
Alert: High CPU on Production Servers
Condition: CPU > 80% for 5 minutes
Severity: Warning
Notify: devops@company.com, #alerts-production
Actions: Send notification, create incident
```

### Capacity Planner

Access the Capacity Planner to forecast resource needs:

1. Navigate to **Resources → Capacity Planner**
2. View server capacity scores and recommendations
3. See predicted exhaustion dates based on current growth trends
4. Plan infrastructure scaling ahead of capacity issues

![Capacity Planner](./images/capacity-planner.png)

**Capacity Score Breakdown:**

Each server displays a capacity score (0-100) indicating deployment suitability:

```
Server: production-app-2
Capacity Score: 78 / 100

Breakdown:
  CPU Availability: 85% × 30% = 25.5 points
  Memory Availability: 75% × 30% = 22.5 points
  Disk Availability: 68% × 20% = 13.6 points
  Network Availability: 90% × 10% = 9.0 points
  Load Factor: 72% × 10% = 7.2 points

Total Score: 77.8 / 100 (rounded to 78)

Recommendation: Excellent deployment candidate
```

**Interpreting Scores:**
- **90-100** - Excellent capacity, ideal for deployments
- **70-89** - Good capacity, suitable for most deployments
- **50-69** - Moderate capacity, suitable for small/medium deployments
- **30-49** - Limited capacity, avoid new deployments unless necessary
- **0-29** - Critical capacity, do not deploy

### Organization Resource Quotas

View your organization's resource quotas and current usage:

1. Navigate to **Organization Settings → Resources**
2. View quota allocation by license tier
3. Monitor current usage percentages
4. See quota violation warnings

**Quota Dashboard Example:**

```
Organization: Acme Corporation
License: Professional Tier

Server Quota: 15 / 20 (75%) 🟡
Application Quota: 67 / 100 (67%) 🟢
CPU Quota: 52 cores / 80 cores (65%) 🟢
Memory Quota: 168 GB / 256 GB (65%) 🟢
Storage Quota: 1.2 TB / 2 TB (60%) 🟢

Status: Within limits
Next Review: 2025-11-15
Upgrade Options: Enterprise tier (200 servers, unlimited apps)
```

**Quota Warnings:**
- **🟢 Green (0-70%)** - Healthy usage
- **🟡 Yellow (70-90%)** - Approaching limit, consider planning expansion
- **🔴 Red (90-100%)** - Near limit, action required soon
- **⛔ Blocked (100%)** - Quota exceeded, cannot create new resources

### WebSocket Connection Status

The dashboard uses WebSocket for real-time updates. Connection status is shown in the top-right corner:

- **🟢 Connected** - Receiving real-time updates
- **🟡 Connecting** - Establishing connection
- **🔴 Disconnected** - No real-time updates (page refresh required)

If disconnected, the dashboard automatically attempts reconnection every 5 seconds.

### Performance Tips

**Optimize Dashboard Performance:**

1. **Limit Time Range** - Shorter ranges load faster
2. **Filter Servers** - Display only relevant servers
3. **Reduce Metric Types** - Hide unused metric charts
4. **Use Aggregated Views** - For historical data, use hour/day aggregates

**Browser Requirements:**
- Modern browser with WebSocket support (Chrome, Firefox, Safari, Edge)
- JavaScript enabled
- Minimum 2 GB RAM for large deployments (100+ servers)
```

### Administrator Guide Document Structure

**File:** `docs/enterprise/resource-monitoring/admin-guide.md`

```markdown
# Resource Monitoring Administrator Guide

## System Configuration

### Environment Variables

Configure monitoring behavior via environment variables in `.env`:

```bash
# Monitoring Collection
MONITORING_ENABLED=true
MONITORING_INTERVAL=30  # Seconds between collections
MONITORING_TIMEOUT=10   # SSH timeout for metric collection

# Metric Retention
METRICS_RAW_RETENTION_DAYS=7
METRICS_5MIN_RETENTION_DAYS=30
METRICS_HOURLY_RETENTION_DAYS=90
METRICS_DAILY_RETENTION_DAYS=365

# Performance Tuning
MONITORING_CONCURRENT_SERVERS=10  # Parallel metric collection
MONITORING_REDIS_CACHE_TTL=60     # Cache duration in seconds
MONITORING_BATCH_SIZE=100         # Metrics per database insert

# WebSocket Broadcasting
MONITORING_BROADCAST_ENABLED=true
MONITORING_BROADCAST_CHANNEL=resource-metrics

# Alerting
MONITORING_ALERT_ENABLED=true
MONITORING_ALERT_EMAIL=devops@company.com
```

### Database Configuration

Monitoring uses the `server_resource_metrics` and `organization_resource_usage` tables.

**Partitioning Configuration (PostgreSQL):**

```sql
-- Enable partitioning for large installations
CREATE TABLE server_resource_metrics_2025_10 PARTITION OF server_resource_metrics
FOR VALUES FROM ('2025-10-01') TO ('2025-11-01');

-- Automatic partition creation via cron
0 0 1 * * php /path/to/coolify/artisan monitoring:create-partition
```

**Indexing:**

```sql
-- Performance indexes (automatically created by migration)
CREATE INDEX idx_metrics_server_timestamp ON server_resource_metrics(server_id, collected_at DESC);
CREATE INDEX idx_metrics_org_timestamp ON organization_resource_usage(organization_id, period_start DESC);
CREATE INDEX idx_metrics_collected_at ON server_resource_metrics(collected_at) WHERE collected_at > NOW() - INTERVAL '7 days';
```

### Redis Caching Configuration

Metrics are cached in Redis for performance:

```
Cache Keys:
  - monitoring:server:{server_id}:latest      # Latest metrics (60s TTL)
  - monitoring:org:{org_id}:usage             # Organization totals (300s TTL)
  - monitoring:capacity:scores                # Capacity scores (60s TTL)

Memory Usage: ~10 KB per server × server count
Example: 100 servers = ~1 MB Redis memory
```

**Redis Configuration:**

```bash
# config/database.php
'redis' => [
    'monitoring' => [
        'host' => env('REDIS_HOST', '127.0.0.1'),
        'password' => env('REDIS_PASSWORD', null),
        'port' => env('REDIS_PORT', 6379),
        'database' => env('REDIS_MONITORING_DB', 2),
    ],
],
```

### Scheduled Jobs Configuration

Monitoring requires scheduled jobs in `app/Console/Kernel.php`:

```php
protected function schedule(Schedule $schedule)
{
    // Resource metric collection (every 30 seconds)
    $schedule->job(new ResourceMonitoringJob)
        ->everyThirtySeconds()
        ->withoutOverlapping()
        ->runInBackground();

    // Capacity score calculation (every 5 minutes)
    $schedule->job(new CapacityAnalysisJob)
        ->everyFiveMinutes()
        ->withoutOverlapping();

    // Organization usage aggregation (hourly)
    $schedule->job(new OrganizationUsageAggregationJob)
        ->hourly();

    // Metric cleanup (daily at 2 AM)
    $schedule->command('monitoring:cleanup-old-metrics')
        ->dailyAt('02:00');

    // Alert processing (every minute)
    $schedule->job(new AlertProcessingJob)
        ->everyMinute()
        ->when(fn() => config('monitoring.alerts.enabled'));
}
```

**Ensure Horizon is running for job processing:**

```bash
php artisan horizon
```

### Organization Quota Configuration

Configure quotas via the License Management interface or directly in the database:

**Via UI:**

1. Navigate to **Admin → Organizations → {Organization} → License**
2. Select license tier (Starter, Professional, Enterprise, Custom)
3. Configure custom quotas if using Custom tier
4. Save changes

**Via Database:**

```sql
UPDATE enterprise_licenses
SET quota_max_servers = 50,
    quota_max_applications = 200,
    quota_max_cpu_cores = 200,
    quota_max_memory_gb = 512,
    quota_max_storage_tb = 5
WHERE organization_id = 123;
```

**Quota Enforcement:**

Quotas are enforced at resource creation:

```php
// Example quota check (automatic in code)
$organization = auth()->user()->currentOrganization();

if ($organization->servers()->count() >= $organization->license->quota_max_servers) {
    throw new QuotaExceededException(
        "Server quota exceeded. Current: {$count}, Limit: {$limit}.
         Please upgrade your license to increase limits."
    );
}
```

### Server Selection Algorithm Configuration

Customize server scoring weights in `config/capacity.php`:

```php
return [
    'scoring' => [
        'weights' => [
            'cpu' => env('CAPACITY_WEIGHT_CPU', 0.30),       // 30%
            'memory' => env('CAPACITY_WEIGHT_MEMORY', 0.30), // 30%
            'disk' => env('CAPACITY_WEIGHT_DISK', 0.20),     // 20%
            'network' => env('CAPACITY_WEIGHT_NETWORK', 0.10), // 10%
            'load' => env('CAPACITY_WEIGHT_LOAD', 0.10),     // 10%
        ],

        'thresholds' => [
            'minimum_score' => 30,  // Don't deploy to servers below this score
            'preferred_score' => 70, // Prefer servers above this score
        ],

        'penalties' => [
            'recent_deployment' => 10,  // Reduce score for servers with recent deployment
            'high_load' => 20,          // Additional penalty for load > 80%
            'low_disk' => 15,           // Penalty for disk > 85%
        ],
    ],
];
```

### Metric Collection SSH Configuration

Monitoring connects to servers via SSH to collect metrics:

**SSH Key Setup:**

```bash
# Generate monitoring-specific SSH key
ssh-keygen -t ed25519 -f ~/.ssh/coolify_monitoring -C "coolify-monitoring"

# Add public key to all servers
ssh-copy-id -i ~/.ssh/coolify_monitoring.pub user@server
```

**Configure in `.env`:**

```bash
MONITORING_SSH_KEY_PATH=/home/coolify/.ssh/coolify_monitoring
MONITORING_SSH_USER=coolify
MONITORING_SSH_PORT=22
```

**Required Server Commands:**

Monitoring executes these commands via SSH (ensure user has permissions):

```bash
# CPU and load average
cat /proc/stat
cat /proc/loadavg

# Memory
cat /proc/meminfo

# Disk
df -h /
df -i /  # Inode usage

# Network
cat /proc/net/dev

# Docker (if installed)
docker stats --no-stream --format "{{json .}}"
```

### Alert Configuration

Configure alert rules and notification channels:

**Alert Rule Structure:**

```json
{
  "name": "High CPU Usage",
  "metric": "cpu_percent",
  "condition": "greater_than",
  "threshold": 80,
  "duration_seconds": 300,
  "severity": "warning",
  "notification_channels": ["email", "slack"],
  "actions": ["notify", "create_incident"]
}
```

**Notification Channels:**

```bash
# Email
ALERT_EMAIL_ENABLED=true
ALERT_EMAIL_TO=devops@company.com
ALERT_EMAIL_FROM=alerts@coolify.company.com

# Slack
ALERT_SLACK_ENABLED=true
ALERT_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
ALERT_SLACK_CHANNEL=#alerts-production

# PagerDuty
ALERT_PAGERDUTY_ENABLED=true
ALERT_PAGERDUTY_INTEGRATION_KEY=...

# Webhook
ALERT_WEBHOOK_ENABLED=true
ALERT_WEBHOOK_URL=https://monitoring.company.com/webhook
ALERT_WEBHOOK_SECRET=...
```

### Performance Tuning

**High-Volume Deployments (100+ servers):**

1. **Increase Concurrent Collection:**

```bash
MONITORING_CONCURRENT_SERVERS=20
```

2. **Enable Database Connection Pooling:**

```bash
DB_CONNECTION_POOL_MIN=5
DB_CONNECTION_POOL_MAX=20
```

3. **Partition Metrics Table:**

```bash
php artisan monitoring:enable-partitioning
```

4. **Use Dedicated Redis Instance:**

```bash
REDIS_MONITORING_HOST=redis-monitoring.internal
```

5. **Enable Metric Batching:**

```bash
MONITORING_BATCH_SIZE=500
MONITORING_BATCH_INTERVAL=10  # Seconds
```

**Monitoring the Monitoring System:**

Track monitoring system performance:

```sql
-- Job execution time
SELECT AVG(execution_time), MAX(execution_time)
FROM jobs_log
WHERE job_type = 'ResourceMonitoringJob'
AND created_at > NOW() - INTERVAL '1 hour';

-- Metric collection failures
SELECT server_id, COUNT(*) as failures
FROM server_resource_metrics_failures
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY server_id
ORDER BY failures DESC;
```

### Backup and Disaster Recovery

**Metrics Backup Strategy:**

1. **Database Backup** - Include metrics tables in regular database backups
2. **Time-Series Export** - Daily export to S3 for long-term storage
3. **Redis Persistence** - Enable RDB snapshots for cache recovery

**Backup Configuration:**

```bash
# Daily metrics export to S3
php artisan monitoring:export-metrics --days=7 --s3-bucket=coolify-metrics-backup
```

**Recovery Procedures:**

```bash
# Restore metrics from S3 backup
php artisan monitoring:import-metrics --s3-bucket=coolify-metrics-backup --date=2025-10-01

# Rebuild capacity scores
php artisan monitoring:rebuild-capacity-scores

# Regenerate aggregates
php artisan monitoring:regenerate-aggregates --from=2025-10-01 --to=2025-10-06
```

### Troubleshooting Admin Issues

**Metrics Not Collecting:**

1. Check Horizon is running: `php artisan horizon:status`
2. Verify SSH connectivity: `ssh -i $MONITORING_SSH_KEY_PATH $MONITORING_SSH_USER@server`
3. Check job failures: `php artisan queue:failed`
4. Review logs: `tail -f storage/logs/monitoring.log`

**High Database Load:**

1. Enable metric partitioning
2. Increase batch size
3. Review index usage: `EXPLAIN SELECT * FROM server_resource_metrics WHERE ...`
4. Archive old metrics: `php artisan monitoring:archive --before=2024-01-01`

**WebSocket Connection Issues:**

1. Verify Laravel Reverb is running: `php artisan reverb:status`
2. Check firewall allows WebSocket port (default 8080)
3. Test WebSocket connection: `wscat -c ws://coolify.company.com:8080/apps/monitoring`

**Capacity Scores Incorrect:**

1. Rebuild scores: `php artisan capacity:rebuild-scores`
2. Verify configuration weights sum to 1.0
3. Check recent metrics are available: `SELECT MAX(collected_at) FROM server_resource_metrics`
```

### API Reference Document Structure

**File:** `docs/enterprise/resource-monitoring/api-reference.md`

```markdown
# Resource Monitoring API Reference

## Authentication

All API endpoints require authentication via Sanctum token with `monitoring:read` or `monitoring:write` abilities.

**Request Header:**

```
Authorization: Bearer {your-api-token}
```

## Endpoints

### GET /api/v1/monitoring/servers

Get monitoring data for all servers in the organization.

**Query Parameters:**

```
time_range: string (1h, 24h, 7d, 30d, 90d) - Default: 1h
metrics: string[] - Comma-separated list (cpu,memory,disk,network,load)
server_ids: integer[] - Filter by server IDs
tags: string[] - Filter by server tags
```

**Example Request:**

```bash
curl -X GET "https://coolify.company.com/api/v1/monitoring/servers?time_range=24h&metrics=cpu,memory" \
  -H "Authorization: Bearer {token}"
```

**Example Response:**

```json
{
  "data": [
    {
      "server_id": 15,
      "server_name": "production-app-1",
      "metrics": {
        "cpu": {
          "current": 42.3,
          "average_1h": 38.5,
          "average_24h": 45.2,
          "peak_24h": 78.1,
          "peak_timestamp": "2025-10-06T14:23:00Z"
        },
        "memory": {
          "total_gb": 32,
          "used_gb": 24.3,
          "available_gb": 7.7,
          "cached_gb": 4.2,
          "percent_used": 76.0
        }
      },
      "health_status": "warning",
      "last_collected_at": "2025-10-06T15:30:00Z"
    }
  ],
  "meta": {
    "total_servers": 15,
    "healthy": 12,
    "warning": 2,
    "critical": 1,
    "offline": 0
  }
}
```

### GET /api/v1/monitoring/servers/{server_id}

Get detailed monitoring data for a specific server.

**Path Parameters:**

```
server_id: integer (required)
```

**Query Parameters:**

```
time_range: string - Default: 1h
granularity: string (raw, 5min, 1hour, 1day) - Default: auto
```

**Example Request:**

```bash
curl -X GET "https://coolify.company.com/api/v1/monitoring/servers/15?time_range=7d&granularity=1hour" \
  -H "Authorization: Bearer {token}"
```

**Example Response:**

```json
{
  "server_id": 15,
  "server_name": "production-app-1",
  "organization_id": 5,
  "time_series": [
    {
      "timestamp": "2025-10-06T14:00:00Z",
      "cpu_percent": 42.3,
      "memory_used_gb": 24.3,
      "memory_percent": 76.0,
      "disk_used_gb": 450,
      "disk_percent": 45.0,
      "network_in_mbps": 125,
      "network_out_mbps": 85,
      "load_average_1m": 1.2,
      "load_average_5m": 1.5,
      "load_average_15m": 1.8
    }
  ],
  "capacity_score": 62,
  "capacity_breakdown": {
    "cpu_score": 18,
    "memory_score": 15,
    "disk_score": 14,
    "network_score": 8,
    "load_score": 7
  }
}
```

### GET /api/v1/monitoring/organizations/{org_id}/usage

Get organization-wide resource usage and quota information.

**Path Parameters:**

```
org_id: integer (required)
```

**Example Request:**

```bash
curl -X GET "https://coolify.company.com/api/v1/monitoring/organizations/5/usage" \
  -H "Authorization: Bearer {token}"
```

**Example Response:**

```json
{
  "organization_id": 5,
  "organization_name": "Acme Corporation",
  "license_tier": "professional",
  "quotas": {
    "max_servers": 20,
    "max_applications": 100,
    "max_cpu_cores": 80,
    "max_memory_gb": 256,
    "max_storage_tb": 2
  },
  "current_usage": {
    "servers": {
      "count": 15,
      "percent": 75.0,
      "status": "warning"
    },
    "applications": {
      "count": 67,
      "percent": 67.0,
      "status": "healthy"
    },
    "cpu_cores": {
      "allocated": 52,
      "percent": 65.0,
      "status": "healthy"
    },
    "memory_gb": {
      "allocated": 168,
      "percent": 65.6,
      "status": "healthy"
    },
    "storage_tb": {
      "allocated": 1.2,
      "percent": 60.0,
      "status": "healthy"
    }
  },
  "trending": {
    "servers_7d_growth": 2,
    "applications_7d_growth": 8,
    "predicted_server_exhaustion_date": "2026-02-15"
  }
}
```

### GET /api/v1/monitoring/capacity/scores

Get capacity scores for all servers to determine optimal deployment targets.

**Query Parameters:**

```
min_score: integer - Minimum score threshold (0-100)
server_tags: string[] - Filter by tags
sort: string (score_desc, score_asc, name) - Default: score_desc
```

**Example Request:**

```bash
curl -X GET "https://coolify.company.com/api/v1/monitoring/capacity/scores?min_score=50&server_tags=production" \
  -H "Authorization: Bearer {token}"
```

**Example Response:**

```json
{
  "data": [
    {
      "server_id": 18,
      "server_name": "production-app-4",
      "capacity_score": 85,
      "recommendation": "excellent",
      "breakdown": {
        "cpu_availability": 90,
        "memory_availability": 85,
        "disk_availability": 80,
        "network_availability": 88,
        "load_factor": 75
      },
      "weighted_scores": {
        "cpu": 27.0,
        "memory": 25.5,
        "disk": 16.0,
        "network": 8.8,
        "load": 7.5
      },
      "suitable_for_deployment": true,
      "estimated_deployments_capacity": 8
    }
  ],
  "meta": {
    "total_servers": 15,
    "suitable_servers": 12,
    "best_server_id": 18
  }
}
```

### POST /api/v1/monitoring/servers/{server_id}/metrics

Manually submit metrics for a server (for custom monitoring integrations).

**Path Parameters:**

```
server_id: integer (required)
```

**Request Body:**

```json
{
  "timestamp": "2025-10-06T15:30:00Z",
  "metrics": {
    "cpu_percent": 42.3,
    "memory_used_gb": 24.3,
    "memory_total_gb": 32,
    "disk_used_gb": 450,
    "disk_total_gb": 1000,
    "network_in_mbps": 125,
    "network_out_mbps": 85,
    "load_average_1m": 1.2,
    "load_average_5m": 1.5,
    "load_average_15m": 1.8
  }
}
```

**Example Request:**

```bash
curl -X POST "https://coolify.company.com/api/v1/monitoring/servers/15/metrics" \
  -H "Authorization: Bearer {token}" \
  -H "Content-Type: application/json" \
  -d '{
    "timestamp": "2025-10-06T15:30:00Z",
    "metrics": {
      "cpu_percent": 42.3,
      "memory_used_gb": 24.3,
      "memory_total_gb": 32
    }
  }'
```

**Example Response:**

```json
{
  "success": true,
  "message": "Metrics stored successfully",
  "server_id": 15,
  "timestamp": "2025-10-06T15:30:00Z"
}
```

### GET /api/v1/monitoring/alerts

Get active alerts and alert history.

**Query Parameters:**

```
status: string (active, resolved, all) - Default: active
severity: string (info, warning, critical) - Filter by severity
server_ids: integer[] - Filter by server
time_range: string (1h, 24h, 7d, 30d) - Default: 24h
```

**Example Response:**

```json
{
  "data": [
    {
      "alert_id": 1234,
      "server_id": 15,
      "server_name": "production-app-1",
      "metric": "cpu_percent",
      "condition": "greater_than",
      "threshold": 80,
      "current_value": 85.2,
      "severity": "warning",
      "status": "active",
      "triggered_at": "2025-10-06T15:25:00Z",
      "duration_seconds": 300,
      "notification_sent": true,
      "notification_channels": ["email", "slack"]
    }
  ],
  "meta": {
    "total_alerts": 1,
    "active": 1,
    "resolved_24h": 5
  }
}
```

## Rate Limits

API endpoints are rate-limited based on organization license tier:

- **Starter:** 100 requests per minute
- **Professional:** 500 requests per minute
- **Enterprise:** 2000 requests per minute

Rate limit headers are included in all responses:

```
X-RateLimit-Limit: 500
X-RateLimit-Remaining: 487
X-RateLimit-Reset: 1696611600
```

## Error Handling

Standard error response format:

```json
{
  "error": {
    "code": "QUOTA_EXCEEDED",
    "message": "Server quota exceeded. Current: 20, Limit: 20.",
    "details": {
      "current_count": 20,
      "max_count": 20,
      "license_tier": "professional"
    }
  }
}
```

**Common Error Codes:**

- `UNAUTHORIZED` - Invalid or missing API token
- `FORBIDDEN` - Insufficient permissions
- `NOT_FOUND` - Resource not found
- `QUOTA_EXCEEDED` - Organization quota limit reached
- `RATE_LIMIT_EXCEEDED` - API rate limit exceeded
- `VALIDATION_ERROR` - Invalid request parameters
```

## Implementation Approach

### Step 1: Create Documentation Directory Structure
1. Create `/docs/enterprise/resource-monitoring/` directory
2. Create subdirectories: `images/`, `examples/`
3. Set up markdown file templates

### Step 2: Write Core Documentation Files
1. Start with `overview.md` - Feature introduction and architecture
2. Write `user-guide.md` - Dashboard walkthrough with screenshots
3. Create `admin-guide.md` - Configuration and system administration
4. Develop `technical-reference.md` - Deep technical details

### Step 3: Create API Documentation
1. Document all monitoring API endpoints
2. Include request/response examples for each endpoint
3. Add authentication and rate limiting information
4. Create code examples in multiple languages (curl, PHP, JavaScript)

### Step 4: Write Troubleshooting Guide
1. Document common issues and resolutions
2. Create diagnostic procedures
3. Add performance tuning recommendations
4. Include recovery procedures

### Step 5: Develop Best Practices Guide
1. Resource planning recommendations
2. Quota sizing guidelines
3. Monitoring optimization strategies
4. Integration patterns

### Step 6: Create Supporting Materials
1. Take screenshots of all UI components
2. Create architecture diagrams
3. Generate code examples
4. Build sample configurations

### Step 7: Review and Testing
1. Technical review by engineering team
2. User testing with sample documentation
3. Validate all code examples
4. Check internal links and references

### Step 8: Publication and Maintenance
1. Integrate into main Coolify documentation site
2. Create changelog tracking
3. Set up feedback mechanism
4. Schedule periodic reviews

## Test Strategy

### Documentation Quality Tests

**File:** `tests/Documentation/ResourceMonitoringDocsTest.php`

```php
<?php

use Illuminate\Support\Facades\File;

it('has all required documentation files', function () {
    $requiredFiles = [
        'docs/enterprise/resource-monitoring/overview.md',
        'docs/enterprise/resource-monitoring/user-guide.md',
        'docs/enterprise/resource-monitoring/admin-guide.md',
        'docs/enterprise/resource-monitoring/technical-reference.md',
        'docs/enterprise/resource-monitoring/api-reference.md',
        'docs/enterprise/resource-monitoring/troubleshooting.md',
        'docs/enterprise/resource-monitoring/best-practices.md',
        'docs/enterprise/resource-monitoring/configuration.md',
    ];

    foreach ($requiredFiles as $file) {
        expect(File::exists(base_path($file)))->toBeTrue("File {$file} is missing");
    }
});

it('documentation has valid markdown syntax', function () {
    $docFiles = File::glob(base_path('docs/enterprise/resource-monitoring/*.md'));

    foreach ($docFiles as $file) {
        $content = File::get($file);

        // Check for balanced code blocks
        $backtickCount = substr_count($content, '```');
        expect($backtickCount % 2)->toBe(0, "Unbalanced code blocks in {$file}");

        // Check for valid headers
        preg_match_all('/^(#{1,6})\s+/m', $content, $headers);
        expect($headers[0])->not->toBeEmpty("No headers found in {$file}");
    }
});

it('all code examples are syntactically valid', function () {
    $docFiles = File::glob(base_path('docs/enterprise/resource-monitoring/*.md'));

    foreach ($docFiles as $file) {
        $content = File::get($file);

        // Extract PHP code blocks
        preg_match_all('/```php\n(.*?)```/s', $content, $phpBlocks);

        foreach ($phpBlocks[1] as $index => $code) {
            $result = shell_exec("echo " . escapeshellarg($code) . " | php -l 2>&1");
            expect($result)->toContain('No syntax errors', "Syntax error in {$file} block #{$index}");
        }
    }
});

it('all internal links are valid', function () {
    $docFiles = File::glob(base_path('docs/enterprise/resource-monitoring/*.md'));

    foreach ($docFiles as $file) {
        $content = File::get($file);
        $dir = dirname($file);

        // Extract markdown links [text](path)
        preg_match_all('/\[([^\]]+)\]$([^)]+)$/', $content, $links);

        foreach ($links[2] as $link) {
            // Skip external links
            if (str_starts_with($link, 'http')) {
                continue;
            }

            // Skip anchors
            if (str_starts_with($link, '#')) {
                continue;
            }

            // Check file exists
            $linkPath = $dir . '/' . $link;
            expect(File::exists($linkPath))->toBeTrue("Broken link: {$link} in {$file}");
        }
    }
});

it('API endpoint examples return valid responses', function () {
    // Test actual API endpoints match documentation
    $this->actingAs($user = User::factory()->create());

    $response = $this->getJson('/api/v1/monitoring/servers');

    $response->assertOk()
        ->assertJsonStructure([
            'data' => [
                '*' => [
                    'server_id',
                    'server_name',
                    'metrics',
                    'health_status',
                    'last_collected_at',
                ],
            ],
            'meta' => [
                'total_servers',
                'healthy',
                'warning',
                'critical',
            ],
        ]);
});
```

### Documentation Completeness Checklist

**Manual Review Checklist:**

- [ ] All features from implementation (Tasks 22-31) are documented
- [ ] Every UI component has screenshot with caption
- [ ] Every configuration option has description and example
- [ ] Every API endpoint has request/response example
- [ ] All error codes are documented with resolutions
- [ ] Architecture diagrams show complete system flow
- [ ] Code examples use consistent styling
- [ ] Terminology is consistent across all documents
- [ ] Cross-references between documents are accurate
- [ ] Table of contents is complete and accurate
- [ ] Search keywords are included in metadata
- [ ] Version compatibility is documented

## Definition of Done

- [ ] All 11 core documentation files created and complete
- [ ] Documentation directory structure established
- [ ] Overview document covers architecture and key features (complete)
- [ ] User guide includes dashboard walkthrough with screenshots (8+ sections)
- [ ] Administrator guide covers all configuration options (10+ sections)
- [ ] Technical reference explains algorithms and data structures
- [ ] API reference documents all monitoring endpoints (8+ endpoints)
- [ ] Troubleshooting guide includes common issues and resolutions (15+ issues)
- [ ] Best practices guide provides planning and optimization advice
- [ ] Configuration reference lists all settings with examples
- [ ] Migration guide explains enabling on existing installations
- [ ] Integration guide covers external monitoring systems (Prometheus, Grafana)
- [ ] Security documentation covers access controls and permissions
- [ ] All screenshots captured and optimized (20+ images)
- [ ] All architecture diagrams created (5+ diagrams)
- [ ] All code examples tested and validated (50+ examples)
- [ ] Internal links verified and working
- [ ] Documentation follows Coolify style guide
- [ ] Technical review completed by engineering team
- [ ] User testing completed with sample users
- [ ] Feedback incorporated from review process
- [ ] Documentation integrated into main docs site
- [ ] Changelog created tracking documentation versions
- [ ] Search metadata added for discoverability
- [ ] PDF export generation working
- [ ] Documentation quality tests written and passing
- [ ] All acceptance criteria met

## Related Tasks

- **Depends on:** Task 31 (WebSocket broadcasting implementation - must be complete to document accurately)
- **Integrates with:** Task 22-30 (All resource monitoring implementation tasks - documentation reflects implementation)
- **Used by:** Enterprise customers for understanding and configuring monitoring features
- **Complements:** Task 82 (White-label documentation), Task 83 (Terraform documentation)


Write resource monitoring and capacity management documentation #191

Description

Task: Write resource monitoring and capacity management documentation

Description

Acceptance Criteria

Technical Details

Documentation Structure

Overview Document Structure

User Guide Document Structure

Setting Up Alerts

Capacity Planner

Organization Resource Quotas

WebSocket Connection Status

Performance Tips

Database Configuration

Redis Caching Configuration

Scheduled Jobs Configuration

Organization Quota Configuration

Server Selection Algorithm Configuration

Metric Collection SSH Configuration

Alert Configuration

Performance Tuning

Backup and Disaster Recovery

Troubleshooting Admin Issues

GET /api/v1/monitoring/servers/{server_id}

GET /api/v1/monitoring/organizations/{org_id}/usage

GET /api/v1/monitoring/capacity/scores

POST /api/v1/monitoring/servers/{server_id}/metrics

GET /api/v1/monitoring/alerts

Rate Limits

Error Handling

Documentation Completeness Checklist

Definition of Done

Related Tasks

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions