-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Task: Write resource monitoring and capacity management documentation
Description
Create comprehensive user and administrator documentation for Coolify Enterprise's resource monitoring and capacity management system. This documentation covers real-time server metrics monitoring, intelligent server selection algorithms, organization-level resource quotas, capacity planning tools, and the advanced deployment strategies enabled by capacity awareness.
This documentation is critical for enterprise administrators who need to understand how Coolify automatically optimizes resource utilization across their infrastructure, prevents over-provisioning, enforces organizational quotas, and ensures deployments are placed on optimal servers based on real-time capacity analysis.
Target Audiences:
- Organization Administrators - Understanding quota management, resource monitoring dashboards, and capacity planning
- DevOps Engineers - Configuring resource monitoring, understanding server selection algorithms, troubleshooting capacity issues
- Application Developers - Understanding how capacity affects their deployment experience and automatic server selection
- System Architects - Planning infrastructure scaling, understanding resource allocation patterns
- Enterprise Support Teams - Troubleshooting resource-related issues, understanding monitoring data
Documentation Scope:
- User Guides - Step-by-step instructions for accessing dashboards, configuring quotas, interpreting metrics
- Technical Reference - Detailed explanation of monitoring architecture, scoring algorithms, data retention policies
- Administrator Guides - Setting up monitoring, configuring thresholds, managing organization quotas
- API Documentation - Programmatic access to monitoring data and capacity information
- Troubleshooting Guides - Common issues, diagnostic procedures, resolution steps
- Best Practices - Resource planning, quota sizing, monitoring optimization
Integration Context:
This documentation builds upon the implementation completed in Tasks 22-31 (resource monitoring system). It must accurately reflect the implemented features:
- Real-time metrics collection (CPU, memory, disk, network, load average)
- Server scoring algorithm with weighted criteria
- Organization resource quotas linked to enterprise licenses
- WebSocket-powered real-time dashboards
- Capacity-aware deployment server selection
- Time-series metrics storage with configurable retention
Why This Documentation Is Critical:
Resource monitoring and capacity management are complex enterprise features that differentiate Coolify Enterprise from standard Coolify. Without comprehensive documentation, administrators cannot effectively utilize these features, leading to:
- Under-utilization of capacity planning tools
- Misunderstanding of quota enforcement
- Inability to troubleshoot resource allocation issues
- Poor infrastructure scaling decisions
- Confusion about automatic server selection behavior
Professional documentation ensures enterprise customers can fully leverage these advanced features, reducing support burden and increasing customer satisfaction.
Acceptance Criteria
- User guide covering all dashboard features with screenshots and walkthroughs
- Administrator guide for quota configuration and management
- Technical reference explaining monitoring architecture and data flow
- Server scoring algorithm documentation with examples and scoring breakdowns
- API documentation for all resource monitoring endpoints with examples
- Troubleshooting guide covering common capacity issues and resolutions
- Best practices guide for resource planning and quota sizing
- Configuration reference for monitoring settings and thresholds
- Migration guide for enabling monitoring on existing installations
- Integration guide for connecting monitoring to external systems (Prometheus, Grafana, etc.)
- Performance tuning guide for high-volume metrics collection
- Security documentation covering metric access controls and organization scoping
- All documentation includes real-world examples and use cases
- Documentation follows Coolify's established style guide and formatting
- All code examples are tested and working
Technical Details
Documentation Structure
File Locations:
Primary documentation directory:
/home/topgun/topgun/docs/enterprise/resource-monitoring/(new directory)
Individual documentation files:
/home/topgun/topgun/docs/enterprise/resource-monitoring/overview.md- Feature overview and introduction/home/topgun/topgun/docs/enterprise/resource-monitoring/user-guide.md- End-user dashboard walkthrough/home/topgun/topgun/docs/enterprise/resource-monitoring/admin-guide.md- Administrator configuration guide/home/topgun/topgun/docs/enterprise/resource-monitoring/technical-reference.md- Architecture and algorithms/home/topgun/topgun/docs/enterprise/resource-monitoring/api-reference.md- API endpoint documentation/home/topgun/topgun/docs/enterprise/resource-monitoring/troubleshooting.md- Issue diagnosis and resolution/home/topgun/topgun/docs/enterprise/resource-monitoring/best-practices.md- Planning and optimization/home/topgun/topgun/docs/enterprise/resource-monitoring/configuration.md- Settings and environment variables/home/topgun/topgun/docs/enterprise/resource-monitoring/migration.md- Enabling on existing installations/home/topgun/topgun/docs/enterprise/resource-monitoring/integration.md- External monitoring integration/home/topgun/topgun/docs/enterprise/resource-monitoring/security.md- Access controls and permissions
Supporting files:
/home/topgun/topgun/docs/enterprise/resource-monitoring/images/- Screenshots and diagrams/home/topgun/topgun/docs/enterprise/resource-monitoring/examples/- Code examples and API calls
Overview Document Structure
File: docs/enterprise/resource-monitoring/overview.md
# Resource Monitoring and Capacity Management
## Overview
Coolify Enterprise provides comprehensive resource monitoring and intelligent capacity management to optimize infrastructure utilization, prevent over-provisioning, and ensure deployments are placed on optimal servers based on real-time capacity analysis.
### Key Features
- **Real-time Metrics Collection** - CPU, memory, disk, network, and load average metrics collected every 30 seconds
- **Intelligent Server Selection** - Weighted scoring algorithm automatically selects optimal servers for deployments
- **Organization Quotas** - Hierarchical quota enforcement linked to enterprise license tiers
- **Capacity Planning** - Visual tools for forecasting resource needs and planning infrastructure scaling
- **WebSocket Dashboards** - Real-time dashboard updates without page refreshes
- **Time-Series Storage** - Efficient metrics storage with configurable retention policies
- **API Access** - Programmatic access to all monitoring data and capacity information
### Architecture Overview
The resource monitoring system consists of four primary components:
1. **ResourceMonitoringJob** - Background job collecting metrics from all servers every 30 seconds
2. **SystemResourceMonitor** - Service for metric aggregation, storage, and time-series management
3. **CapacityManager** - Intelligent server selection using weighted scoring algorithm
4. **ResourceDashboard.vue** - Real-time WebSocket-powered dashboard with ApexCharts visualization
### Monitoring Data Flow
Server Metrics Collection (every 30s)
↓
ResourceMonitoringJob executes on all servers
↓
SSH connection retrieves system metrics
↓
SystemResourceMonitor processes and stores metrics
↓
server_resource_metrics table (time-series data)
↓
Redis cache for recent metrics
↓
WebSocket broadcast to connected clients
↓
ResourceDashboard.vue updates in real-time
### Server Scoring Algorithm
Deployments automatically select the optimal server based on weighted scoring:
- **CPU Availability (30%)** - Remaining CPU capacity
- **Memory Availability (30%)** - Free memory for application allocation
- **Disk Space (20%)** - Available storage for application data
- **Network Bandwidth (10%)** - Available network capacity
- **Current Load (10%)** - Server load average (penalizes heavily loaded servers)
**Example Score Calculation:**
Server: production-app-1
CPU: 40% used (60% available) = 60 points × 30% weight = 18 points
Memory: 50% used (50% available) = 50 points × 30% weight = 15 points
Disk: 30% used (70% available) = 70 points × 20% weight = 14 points
Network: 20% used (80% available) = 80 points × 10% weight = 8 points
Load: 1.2/4.0 (70% available) = 70 points × 10% weight = 7 points
Total Score: 62 / 100
Higher scores indicate better deployment candidates.
### Organization Quota Enforcement
Organization resource usage is tracked and enforced based on enterprise license quotas:
Organization: Acme Corp
License Tier: Professional
Quotas:
- Max Servers: 20
- Max Applications: 100
- Max CPU Cores: 80
- Max RAM: 256 GB
- Max Storage: 2 TB
Current Usage:
- Servers: 15 / 20 (75%)
- Applications: 67 / 100 (67%)
- CPU Cores: 52 / 80 (65%)
- RAM: 168 GB / 256 GB (65%)
- Storage: 1.2 TB / 2 TB (60%)
Quota violations prevent new resource creation with clear error messages.
### Metric Retention Policies
Metrics are stored with varying granularity based on age:
- **Raw metrics (30s intervals):** Retained for 7 days
- **5-minute aggregates:** Retained for 30 days
- **1-hour aggregates:** Retained for 90 days
- **Daily aggregates:** Retained for 1 year
This provides high-resolution recent data while maintaining long-term trends.
### Getting Started
1. **Enable Monitoring** - Monitoring is automatically enabled on all servers in Enterprise installations
2. **Configure Quotas** - Set organization quotas via License Management interface
3. **Access Dashboards** - Navigate to Resources → Monitoring to view real-time metrics
4. **Plan Capacity** - Use Capacity Planner to forecast resource needs
5. **Monitor Quotas** - Track organization usage in Organization Settings → Resources
### Next Steps
- [User Guide](./user-guide.md) - Dashboard walkthrough and feature tutorials
- [Administrator Guide](./admin-guide.md) - Configuration and quota management
- [Technical Reference](./technical-reference.md) - Architecture deep-dive and algorithms
- [API Reference](./api-reference.md) - Programmatic access to monitoring data
User Guide Document Structure
File: docs/enterprise/resource-monitoring/user-guide.md
# Resource Monitoring User Guide
## Accessing the Resource Dashboard
Navigate to **Resources → Monitoring** in the main navigation menu. The Resource Dashboard displays real-time metrics for all servers in your organization.
### Dashboard Overview

The dashboard consists of four main sections:
1. **Server List** - Left sidebar showing all servers with health status indicators
2. **Metrics Charts** - Main area displaying CPU, memory, disk, and network usage over time
3. **Current Status** - Top bar showing aggregate statistics across all servers
4. **Server Details** - Right panel with detailed metrics for the selected server
### Understanding Health Status Indicators
Servers display colored health indicators based on resource utilization:
- **🟢 Green (Healthy)** - All resources below 70% utilization
- **🟡 Yellow (Warning)** - Any resource between 70-85% utilization
- **🔴 Red (Critical)** - Any resource above 85% utilization
- **⚫ Gray (Offline)** - Server unreachable or metrics collection failed
### Real-Time Metrics
Metrics update automatically every 30 seconds without page refresh via WebSocket connection.
#### CPU Usage
Displays CPU utilization across all cores:
Current: 42% (12 cores)
Average (1h): 38%
Average (24h): 45%
Peak (24h): 78% at 14:23 UTC
**Interpreting CPU Metrics:**
- **0-50%** - Normal operation, sufficient capacity for new deployments
- **50-70%** - Moderate load, deployments may be routed to other servers
- **70-85%** - High load, new deployments redirected to other servers
- **85-100%** - Critical load, investigate resource-intensive applications
#### Memory Usage
Shows RAM allocation and availability:
Used: 24.3 GB / 32 GB (76%)
Available: 7.7 GB
Cached: 4.2 GB (can be freed)
Application Usage: 20.1 GB
**Memory States:**
- **Active** - Currently in use by applications
- **Cached** - File cache, automatically freed when needed
- **Available** - Free for immediate use
- **Swap Used** - Indicates memory pressure (should be minimal)
#### Disk Usage
Displays storage utilization by mount point:
/data - 450 GB / 1 TB (45%)
/var/lib/docker - 125 GB / 500 GB (25%)
/backups - 80 GB / 200 GB (40%)
**Disk Metrics:**
- **Total Size** - Physical disk capacity
- **Used Space** - Allocated storage
- **Available Space** - Free for new data
- **Inodes Used** - File count (important for containers)
#### Network Usage
Shows network throughput in/out:
Inbound: 125 Mbps (current)
Outbound: 85 Mbps (current)
Total (24h): 450 GB in / 320 GB out
Peak: 850 Mbps in at 18:45 UTC
### Time Range Selection
Use the time range selector to view metrics over different periods:
- **Last Hour** - High-resolution 30-second intervals
- **Last 24 Hours** - 5-minute aggregates
- **Last 7 Days** - 1-hour aggregates
- **Last 30 Days** - 1-hour aggregates
- **Last 90 Days** - Daily aggregates
- **Custom Range** - Select specific start/end dates
### Filtering and Sorting
#### Filter by Server Tags
Filter servers by tags to view specific groups:
Production: 8 servers
Staging: 4 servers
Development: 6 servers
Database: 3 servers
Click tag names to filter dashboard to tagged servers.
#### Sort Servers
Sort server list by various criteria:
- **Name (A-Z / Z-A)**
- **CPU Usage (High to Low)**
- **Memory Usage (High to Low)**
- **Disk Usage (High to Low)**
- **Health Status (Critical First)**
- **Last Metric Update (Newest First)**
### Exporting Metrics
Export metrics for external analysis:
1. Click **Export** button in dashboard toolbar
2. Select time range and metrics to export
3. Choose format: CSV, JSON, or Prometheus format
4. Download file
**Example CSV Export:**
```csv
timestamp,server_id,server_name,cpu_percent,memory_percent,disk_percent
2025-10-06 14:30:00,15,production-app-1,42.3,68.5,45.2
2025-10-06 14:30:30,15,production-app-1,43.1,68.7,45.2
Setting Up Alerts
Configure custom alerts for resource thresholds:
- Navigate to Resources → Monitoring → Alerts
- Click Create Alert Rule
- Configure alert parameters:
- Metric: CPU, Memory, Disk, Network, or Load Average
- Threshold: Percentage or absolute value
- Duration: How long threshold must be exceeded
- Severity: Info, Warning, Critical
- Notification Channels: Email, Slack, PagerDuty
Example Alert:
Alert: High CPU on Production Servers
Condition: CPU > 80% for 5 minutes
Severity: Warning
Notify: devops@company.com, #alerts-production
Actions: Send notification, create incident
Capacity Planner
Access the Capacity Planner to forecast resource needs:
- Navigate to Resources → Capacity Planner
- View server capacity scores and recommendations
- See predicted exhaustion dates based on current growth trends
- Plan infrastructure scaling ahead of capacity issues
Capacity Score Breakdown:
Each server displays a capacity score (0-100) indicating deployment suitability:
Server: production-app-2
Capacity Score: 78 / 100
Breakdown:
CPU Availability: 85% × 30% = 25.5 points
Memory Availability: 75% × 30% = 22.5 points
Disk Availability: 68% × 20% = 13.6 points
Network Availability: 90% × 10% = 9.0 points
Load Factor: 72% × 10% = 7.2 points
Total Score: 77.8 / 100 (rounded to 78)
Recommendation: Excellent deployment candidate
Interpreting Scores:
- 90-100 - Excellent capacity, ideal for deployments
- 70-89 - Good capacity, suitable for most deployments
- 50-69 - Moderate capacity, suitable for small/medium deployments
- 30-49 - Limited capacity, avoid new deployments unless necessary
- 0-29 - Critical capacity, do not deploy
Organization Resource Quotas
View your organization's resource quotas and current usage:
- Navigate to Organization Settings → Resources
- View quota allocation by license tier
- Monitor current usage percentages
- See quota violation warnings
Quota Dashboard Example:
Organization: Acme Corporation
License: Professional Tier
Server Quota: 15 / 20 (75%) 🟡
Application Quota: 67 / 100 (67%) 🟢
CPU Quota: 52 cores / 80 cores (65%) 🟢
Memory Quota: 168 GB / 256 GB (65%) 🟢
Storage Quota: 1.2 TB / 2 TB (60%) 🟢
Status: Within limits
Next Review: 2025-11-15
Upgrade Options: Enterprise tier (200 servers, unlimited apps)
Quota Warnings:
- 🟢 Green (0-70%) - Healthy usage
- 🟡 Yellow (70-90%) - Approaching limit, consider planning expansion
- 🔴 Red (90-100%) - Near limit, action required soon
- ⛔ Blocked (100%) - Quota exceeded, cannot create new resources
WebSocket Connection Status
The dashboard uses WebSocket for real-time updates. Connection status is shown in the top-right corner:
- 🟢 Connected - Receiving real-time updates
- 🟡 Connecting - Establishing connection
- 🔴 Disconnected - No real-time updates (page refresh required)
If disconnected, the dashboard automatically attempts reconnection every 5 seconds.
Performance Tips
Optimize Dashboard Performance:
- Limit Time Range - Shorter ranges load faster
- Filter Servers - Display only relevant servers
- Reduce Metric Types - Hide unused metric charts
- Use Aggregated Views - For historical data, use hour/day aggregates
Browser Requirements:
- Modern browser with WebSocket support (Chrome, Firefox, Safari, Edge)
- JavaScript enabled
- Minimum 2 GB RAM for large deployments (100+ servers)
### Administrator Guide Document Structure
**File:** `docs/enterprise/resource-monitoring/admin-guide.md`
```markdown
# Resource Monitoring Administrator Guide
## System Configuration
### Environment Variables
Configure monitoring behavior via environment variables in `.env`:
```bash
# Monitoring Collection
MONITORING_ENABLED=true
MONITORING_INTERVAL=30 # Seconds between collections
MONITORING_TIMEOUT=10 # SSH timeout for metric collection
# Metric Retention
METRICS_RAW_RETENTION_DAYS=7
METRICS_5MIN_RETENTION_DAYS=30
METRICS_HOURLY_RETENTION_DAYS=90
METRICS_DAILY_RETENTION_DAYS=365
# Performance Tuning
MONITORING_CONCURRENT_SERVERS=10 # Parallel metric collection
MONITORING_REDIS_CACHE_TTL=60 # Cache duration in seconds
MONITORING_BATCH_SIZE=100 # Metrics per database insert
# WebSocket Broadcasting
MONITORING_BROADCAST_ENABLED=true
MONITORING_BROADCAST_CHANNEL=resource-metrics
# Alerting
MONITORING_ALERT_ENABLED=true
MONITORING_ALERT_EMAIL=devops@company.com
Database Configuration
Monitoring uses the server_resource_metrics and organization_resource_usage tables.
Partitioning Configuration (PostgreSQL):
-- Enable partitioning for large installations
CREATE TABLE server_resource_metrics_2025_10 PARTITION OF server_resource_metrics
FOR VALUES FROM ('2025-10-01') TO ('2025-11-01');
-- Automatic partition creation via cron
0 0 1 * * php /path/to/coolify/artisan monitoring:create-partitionIndexing:
-- Performance indexes (automatically created by migration)
CREATE INDEX idx_metrics_server_timestamp ON server_resource_metrics(server_id, collected_at DESC);
CREATE INDEX idx_metrics_org_timestamp ON organization_resource_usage(organization_id, period_start DESC);
CREATE INDEX idx_metrics_collected_at ON server_resource_metrics(collected_at) WHERE collected_at > NOW() - INTERVAL '7 days';Redis Caching Configuration
Metrics are cached in Redis for performance:
Cache Keys:
- monitoring:server:{server_id}:latest # Latest metrics (60s TTL)
- monitoring:org:{org_id}:usage # Organization totals (300s TTL)
- monitoring:capacity:scores # Capacity scores (60s TTL)
Memory Usage: ~10 KB per server × server count
Example: 100 servers = ~1 MB Redis memory
Redis Configuration:
# config/database.php
'redis' => [
'monitoring' => [
'host' => env('REDIS_HOST', '127.0.0.1'),
'password' => env('REDIS_PASSWORD', null),
'port' => env('REDIS_PORT', 6379),
'database' => env('REDIS_MONITORING_DB', 2),
],
],Scheduled Jobs Configuration
Monitoring requires scheduled jobs in app/Console/Kernel.php:
protected function schedule(Schedule $schedule)
{
// Resource metric collection (every 30 seconds)
$schedule->job(new ResourceMonitoringJob)
->everyThirtySeconds()
->withoutOverlapping()
->runInBackground();
// Capacity score calculation (every 5 minutes)
$schedule->job(new CapacityAnalysisJob)
->everyFiveMinutes()
->withoutOverlapping();
// Organization usage aggregation (hourly)
$schedule->job(new OrganizationUsageAggregationJob)
->hourly();
// Metric cleanup (daily at 2 AM)
$schedule->command('monitoring:cleanup-old-metrics')
->dailyAt('02:00');
// Alert processing (every minute)
$schedule->job(new AlertProcessingJob)
->everyMinute()
->when(fn() => config('monitoring.alerts.enabled'));
}Ensure Horizon is running for job processing:
php artisan horizonOrganization Quota Configuration
Configure quotas via the License Management interface or directly in the database:
Via UI:
- Navigate to Admin → Organizations → {Organization} → License
- Select license tier (Starter, Professional, Enterprise, Custom)
- Configure custom quotas if using Custom tier
- Save changes
Via Database:
UPDATE enterprise_licenses
SET quota_max_servers = 50,
quota_max_applications = 200,
quota_max_cpu_cores = 200,
quota_max_memory_gb = 512,
quota_max_storage_tb = 5
WHERE organization_id = 123;Quota Enforcement:
Quotas are enforced at resource creation:
// Example quota check (automatic in code)
$organization = auth()->user()->currentOrganization();
if ($organization->servers()->count() >= $organization->license->quota_max_servers) {
throw new QuotaExceededException(
"Server quota exceeded. Current: {$count}, Limit: {$limit}.
Please upgrade your license to increase limits."
);
}Server Selection Algorithm Configuration
Customize server scoring weights in config/capacity.php:
return [
'scoring' => [
'weights' => [
'cpu' => env('CAPACITY_WEIGHT_CPU', 0.30), // 30%
'memory' => env('CAPACITY_WEIGHT_MEMORY', 0.30), // 30%
'disk' => env('CAPACITY_WEIGHT_DISK', 0.20), // 20%
'network' => env('CAPACITY_WEIGHT_NETWORK', 0.10), // 10%
'load' => env('CAPACITY_WEIGHT_LOAD', 0.10), // 10%
],
'thresholds' => [
'minimum_score' => 30, // Don't deploy to servers below this score
'preferred_score' => 70, // Prefer servers above this score
],
'penalties' => [
'recent_deployment' => 10, // Reduce score for servers with recent deployment
'high_load' => 20, // Additional penalty for load > 80%
'low_disk' => 15, // Penalty for disk > 85%
],
],
];Metric Collection SSH Configuration
Monitoring connects to servers via SSH to collect metrics:
SSH Key Setup:
# Generate monitoring-specific SSH key
ssh-keygen -t ed25519 -f ~/.ssh/coolify_monitoring -C "coolify-monitoring"
# Add public key to all servers
ssh-copy-id -i ~/.ssh/coolify_monitoring.pub user@serverConfigure in .env:
MONITORING_SSH_KEY_PATH=/home/coolify/.ssh/coolify_monitoring
MONITORING_SSH_USER=coolify
MONITORING_SSH_PORT=22Required Server Commands:
Monitoring executes these commands via SSH (ensure user has permissions):
# CPU and load average
cat /proc/stat
cat /proc/loadavg
# Memory
cat /proc/meminfo
# Disk
df -h /
df -i / # Inode usage
# Network
cat /proc/net/dev
# Docker (if installed)
docker stats --no-stream --format "{{json .}}"Alert Configuration
Configure alert rules and notification channels:
Alert Rule Structure:
{
"name": "High CPU Usage",
"metric": "cpu_percent",
"condition": "greater_than",
"threshold": 80,
"duration_seconds": 300,
"severity": "warning",
"notification_channels": ["email", "slack"],
"actions": ["notify", "create_incident"]
}Notification Channels:
# Email
ALERT_EMAIL_ENABLED=true
ALERT_EMAIL_TO=devops@company.com
ALERT_EMAIL_FROM=alerts@coolify.company.com
# Slack
ALERT_SLACK_ENABLED=true
ALERT_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
ALERT_SLACK_CHANNEL=#alerts-production
# PagerDuty
ALERT_PAGERDUTY_ENABLED=true
ALERT_PAGERDUTY_INTEGRATION_KEY=...
# Webhook
ALERT_WEBHOOK_ENABLED=true
ALERT_WEBHOOK_URL=https://monitoring.company.com/webhook
ALERT_WEBHOOK_SECRET=...Performance Tuning
High-Volume Deployments (100+ servers):
- Increase Concurrent Collection:
MONITORING_CONCURRENT_SERVERS=20- Enable Database Connection Pooling:
DB_CONNECTION_POOL_MIN=5
DB_CONNECTION_POOL_MAX=20- Partition Metrics Table:
php artisan monitoring:enable-partitioning- Use Dedicated Redis Instance:
REDIS_MONITORING_HOST=redis-monitoring.internal- Enable Metric Batching:
MONITORING_BATCH_SIZE=500
MONITORING_BATCH_INTERVAL=10 # SecondsMonitoring the Monitoring System:
Track monitoring system performance:
-- Job execution time
SELECT AVG(execution_time), MAX(execution_time)
FROM jobs_log
WHERE job_type = 'ResourceMonitoringJob'
AND created_at > NOW() - INTERVAL '1 hour';
-- Metric collection failures
SELECT server_id, COUNT(*) as failures
FROM server_resource_metrics_failures
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY server_id
ORDER BY failures DESC;Backup and Disaster Recovery
Metrics Backup Strategy:
- Database Backup - Include metrics tables in regular database backups
- Time-Series Export - Daily export to S3 for long-term storage
- Redis Persistence - Enable RDB snapshots for cache recovery
Backup Configuration:
# Daily metrics export to S3
php artisan monitoring:export-metrics --days=7 --s3-bucket=coolify-metrics-backupRecovery Procedures:
# Restore metrics from S3 backup
php artisan monitoring:import-metrics --s3-bucket=coolify-metrics-backup --date=2025-10-01
# Rebuild capacity scores
php artisan monitoring:rebuild-capacity-scores
# Regenerate aggregates
php artisan monitoring:regenerate-aggregates --from=2025-10-01 --to=2025-10-06Troubleshooting Admin Issues
Metrics Not Collecting:
- Check Horizon is running:
php artisan horizon:status - Verify SSH connectivity:
ssh -i $MONITORING_SSH_KEY_PATH $MONITORING_SSH_USER@server - Check job failures:
php artisan queue:failed - Review logs:
tail -f storage/logs/monitoring.log
High Database Load:
- Enable metric partitioning
- Increase batch size
- Review index usage:
EXPLAIN SELECT * FROM server_resource_metrics WHERE ... - Archive old metrics:
php artisan monitoring:archive --before=2024-01-01
WebSocket Connection Issues:
- Verify Laravel Reverb is running:
php artisan reverb:status - Check firewall allows WebSocket port (default 8080)
- Test WebSocket connection:
wscat -c ws://coolify.company.com:8080/apps/monitoring
Capacity Scores Incorrect:
- Rebuild scores:
php artisan capacity:rebuild-scores - Verify configuration weights sum to 1.0
- Check recent metrics are available:
SELECT MAX(collected_at) FROM server_resource_metrics
### API Reference Document Structure
**File:** `docs/enterprise/resource-monitoring/api-reference.md`
```markdown
# Resource Monitoring API Reference
## Authentication
All API endpoints require authentication via Sanctum token with `monitoring:read` or `monitoring:write` abilities.
**Request Header:**
Authorization: Bearer {your-api-token}
## Endpoints
### GET /api/v1/monitoring/servers
Get monitoring data for all servers in the organization.
**Query Parameters:**
time_range: string (1h, 24h, 7d, 30d, 90d) - Default: 1h
metrics: string[] - Comma-separated list (cpu,memory,disk,network,load)
server_ids: integer[] - Filter by server IDs
tags: string[] - Filter by server tags
**Example Request:**
```bash
curl -X GET "https://coolify.company.com/api/v1/monitoring/servers?time_range=24h&metrics=cpu,memory" \
-H "Authorization: Bearer {token}"
Example Response:
{
"data": [
{
"server_id": 15,
"server_name": "production-app-1",
"metrics": {
"cpu": {
"current": 42.3,
"average_1h": 38.5,
"average_24h": 45.2,
"peak_24h": 78.1,
"peak_timestamp": "2025-10-06T14:23:00Z"
},
"memory": {
"total_gb": 32,
"used_gb": 24.3,
"available_gb": 7.7,
"cached_gb": 4.2,
"percent_used": 76.0
}
},
"health_status": "warning",
"last_collected_at": "2025-10-06T15:30:00Z"
}
],
"meta": {
"total_servers": 15,
"healthy": 12,
"warning": 2,
"critical": 1,
"offline": 0
}
}GET /api/v1/monitoring/servers/{server_id}
Get detailed monitoring data for a specific server.
Path Parameters:
server_id: integer (required)
Query Parameters:
time_range: string - Default: 1h
granularity: string (raw, 5min, 1hour, 1day) - Default: auto
Example Request:
curl -X GET "https://coolify.company.com/api/v1/monitoring/servers/15?time_range=7d&granularity=1hour" \
-H "Authorization: Bearer {token}"Example Response:
{
"server_id": 15,
"server_name": "production-app-1",
"organization_id": 5,
"time_series": [
{
"timestamp": "2025-10-06T14:00:00Z",
"cpu_percent": 42.3,
"memory_used_gb": 24.3,
"memory_percent": 76.0,
"disk_used_gb": 450,
"disk_percent": 45.0,
"network_in_mbps": 125,
"network_out_mbps": 85,
"load_average_1m": 1.2,
"load_average_5m": 1.5,
"load_average_15m": 1.8
}
],
"capacity_score": 62,
"capacity_breakdown": {
"cpu_score": 18,
"memory_score": 15,
"disk_score": 14,
"network_score": 8,
"load_score": 7
}
}GET /api/v1/monitoring/organizations/{org_id}/usage
Get organization-wide resource usage and quota information.
Path Parameters:
org_id: integer (required)
Example Request:
curl -X GET "https://coolify.company.com/api/v1/monitoring/organizations/5/usage" \
-H "Authorization: Bearer {token}"Example Response:
{
"organization_id": 5,
"organization_name": "Acme Corporation",
"license_tier": "professional",
"quotas": {
"max_servers": 20,
"max_applications": 100,
"max_cpu_cores": 80,
"max_memory_gb": 256,
"max_storage_tb": 2
},
"current_usage": {
"servers": {
"count": 15,
"percent": 75.0,
"status": "warning"
},
"applications": {
"count": 67,
"percent": 67.0,
"status": "healthy"
},
"cpu_cores": {
"allocated": 52,
"percent": 65.0,
"status": "healthy"
},
"memory_gb": {
"allocated": 168,
"percent": 65.6,
"status": "healthy"
},
"storage_tb": {
"allocated": 1.2,
"percent": 60.0,
"status": "healthy"
}
},
"trending": {
"servers_7d_growth": 2,
"applications_7d_growth": 8,
"predicted_server_exhaustion_date": "2026-02-15"
}
}GET /api/v1/monitoring/capacity/scores
Get capacity scores for all servers to determine optimal deployment targets.
Query Parameters:
min_score: integer - Minimum score threshold (0-100)
server_tags: string[] - Filter by tags
sort: string (score_desc, score_asc, name) - Default: score_desc
Example Request:
curl -X GET "https://coolify.company.com/api/v1/monitoring/capacity/scores?min_score=50&server_tags=production" \
-H "Authorization: Bearer {token}"Example Response:
{
"data": [
{
"server_id": 18,
"server_name": "production-app-4",
"capacity_score": 85,
"recommendation": "excellent",
"breakdown": {
"cpu_availability": 90,
"memory_availability": 85,
"disk_availability": 80,
"network_availability": 88,
"load_factor": 75
},
"weighted_scores": {
"cpu": 27.0,
"memory": 25.5,
"disk": 16.0,
"network": 8.8,
"load": 7.5
},
"suitable_for_deployment": true,
"estimated_deployments_capacity": 8
}
],
"meta": {
"total_servers": 15,
"suitable_servers": 12,
"best_server_id": 18
}
}POST /api/v1/monitoring/servers/{server_id}/metrics
Manually submit metrics for a server (for custom monitoring integrations).
Path Parameters:
server_id: integer (required)
Request Body:
{
"timestamp": "2025-10-06T15:30:00Z",
"metrics": {
"cpu_percent": 42.3,
"memory_used_gb": 24.3,
"memory_total_gb": 32,
"disk_used_gb": 450,
"disk_total_gb": 1000,
"network_in_mbps": 125,
"network_out_mbps": 85,
"load_average_1m": 1.2,
"load_average_5m": 1.5,
"load_average_15m": 1.8
}
}Example Request:
curl -X POST "https://coolify.company.com/api/v1/monitoring/servers/15/metrics" \
-H "Authorization: Bearer {token}" \
-H "Content-Type: application/json" \
-d '{
"timestamp": "2025-10-06T15:30:00Z",
"metrics": {
"cpu_percent": 42.3,
"memory_used_gb": 24.3,
"memory_total_gb": 32
}
}'Example Response:
{
"success": true,
"message": "Metrics stored successfully",
"server_id": 15,
"timestamp": "2025-10-06T15:30:00Z"
}GET /api/v1/monitoring/alerts
Get active alerts and alert history.
Query Parameters:
status: string (active, resolved, all) - Default: active
severity: string (info, warning, critical) - Filter by severity
server_ids: integer[] - Filter by server
time_range: string (1h, 24h, 7d, 30d) - Default: 24h
Example Response:
{
"data": [
{
"alert_id": 1234,
"server_id": 15,
"server_name": "production-app-1",
"metric": "cpu_percent",
"condition": "greater_than",
"threshold": 80,
"current_value": 85.2,
"severity": "warning",
"status": "active",
"triggered_at": "2025-10-06T15:25:00Z",
"duration_seconds": 300,
"notification_sent": true,
"notification_channels": ["email", "slack"]
}
],
"meta": {
"total_alerts": 1,
"active": 1,
"resolved_24h": 5
}
}Rate Limits
API endpoints are rate-limited based on organization license tier:
- Starter: 100 requests per minute
- Professional: 500 requests per minute
- Enterprise: 2000 requests per minute
Rate limit headers are included in all responses:
X-RateLimit-Limit: 500
X-RateLimit-Remaining: 487
X-RateLimit-Reset: 1696611600
Error Handling
Standard error response format:
{
"error": {
"code": "QUOTA_EXCEEDED",
"message": "Server quota exceeded. Current: 20, Limit: 20.",
"details": {
"current_count": 20,
"max_count": 20,
"license_tier": "professional"
}
}
}Common Error Codes:
UNAUTHORIZED- Invalid or missing API tokenFORBIDDEN- Insufficient permissionsNOT_FOUND- Resource not foundQUOTA_EXCEEDED- Organization quota limit reachedRATE_LIMIT_EXCEEDED- API rate limit exceededVALIDATION_ERROR- Invalid request parameters
## Implementation Approach
### Step 1: Create Documentation Directory Structure
1. Create `/docs/enterprise/resource-monitoring/` directory
2. Create subdirectories: `images/`, `examples/`
3. Set up markdown file templates
### Step 2: Write Core Documentation Files
1. Start with `overview.md` - Feature introduction and architecture
2. Write `user-guide.md` - Dashboard walkthrough with screenshots
3. Create `admin-guide.md` - Configuration and system administration
4. Develop `technical-reference.md` - Deep technical details
### Step 3: Create API Documentation
1. Document all monitoring API endpoints
2. Include request/response examples for each endpoint
3. Add authentication and rate limiting information
4. Create code examples in multiple languages (curl, PHP, JavaScript)
### Step 4: Write Troubleshooting Guide
1. Document common issues and resolutions
2. Create diagnostic procedures
3. Add performance tuning recommendations
4. Include recovery procedures
### Step 5: Develop Best Practices Guide
1. Resource planning recommendations
2. Quota sizing guidelines
3. Monitoring optimization strategies
4. Integration patterns
### Step 6: Create Supporting Materials
1. Take screenshots of all UI components
2. Create architecture diagrams
3. Generate code examples
4. Build sample configurations
### Step 7: Review and Testing
1. Technical review by engineering team
2. User testing with sample documentation
3. Validate all code examples
4. Check internal links and references
### Step 8: Publication and Maintenance
1. Integrate into main Coolify documentation site
2. Create changelog tracking
3. Set up feedback mechanism
4. Schedule periodic reviews
## Test Strategy
### Documentation Quality Tests
**File:** `tests/Documentation/ResourceMonitoringDocsTest.php`
```php
<?php
use Illuminate\Support\Facades\File;
it('has all required documentation files', function () {
$requiredFiles = [
'docs/enterprise/resource-monitoring/overview.md',
'docs/enterprise/resource-monitoring/user-guide.md',
'docs/enterprise/resource-monitoring/admin-guide.md',
'docs/enterprise/resource-monitoring/technical-reference.md',
'docs/enterprise/resource-monitoring/api-reference.md',
'docs/enterprise/resource-monitoring/troubleshooting.md',
'docs/enterprise/resource-monitoring/best-practices.md',
'docs/enterprise/resource-monitoring/configuration.md',
];
foreach ($requiredFiles as $file) {
expect(File::exists(base_path($file)))->toBeTrue("File {$file} is missing");
}
});
it('documentation has valid markdown syntax', function () {
$docFiles = File::glob(base_path('docs/enterprise/resource-monitoring/*.md'));
foreach ($docFiles as $file) {
$content = File::get($file);
// Check for balanced code blocks
$backtickCount = substr_count($content, '```');
expect($backtickCount % 2)->toBe(0, "Unbalanced code blocks in {$file}");
// Check for valid headers
preg_match_all('/^(#{1,6})\s+/m', $content, $headers);
expect($headers[0])->not->toBeEmpty("No headers found in {$file}");
}
});
it('all code examples are syntactically valid', function () {
$docFiles = File::glob(base_path('docs/enterprise/resource-monitoring/*.md'));
foreach ($docFiles as $file) {
$content = File::get($file);
// Extract PHP code blocks
preg_match_all('/```php\n(.*?)```/s', $content, $phpBlocks);
foreach ($phpBlocks[1] as $index => $code) {
$result = shell_exec("echo " . escapeshellarg($code) . " | php -l 2>&1");
expect($result)->toContain('No syntax errors', "Syntax error in {$file} block #{$index}");
}
}
});
it('all internal links are valid', function () {
$docFiles = File::glob(base_path('docs/enterprise/resource-monitoring/*.md'));
foreach ($docFiles as $file) {
$content = File::get($file);
$dir = dirname($file);
// Extract markdown links [text](path)
preg_match_all('/\[([^\]]+)\]\(([^)]+)\)/', $content, $links);
foreach ($links[2] as $link) {
// Skip external links
if (str_starts_with($link, 'http')) {
continue;
}
// Skip anchors
if (str_starts_with($link, '#')) {
continue;
}
// Check file exists
$linkPath = $dir . '/' . $link;
expect(File::exists($linkPath))->toBeTrue("Broken link: {$link} in {$file}");
}
}
});
it('API endpoint examples return valid responses', function () {
// Test actual API endpoints match documentation
$this->actingAs($user = User::factory()->create());
$response = $this->getJson('/api/v1/monitoring/servers');
$response->assertOk()
->assertJsonStructure([
'data' => [
'*' => [
'server_id',
'server_name',
'metrics',
'health_status',
'last_collected_at',
],
],
'meta' => [
'total_servers',
'healthy',
'warning',
'critical',
],
]);
});
Documentation Completeness Checklist
Manual Review Checklist:
- All features from implementation (Tasks 22-31) are documented
- Every UI component has screenshot with caption
- Every configuration option has description and example
- Every API endpoint has request/response example
- All error codes are documented with resolutions
- Architecture diagrams show complete system flow
- Code examples use consistent styling
- Terminology is consistent across all documents
- Cross-references between documents are accurate
- Table of contents is complete and accurate
- Search keywords are included in metadata
- Version compatibility is documented
Definition of Done
- All 11 core documentation files created and complete
- Documentation directory structure established
- Overview document covers architecture and key features (complete)
- User guide includes dashboard walkthrough with screenshots (8+ sections)
- Administrator guide covers all configuration options (10+ sections)
- Technical reference explains algorithms and data structures
- API reference documents all monitoring endpoints (8+ endpoints)
- Troubleshooting guide includes common issues and resolutions (15+ issues)
- Best practices guide provides planning and optimization advice
- Configuration reference lists all settings with examples
- Migration guide explains enabling on existing installations
- Integration guide covers external monitoring systems (Prometheus, Grafana)
- Security documentation covers access controls and permissions
- All screenshots captured and optimized (20+ images)
- All architecture diagrams created (5+ diagrams)
- All code examples tested and validated (50+ examples)
- Internal links verified and working
- Documentation follows Coolify style guide
- Technical review completed by engineering team
- User testing completed with sample users
- Feedback incorporated from review process
- Documentation integrated into main docs site
- Changelog created tracking documentation versions
- Search metadata added for discoverability
- PDF export generation working
- Documentation quality tests written and passing
- All acceptance criteria met
Related Tasks
- Depends on: Task 31 (WebSocket broadcasting implementation - must be complete to document accurately)
- Integrates with: Task 22-30 (All resource monitoring implementation tasks - documentation reflects implementation)
- Used by: Enterprise customers for understanding and configuring monitoring features
- Complements: Task 82 (White-label documentation), Task 83 (Terraform documentation)
