-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Task: Implement automatic rollback mechanism on health check failures
Description
Implement a comprehensive automatic rollback system that monitors application health during and after deployments, automatically reverting to the previous stable version when health checks fail. This critical safety mechanism protects production environments from broken deployments by continuously validating application health using configurable health check strategies (HTTP endpoints, TCP connections, custom scripts) and orchestrating seamless rollbacks when issues are detected.
Modern deployment strategies (rolling updates, blue-green, canary) require intelligent health validation to ensure applications remain available. This task builds the foundation for production-grade deployment safety by implementing:
- Multi-Strategy Health Checking: HTTP endpoint validation, TCP port checks, custom script execution, container status monitoring
- Configurable Health Policies: Define success criteria (status codes, response times, consecutive successes), failure thresholds (max retries, timeout durations)
- Automated Rollback Orchestration: Trigger rollback on health check failures, coordinate state restoration across deployment strategies, preserve previous deployment artifacts
- Health Check Persistence: Store health check results in database, track health history for post-deployment analysis, generate health trend reports
- Real-Time Notifications: Alert administrators via WebSocket, email, and Slack when health checks fail and rollbacks execute
- Deployment History: Maintain comprehensive deployment and rollback audit trail with state snapshots
Integration with Existing Coolify Architecture:
- Extends
ApplicationDeploymentJobwith health check validation phases - Integrates with existing
ServerSSH execution infrastructure viaExecuteRemoteCommandtrait - Uses Coolify's existing notification system for health check failure alerts
- Leverages Docker container inspection for health status validation
- Coordinates with proxy configuration updates (Nginx/Traefik) for traffic management
Integration with Enterprise Deployment System:
- Works with
EnhancedDeploymentService(Task 32) for strategy-aware rollbacks - Coordinates with
CapacityManager(Task 26) for resource state restoration - Uses health check data in deployment decision-making algorithms
- Integrates with resource monitoring for correlation between health and resource usage
Why this task is critical: Automatic rollback is the safety net that prevents catastrophic production failures. Without health-based rollbacks, broken deployments can take applications offline for extended periods while administrators manually diagnose and fix issues. Automated rollback restores service within seconds, minimizing downtime and customer impact. This transforms deployments from high-risk operations requiring human supervision into reliable automated processes that self-correct when problems occur.
Acceptance Criteria
Core Functionality
- Health check system supports HTTP endpoint validation with configurable status codes, response time thresholds, and response body validation
- Health check system supports TCP port connectivity checks for non-HTTP services (databases, Redis, message queues)
- Health check system supports custom script execution for application-specific validation logic
- Health check system supports Docker container health status inspection
- Health checks execute on configurable intervals (default: every 10 seconds for 5 minutes post-deployment)
- Rollback triggers automatically when health checks fail consecutive threshold (default: 3 consecutive failures)
Rollback Orchestration
- Rollback preserves previous deployment artifacts (Docker images, configuration files, environment variables)
- Rollback restores previous Docker container configuration exactly (image tag, environment, volumes, networks)
- Rollback coordinates with proxy configuration (Nginx/Traefik) to route traffic back to previous version
- Rollback executes within 30 seconds of health check failure detection (target: < 60 seconds total downtime)
- Rollback handles partial failures gracefully (some servers succeed, others fail)
Deployment Strategy Integration
- Rolling update rollback reverts servers in reverse order, restoring traffic to old containers first
- Blue-green rollback switches traffic back to previous environment without destroying new environment
- Canary rollback immediately stops traffic to canary instances and removes them from load balancer
Configuration & Policy
- Health check configuration stored in database per application with sensible defaults
- Health check policies support: success threshold (consecutive successes), failure threshold (consecutive failures), timeout durations, retry intervals
- Applications can define multiple health check endpoints with AND/OR logic (all must pass OR any must pass)
- Health check configuration UI integrated into application settings
Persistence & Reporting
- Health check results persisted to database with timestamps, status, response details, execution duration
- Deployment history tracks rollback events with triggering health check failure details
- Health check dashboard displays real-time status during deployments with historical trends
- Administrators can view detailed health check logs for failed deployments
Notifications & Alerts
- Real-time WebSocket notifications broadcast health check failures and rollback initiation
- Email notifications sent to application owners on health check failures and rollback completion
- Slack/Discord webhook integration for team notifications (optional, configurable per application)
Error Handling & Edge Cases
- Rollback system handles cases where previous deployment artifacts are missing (logs warning, prevents rollback)
- Health checks timeout gracefully without blocking deployment job indefinitely
- Rollback handles concurrent deployment attempts with proper locking
- System distinguishes between temporary network issues (retry) vs. persistent failures (rollback)
Technical Details
File Paths
Service Layer (NEW):
app/Services/Enterprise/Deployment/HealthCheckService.php- Health check execution and validationapp/Services/Enterprise/Deployment/RollbackOrchestrator.php- Rollback coordination across strategiesapp/Contracts/HealthCheckServiceInterface.php- Health check service interfaceapp/Contracts/RollbackOrchestratorInterface.php- Rollback orchestrator interface
Models (NEW):
app/Models/Enterprise/HealthCheckConfig.php- Health check configuration per applicationapp/Models/Enterprise/HealthCheckResult.php- Health check execution resultsapp/Models/Enterprise/DeploymentHistory.php- Deployment and rollback audit trailapp/Models/Enterprise/DeploymentSnapshot.php- State snapshots for rollback restoration
Jobs (ENHANCE EXISTING):
app/Jobs/ApplicationDeploymentJob.php- Enhance with health check validation phaseapp/Jobs/HealthCheckMonitorJob.php- NEW: Scheduled health check monitoring post-deployment
Actions (NEW):
app/Actions/Deployment/ExecuteHealthCheck.php- Execute individual health checkapp/Actions/Deployment/ExecuteRollback.php- Execute rollback for single deploymentapp/Actions/Deployment/CreateDeploymentSnapshot.php- Capture deployment state before changesapp/Actions/Deployment/RestoreDeploymentSnapshot.php- Restore previous deployment state
Database Migrations:
database/migrations/2025_01_XX_create_health_check_configs_table.phpdatabase/migrations/2025_01_XX_create_health_check_results_table.phpdatabase/migrations/2025_01_XX_create_deployment_histories_table.phpdatabase/migrations/2025_01_XX_create_deployment_snapshots_table.php
Tests:
tests/Unit/Enterprise/Deployment/HealthCheckServiceTest.phptests/Unit/Enterprise/Deployment/RollbackOrchestratorTest.phptests/Feature/Enterprise/Deployment/AutomaticRollbackTest.phptests/Feature/Enterprise/Deployment/HealthCheckExecutionTest.php
Database Schema
health_check_configs table:
<?php
use Illuminate\Database\Migrations\Migration;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Support\Facades\Schema;
return new class extends Migration
{
public function up(): void
{
Schema::create('health_check_configs', function (Blueprint $table) {
$table->id();
$table->foreignId('application_id')->constrained()->cascadeOnDelete();
$table->string('name')->nullable(); // User-defined name for this health check
$table->enum('type', ['http', 'tcp', 'script', 'docker_container'])->default('http');
// HTTP health check configuration
$table->string('http_endpoint')->nullable(); // e.g., /health, /api/status
$table->string('http_method')->default('GET'); // GET, POST, HEAD
$table->json('http_expected_status_codes')->nullable(); // [200, 204]
$table->integer('http_timeout_seconds')->default(10);
$table->text('http_expected_body_contains')->nullable(); // Optional body validation
$table->json('http_headers')->nullable(); // Custom headers
// TCP health check configuration
$table->integer('tcp_port')->nullable();
$table->integer('tcp_timeout_seconds')->default(5);
// Script health check configuration
$table->text('script_command')->nullable(); // Shell command to execute
$table->integer('script_timeout_seconds')->default(30);
$table->integer('script_expected_exit_code')->default(0);
// Docker container health check
$table->boolean('use_docker_health_status')->default(false);
// Health check policy
$table->integer('success_threshold')->default(1); // Consecutive successes needed
$table->integer('failure_threshold')->default(3); // Consecutive failures before rollback
$table->integer('check_interval_seconds')->default(10); // Time between checks
$table->integer('initial_delay_seconds')->default(30); // Wait before first check
$table->integer('monitoring_duration_seconds')->default(300); // How long to monitor (5 min default)
// Rollback policy
$table->boolean('auto_rollback_enabled')->default(true);
$table->boolean('notify_on_failure')->default(true);
$table->json('notification_channels')->nullable(); // ['email', 'slack', 'discord']
$table->boolean('is_active')->default(true);
$table->timestamps();
$table->index(['application_id', 'is_active']);
});
}
public function down(): void
{
Schema::dropIfExists('health_check_configs');
}
};health_check_results table:
<?php
use Illuminate\Database\Migrations\Migration;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Support\Facades\Schema;
return new class extends Migration
{
public function up(): void
{
Schema::create('health_check_results', function (Blueprint $table) {
$table->id();
$table->foreignId('health_check_config_id')->constrained()->cascadeOnDelete();
$table->foreignId('deployment_id')->nullable()->constrained('application_deployments')->nullOnDelete();
$table->foreignId('application_id')->constrained()->cascadeOnDelete();
$table->foreignId('server_id')->nullable()->constrained()->nullOnDelete();
$table->enum('status', ['success', 'failure', 'timeout', 'error'])->index();