Skip to content

pgControlPlane is a comprehensive, production-ready control plane for managing PostgreSQL clusters with high availability, automated failover, backup orchestration, and advanced observability.

Notifications You must be signed in to change notification settings

pgElephant/pgControlPlane

Repository files navigation

pgControlPlane

Modern PostgreSQL Control Plane for Enterprise Cluster Management

pgControlPlane is a comprehensive, production-ready control plane for managing PostgreSQL clusters with high availability, automated failover, backup orchestration, and advanced observability. It seamlessly integrates all pgElephant components (pgbalancer, pgBackRest, pgraft, FauxDB, pgSentinel) into a unified cluster management platform.

πŸš€ Features

Core Capabilities

  • Multi-Cluster Management: Manage multiple PostgreSQL clusters from a single control plane
  • Automated Failover: Intelligent failover with configurable policies and safety checks
  • Blue-Green Deployments: Zero-downtime upgrades and migrations
  • Backup Orchestration: Automated backup scheduling with pgBackRest and WAL-G integration
  • Configuration Management: Centralized configuration with drift detection and auto-remediation
  • Connection Pooling: Integrated pgbalancer management with automatic configuration updates

Architecture

  • Dual API: gRPC for performance + REST for convenience with OpenAPI specs
  • PostgreSQL Persistence: Production-ready state storage with migrations
  • OpenTelemetry: Complete observability with traces, metrics, and logs
  • WebSocket Streaming: Real-time cluster status and event notifications
  • Kubernetes Native: CRDs, operator, and Helm charts for K8s deployments
  • Multi-Cloud: Works on VMs, bare metal, and Kubernetes across all cloud providers

Advanced Features

  • Smart Reconciliation: Continuous state reconciliation with configurable intervals
  • Health Scoring: Advanced health metrics for intelligent decision-making
  • Quorum-Based Operations: Safe promotions with quorum requirements
  • Automated Healing: Self-healing clusters with automatic node recovery
  • Point-in-Time Recovery: PITR support with backup/restore orchestration
  • Monitoring Integration: Built-in Prometheus, Grafana, and pgSentinel integration

Security

  • mTLS: Mutual TLS between control plane and agents
  • RBAC: Role-based access control with fine-grained permissions
  • Vault Integration: Secret management with HashiCorp Vault
  • Audit Logging: Complete audit trail for compliance
  • JWT Authentication: Modern token-based authentication with refresh tokens

πŸ“‹ Quick Start

⚑ Deploy a complete cluster in one command:

cd pgControlPlane
./scripts/deploy-full-cluster.sh --name mycluster --nodes 3

This single command deploys:

  • βœ… 3-node PostgreSQL cluster with automated failover
  • βœ… pgbalancer for connection pooling and load balancing
  • βœ… pgBackRest for automated backups
  • βœ… pgSentinel for real-time monitoring
  • βœ… FauxDB for MongoDB compatibility layer
  • βœ… Control plane agents on all nodes

See QUICKSTART.md for detailed instructions.

Prerequisites

  • Go 1.22+
  • PostgreSQL 14+ (optional, managed by control plane)
  • Docker & Docker Compose (for local deployment)
  • Kubernetes 1.27+ (optional, for production K8s deployment)

Installation

Binary Installation

# Download latest release
curl -L https://github.com/pgElephant/pgControlPlane/releases/latest/download/pgcp-linux-amd64.tar.gz | tar xz
sudo mv pgcp /usr/local/bin/

# Verify installation
pgcp version

Build from Source

git clone https://github.com/pgElephant/pgControlPlane
cd pgControlPlane
make build
sudo make install

Docker (Complete Stack with All Components)

# Deploy the complete pgElephant stack
cd pgControlPlane
docker-compose -f deployments/complete-stack.yaml up -d

# Verify all services are running
docker-compose -f deployments/complete-stack.yaml ps

# Access the cluster
psql postgresql://postgres:postgres@localhost:5435/production

Services included:

Kubernetes

# Add Helm repository
helm repo add pgelephant https://pgelephant.github.io/charts
helm repo update

# Install with Helm
helm install pgcontrolplane pgelephant/pgcontrolplane \
  --namespace pgcontrolplane \
  --create-namespace \
  --set database.url="postgres://user:pass@host:5432/pgcp"

Running Locally

# Set up database
createdb pgcontrolplane
make migrate

# Configure
export PGCP_DATABASE_URL="postgres://localhost:5432/pgcontrolplane"
export PGCP_LOG_LEVEL="info"

# Run control plane
make run

# Or run with Docker Compose
docker-compose up -d

Deploy Your First Cluster

Option 1: Automated Deployment (Recommended)

# Deploy complete cluster with all components
./scripts/deploy-full-cluster.sh \
  --name production \
  --nodes 3 \
  --version 16.1 \
  --replication async

# Output includes connection info and all service URLs

Option 2: API-based Deployment

# Provision via orchestrator API
curl -X POST http://localhost:8080/api/v1/orchestrator/provision \
  -H "Content-Type: application/json" \
  -d '{
    "name": "production",
    "postgres_version": "16.1",
    "node_count": 3,
    "region": "us-east-1",
    "enable_pgbalancer": true,
    "enable_pgbackrest": true,
    "enable_pgraft": false,
    "enable_fauxdb": true,
    "enable_pgsentinel": true,
    "replication_mode": "async",
    "backup_schedule": "0 2 * * *",
    "instance_type": "m5.large",
    "storage_gb": 100,
    "extensions": ["pg_stat_statements", "pg_stat_insights"]
  }'

# Check cluster status
curl http://localhost:8080/api/v1/clusters/production/status

Option 3: Kubernetes CRD

apiVersion: controlplane.pgelephant.com/v1
kind: PgCluster
metadata:
  name: production
spec:
  name: production
  postgresVersion: "16.1"
  nodeCount: 3
  enablePgBalancer: true
  enablePgBackRest: true
  enablePgSentinel: true

πŸ”— Integrated pgElephant Components

pgControlPlane seamlessly orchestrates all pgElephant components:

pgbalancer - Connection Pooling & Load Balancing

  • Purpose: Intelligent connection pooling and load distribution
  • Features: Round-robin/least-connected balancing, REST API control, MQTT clustering
  • Integration: Auto-configured with backend nodes, health checks, automatic failover
  • Access: Clients connect through pgbalancer for optimal performance

pgBackRest - Enterprise Backup Solution

  • Purpose: Automated backup and point-in-time recovery
  • Features: Full/incremental/differential backups, encryption, compression
  • Integration: Scheduled backups, retention policies, restore automation
  • Storage: S3, Azure Blob, GCS, or local filesystem

pgraft - Raft Consensus (Optional)

  • Purpose: Strong consistency with distributed consensus
  • Features: Leader election, log replication, etcd-compatible API
  • Integration: Alternative to streaming replication for CP guarantees
  • Use Case: Financial systems, inventory management, critical data

pgSentinel - Real-Time Monitoring

  • Purpose: Comprehensive cluster monitoring and alerting
  • Features: Query analytics, replication monitoring, performance insights
  • Integration: Auto-discovers all nodes, tracks metrics, generates alerts
  • Dashboard: Beautiful web UI with real-time updates

FauxDB - MongoDB Compatibility Layer

  • Purpose: MongoDB API compatibility for PostgreSQL
  • Features: MongoDB wire protocol, JSON document storage, MongoDB query language
  • Integration: Transparently translates MongoDB requests to PostgreSQL
  • Use Case: Migrate MongoDB applications to PostgreSQL without code changes

pg_stat_insights - Query Analytics Extension

  • Purpose: Deep query performance analysis
  • Features: Query tracking, execution plans, performance trends
  • Integration: Installed on all nodes, data aggregated by pgSentinel
  • Benefits: Identify slow queries, optimize performance

πŸ“š Architecture

Component Interaction

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Clients   β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  pgbalancer    │◄────── pgControlPlaneβ”‚
β”‚ (Port 5433)    β”‚      β”‚  (Port 8080)  β”‚
β””β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚   β”‚   β”‚                   β”‚
β”Œβ”€β”€β”€β–Όβ”€β”€β”€β–Όβ”€β”€β”€β–Όβ”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PostgreSQL     │◄────   Agents          β”‚
β”‚ Nodes (1-N)    β”‚    β”‚   (on each node)  β”‚
β”‚ + pg_stat_     β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚   insights     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
        β”‚                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  pgBackRest      β”‚   β”‚   pgSentinel     β”‚
β”‚  (Backups)       β”‚   β”‚   (Monitoring)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚    FauxDB      β”‚
                       β”‚   (Testing)    β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

System Architecture

                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚    Clients      β”‚
                          β”‚  Applications   β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚              β”‚              β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚   pgbalancer     β”‚     β”‚     β”‚  pgControlPlane  β”‚
          β”‚ Connection Pool  β”‚     β”‚     β”‚   Control API    β”‚
          β”‚ Load Balancer    β”‚     β”‚     β”‚  (Port 8080)     β”‚
          β”‚  (Port 5433)     │◄────┼──────                  β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚              β”‚              β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
       β”‚            β”‚                             β”‚
       β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
       β”‚  β”‚    PostgreSQL Cluster Nodes     β”‚    β”‚
       β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚
       β”‚  β”‚  β”‚  Node 1 (Primary)       │◄───┼─────
       β”‚  β”‚  β”‚  + Agent                β”‚    β”‚    β”‚
       β”‚  β”‚  β”‚  + pg_stat_insights     β”‚    β”‚    β”‚
       β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚
       β”‚  β”‚              β”‚ Replication      β”‚    β”‚
       β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚
       β”‚  β”‚  β”‚  Node 2 (Replica)       │◄───┼─────
       β”‚  β”‚  β”‚  + Agent                β”‚    β”‚    β”‚
       β”‚  β”‚  β”‚  + pg_stat_insights     β”‚    β”‚    β”‚
       β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚
       β”‚  β”‚              β”‚ Replication      β”‚    β”‚
       β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚
       β”‚  β”‚  β”‚  Node 3 (Replica)       │◄───┼─────
       β”‚  β”‚  β”‚  + Agent                β”‚    β”‚    β”‚
       β”‚  β”‚  β”‚  + pg_stat_insights     β”‚    β”‚    β”‚
       β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚
       β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
       β”‚           β”‚                             β”‚
       β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚  β”‚   pgBackRest      β”‚     β”‚   pgSentinel         β”‚
       β”‚  β”‚  Backup System    β”‚     β”‚  Monitoring Hub      β”‚
       β”‚  β”‚  - Full/Incr      β”‚     β”‚  - Dashboard         β”‚
       β”‚  β”‚  - PITR           β”‚     β”‚  - Alerts            β”‚
       β”‚  β”‚  - S3 Storage     β”‚     β”‚  - Analytics         β”‚
       β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                                       β”‚
       β”‚                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚                            β”‚     FauxDB          β”‚
       β”‚                            β”‚  MongoDB Compat     β”‚
       └─────────────────────────────  - Wire Protocol    β”‚
                                    β”‚  - JSON Documents   β”‚
                                    β”‚  - Query Trans.     β”‚
                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Legend:
━━━  Data Flow          ◄───  Management/Control
β”‚    Replication        β”Œβ”€β”   Component/Service

Control Plane Internal Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ pgControlPlane ──────────────────────────┐
β”‚                                                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ API Layer ───────────────────────┐        β”‚
β”‚  β”‚                                                        β”‚        β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚        β”‚
β”‚  β”‚  β”‚   REST   β”‚    β”‚   gRPC   β”‚    β”‚  WebSocket   β”‚   β”‚        β”‚
β”‚  β”‚  β”‚ (8080)   β”‚    β”‚  (9090)  β”‚    β”‚   (8081)     β”‚   β”‚        β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚          β”‚               β”‚                 β”‚                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚                 Service Layer                        β”‚         β”‚
β”‚  β”‚                                                      β”‚         β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚         β”‚
β”‚  β”‚  β”‚   Cluster    β”‚  β”‚  Reconciler  β”‚  β”‚   Agent   β”‚ β”‚         β”‚
β”‚  β”‚  β”‚   Manager    β”‚  β”‚     Loop     β”‚  β”‚Coordinatorβ”‚ β”‚         β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚         β”‚
β”‚  β”‚                                                      β”‚         β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚         β”‚
β”‚  β”‚  β”‚    Backup    β”‚  β”‚ Orchestrator β”‚  β”‚ WebSocket β”‚ β”‚         β”‚
β”‚  β”‚  β”‚   Manager    β”‚  β”‚   (Deploy)   β”‚  β”‚  Manager  β”‚ β”‚         β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚         β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚                           β”‚                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚      PostgreSQL State Store (Control Plane DB)           β”‚   β”‚
β”‚  β”‚                                                           β”‚   β”‚
β”‚  β”‚   Tables:  clusters | nodes | agents | backups           β”‚   β”‚
β”‚  β”‚            events | configurations                        β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Observability Stack ─────────────────┐    β”‚
β”‚  β”‚                                                           β”‚    β”‚
β”‚  β”‚  β€’ OpenTelemetry   β€’ Prometheus   β€’ Metrics (Port 2112)  β”‚    β”‚
β”‚  β”‚  β€’ Traces          β€’ Logs         β€’ Grafana Dashboards   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

  1. API Layer: Accepts requests via REST, gRPC, or WebSocket
  2. Service Layer: Business logic for cluster operations
  3. Persistence Layer: PostgreSQL for state + event sourcing
  4. Agent Communication: mTLS-secured commands to agents
  5. Observability: OpenTelemetry spans and metrics throughout

Key Services

Cluster Manager

  • CRUD operations for clusters and nodes
  • Health check orchestration
  • Topology management

Reconciler

  • Continuous reconciliation loop (default: 30s)
  • Detects and repairs drift
  • Automated failover when needed
  • Configuration synchronization

Agent Coordinator

  • Agent registration and heartbeat
  • Command dispatch with retry logic
  • Health monitoring and pruning

Backup Manager

  • Scheduled backup orchestration
  • PITR support
  • Backup verification
  • Retention policy enforcement

πŸ”§ Configuration

Environment Variables

# Server
PGCP_HTTP_PORT=8080
PGCP_GRPC_PORT=9090
PGCP_WS_PORT=8081

# Database
PGCP_DATABASE_URL=postgres://localhost:5432/pgcontrolplane
PGCP_DATABASE_MAX_CONNS=100
PGCP_DATABASE_MAX_IDLE_CONNS=10

# Security
PGCP_JWT_SECRET=your-secret-key
PGCP_JWT_EXPIRY=24h
PGCP_TLS_CERT=/path/to/cert.pem
PGCP_TLS_KEY=/path/to/key.pem
PGCP_MTLS_CA=/path/to/ca.pem

# Reconciliation
PGCP_RECONCILE_INTERVAL=30s
PGCP_PROMOTION_TIMEOUT=30s
PGCP_SAFE_PROMOTE=true

# Agent
PGCP_AGENT_TTL=5m
PGCP_AGENT_PRUNE_INTERVAL=1m

# Observability
PGCP_LOG_LEVEL=info
PGCP_LOG_FORMAT=json
PGCP_METRICS_PORT=2112
PGCP_TRACING_ENDPOINT=http://jaeger:14268/api/traces

# Features
PGCP_AUTO_FAILOVER=true
PGCP_AUTO_HEALING=true
PGCP_CONFIG_DRIFT_DETECTION=true

Configuration File

# config.yaml
server:
  http_port: 8080
  grpc_port: 9090
  ws_port: 8081
  read_timeout: 30s
  write_timeout: 30s

database:
  url: postgres://localhost:5432/pgcontrolplane
  max_connections: 100
  max_idle_connections: 10
  connection_timeout: 10s

security:
  jwt:
    secret: ${JWT_SECRET}
    expiry: 24h
  tls:
    enabled: true
    cert_file: /etc/pgcp/tls/cert.pem
    key_file: /etc/pgcp/tls/key.pem
    ca_file: /etc/pgcp/tls/ca.pem
  rbac:
    enabled: true

reconciler:
  interval: 30s
  promotion_timeout: 30s
  safe_promote: true
  max_concurrent_reconciles: 5

agents:
  ttl: 5m
  prune_interval: 1m
  command_timeout: 2m

observability:
  logging:
    level: info
    format: json
  metrics:
    enabled: true
    port: 2112
  tracing:
    enabled: true
    endpoint: http://jaeger:14268/api/traces
    sample_rate: 0.1

features:
  auto_failover: true
  auto_healing: true
  config_drift_detection: true
  backup_orchestration: true

🌐 API Reference

REST API

Full OpenAPI/Swagger documentation available at /api/docs

Authentication

# Login
POST /api/v1/auth/login
{
  "username": "admin",
  "password": "secret"
}

# Response
{
  "access_token": "eyJ...",
  "refresh_token": "eyJ...",
  "expires_in": 86400
}

Clusters

# List clusters
GET /api/v1/clusters

# Get cluster
GET /api/v1/clusters/{id}

# Create cluster
POST /api/v1/clusters
{
  "name": "production",
  "region": "us-east-1",
  "postgres_version": "16.1",
  "replication_mode": "sync",
  "auto_failover": true
}

# Update cluster
PUT /api/v1/clusters/{id}

# Delete cluster
DELETE /api/v1/clusters/{id}

# Get cluster status
GET /api/v1/clusters/{id}/status

# Get cluster topology
GET /api/v1/clusters/{id}/topology

# Get cluster metrics
GET /api/v1/clusters/{id}/metrics

Nodes

# List nodes
GET /api/v1/clusters/{cluster_id}/nodes

# Add node
POST /api/v1/clusters/{cluster_id}/nodes
{
  "host": "192.168.1.10",
  "port": 5432,
  "role": "replica",
  "priority": 100
}

# Remove node
DELETE /api/v1/clusters/{cluster_id}/nodes/{node_id}

# Promote node
POST /api/v1/clusters/{cluster_id}/nodes/{node_id}/promote
{
  "force": false
}

Backups

# List backups
GET /api/v1/clusters/{cluster_id}/backups

# Create backup
POST /api/v1/clusters/{cluster_id}/backups
{
  "type": "full",
  "compression": true
}

# Restore backup
POST /api/v1/clusters/{cluster_id}/restore
{
  "backup_id": "backup-123",
  "point_in_time": "2024-01-15T10:30:00Z"
}

gRPC API

See api/proto/controlplane.proto for full service definitions

service ControlPlane {
  rpc CreateCluster(CreateClusterRequest) returns (Cluster);
  rpc GetCluster(GetClusterRequest) returns (Cluster);
  rpc ListClusters(ListClustersRequest) returns (ListClustersResponse);
  rpc UpdateCluster(UpdateClusterRequest) returns (Cluster);
  rpc DeleteCluster(DeleteClusterRequest) returns (Empty);
  
  rpc AddNode(AddNodeRequest) returns (Node);
  rpc RemoveNode(RemoveNodeRequest) returns (Empty);
  rpc PromoteNode(PromoteNodeRequest) returns (PromoteNodeResponse);
  
  rpc StreamClusterEvents(StreamClusterEventsRequest) returns (stream ClusterEvent);
}

WebSocket API

// Connect
const ws = new WebSocket('ws://localhost:8081/api/v1/ws');

// Subscribe to cluster events
ws.send(JSON.stringify({
  action: 'subscribe',
  cluster_id: 'production'
}));

// Receive events
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log('Event:', data);
};

πŸ” Observability

Metrics

Prometheus metrics exposed on port 2112:

# Control Plane Metrics
pgcp_clusters_total
pgcp_nodes_total
pgcp_reconcile_runs_total
pgcp_reconcile_duration_seconds
pgcp_promotions_total
pgcp_promotions_failed_total
pgcp_failovers_total
pgcp_agent_commands_total
pgcp_agent_commands_duration_seconds
pgcp_backup_operations_total
pgcp_backup_size_bytes

# Per-Cluster Metrics
pgcp_cluster_health_score
pgcp_cluster_replication_lag_seconds
pgcp_cluster_nodes_up
pgcp_cluster_nodes_down

Logging

Structured JSON logs with correlation IDs:

{
  "level": "info",
  "timestamp": "2024-01-15T10:30:45Z",
  "correlation_id": "req-123-abc",
  "component": "reconciler",
  "cluster_id": "production",
  "message": "promoting node to primary",
  "node_id": "node-2",
  "reason": "primary_down"
}

Tracing

OpenTelemetry traces for all operations:

Promote Node
β”œβ”€β”€ Check Quorum (15ms)
β”œβ”€β”€ Validate Candidate (8ms)
β”œβ”€β”€ Send Promote Command (120ms)
β”‚   β”œβ”€β”€ Agent Call (100ms)
β”‚   └── Retry Logic (20ms)
β”œβ”€β”€ Update State (25ms)
└── Notify Watchers (10ms)
Total: 178ms

🚒 Deployment

Docker Compose

version: '3.8'

services:
  pgcontrolplane:
    image: pgelephant/pgcontrolplane:latest
    ports:
      - "8080:8080"
      - "9090:9090"
      - "2112:2112"
    environment:
      PGCP_DATABASE_URL: postgres://pgcp:secret@postgres:5432/pgcontrolplane
      PGCP_LOG_LEVEL: info
    depends_on:
      - postgres

  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: pgcontrolplane
      POSTGRES_USER: pgcp
      POSTGRES_PASSWORD: secret
    volumes:
      - pgcp-data:/var/lib/postgresql/data

volumes:
  pgcp-data:

Kubernetes

apiVersion: v1
kind: Namespace
metadata:
  name: pgcontrolplane

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pgcontrolplane
  namespace: pgcontrolplane
spec:
  replicas: 3
  selector:
    matchLabels:
      app: pgcontrolplane
  template:
    metadata:
      labels:
        app: pgcontrolplane
    spec:
      containers:
      - name: pgcontrolplane
        image: pgelephant/pgcontrolplane:latest
        ports:
        - containerPort: 8080
          name: http
        - containerPort: 9090
          name: grpc
        - containerPort: 2112
          name: metrics
        env:
        - name: PGCP_DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: pgcp-secrets
              key: database-url
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /readyz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

πŸ“– Examples & Use Cases

Example 1: Development Cluster

# Minimal single-node cluster for development
export ENABLE_PGSENTINEL=false
export ENABLE_FAUXDB=false
export INSTANCE_TYPE=t3.small

./scripts/deploy-full-cluster.sh --name dev --nodes 1

Example 2: High Availability Production Cluster

# 5-node cluster with synchronous replication
./scripts/deploy-full-cluster.sh \
  --name production \
  --nodes 5 \
  --version 16.1 \
  --replication sync \
  --region us-east-1

Example 3: Raft Consensus Cluster

# Strong consistency with pgraft
./scripts/deploy-full-cluster.sh \
  --name financial \
  --nodes 5 \
  --with-raft

Example 4: Multi-Region Setup

# Primary region
./scripts/deploy-full-cluster.sh \
  --name primary \
  --nodes 3 \
  --region us-east-1

# Standby region  
./scripts/deploy-full-cluster.sh \
  --name standby \
  --nodes 3 \
  --region us-west-2

Example 5: Complete Testing Environment

# Full stack with all testing tools
export ENABLE_FAUXDB=true
export ENABLE_PGSENTINEL=true

./scripts/deploy-full-cluster.sh \
  --name testing \
  --nodes 3

# Run automated tests
curl -X POST http://localhost:5000/api/tests/run

See the examples/ directory for more:

  • basic-cluster/ - Simple 3-node cluster setup
  • ha-cluster/ - High-availability configuration
  • kubernetes/ - Complete Kubernetes deployment
  • Full documentation at QUICKSTART.md

πŸ§ͺ Testing

# Run unit tests
make test

# Run integration tests
make test-integration

# Run end-to-end tests
make test-e2e

# Run with coverage
make test-coverage

# Run benchmarks
make bench

🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

πŸ“„ License

Apache License 2.0 - see LICENSE file for details.

πŸ”— Links

πŸ™ Acknowledgments

Built with ❀️ by the pgElephant team and contributors.

Special thanks to the PostgreSQL community and the following projects:

  • PostgreSQL
  • pgBackRest
  • Patroni
  • etcd
  • OpenTelemetry

About

pgControlPlane is a comprehensive, production-ready control plane for managing PostgreSQL clusters with high availability, automated failover, backup orchestration, and advanced observability.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published