- 
                Notifications
    You must be signed in to change notification settings 
- Fork 0
Monitoring Alerting
Real-time monitoring, alerting, capacity planning, and resource analysis for production PostgreSQL databases.
5 enterprise monitoring tools for comprehensive database observability:
| Tool | Purpose | Category | 
|---|---|---|
| monitor_real_time | Real-time performance monitoring | Observability | 
| alert_threshold_set | Metric threshold analysis and alerting | Alerting | 
| capacity_planning | Growth projection and capacity forecasting | Planning | 
| resource_usage_analyze | CPU/Memory/IO resource analysis | Performance | 
| replication_monitor | Replication status and lag monitoring | High Availability | 
Real-time monitoring of database performance metrics including queries, locks, connections, and I/O.
Parameters:
- 
include_queries(boolean, optional): Include currently running queries
- 
include_locks(boolean, optional): Include lock information
- 
include_io(boolean, optional): Include I/O statistics
- 
limit(integer, optional): Limit number of results (default: 10)
Returns:
- 
timestamp: Current timestamp
- 
metrics.connections: Connection statistics by state
- 
metrics.active_queries: Currently running queries
- 
metrics.locks: Lock information by type
- 
metrics.io_statistics: Heap and index block statistics
- 
metrics.database: Database size and modifications
Example:
result = monitor_real_time(
    include_queries=True,
    include_locks=True,
    include_io=True,
    limit=10
)
# Returns: {
#   "timestamp": "2025-10-03 15:30:45",
#   "metrics": {
#     "connections": {
#       "total": 25,
#       "by_state": [
#         {"state": "active", "count": 3},
#         {"state": "idle", "count": 22}
#       ]
#     },
#     "active_queries": {...},
#     "locks": {"total": 5, "blocked": 0},
#     "io_statistics": {
#       "heap_hit_ratio_percent": 99.2,
#       "index_hit_ratio_percent": 99.8
#     }
#   }
# }Use Cases:
- Production monitoring dashboards
- Real-time performance troubleshooting
- Lock contention detection
- Connection pool monitoring
Analyze database metrics against configurable alert thresholds.
Parameters:
- 
metric_type(string, required): Metric to check (cache_hit_ratio,connection_count,transaction_age,database_size,replication_lag)
- 
warning_threshold(number, required): Warning threshold value
- 
critical_threshold(number, required): Critical threshold value
- 
check_current(boolean, optional): Check current value against thresholds
Returns:
- 
metric_type: Metric being monitored
- 
thresholds: Configured warning and critical thresholds
- 
current_value: Current metric value
- 
alert_status: Current status (ok,warning,critical)
- 
unit: Measurement unit
Example:
# Monitor cache hit ratio
result = alert_threshold_set(
    metric_type="cache_hit_ratio",
    warning_threshold=95.0,
    critical_threshold=90.0,
    check_current=True
)
# Returns: {
#   "metric_type": "cache_hit_ratio",
#   "thresholds": {"warning": 95.0, "critical": 90.0},
#   "current_value": 99.3,
#   "alert_status": "ok",
#   "unit": "percent"
# }
# Monitor connection pool
result = alert_threshold_set(
    metric_type="connection_count",
    warning_threshold=80,
    critical_threshold=95,
    check_current=True
)Supported Metrics:
- 
cache_hit_ratio- Buffer cache hit percentage
- 
connection_count- Active database connections
- 
transaction_age- Longest running transaction age (seconds)
- 
database_size- Total database size (bytes)
- 
replication_lag- Replication lag (seconds, replicas only)
Use Cases:
- Automated alerting systems
- Threshold-based monitoring
- Performance degradation detection
- Capacity warning systems
Analyze database growth trends and project future capacity needs.
Parameters:
- 
forecast_days(integer, required): Number of days to forecast
- 
include_table_growth(boolean, optional): Include per-table growth analysis
- 
include_index_growth(boolean, optional): Include per-index growth analysis
Returns:
- 
current_state: Current database size and top tables/indexes
- 
projections: Growth forecasts for specified period
- 
recommendations: Storage and capacity recommendations
Example:
result = capacity_planning(
    forecast_days=90,
    include_table_growth=True,
    include_index_growth=True
)
# Returns: {
#   "current_state": {
#     "total_size": {"bytes": 50000000000, "gb": 46.57, "pretty": "47 GB"},
#     "user_data_size_mb": 35000,
#     "index_size_mb": 8500,
#     "top_tables": [
#       {"table": "orders", "size_mb": 12500, "row_count": 5000000},
#       {"table": "users", "size_mb": 8900, "row_count": 2000000}
#     ]
#   },
#   "projections": {
#     "forecast_days": 90,
#     "estimated_daily_growth_mb": 450,
#     "estimated_total_growth_gb": 38.6,
#     "projected_total_size_gb": 85.2
#   },
#   "recommendations": {
#     "recommended_storage_gb": 127.8,
#     "buffer_percentage": 50,
#     "planning_horizon_days": 90
#   }
# }Use Cases:
- Storage capacity planning
- Budget forecasting
- Growth trend analysis
- Infrastructure scaling decisions
Analyze CPU, memory, and I/O resource usage patterns.
Parameters:
- 
include_cpu(boolean, optional): Include CPU usage analysis
- 
include_memory(boolean, optional): Include memory usage analysis
- 
include_io(boolean, optional): Include I/O usage analysis
Returns:
- 
resource_analysis.memory: Buffer cache and shared memory stats
- 
resource_analysis.io: Disk I/O and cache hit ratios
- 
resource_analysis.cpu: Query execution time statistics
- 
recommendations: Resource optimization suggestions
Example:
result = resource_usage_analyze(
    include_cpu=True,
    include_memory=True,
    include_io=True
)
# Returns: {
#   "resource_analysis": {
#     "memory": {
#       "shared_buffers": "16384",
#       "buffer_cache_hit_ratio": 99.5,
#       "buffer_hits": 5234567,
#       "disk_reads": 25678
#     },
#     "io": {
#       "heap_blocks_from_disk": 12345,
#       "heap_blocks_from_cache": 987654,
#       "heap_hit_ratio": 98.8,
#       "index_blocks_from_disk": 3456,
#       "index_blocks_from_cache": 654321,
#       "index_hit_ratio": 99.5
#     },
#     "cpu": {
#       "total_execution_time_ms": 125678,
#       "total_calls": 456789,
#       "avg_query_time_ms": 0.275,
#       "max_query_time_ms": 1250.45
#     }
#   },
#   "recommendations": [
#     {
#       "category": "memory",
#       "priority": "info",
#       "recommendation": "Buffer cache hit ratio excellent at 99.5%"
#     }
#   ]
# }Requirements: pg_stat_statements extension for CPU analysis
Use Cases:
- Performance bottleneck identification
- Resource optimization
- Infrastructure right-sizing
- Cost optimization
Monitor replication status, lag, and health for primary and replica databases.
Parameters:
- 
include_wal_status(boolean, optional): Include WAL sender/receiver status
- 
include_slots(boolean, optional): Include replication slot information
Returns:
- 
replication_status.is_replica: Whether this is a replica
- 
replication_status.role: Database role (primary/replica)
- 
replication_status.wal_senders: WAL sender connections (primary only)
- 
replication_status.replication_slots: Active replication slots
- 
lag_info: Replication lag statistics (replica only)
Example:
# On primary database
result = replication_monitor(
    include_wal_status=True,
    include_slots=True
)
# Returns: {
#   "replication_status": {
#     "is_replica": False,
#     "role": "primary",
#     "wal_senders": {
#       "count": 2,
#       "senders": [
#         {
#           "application_name": "replica1",
#           "client_addr": "10.0.1.5",
#           "state": "streaming",
#           "sync_state": "async",
#           "sent_lsn": "0/5A2F3C0",
#           "write_lsn": "0/5A2F3C0",
#           "flush_lsn": "0/5A2F3C0"
#         }
#       ]
#     },
#     "replication_slots": {
#       "total_count": 2,
#       "inactive_count": 0,
#       "slots": [...]
#     }
#   }
# }
# On replica database
result = replication_monitor(include_wal_status=True)
# Returns: {
#   "replication_status": {
#     "is_replica": True,
#     "role": "replica"
#   },
#   "lag_info": {
#     "receive_lsn": "0/5A2F3C0",
#     "replay_lsn": "0/5A2F380",
#     "lag_bytes": 64,
#     "lag_seconds": 0.05,
#     "is_replaying": True
#   }
# }Use Cases:
- High availability monitoring
- Replication health checks
- Lag detection and alerting
- Disaster recovery readiness
# 1. Real-time metrics
metrics = monitor_real_time(
    include_queries=True,
    include_locks=True,
    include_io=True
)
# 2. Check thresholds
cache_status = alert_threshold_set(
    metric_type="cache_hit_ratio",
    warning_threshold=95,
    critical_threshold=90,
    check_current=True
)
conn_status = alert_threshold_set(
    metric_type="connection_count",
    warning_threshold=80,
    critical_threshold=95,
    check_current=True
)
# 3. Resource analysis
resources = resource_usage_analyze(
    include_cpu=True,
    include_memory=True,
    include_io=True
)# 1. Analyze growth
capacity = capacity_planning(
    forecast_days=90,
    include_table_growth=True,
    include_index_growth=True
)
# 2. Current resource usage
resources = resource_usage_analyze(
    include_cpu=True,
    include_memory=True,
    include_io=True
)
# 3. Project needs
# Use capacity["recommendations"]["recommended_storage_gb"]
# Use capacity["projections"]["projected_total_size_gb"]# 1. Check replication status
repl_status = replication_monitor(
    include_wal_status=True,
    include_slots=True
)
# 2. Monitor lag
if repl_status["replication_status"]["is_replica"]:
    lag_alert = alert_threshold_set(
        metric_type="replication_lag",
        warning_threshold=5,
        critical_threshold=30,
        check_current=True
    )- Run monitor_real_time()every 1-5 minutes
- Check alert_threshold_set()for key metrics
- Review resource_usage_analyze()daily
- Run capacity_planning()monthly
- Track growth trends over time
- Plan upgrades 3 months in advance
Recommended Thresholds:
# Cache hit ratio
cache_warning = 95.0    # Below 95% investigate
cache_critical = 90.0   # Below 90% urgent action
# Connections
conn_warning = 80       # 80% of max_connections
conn_critical = 95      # 95% of max_connections
# Transaction age
txn_warning = 300       # 5 minutes
txn_critical = 1800     # 30 minutes
# Replication lag
lag_warning = 5         # 5 seconds
lag_critical = 30       # 30 seconds- Monitor lag every 30 seconds
- Alert on WAL sender disconnections
- Watch for inactive replication slots
# 1. Identify connections
metrics = monitor_real_time(include_queries=True)
# 2. Analyze queries
from Core import get_top_queries
slow_queries = get_top_queries(sort_by="calls", limit=20)
# 3. Check for connection leaks
# Review application connection pooling# 1. Analyze resource usage
resources = resource_usage_analyze(include_memory=True, include_io=True)
# 2. Check buffer cache size
# Consider increasing shared_buffers in postgresql.conf
# 3. Review query patterns
# Identify queries causing excessive disk I/O# 1. Monitor replication
repl = replication_monitor(include_wal_status=True)
# 2. Check network connectivity
# Verify network bandwidth between primary and replica
# 3. Analyze replica load
# Check if replica is under heavy read load- Core Database Tools - Basic health monitoring
- Performance Intelligence - Query optimization
- Backup & Recovery - Backup strategies
- Security Best Practices - Secure monitoring
See Home for more tool categories.