Monitoring Guide

This guide covers monitoring Guardian Agent in production environments.

Metrics

Guardian Agent exposes Prometheus-compatible metrics at the /metrics endpoint.

Available Metrics

Counter Metrics

guardian_validations_total - Total number of action validations
guardian_validations_allowed - Number of allowed validations
guardian_validations_denied - Number of denied validations
guardian_cache_hits - Number of cache hits
guardian_cache_misses - Number of cache misses
guardian_errors_total - Total number of errors
guardian_llm_reasoning_total - Number of LLM reasoning calls

Error Metrics by Type

guardian_errors_by_type{type="..."} - Error count by error type
- Types: PolicyValidation, OpaError, JwtError, IoError, SerializationError, etc.

Latency Metrics

guardian_avg_latency_ms - Average latency in milliseconds
guardian_latency_histogram{le="..."} - Latency histogram buckets
- Buckets: 1ms, 5ms, 10ms, 50ms, 100ms, 500ms, 1s, 5s

Accessing Metrics

# Direct access
curl http://localhost:8080/metrics

# Kubernetes port forward
kubectl port-forward svc/guardian-agent 8080:8080
curl http://localhost:8080/metrics

Prometheus Integration

ServiceMonitor (Prometheus Operator)

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: guardian-agent
  labels:
    app: guardian-agent
spec:
  selector:
    matchLabels:
      app: guardian-agent
  endpoints:
  - port: http
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s

Prometheus Configuration

scrape_configs:
  - job_name: 'guardian-agent'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        action: keep
        regex: guardian-agent

Grafana Dashboards

Key Metrics to Monitor

Validation Rate
- Validations per second
- Allow/deny ratio
- Cache hit rate
Latency
- P50, P95, P99 latency
- Average latency
- Latency distribution
Errors
- Error rate
- Error types
- Error trends
LLM Reasoning
- Reasoning calls per second
- Reasoning latency

Sample Grafana Dashboard

{
  "dashboard": {
    "title": "Guardian Agent",
    "panels": [
      {
        "title": "Validation Rate",
        "targets": [
          {
            "expr": "rate(guardian_validations_total[5m])"
          }
        ]
      },
      {
        "title": "Latency (P95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, guardian_latency_histogram)"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(guardian_errors_total[5m])"
          }
        ]
      }
    ]
  }
}

Alerting Rules

Prometheus Alerting Rules

groups:
  - name: guardian_agent
    rules:
      - alert: GuardianAgentHighErrorRate
        expr: rate(guardian_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Guardian Agent high error rate"
          description: "Error rate is {{ $value }} errors/sec"

      - alert: GuardianAgentHighLatency
        expr: guardian_avg_latency_ms > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Guardian Agent high latency"
          description: "Average latency is {{ $value }}ms"

      - alert: GuardianAgentDown
        expr: up{job="guardian-agent"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Guardian Agent is down"
          description: "Guardian Agent has been down for more than 1 minute"

Health Checks

HTTP Health Endpoint

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "device_id": "abc123",
  "uptime_seconds": 3600.0,
  "metrics": {
    "validations_total": 1000,
    "validations_allowed": 950,
    "validations_denied": 50,
    "cache_hits": 800,
    "cache_misses": 200,
    "errors_total": 5,
    "avg_latency_ms": 2.5
  }
}

Kubernetes Probes

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Log Aggregation

Structured Logging

Guardian Agent uses structured JSON logging. Logs include:

Request IDs
Correlation IDs
Timestamps
Event types
Action details
Verdicts

Log Format

{
  "timestamp": "2024-01-01T12:00:00Z",
  "event_type": "action_validated",
  "request_id": "abc-123",
  "correlation_id": "xyz-789",
  "action": {
    "type": "file_write",
    "resource": "/tmp/test.txt"
  },
  "verdict": {
    "allowed": true,
    "reason": "Policy evaluation completed"
  }
}

Log Collection

Fluentd/Fluent Bit

<source>
  @type tail
  path /var/lib/guardian/*.jsonl
  pos_file /var/log/fluentd-guardian.pos
  tag guardian
  format json
</source>

Filebeat

filebeat.inputs:
  - type: log
    paths:
      - /var/lib/guardian/*.jsonl
    json.keys_under_root: true
    json.add_error_key: true

Performance Monitoring

Key Performance Indicators (KPIs)

Throughput: Validations per second
Latency: P50, P95, P99 response times
Error Rate: Errors per second
Cache Efficiency: Cache hit ratio
Resource Usage: CPU, memory, disk I/O

Monitoring Commands

# Check current metrics
curl http://localhost:8080/metrics | grep guardian_

# Check health
curl http://localhost:8080/health | jq

# Monitor logs
tail -f /var/lib/guardian/log.jsonl | jq

# Check resource usage (Kubernetes)
kubectl top pod -l app.kubernetes.io/name=guardian-agent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring Guide

Metrics

Available Metrics

Counter Metrics

Error Metrics by Type

Latency Metrics

Accessing Metrics

Prometheus Integration

ServiceMonitor (Prometheus Operator)

Prometheus Configuration

Grafana Dashboards

Key Metrics to Monitor

Sample Grafana Dashboard

Alerting Rules

Prometheus Alerting Rules

Health Checks

HTTP Health Endpoint

Kubernetes Probes

Log Aggregation

Structured Logging

Log Format

Log Collection

Fluentd/Fluent Bit

Filebeat

Performance Monitoring

Key Performance Indicators (KPIs)

Monitoring Commands

Troubleshooting

High Latency

High Error Rate

Low Cache Hit Rate

Next Steps

FilesExpand file tree

MONITORING.md

Latest commit

History

MONITORING.md

File metadata and controls

Monitoring Guide

Metrics

Available Metrics

Counter Metrics

Error Metrics by Type

Latency Metrics

Accessing Metrics

Prometheus Integration

ServiceMonitor (Prometheus Operator)

Prometheus Configuration

Grafana Dashboards

Key Metrics to Monitor

Sample Grafana Dashboard

Alerting Rules

Prometheus Alerting Rules

Health Checks

HTTP Health Endpoint

Kubernetes Probes

Log Aggregation

Structured Logging

Log Format

Log Collection

Fluentd/Fluent Bit

Filebeat

Performance Monitoring

Key Performance Indicators (KPIs)

Monitoring Commands

Troubleshooting

High Latency

High Error Rate

Low Cache Hit Rate

Next Steps