Skip to content

Latest commit

 

History

History
326 lines (260 loc) · 6.5 KB

File metadata and controls

326 lines (260 loc) · 6.5 KB

Monitoring Guide

This guide covers monitoring Guardian Agent in production environments.

Metrics

Guardian Agent exposes Prometheus-compatible metrics at the /metrics endpoint.

Available Metrics

Counter Metrics

  • guardian_validations_total - Total number of action validations
  • guardian_validations_allowed - Number of allowed validations
  • guardian_validations_denied - Number of denied validations
  • guardian_cache_hits - Number of cache hits
  • guardian_cache_misses - Number of cache misses
  • guardian_errors_total - Total number of errors
  • guardian_llm_reasoning_total - Number of LLM reasoning calls

Error Metrics by Type

  • guardian_errors_by_type{type="..."} - Error count by error type
    • Types: PolicyValidation, OpaError, JwtError, IoError, SerializationError, etc.

Latency Metrics

  • guardian_avg_latency_ms - Average latency in milliseconds
  • guardian_latency_histogram{le="..."} - Latency histogram buckets
    • Buckets: 1ms, 5ms, 10ms, 50ms, 100ms, 500ms, 1s, 5s

Accessing Metrics

# Direct access
curl http://localhost:8080/metrics

# Kubernetes port forward
kubectl port-forward svc/guardian-agent 8080:8080
curl http://localhost:8080/metrics

Prometheus Integration

ServiceMonitor (Prometheus Operator)

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: guardian-agent
  labels:
    app: guardian-agent
spec:
  selector:
    matchLabels:
      app: guardian-agent
  endpoints:
  - port: http
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s

Prometheus Configuration

scrape_configs:
  - job_name: 'guardian-agent'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        action: keep
        regex: guardian-agent

Grafana Dashboards

Key Metrics to Monitor

  1. Validation Rate

    • Validations per second
    • Allow/deny ratio
    • Cache hit rate
  2. Latency

    • P50, P95, P99 latency
    • Average latency
    • Latency distribution
  3. Errors

    • Error rate
    • Error types
    • Error trends
  4. LLM Reasoning

    • Reasoning calls per second
    • Reasoning latency

Sample Grafana Dashboard

{
  "dashboard": {
    "title": "Guardian Agent",
    "panels": [
      {
        "title": "Validation Rate",
        "targets": [
          {
            "expr": "rate(guardian_validations_total[5m])"
          }
        ]
      },
      {
        "title": "Latency (P95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, guardian_latency_histogram)"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(guardian_errors_total[5m])"
          }
        ]
      }
    ]
  }
}

Alerting Rules

Prometheus Alerting Rules

groups:
  - name: guardian_agent
    rules:
      - alert: GuardianAgentHighErrorRate
        expr: rate(guardian_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Guardian Agent high error rate"
          description: "Error rate is {{ $value }} errors/sec"

      - alert: GuardianAgentHighLatency
        expr: guardian_avg_latency_ms > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Guardian Agent high latency"
          description: "Average latency is {{ $value }}ms"

      - alert: GuardianAgentDown
        expr: up{job="guardian-agent"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Guardian Agent is down"
          description: "Guardian Agent has been down for more than 1 minute"

Health Checks

HTTP Health Endpoint

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "device_id": "abc123",
  "uptime_seconds": 3600.0,
  "metrics": {
    "validations_total": 1000,
    "validations_allowed": 950,
    "validations_denied": 50,
    "cache_hits": 800,
    "cache_misses": 200,
    "errors_total": 5,
    "avg_latency_ms": 2.5
  }
}

Kubernetes Probes

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Log Aggregation

Structured Logging

Guardian Agent uses structured JSON logging. Logs include:

  • Request IDs
  • Correlation IDs
  • Timestamps
  • Event types
  • Action details
  • Verdicts

Log Format

{
  "timestamp": "2024-01-01T12:00:00Z",
  "event_type": "action_validated",
  "request_id": "abc-123",
  "correlation_id": "xyz-789",
  "action": {
    "type": "file_write",
    "resource": "/tmp/test.txt"
  },
  "verdict": {
    "allowed": true,
    "reason": "Policy evaluation completed"
  }
}

Log Collection

Fluentd/Fluent Bit

<source>
  @type tail
  path /var/lib/guardian/*.jsonl
  pos_file /var/log/fluentd-guardian.pos
  tag guardian
  format json
</source>

Filebeat

filebeat.inputs:
  - type: log
    paths:
      - /var/lib/guardian/*.jsonl
    json.keys_under_root: true
    json.add_error_key: true

Performance Monitoring

Key Performance Indicators (KPIs)

  1. Throughput: Validations per second
  2. Latency: P50, P95, P99 response times
  3. Error Rate: Errors per second
  4. Cache Efficiency: Cache hit ratio
  5. Resource Usage: CPU, memory, disk I/O

Monitoring Commands

# Check current metrics
curl http://localhost:8080/metrics | grep guardian_

# Check health
curl http://localhost:8080/health | jq

# Monitor logs
tail -f /var/lib/guardian/log.jsonl | jq

# Check resource usage (Kubernetes)
kubectl top pod -l app.kubernetes.io/name=guardian-agent

Troubleshooting

High Latency

  1. Check cache hit rate - low cache hits may indicate policy complexity
  2. Check OPA server latency if using OPA server mode
  3. Check LLM reasoning latency if using LLM reasoner
  4. Review resource limits

High Error Rate

  1. Check error types in metrics
  2. Review logs for error details
  3. Check OPA connectivity if using OPA server
  4. Verify configuration is valid

Low Cache Hit Rate

  1. Review policy complexity
  2. Check cache configuration (TTL, max size)
  3. Consider increasing cache size if memory allows

Next Steps