This guide covers monitoring Guardian Agent in production environments.
Guardian Agent exposes Prometheus-compatible metrics at the /metrics endpoint.
guardian_validations_total- Total number of action validationsguardian_validations_allowed- Number of allowed validationsguardian_validations_denied- Number of denied validationsguardian_cache_hits- Number of cache hitsguardian_cache_misses- Number of cache missesguardian_errors_total- Total number of errorsguardian_llm_reasoning_total- Number of LLM reasoning calls
guardian_errors_by_type{type="..."}- Error count by error type- Types:
PolicyValidation,OpaError,JwtError,IoError,SerializationError, etc.
- Types:
guardian_avg_latency_ms- Average latency in millisecondsguardian_latency_histogram{le="..."}- Latency histogram buckets- Buckets: 1ms, 5ms, 10ms, 50ms, 100ms, 500ms, 1s, 5s
# Direct access
curl http://localhost:8080/metrics
# Kubernetes port forward
kubectl port-forward svc/guardian-agent 8080:8080
curl http://localhost:8080/metricsapiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: guardian-agent
labels:
app: guardian-agent
spec:
selector:
matchLabels:
app: guardian-agent
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10sscrape_configs:
- job_name: 'guardian-agent'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
action: keep
regex: guardian-agent-
Validation Rate
- Validations per second
- Allow/deny ratio
- Cache hit rate
-
Latency
- P50, P95, P99 latency
- Average latency
- Latency distribution
-
Errors
- Error rate
- Error types
- Error trends
-
LLM Reasoning
- Reasoning calls per second
- Reasoning latency
{
"dashboard": {
"title": "Guardian Agent",
"panels": [
{
"title": "Validation Rate",
"targets": [
{
"expr": "rate(guardian_validations_total[5m])"
}
]
},
{
"title": "Latency (P95)",
"targets": [
{
"expr": "histogram_quantile(0.95, guardian_latency_histogram)"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(guardian_errors_total[5m])"
}
]
}
]
}
}groups:
- name: guardian_agent
rules:
- alert: GuardianAgentHighErrorRate
expr: rate(guardian_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Guardian Agent high error rate"
description: "Error rate is {{ $value }} errors/sec"
- alert: GuardianAgentHighLatency
expr: guardian_avg_latency_ms > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Guardian Agent high latency"
description: "Average latency is {{ $value }}ms"
- alert: GuardianAgentDown
expr: up{job="guardian-agent"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Guardian Agent is down"
description: "Guardian Agent has been down for more than 1 minute"curl http://localhost:8080/healthResponse:
{
"status": "healthy",
"device_id": "abc123",
"uptime_seconds": 3600.0,
"metrics": {
"validations_total": 1000,
"validations_allowed": 950,
"validations_denied": 50,
"cache_hits": 800,
"cache_misses": 200,
"errors_total": 5,
"avg_latency_ms": 2.5
}
}livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5Guardian Agent uses structured JSON logging. Logs include:
- Request IDs
- Correlation IDs
- Timestamps
- Event types
- Action details
- Verdicts
{
"timestamp": "2024-01-01T12:00:00Z",
"event_type": "action_validated",
"request_id": "abc-123",
"correlation_id": "xyz-789",
"action": {
"type": "file_write",
"resource": "/tmp/test.txt"
},
"verdict": {
"allowed": true,
"reason": "Policy evaluation completed"
}
}<source>
@type tail
path /var/lib/guardian/*.jsonl
pos_file /var/log/fluentd-guardian.pos
tag guardian
format json
</source>filebeat.inputs:
- type: log
paths:
- /var/lib/guardian/*.jsonl
json.keys_under_root: true
json.add_error_key: true- Throughput: Validations per second
- Latency: P50, P95, P99 response times
- Error Rate: Errors per second
- Cache Efficiency: Cache hit ratio
- Resource Usage: CPU, memory, disk I/O
# Check current metrics
curl http://localhost:8080/metrics | grep guardian_
# Check health
curl http://localhost:8080/health | jq
# Monitor logs
tail -f /var/lib/guardian/log.jsonl | jq
# Check resource usage (Kubernetes)
kubectl top pod -l app.kubernetes.io/name=guardian-agent- Check cache hit rate - low cache hits may indicate policy complexity
- Check OPA server latency if using OPA server mode
- Check LLM reasoning latency if using LLM reasoner
- Review resource limits
- Check error types in metrics
- Review logs for error details
- Check OPA connectivity if using OPA server
- Verify configuration is valid
- Review policy complexity
- Check cache configuration (TTL, max size)
- Consider increasing cache size if memory allows