From 9f4ce3ff76ca5dbbecdc6906f48117c895b5689d Mon Sep 17 00:00:00 2001 From: Max Malkin Date: Tue, 3 Mar 2026 11:30:44 -0700 Subject: [PATCH 1/2] remove unneeded docs --- docs/capacity-planning.md | 283 ------------------------- docs/runbook.md | 434 -------------------------------------- docs/threat-model.md | 334 ----------------------------- 3 files changed, 1051 deletions(-) delete mode 100644 docs/capacity-planning.md delete mode 100644 docs/runbook.md delete mode 100644 docs/threat-model.md diff --git a/docs/capacity-planning.md b/docs/capacity-planning.md deleted file mode 100644 index 162f5fa..0000000 --- a/docs/capacity-planning.md +++ /dev/null @@ -1,283 +0,0 @@ -# AgentAuth Capacity Planning Guide - -This document provides guidance for sizing AgentAuth deployments and planning for growth. - -## Overview - -AgentAuth consists of three main services with different scaling characteristics: - -| Service | Scaling Model | Primary Constraint | -|---------|---------------|-------------------| -| Registry | Vertical + Horizontal | CPU (crypto operations) | -| Verifier | Horizontal | Network I/O, Memory | -| Audit Archiver | Single instance | Database I/O | - -## Current Baseline Metrics - -These metrics should be updated after each major release or significant traffic change. - -| Metric | Current Value | 12-Month Projection | -|--------|---------------|---------------------| -| Token verifications/second | - | - | -| Token issuances/second | - | - | -| Audit events/day | - | - | -| Active agents | - | - | -| Active service providers | - | - | - -## Resource Sizing Guidelines - -### Registry Service - -**Scaling triggers:** -- CPU utilization > 60% sustained -- Token issuance p99 > 50ms - -**Initial sizing:** -```yaml -replicas: 3 -resources: - requests: - cpu: 500m - memory: 256Mi - limits: - cpu: 2000m - memory: 1Gi -``` - -**Scaling formula:** -- 1 registry replica per 500 token issuances/second -- Add 1 replica for each 1000 concurrent grant approval sessions - -**Memory considerations:** -- Base: ~100MB -- Per connection pool: ~10MB per pool -- Audit buffer: Up to 100MB when backpressured - -### Verifier Service - -**Scaling triggers:** -- p99 latency > 5ms -- Request rate > 1000 req/s per replica - -**Initial sizing:** -```yaml -replicas: 5 -resources: - requests: - cpu: 250m - memory: 128Mi - limits: - cpu: 1000m - memory: 512Mi -``` - -**Scaling formula:** -- 1 verifier replica per 1000 token verifications/second sustained -- Account for burst: 2x replicas for 2x peak-to-average ratio - -**Memory considerations:** -- Base: ~50MB -- Connection pools: ~20MB total -- In-flight requests: ~1KB per request - -### Audit Archiver - -**Sizing:** -```yaml -replicas: 1 # Leader election, only one active -resources: - requests: - cpu: 100m - memory: 128Mi - limits: - cpu: 500m - memory: 256Mi -``` - -**Constraints:** -- Single instance with leader election -- Must complete archival within maintenance window -- Partition creation must run before month end - -## Database Sizing - -### PostgreSQL Primary - -**Initial sizing:** -- vCPU: 8 -- RAM: 32GB -- Storage: 500GB SSD (provisioned IOPS) - -**Scaling triggers:** -- Write IOPS > 70% provisioned -- Connection count > 80% max_connections -- Storage > 70% capacity - -**Growth formula:** -- Audit events: ~1KB per event average -- 1M events/day = ~1GB/day = ~30GB/month (before compression) -- After 90-day retention: ~90GB hot storage - -### PostgreSQL Read Replicas - -**Initial:** 2 replicas for high availability - -**Scaling triggers:** -- Replica lag > 5 seconds sustained -- Read query latency increasing - -**Sizing:** Match primary for CPU/RAM, can use smaller storage - -### Redis Cluster - -**Initial sizing:** -- 3 primary nodes + 3 replicas -- 8GB RAM per node -- Total: 24GB usable (after replication) - -**Memory allocation:** -| Store | Allocation | Eviction Policy | -|-------|------------|-----------------| -| Token cache | 40% | allkeys-lru | -| Nonce store | 40% | noeviction | -| Rate limiters | 20% | volatile-ttl | - -**Sizing formula:** -- Token cache: ~500 bytes per cached token -- 1M active tokens = ~500MB -- Nonce store: ~100 bytes per nonce -- 1M nonces = ~100MB - -**Critical threshold:** Nonce store at 70% triggers alert - -## Network Considerations - -### Bandwidth Requirements - -| Path | Estimate | -|------|----------| -| Verifier → Redis | 1KB per verification | -| Verifier → PostgreSQL (fallback) | 2KB per verification | -| Registry → PostgreSQL | 5KB per token issuance | -| Registry → KMS | 1KB per signing operation | - -**Per 10,000 verifications/second:** -- Redis: ~10MB/s -- With 5% fallback to PostgreSQL: ~1MB/s additional - -### Latency Requirements - -| Path | Target | Maximum | -|------|--------|---------| -| Verifier → Redis | <1ms | 5ms | -| Verifier → PostgreSQL | <5ms | 20ms | -| Registry → PostgreSQL | <10ms | 50ms | -| Registry → KMS | <50ms | 200ms | - -## Scaling Scenarios - -### Scenario 1: 10x Traffic Increase - -**Current:** 1,000 verifications/second -**Target:** 10,000 verifications/second - -**Changes needed:** -- Verifier: 5 → 15 replicas -- Redis: Add 3 more primary nodes -- Registry: 3 → 5 replicas -- PostgreSQL: Add 2 more read replicas - -### Scenario 2: New Region Deployment - -**For each new region:** -- Full verifier deployment (can operate read-only) -- Redis cluster (for local caching) -- PostgreSQL read replica (for fallback) -- Registry not required (can call primary region) - -### Scenario 3: High Burst Events - -**For 10x burst capacity:** -- HPA maxReplicas = 3x normal -- Redis: Ensure headroom in memory -- Rate limiting at edge to shed excess load - -## Cost Optimization - -### Right-sizing Recommendations - -1. **Off-peak scaling:** Scale verifiers down to 50% during low-traffic hours -2. **Spot instances:** Verifiers are stateless, suitable for spot/preemptible -3. **Reserved capacity:** Registry and database benefit from reserved pricing - -### Resource Efficiency Targets - -| Metric | Target | -|--------|--------| -| CPU utilization (avg) | 40-60% | -| Memory utilization (avg) | 50-70% | -| Cache hit ratio | >95% | -| Database connection utilization | 50-70% | - -## Capacity Planning Checklist - -### Monthly Review -- [ ] Update baseline metrics table -- [ ] Review resource utilization trends -- [ ] Check database storage growth -- [ ] Verify audit partition creation -- [ ] Review Redis memory usage - -### Quarterly Review -- [ ] Update 12-month projections -- [ ] Review and adjust HPA settings -- [ ] Load test at projected capacity -- [ ] Review cost vs. capacity tradeoffs - -### Pre-Launch Checklist -- [ ] Load test at 2x expected peak -- [ ] Verify auto-scaling works correctly -- [ ] Confirm database can handle projected writes -- [ ] Verify Redis cluster can handle projected cache size -- [ ] Test failover scenarios - -## Monitoring Dashboard Queries - -### Key Capacity Metrics (Prometheus) - -```promql -# Verifications per second -sum(rate(agentauth_tokens_verified_total[5m])) - -# Token issuances per second -sum(rate(agentauth_tokens_issued_total[5m])) - -# Active tokens (approximate) -sum(agentauth_active_tokens) - -# Redis memory usage percentage -redis_memory_used_bytes / redis_memory_max_bytes * 100 - -# Database connections in use -pg_stat_activity_count / pg_settings_max_connections * 100 - -# Audit events per day (24h) -sum(increase(agentauth_audit_events_total[24h])) -``` - -## Emergency Procedures - -### If approaching Redis memory limit -1. Enable aggressive LRU eviction on token cache -2. Reduce token cache TTL -3. Scale Redis cluster (add nodes) - -### If approaching database storage limit -1. Run emergency archival job -2. Drop oldest partitions after archiving -3. Add storage capacity - -### If approaching connection limits -1. Review and close idle connections -2. Reduce connection pool sizes temporarily -3. Add read replicas to distribute load diff --git a/docs/runbook.md b/docs/runbook.md deleted file mode 100644 index 999299d..0000000 --- a/docs/runbook.md +++ /dev/null @@ -1,434 +0,0 @@ -# AgentAuth Operations Runbook - -This runbook provides guidance for responding to alerts and operational issues in the AgentAuth system. - -## Table of Contents - -- [Token Verification Errors](#token-verify-errors) -- [Token Verification Latency](#token-verify-latency) -- [Circuit Breaker](#circuit-breaker) -- [Revocation Lag](#revocation-lag) -- [Audit Lag](#audit-lag) -- [Audit Buffer](#audit-buffer) -- [Nonce Store](#nonce-store) -- [Replica Lag](#replica-lag) -- [SLO Budget](#slo-budget) -- [OOM](#oom) -- [Cache Hit Ratio](#cache-hit-ratio) -- [Redis Unavailable](#redis-unavailable) -- [Archival Issues](#archival-issues) - ---- - -## token-verify-errors - -### What this alert means -Token verification requests are failing at a rate above 0.1%. This indicates a significant portion of authentication requests are not succeeding. - -### Immediate mitigation steps (first 5 minutes) -1. Check the verifier logs for error patterns: `kubectl logs -l app=verifier -n agentauth --tail=100` -2. Check Redis connectivity: `redis-cli -h $REDIS_HOST ping` -3. Check if circuit breakers are open: Look at `agentauth_circuit_breaker_state` metrics -4. If Redis is down, verifier should fall back to PostgreSQL - verify this is working - -### How to verify recovery -- Error rate drops below 0.1% -- `agentauth_tokens_verified_total{outcome="allowed"}` is increasing -- No new error logs appearing - -### Root cause investigation steps -1. Correlate with deployment events - was there a recent deploy? -2. Check for Redis cluster issues -3. Check for database connectivity issues -4. Look for patterns in failed token JTIs - are specific agents affected? -5. Check if KMS is available for public key fetching - -### Known false-positive conditions -- Brief spikes during rolling deployments -- Intentional load testing with invalid tokens - ---- - -## token-verify-latency - -### What this alert means -Token verification p99 latency is above 5ms. This may indicate Redis issues or increased database fallback. - -### Immediate mitigation steps (first 5 minutes) -1. Check Redis latency: `redis-cli --latency -h $REDIS_HOST` -2. Check cache hit ratio: `agentauth_cache_hit_ratio{cache="token"}` -3. If cache hit ratio is low, check if Redis has memory pressure -4. Check verifier pod resource usage: `kubectl top pods -l app=verifier -n agentauth` - -### How to verify recovery -- p99 latency drops below 5ms -- Cache hit ratio returns to >95% - -### Root cause investigation steps -1. Check Redis cluster for hot keys or memory pressure -2. Check if there's a cache stampede (many requests for same cold key) -3. Review database query performance -4. Check network latency between verifier and Redis - -### Known false-positive conditions -- Initial cold start after deployment -- After Redis failover - ---- - -## circuit-breaker - -### What this alert means -A circuit breaker has been open for more than 2 minutes, indicating a dependency is failing. - -### Immediate mitigation steps (first 5 minutes) -1. Identify which circuit breaker: Check `agentauth_circuit_breaker_state` labels -2. For Redis: Check cluster health with `redis-cli cluster info` -3. For PostgreSQL: Check connection with `pg_isready -h $PG_HOST` -4. For KMS: Check cloud provider status page - -### How to verify recovery -- Circuit breaker state changes to 0 (closed) or 2 (half-open attempting recovery) -- Dependency connectivity restored - -### Root cause investigation steps -1. Check dependency service logs and metrics -2. Check network connectivity and DNS resolution -3. Review recent infrastructure changes -4. Check for resource exhaustion on dependency services - -### Known false-positive conditions -- Planned maintenance on dependencies -- Brief network partitions that self-heal - ---- - -## revocation-lag - -### What this alert means -Token revocations are taking more than 200ms to propagate to the cache. This could allow revoked tokens to be used during the lag window. - -### Immediate mitigation steps (first 5 minutes) -1. Check Redis write latency -2. Check registry to verifier network latency -3. Verify revocation events are being published -4. Check for Redis replication lag in cluster mode - -### How to verify recovery -- `agentauth_revocation_propagation_seconds` p99 drops below 200ms -- Revocation test completes within expected time - -### Root cause investigation steps -1. Check Redis cluster for write performance issues -2. Review revocation event publishing code path -3. Check for network issues between services -4. Verify Redis cluster replication is healthy - -### Known false-positive conditions -- During Redis cluster failover - ---- - -## audit-lag - -### What this alert means -Audit events are taking more than 30 seconds to be written. This may indicate database issues or backpressure. - -### Immediate mitigation steps (first 5 minutes) -1. Check PostgreSQL connections: `SELECT count(*) FROM pg_stat_activity` -2. Check for long-running transactions: `SELECT * FROM pg_stat_activity WHERE state = 'active'` -3. Check audit buffer usage: `agentauth_audit_buffer_pct` -4. Check for disk I/O issues on database - -### How to verify recovery -- `agentauth_audit_write_lag_seconds` drops below 30s -- Audit buffer usage decreasing - -### Root cause investigation steps -1. Check database for lock contention -2. Review recent schema or index changes -3. Check for partition issues (is next month's partition created?) -4. Analyze slow query logs - -### Known false-positive conditions -- During large batch operations -- During partition rotation - ---- - -## audit-buffer - -### What this alert means -The in-memory audit buffer is above 70% capacity. If it fills completely, primary operations will start failing. - -### Immediate mitigation steps (first 5 minutes) -1. **This is critical** - audit writes must succeed or operations will fail -2. Check PostgreSQL connectivity immediately -3. Check for database transaction locks -4. Consider scaling registry replicas down temporarily to reduce write volume -5. Check disk space on database server - -### How to verify recovery -- `agentauth_audit_buffer_pct` drops below 50% -- Audit write lag returning to normal - -### Root cause investigation steps -1. Check database for the root cause of slow writes -2. Review audit table partitioning -3. Check for disk I/O saturation -4. Verify database autovacuum is working - -### Known false-positive conditions -- None - this alert should always be investigated - ---- - -## nonce-store - -### What this alert means -The nonce store Redis memory is above 70%. If it reaches capacity with `noeviction` policy, new requests will be rejected rather than risk replay attacks. - -### Immediate mitigation steps (first 5 minutes) -1. Check Redis memory usage: `redis-cli info memory` -2. Check nonce TTLs are working: Keys should expire with token lifetime -3. Consider scaling Redis cluster if persistent -4. Check for abnormal traffic patterns - -### How to verify recovery -- `agentauth_nonce_store_memory_pct` drops below 60% -- Memory growth rate returns to normal - -### Root cause investigation steps -1. Check for abnormal request volume -2. Verify nonce TTLs are being set correctly -3. Check for memory leaks in Redis configuration -4. Review token lifetime settings - -### Known false-positive conditions -- After major traffic spikes (should self-heal as nonces expire) - ---- - -## replica-lag - -### What this alert means -PostgreSQL read replica is more than 5 seconds behind the primary. Read queries may return stale data. - -### Immediate mitigation steps (first 5 minutes) -1. Check replication status: `SELECT * FROM pg_stat_replication` -2. Check replica disk I/O and CPU -3. Check network between primary and replica -4. Consider failing over to a healthy replica if multiple are available - -### How to verify recovery -- Replica lag drops below 5 seconds -- `pg_stat_replication` shows active streaming - -### Root cause investigation steps -1. Check for large transactions on primary -2. Review replica resource utilization -3. Check network bandwidth between primary and replica -4. Review WAL generation rate on primary - -### Known false-positive conditions -- During large bulk operations -- During initial replica sync - ---- - -## slo-budget - -### What this alert means -Error budget is being consumed at 5x the normal rate. At this rate, the monthly error budget will be exhausted prematurely. - -### Immediate mitigation steps (first 5 minutes) -1. Identify the source of errors from recent alerts -2. Check for recent deployments that may have introduced issues -3. Consider rolling back recent changes -4. Freeze non-critical deployments - -### How to verify recovery -- Error rate returns to baseline -- Error budget burn rate drops below 2x normal - -### Root cause investigation steps -1. Correlate with other alerts and deployment events -2. Review error logs for patterns -3. Check dependency health -4. Review recent code changes - -### Known false-positive conditions -- Intentional chaos engineering exercises -- Load testing - ---- - -## oom - -### What this alert means -A pod was killed due to exceeding its memory limit. - -### Immediate mitigation steps (first 5 minutes) -1. Pod should auto-restart - verify it's running -2. Check if it's a recurring issue: `kubectl get events -n agentauth --field-selector reason=OOMKilled` -3. Check current memory usage of surviving pods -4. Consider increasing memory limits if consistently hitting limits - -### How to verify recovery -- Pod is running and healthy -- Memory usage is stable - -### Root cause investigation steps -1. Check for memory leaks using profiling tools -2. Review recent code changes that may affect memory usage -3. Analyze heap dumps if available -4. Check for unbounded caches or buffers - -### Known false-positive conditions -- None - OOM kills should always be investigated - ---- - -## cache-hit-ratio - -### What this alert means -Token cache hit ratio is below 90%, meaning more than 10% of verifications are hitting the database. - -### Immediate mitigation steps (first 5 minutes) -1. Check Redis connectivity and health -2. Check cache eviction rate: `redis-cli info stats | grep evicted` -3. Check if there's a spike in unique tokens being verified -4. Verify cache population is working correctly - -### How to verify recovery -- Cache hit ratio returns above 90% -- Database query rate decreases - -### Root cause investigation steps -1. Check Redis memory pressure and eviction policy -2. Review token access patterns -3. Check for cache invalidation bugs -4. Verify cache warming on startup - -### Known false-positive conditions -- After verifier pod restart (cache needs to warm) -- After Redis restart - ---- - -## redis-unavailable - -### What this alert means -The verifier is unable to connect to Redis. It should fall back to PostgreSQL but with degraded latency. - -### Immediate mitigation steps (first 5 minutes) -1. Check Redis cluster health: `redis-cli cluster info` -2. Check network connectivity to Redis -3. Verify verifier is falling back to PostgreSQL correctly -4. Check Redis for OOM or connection limit issues - -### How to verify recovery -- Redis connection errors stop -- Latency returns to normal (sub-5ms) - -### Root cause investigation steps -1. Check Redis logs for errors -2. Review network configuration and firewall rules -3. Check for Redis cluster failover events -4. Verify Redis resource limits - -### Known false-positive conditions -- During planned Redis maintenance - ---- - -## archival-issues - -### Archival Job Failed - -#### What this means -The audit archival job failed to complete successfully. - -#### Immediate steps -1. Check archiver logs: `kubectl logs -l app=audit-archiver -n agentauth` -2. Verify database connectivity -3. Check cold storage (S3/GCS) access - -#### Recovery verification -- Next scheduled job completes successfully -- `agentauth_archival_job_status` returns to 1 - -### Partition Creation Failed - -#### What this means -Failed to create next month's audit partition. If not fixed, audit writes will fail when the current partition ends. - -#### Immediate steps -1. **This is critical** - manually create the partition if needed -2. Check database connectivity -3. Check for disk space issues -4. Verify database user permissions - -#### Recovery verification -- Partition exists for next month -- `agentauth_partition_creation_status` returns to 1 - -### Cold Storage Upload Failed - -#### What this means -Archived audit data failed to upload to cold storage (S3/GCS). - -#### Immediate steps -1. Check cloud provider credentials -2. Verify bucket exists and is accessible -3. Check for network issues to cloud storage - -#### Recovery verification -- Upload succeeds on retry -- `agentauth_cold_storage_upload_status` returns to 1 - ---- - -## General Troubleshooting - -### Useful Commands - -```bash -# Check all pod status -kubectl get pods -n agentauth - -# Check recent events -kubectl get events -n agentauth --sort-by='.lastTimestamp' - -# Check service logs -kubectl logs -l app=registry -n agentauth --tail=100 -kubectl logs -l app=verifier -n agentauth --tail=100 - -# Check metrics endpoint -kubectl port-forward svc/registry-metrics 9090:9090 -n agentauth -curl localhost:9090/metrics - -# Check Redis -redis-cli -h $REDIS_HOST cluster info -redis-cli -h $REDIS_HOST info memory - -# Check PostgreSQL -psql -h $PG_HOST -U agentauth -c "SELECT count(*) FROM pg_stat_activity" -``` - -### Escalation Path - -1. **P1 (Critical)**: Page on-call immediately - - Token verification down - - Audit buffer full - - Nonce store full - -2. **P2 (High)**: Page during business hours - - High latency - - Circuit breakers open - - SLO budget burning fast - -3. **P3 (Medium)**: Ticket for next business day - - Replica lag - - Cache hit ratio low - - Archival issues diff --git a/docs/threat-model.md b/docs/threat-model.md deleted file mode 100644 index 4892530..0000000 --- a/docs/threat-model.md +++ /dev/null @@ -1,334 +0,0 @@ -# AgentAuth Threat Model - -This document identifies security threats to the AgentAuth system and describes the mitigations implemented, residual risks, and detection mechanisms for each threat vector. - -## Overview - -AgentAuth is a capability-based authentication system for AI agents. The system involves: -- **Registry Service**: Issues and manages agent access tokens (AATs) -- **Verifier Service**: Validates tokens for service providers -- **Agent SDK**: Client library for agents to authenticate -- **Approval UI**: Human-facing interface for capability approvals -- **Service Providers**: Third-party services that accept AgentAuth tokens - -## Threat Vectors - ---- - -### 1. Stolen Registry Signing Key - -**Attack Description:** -An attacker obtains the registry's private signing key, enabling them to forge arbitrary AATs and capability grants. This is a catastrophic compromise that would allow impersonation of any agent. - -**Mitigations Implemented:** -- Registry signing keys are stored exclusively in Hardware Security Modules (HSM) via KMS backends (AWS KMS, GCP Cloud KMS, or HashiCorp Vault Transit) -- Keys never exist in plaintext form on any server - all signing operations occur within the HSM -- The `InMemorySigningBackend` and `PlaintextKeyfile` backends are disabled in production via compile-time feature flags and CI checks -- Key rotation is supported via the `key_id` field in tokens and the `/well-known/agentauth/keys` endpoint - -**Residual Risk:** -- Compromise of cloud provider KMS infrastructure (extremely rare, covered by provider SLAs) -- Insider threat with KMS admin access - -**Detection:** -- Monitor KMS audit logs for unusual signing operations -- Alert on tokens signed with unknown `key_id` values -- Track signing operation volume - sudden spikes indicate compromise - ---- - -### 2. Stolen Agent Private Key - -**Attack Description:** -An attacker steals an agent's private key, allowing them to authenticate as that agent and perform actions within the agent's granted capabilities. - -**Mitigations Implemented:** -- Agent keys are stored in KMS, never as plaintext in the agent's runtime environment -- OTP-based bootstrap flow ensures agents never handle raw private keys -- DPoP (Demonstration of Proof of Possession) sender-constraint requires proof of key possession for every authenticated request -- Short token lifetimes (15 minutes maximum) limit the window of exploitation - -**Residual Risk:** -- Compromise of the KMS where agent keys are stored -- If an attacker also has network MITM capability during the 15-minute token window - -**Detection:** -- Monitor for DPoP proofs signed with keys not matching the `cnf` claim -- Alert on authentication from unexpected IP addresses/regions -- Track behavioral anomalies (sudden capability usage patterns) - ---- - -### 3. Phished Human Principal Credential - -**Attack Description:** -An attacker tricks a human principal into approving malicious capability grants through phishing or social engineering. - -**Mitigations Implemented:** -- WebAuthn/Passkey required for approval assertions - phishing-resistant by design -- Approval assertions are cryptographically signed and bound to the specific capability set shown -- Two-step confirmation required for dangerous capabilities (Transact, Delete) -- Capability descriptions rendered in plain English to prevent confusion - -**Residual Risk:** -- Real-time phishing where attacker proxies the legitimate UI -- Social engineering to approve legitimate-looking but malicious requests - -**Detection:** -- Monitor for unusual approval patterns (time, location, frequency) -- Alert on approvals from new devices -- Audit log all approval decisions with human-readable capability descriptions - ---- - -### 4. AAT Interception and Replay - -**Attack Description:** -An attacker intercepts a valid AAT from network traffic and attempts to reuse it. - -**Mitigations Implemented:** -- Nonce-based replay prevention: each token usage includes a unique nonce stored in Redis -- DPoP sender-constraint: tokens are bound to a specific keypair; replay without the private key fails -- Short token lifetimes (15 minutes) minimize replay window -- TLS required for all communications - -**Residual Risk:** -- If an attacker compromises both the AAT and the agent's DPoP private key -- Redis failure allowing nonce storage bypass - -**Detection:** -- Alert on nonce replay attempts (logged with source IP) -- Monitor for high-volume verification requests with identical nonces -- Track verification failures with "nonce already used" errors - ---- - -### 5. AAT Claims Forgery - -**Attack Description:** -An attacker attempts to modify token claims (capabilities, expiry, service provider binding) to escalate privileges. - -**Mitigations Implemented:** -- All token claims are covered by the registry's Ed25519 signature -- `key_id` field is verified before selecting the public key for verification -- Tampered claims cause signature verification failure -- Verification uses constant-time comparison (via `subtle` crate) to prevent timing attacks - -**Residual Risk:** -- Theoretical cryptographic break of Ed25519 (currently considered infeasible) - -**Detection:** -- Log all verification failures with reason codes -- Alert on repeated forgery attempts from the same source -- Monitor for attempts to use old/rotated `key_id` values - ---- - -### 6. Cross-Service-Provider Token Reuse - -**Attack Description:** -An attacker takes a token issued for Service Provider A and attempts to use it at Service Provider B. - -**Mitigations Implemented:** -- Every AAT contains a `service_provider_id` claim binding it to a specific service provider -- Verifiers must validate that the `service_provider_id` matches their own identity -- DPoP proofs include the target URL, preventing replay across different endpoints - -**Residual Risk:** -- Service provider misconfiguration not checking `service_provider_id` - -**Detection:** -- Log service_provider_id mismatches at verification time -- Alert on tokens verified by unexpected service providers (via audit logs) - ---- - -### 7. Malicious Service Provider Forging Audit Records - -**Attack Description:** -A compromised or malicious service provider attempts to forge audit records to hide unauthorized access or frame other entities. - -**Mitigations Implemented:** -- Audit events include a hash chain: each event contains `previous_event_hash` -- Registry signs all audit records with `registry_signature` -- `UPDATE` and `DELETE` operations are revoked at the database level for the service role -- Audit events are immutable and append-only - -**Residual Risk:** -- Registry compromise allowing signing of malicious audit records -- Database admin with elevated privileges - -**Detection:** -- Audit chain integrity verification endpoint (`/v1/audit/:agent_id/verify`) -- Alert on hash chain breaks or missing events -- Regular automated chain integrity checks - ---- - -### 8. Approval UI CSRF - -**Attack Description:** -An attacker tricks a logged-in human principal into submitting an approval request through a malicious website. - -**Mitigations Implemented:** -- `SameSite=Strict` cookie policy prevents cross-site request inclusion -- Double Submit Cookie pattern: CSRF token in cookie and request body must match -- `Origin` header validation rejects requests from unexpected origins -- Approval assertion is cryptographically signed via WebAuthn - cannot be forged without the user's authenticator - -**Residual Risk:** -- Browser vulnerabilities bypassing SameSite -- XSS in the approval UI itself (mitigated by CSP) - -**Detection:** -- Log requests with missing or mismatched CSRF tokens -- Alert on approval attempts from unexpected origins -- Monitor for patterns indicating automated CSRF attempts - ---- - -### 9. Grant Request Flooding / Approval Spam - -**Attack Description:** -An attacker floods the system with grant requests or approval submissions to overwhelm human reviewers or cause denial of service. - -**Mitigations Implemented:** -- Maximum 5 pending approval requests per agent at any time -- Approval requests expire after 1 hour if not acted upon -- Denied requests trigger exponential backoff cooldown: 1h, 4h, 24h -- Rate limiting at load balancer, middleware, and SDK levels - -**Residual Risk:** -- Distributed attack from many compromised agents -- Resource exhaustion if flood protection thresholds are too high - -**Detection:** -- Monitor pending approval counts per agent -- Alert on agents hitting the pending limit repeatedly -- Track denial rates and cooldown trigger frequency - ---- - -### 10. Agent Manifest Spoofing / Impersonation - -**Attack Description:** -An attacker creates a fake agent manifest claiming to be a legitimate agent or claiming capabilities beyond what should be allowed. - -**Mitigations Implemented:** -- Agent manifests are signed and registered through the registry -- `model_origin` field tracks the source model provider -- Registry validates manifest claims during registration -- Capability grants cannot exceed what was declared in the original manifest - -**Residual Risk:** -- Compromised agent provisioning pipeline -- Social engineering to get a malicious manifest approved - -**Detection:** -- Audit log all manifest registrations -- Alert on capability requests exceeding manifest declarations -- Monitor for manifests claiming sensitive `model_origin` values - ---- - -### 11. Registry Compromise - -**Attack Description:** -An attacker gains control of the registry service, potentially accessing all agent data and signing keys. - -**Mitigations Implemented:** -- Signing keys stored in HSM - even full registry compromise cannot extract raw keys -- Registry does not store tokens - only issues them -- Write operations require proper authentication -- Separation of registry (write-heavy) and verifier (read-only) services limits blast radius -- Database credentials are minimal-privilege - -**Residual Risk:** -- Attacker could issue new tokens during compromise window -- Access to agent metadata and grant history - -**Detection:** -- Intrusion detection on registry hosts -- Anomaly detection on token issuance rates -- File integrity monitoring on registry binaries -- Alert on unusual database query patterns - ---- - -### 12. Supply Chain Attack on SDK - -**Attack Description:** -An attacker compromises the SDK build process or dependencies to inject malicious code that exfiltrates tokens or keys. - -**Mitigations Implemented:** -- `cargo-deny` enforces license compliance and bans known-malicious crates -- `cargo audit` checks for known vulnerabilities in dependencies -- SDK makes no network requests except to configured registry/KMS endpoints -- No telemetry or analytics in the SDK -- Banned crates list includes native-tls (uses rustls only) - -**Residual Risk:** -- Zero-day in a dependency before it's added to advisory database -- Compromise of crates.io infrastructure - -**Detection:** -- Reproducible builds enable verification -- Network monitoring can detect unexpected outbound connections -- Dependency diff review in CI for any new dependencies - ---- - -### 13. Secret Zero / First Provisioning - -**Attack Description:** -An attacker intercepts the initial provisioning process to obtain or substitute agent credentials. - -**Mitigations Implemented:** -- OTP (One-Time Password) bootstrap flow: agent receives single-use provisioning token -- OTP is immediately invalidated after first use -- Keypair is generated inside KMS - agent only receives a key reference, never the raw key -- Reuse of OTP returns `409 Conflict` and emits security audit event - -**Residual Risk:** -- OTP interception during initial deployment -- Compromise of the system distributing OTPs - -**Detection:** -- Audit log all bootstrap attempts -- Alert on OTP reuse attempts -- Monitor for bootstrap requests from unexpected sources - ---- - -## Security Invariants - -The following invariants must hold for the system to be secure: - -1. **No plaintext keys in production**: `InMemorySigningBackend` and `PlaintextKeyfile` never instantiated outside `#[cfg(test)]` -2. **Constant-time comparisons**: All secret comparisons use `subtle::ConstantTimeEq` or ed25519-dalek's internal constant-time verification -3. **TLS everywhere**: No service starts without TLS configured -4. **Audit atomicity**: Audit write failures cause the primary operation to fail -5. **Nonce uniqueness**: Every token usage has a unique nonce that cannot be replayed -6. **DPoP binding**: Tokens without valid DPoP proofs are rejected -7. **Capability boundary**: Agents cannot request capabilities beyond their manifest - ---- - -## Incident Response - -In case of security incident: - -1. **Immediate**: Revoke affected tokens, rotate compromised keys via KMS -2. **Short-term**: Audit logs to determine blast radius, notify affected service providers -3. **Long-term**: Root cause analysis, implement additional mitigations, update threat model - ---- - -## Review Schedule - -This threat model should be reviewed: -- After any significant architectural change -- After any security incident -- At minimum quarterly - -Last reviewed: Stage 5 implementation From d544bb7060ac47d6636881825e86e6460c38b973 Mon Sep 17 00:00:00 2001 From: Max Malkin Date: Tue, 3 Mar 2026 11:34:47 -0700 Subject: [PATCH 2/2] update README --- README.md | 34 +++++----------------------------- 1 file changed, 5 insertions(+), 29 deletions(-) diff --git a/README.md b/README.md index 0d186c6..18c75b8 100644 --- a/README.md +++ b/README.md @@ -75,7 +75,7 @@ cargo build --workspace cargo nextest run --workspace ``` -### Running the Services +### Running Services The easiest way to run all services locally is with the dev runner script: @@ -83,9 +83,9 @@ The easiest way to run all services locally is with the dev runner script: ./dev.sh ``` -This starts the registry, verifier, and approval UI in a single terminal with colored log output. Press Ctrl+C to stop all services. +This starts the registry, verifier, and approval UI in a single terminal with colored log output. -To run services individually: +To run individually: ```bash # Start the registry service @@ -119,12 +119,8 @@ agentauth/ ├── load-tests/ # k6 load test scripts ├── chaos/ # Chaos engineering experiments ├── deploy/ -│ ├── helm/ # Kubernetes Helm charts -│ └── grafana/ # Grafana dashboards -└── docs/ - ├── threat-model.md # Security threat model - ├── runbook.md # Operations runbook - └── capacity-planning.md # Sizing guidelines + ├── helm/ # Kubernetes Helm charts + └── grafana/ # Grafana dashboards ``` ## SDK Usage @@ -172,8 +168,6 @@ headers = await client.authenticate_headers("service-provider-id", "POST", "/api ## Security -AgentAuth is designed with security as a primary concern: - - All signing keys stored in HSMs (AWS KMS, GCP Cloud KMS, Vault Transit) - DPoP sender-constraint prevents token theft - Nonce-based replay prevention @@ -181,24 +175,6 @@ AgentAuth is designed with security as a primary concern: - Immutable audit log with hash chain integrity - WebAuthn/Passkey for human approval signing -See [docs/threat-model.md](docs/threat-model.md) for the full threat model. - -## Performance - -Target performance characteristics: - -| Operation | Throughput | p99 Latency | -|-----------|------------|-------------| -| Token verification (warm) | 10,000 req/s | < 5ms | -| Token verification (cold) | 1,000 req/s | < 20ms | -| Token issuance | 500 req/s | < 50ms | - -## Documentation - -- [Threat Model](docs/threat-model.md) - Security analysis and mitigations -- [Operations Runbook](docs/runbook.md) - Alert response procedures -- [Capacity Planning](docs/capacity-planning.md) - Sizing and scaling guidelines - ## License MIT License