From 9f4ce3ff76ca5dbbecdc6906f48117c895b5689d Mon Sep 17 00:00:00 2001
From: Max Malkin <maxim_malkin@outlook.com>
Date: Tue, 3 Mar 2026 11:30:44 -0700
Subject: [PATCH 1/2] remove unneeded docs

---
 docs/capacity-planning.md | 283 -------------------------
 docs/runbook.md           | 434 --------------------------------------
 docs/threat-model.md      | 334 -----------------------------
 3 files changed, 1051 deletions(-)
 delete mode 100644 docs/capacity-planning.md
 delete mode 100644 docs/runbook.md
 delete mode 100644 docs/threat-model.md

diff --git a/docs/capacity-planning.md b/docs/capacity-planning.md
deleted file mode 100644
index 162f5fa..0000000
--- a/docs/capacity-planning.md
+++ /dev/null
@@ -1,283 +0,0 @@
-# AgentAuth Capacity Planning Guide
-
-This document provides guidance for sizing AgentAuth deployments and planning for growth.
-
-## Overview
-
-AgentAuth consists of three main services with different scaling characteristics:
-
-| Service | Scaling Model | Primary Constraint |
-|---------|---------------|-------------------|
-| Registry | Vertical + Horizontal | CPU (crypto operations) |
-| Verifier | Horizontal | Network I/O, Memory |
-| Audit Archiver | Single instance | Database I/O |
-
-## Current Baseline Metrics
-
-These metrics should be updated after each major release or significant traffic change.
-
-| Metric | Current Value | 12-Month Projection |
-|--------|---------------|---------------------|
-| Token verifications/second | - | - |
-| Token issuances/second | - | - |
-| Audit events/day | - | - |
-| Active agents | - | - |
-| Active service providers | - | - |
-
-## Resource Sizing Guidelines
-
-### Registry Service
-
-**Scaling triggers:**
-- CPU utilization > 60% sustained
-- Token issuance p99 > 50ms
-
-**Initial sizing:**
-```yaml
-replicas: 3
-resources:
-  requests:
-    cpu: 500m
-    memory: 256Mi
-  limits:
-    cpu: 2000m
-    memory: 1Gi
-```
-
-**Scaling formula:**
-- 1 registry replica per 500 token issuances/second
-- Add 1 replica for each 1000 concurrent grant approval sessions
-
-**Memory considerations:**
-- Base: ~100MB
-- Per connection pool: ~10MB per pool
-- Audit buffer: Up to 100MB when backpressured
-
-### Verifier Service
-
-**Scaling triggers:**
-- p99 latency > 5ms
-- Request rate > 1000 req/s per replica
-
-**Initial sizing:**
-```yaml
-replicas: 5
-resources:
-  requests:
-    cpu: 250m
-    memory: 128Mi
-  limits:
-    cpu: 1000m
-    memory: 512Mi
-```
-
-**Scaling formula:**
-- 1 verifier replica per 1000 token verifications/second sustained
-- Account for burst: 2x replicas for 2x peak-to-average ratio
-
-**Memory considerations:**
-- Base: ~50MB
-- Connection pools: ~20MB total
-- In-flight requests: ~1KB per request
-
-### Audit Archiver
-
-**Sizing:**
-```yaml
-replicas: 1  # Leader election, only one active
-resources:
-  requests:
-    cpu: 100m
-    memory: 128Mi
-  limits:
-    cpu: 500m
-    memory: 256Mi
-```
-
-**Constraints:**
-- Single instance with leader election
-- Must complete archival within maintenance window
-- Partition creation must run before month end
-
-## Database Sizing
-
-### PostgreSQL Primary
-
-**Initial sizing:**
-- vCPU: 8
-- RAM: 32GB
-- Storage: 500GB SSD (provisioned IOPS)
-
-**Scaling triggers:**
-- Write IOPS > 70% provisioned
-- Connection count > 80% max_connections
-- Storage > 70% capacity
-
-**Growth formula:**
-- Audit events: ~1KB per event average
-- 1M events/day = ~1GB/day = ~30GB/month (before compression)
-- After 90-day retention: ~90GB hot storage
-
-### PostgreSQL Read Replicas
-
-**Initial:** 2 replicas for high availability
-
-**Scaling triggers:**
-- Replica lag > 5 seconds sustained
-- Read query latency increasing
-
-**Sizing:** Match primary for CPU/RAM, can use smaller storage
-
-### Redis Cluster
-
-**Initial sizing:**
-- 3 primary nodes + 3 replicas
-- 8GB RAM per node
-- Total: 24GB usable (after replication)
-
-**Memory allocation:**
-| Store | Allocation | Eviction Policy |
-|-------|------------|-----------------|
-| Token cache | 40% | allkeys-lru |
-| Nonce store | 40% | noeviction |
-| Rate limiters | 20% | volatile-ttl |
-
-**Sizing formula:**
-- Token cache: ~500 bytes per cached token
-- 1M active tokens = ~500MB
-- Nonce store: ~100 bytes per nonce
-- 1M nonces = ~100MB
-
-**Critical threshold:** Nonce store at 70% triggers alert
-
-## Network Considerations
-
-### Bandwidth Requirements
-
-| Path | Estimate |
-|------|----------|
-| Verifier → Redis | 1KB per verification |
-| Verifier → PostgreSQL (fallback) | 2KB per verification |
-| Registry → PostgreSQL | 5KB per token issuance |
-| Registry → KMS | 1KB per signing operation |
-
-**Per 10,000 verifications/second:**
-- Redis: ~10MB/s
-- With 5% fallback to PostgreSQL: ~1MB/s additional
-
-### Latency Requirements
-
-| Path | Target | Maximum |
-|------|--------|---------|
-| Verifier → Redis | <1ms | 5ms |
-| Verifier → PostgreSQL | <5ms | 20ms |
-| Registry → PostgreSQL | <10ms | 50ms |
-| Registry → KMS | <50ms | 200ms |
-
-## Scaling Scenarios
-
-### Scenario 1: 10x Traffic Increase
-
-**Current:** 1,000 verifications/second
-**Target:** 10,000 verifications/second
-
-**Changes needed:**
-- Verifier: 5 → 15 replicas
-- Redis: Add 3 more primary nodes
-- Registry: 3 → 5 replicas
-- PostgreSQL: Add 2 more read replicas
-
-### Scenario 2: New Region Deployment
-
-**For each new region:**
-- Full verifier deployment (can operate read-only)
-- Redis cluster (for local caching)
-- PostgreSQL read replica (for fallback)
-- Registry not required (can call primary region)
-
-### Scenario 3: High Burst Events
-
-**For 10x burst capacity:**
-- HPA maxReplicas = 3x normal
-- Redis: Ensure headroom in memory
-- Rate limiting at edge to shed excess load
-
-## Cost Optimization
-
-### Right-sizing Recommendations
-
-1. **Off-peak scaling:** Scale verifiers down to 50% during low-traffic hours
-2. **Spot instances:** Verifiers are stateless, suitable for spot/preemptible
-3. **Reserved capacity:** Registry and database benefit from reserved pricing
-
-### Resource Efficiency Targets
-
-| Metric | Target |
-|--------|--------|
-| CPU utilization (avg) | 40-60% |
-| Memory utilization (avg) | 50-70% |
-| Cache hit ratio | >95% |
-| Database connection utilization | 50-70% |
-
-## Capacity Planning Checklist
-
-### Monthly Review
-- [ ] Update baseline metrics table
-- [ ] Review resource utilization trends
-- [ ] Check database storage growth
-- [ ] Verify audit partition creation
-- [ ] Review Redis memory usage
-
-### Quarterly Review
-- [ ] Update 12-month projections
-- [ ] Review and adjust HPA settings
-- [ ] Load test at projected capacity
-- [ ] Review cost vs. capacity tradeoffs
-
-### Pre-Launch Checklist
-- [ ] Load test at 2x expected peak
-- [ ] Verify auto-scaling works correctly
-- [ ] Confirm database can handle projected writes
-- [ ] Verify Redis cluster can handle projected cache size
-- [ ] Test failover scenarios
-
-## Monitoring Dashboard Queries
-
-### Key Capacity Metrics (Prometheus)
-
-```promql
-# Verifications per second
-sum(rate(agentauth_tokens_verified_total[5m]))
-
-# Token issuances per second
-sum(rate(agentauth_tokens_issued_total[5m]))
-
-# Active tokens (approximate)
-sum(agentauth_active_tokens)
-
-# Redis memory usage percentage
-redis_memory_used_bytes / redis_memory_max_bytes * 100
-
-# Database connections in use
-pg_stat_activity_count / pg_settings_max_connections * 100
-
-# Audit events per day (24h)
-sum(increase(agentauth_audit_events_total[24h]))
-```
-
-## Emergency Procedures
-
-### If approaching Redis memory limit
-1. Enable aggressive LRU eviction on token cache
-2. Reduce token cache TTL
-3. Scale Redis cluster (add nodes)
-
-### If approaching database storage limit
-1. Run emergency archival job
-2. Drop oldest partitions after archiving
-3. Add storage capacity
-
-### If approaching connection limits
-1. Review and close idle connections
-2. Reduce connection pool sizes temporarily
-3. Add read replicas to distribute load
diff --git a/docs/runbook.md b/docs/runbook.md
deleted file mode 100644
index 999299d..0000000
--- a/docs/runbook.md
+++ /dev/null
@@ -1,434 +0,0 @@
-# AgentAuth Operations Runbook
-
-This runbook provides guidance for responding to alerts and operational issues in the AgentAuth system.
-
-## Table of Contents
-
-- [Token Verification Errors](#token-verify-errors)
-- [Token Verification Latency](#token-verify-latency)
-- [Circuit Breaker](#circuit-breaker)
-- [Revocation Lag](#revocation-lag)
-- [Audit Lag](#audit-lag)
-- [Audit Buffer](#audit-buffer)
-- [Nonce Store](#nonce-store)
-- [Replica Lag](#replica-lag)
-- [SLO Budget](#slo-budget)
-- [OOM](#oom)
-- [Cache Hit Ratio](#cache-hit-ratio)
-- [Redis Unavailable](#redis-unavailable)
-- [Archival Issues](#archival-issues)
-
----
-
-## token-verify-errors
-
-### What this alert means
-Token verification requests are failing at a rate above 0.1%. This indicates a significant portion of authentication requests are not succeeding.
-
-### Immediate mitigation steps (first 5 minutes)
-1. Check the verifier logs for error patterns: `kubectl logs -l app=verifier -n agentauth --tail=100`
-2. Check Redis connectivity: `redis-cli -h $REDIS_HOST ping`
-3. Check if circuit breakers are open: Look at `agentauth_circuit_breaker_state` metrics
-4. If Redis is down, verifier should fall back to PostgreSQL - verify this is working
-
-### How to verify recovery
-- Error rate drops below 0.1%
-- `agentauth_tokens_verified_total{outcome="allowed"}` is increasing
-- No new error logs appearing
-
-### Root cause investigation steps
-1. Correlate with deployment events - was there a recent deploy?
-2. Check for Redis cluster issues
-3. Check for database connectivity issues
-4. Look for patterns in failed token JTIs - are specific agents affected?
-5. Check if KMS is available for public key fetching
-
-### Known false-positive conditions
-- Brief spikes during rolling deployments
-- Intentional load testing with invalid tokens
-
----
-
-## token-verify-latency
-
-### What this alert means
-Token verification p99 latency is above 5ms. This may indicate Redis issues or increased database fallback.
-
-### Immediate mitigation steps (first 5 minutes)
-1. Check Redis latency: `redis-cli --latency -h $REDIS_HOST`
-2. Check cache hit ratio: `agentauth_cache_hit_ratio{cache="token"}`
-3. If cache hit ratio is low, check if Redis has memory pressure
-4. Check verifier pod resource usage: `kubectl top pods -l app=verifier -n agentauth`
-
-### How to verify recovery
-- p99 latency drops below 5ms
-- Cache hit ratio returns to >95%
-
-### Root cause investigation steps
-1. Check Redis cluster for hot keys or memory pressure
-2. Check if there's a cache stampede (many requests for same cold key)
-3. Review database query performance
-4. Check network latency between verifier and Redis
-
-### Known false-positive conditions
-- Initial cold start after deployment
-- After Redis failover
-
----
-
-## circuit-breaker
-
-### What this alert means
-A circuit breaker has been open for more than 2 minutes, indicating a dependency is failing.
-
-### Immediate mitigation steps (first 5 minutes)
-1. Identify which circuit breaker: Check `agentauth_circuit_breaker_state` labels
-2. For Redis: Check cluster health with `redis-cli cluster info`
-3. For PostgreSQL: Check connection with `pg_isready -h $PG_HOST`
-4. For KMS: Check cloud provider status page
-
-### How to verify recovery
-- Circuit breaker state changes to 0 (closed) or 2 (half-open attempting recovery)
-- Dependency connectivity restored
-
-### Root cause investigation steps
-1. Check dependency service logs and metrics
-2. Check network connectivity and DNS resolution
-3. Review recent infrastructure changes
-4. Check for resource exhaustion on dependency services
-
-### Known false-positive conditions
-- Planned maintenance on dependencies
-- Brief network partitions that self-heal
-
----
-
-## revocation-lag
-
-### What this alert means
-Token revocations are taking more than 200ms to propagate to the cache. This could allow revoked tokens to be used during the lag window.
-
-### Immediate mitigation steps (first 5 minutes)
-1. Check Redis write latency
-2. Check registry to verifier network latency
-3. Verify revocation events are being published
-4. Check for Redis replication lag in cluster mode
-
-### How to verify recovery
-- `agentauth_revocation_propagation_seconds` p99 drops below 200ms
-- Revocation test completes within expected time
-
-### Root cause investigation steps
-1. Check Redis cluster for write performance issues
-2. Review revocation event publishing code path
-3. Check for network issues between services
-4. Verify Redis cluster replication is healthy
-
-### Known false-positive conditions
-- During Redis cluster failover
-
----
-
-## audit-lag
-
-### What this alert means
-Audit events are taking more than 30 seconds to be written. This may indicate database issues or backpressure.
-
-### Immediate mitigation steps (first 5 minutes)
-1. Check PostgreSQL connections: `SELECT count(*) FROM pg_stat_activity`
-2. Check for long-running transactions: `SELECT * FROM pg_stat_activity WHERE state = 'active'`
-3. Check audit buffer usage: `agentauth_audit_buffer_pct`
-4. Check for disk I/O issues on database
-
-### How to verify recovery
-- `agentauth_audit_write_lag_seconds` drops below 30s
-- Audit buffer usage decreasing
-
-### Root cause investigation steps
-1. Check database for lock contention
-2. Review recent schema or index changes
-3. Check for partition issues (is next month's partition created?)
-4. Analyze slow query logs
-
-### Known false-positive conditions
-- During large batch operations
-- During partition rotation
-
----
-
-## audit-buffer
-
-### What this alert means
-The in-memory audit buffer is above 70% capacity. If it fills completely, primary operations will start failing.
-
-### Immediate mitigation steps (first 5 minutes)
-1. **This is critical** - audit writes must succeed or operations will fail
-2. Check PostgreSQL connectivity immediately
-3. Check for database transaction locks
-4. Consider scaling registry replicas down temporarily to reduce write volume
-5. Check disk space on database server
-
-### How to verify recovery
-- `agentauth_audit_buffer_pct` drops below 50%
-- Audit write lag returning to normal
-
-### Root cause investigation steps
-1. Check database for the root cause of slow writes
-2. Review audit table partitioning
-3. Check for disk I/O saturation
-4. Verify database autovacuum is working
-
-### Known false-positive conditions
-- None - this alert should always be investigated
-
----
-
-## nonce-store
-
-### What this alert means
-The nonce store Redis memory is above 70%. If it reaches capacity with `noeviction` policy, new requests will be rejected rather than risk replay attacks.
-
-### Immediate mitigation steps (first 5 minutes)
-1. Check Redis memory usage: `redis-cli info memory`
-2. Check nonce TTLs are working: Keys should expire with token lifetime
-3. Consider scaling Redis cluster if persistent
-4. Check for abnormal traffic patterns
-
-### How to verify recovery
-- `agentauth_nonce_store_memory_pct` drops below 60%
-- Memory growth rate returns to normal
-
-### Root cause investigation steps
-1. Check for abnormal request volume
-2. Verify nonce TTLs are being set correctly
-3. Check for memory leaks in Redis configuration
-4. Review token lifetime settings
-
-### Known false-positive conditions
-- After major traffic spikes (should self-heal as nonces expire)
-
----
-
-## replica-lag
-
-### What this alert means
-PostgreSQL read replica is more than 5 seconds behind the primary. Read queries may return stale data.
-
-### Immediate mitigation steps (first 5 minutes)
-1. Check replication status: `SELECT * FROM pg_stat_replication`
-2. Check replica disk I/O and CPU
-3. Check network between primary and replica
-4. Consider failing over to a healthy replica if multiple are available
-
-### How to verify recovery
-- Replica lag drops below 5 seconds
-- `pg_stat_replication` shows active streaming
-
-### Root cause investigation steps
-1. Check for large transactions on primary
-2. Review replica resource utilization
-3. Check network bandwidth between primary and replica
-4. Review WAL generation rate on primary
-
-### Known false-positive conditions
-- During large bulk operations
-- During initial replica sync
-
----
-
-## slo-budget
-
-### What this alert means
-Error budget is being consumed at 5x the normal rate. At this rate, the monthly error budget will be exhausted prematurely.
-
-### Immediate mitigation steps (first 5 minutes)
-1. Identify the source of errors from recent alerts
-2. Check for recent deployments that may have introduced issues
-3. Consider rolling back recent changes
-4. Freeze non-critical deployments
-
-### How to verify recovery
-- Error rate returns to baseline
-- Error budget burn rate drops below 2x normal
-
-### Root cause investigation steps
-1. Correlate with other alerts and deployment events
-2. Review error logs for patterns
-3. Check dependency health
-4. Review recent code changes
-
-### Known false-positive conditions
-- Intentional chaos engineering exercises
-- Load testing
-
----
-
-## oom
-
-### What this alert means
-A pod was killed due to exceeding its memory limit.
-
-### Immediate mitigation steps (first 5 minutes)
-1. Pod should auto-restart - verify it's running
-2. Check if it's a recurring issue: `kubectl get events -n agentauth --field-selector reason=OOMKilled`
-3. Check current memory usage of surviving pods
-4. Consider increasing memory limits if consistently hitting limits
-
-### How to verify recovery
-- Pod is running and healthy
-- Memory usage is stable
-
-### Root cause investigation steps
-1. Check for memory leaks using profiling tools
-2. Review recent code changes that may affect memory usage
-3. Analyze heap dumps if available
-4. Check for unbounded caches or buffers
-
-### Known false-positive conditions
-- None - OOM kills should always be investigated
-
----
-
-## cache-hit-ratio
-
-### What this alert means
-Token cache hit ratio is below 90%, meaning more than 10% of verifications are hitting the database.
-
-### Immediate mitigation steps (first 5 minutes)
-1. Check Redis connectivity and health
-2. Check cache eviction rate: `redis-cli info stats | grep evicted`
-3. Check if there's a spike in unique tokens being verified
-4. Verify cache population is working correctly
-
-### How to verify recovery
-- Cache hit ratio returns above 90%
-- Database query rate decreases
-
-### Root cause investigation steps
-1. Check Redis memory pressure and eviction policy
-2. Review token access patterns
-3. Check for cache invalidation bugs
-4. Verify cache warming on startup
-
-### Known false-positive conditions
-- After verifier pod restart (cache needs to warm)
-- After Redis restart
-
----
-
-## redis-unavailable
-
-### What this alert means
-The verifier is unable to connect to Redis. It should fall back to PostgreSQL but with degraded latency.
-
-### Immediate mitigation steps (first 5 minutes)
-1. Check Redis cluster health: `redis-cli cluster info`
-2. Check network connectivity to Redis
-3. Verify verifier is falling back to PostgreSQL correctly
-4. Check Redis for OOM or connection limit issues
-
-### How to verify recovery
-- Redis connection errors stop
-- Latency returns to normal (sub-5ms)
-
-### Root cause investigation steps
-1. Check Redis logs for errors
-2. Review network configuration and firewall rules
-3. Check for Redis cluster failover events
-4. Verify Redis resource limits
-
-### Known false-positive conditions
-- During planned Redis maintenance
-
----
-
-## archival-issues
-
-### Archival Job Failed
-
-#### What this means
-The audit archival job failed to complete successfully.
-
-#### Immediate steps
-1. Check archiver logs: `kubectl logs -l app=audit-archiver -n agentauth`
-2. Verify database connectivity
-3. Check cold storage (S3/GCS) access
-
-#### Recovery verification
-- Next scheduled job completes successfully
-- `agentauth_archival_job_status` returns to 1
-
-### Partition Creation Failed
-
-#### What this means
-Failed to create next month's audit partition. If not fixed, audit writes will fail when the current partition ends.
-
-#### Immediate steps
-1. **This is critical** - manually create the partition if needed
-2. Check database connectivity
-3. Check for disk space issues
-4. Verify database user permissions
-
-#### Recovery verification
-- Partition exists for next month
-- `agentauth_partition_creation_status` returns to 1
-
-### Cold Storage Upload Failed
-
-#### What this means
-Archived audit data failed to upload to cold storage (S3/GCS).
-
-#### Immediate steps
-1. Check cloud provider credentials
-2. Verify bucket exists and is accessible
-3. Check for network issues to cloud storage
-
-#### Recovery verification
-- Upload succeeds on retry
-- `agentauth_cold_storage_upload_status` returns to 1
-
----
-
-## General Troubleshooting
-
-### Useful Commands
-
-```bash
-# Check all pod status
-kubectl get pods -n agentauth
-
-# Check recent events
-kubectl get events -n agentauth --sort-by='.lastTimestamp'
-
-# Check service logs
-kubectl logs -l app=registry -n agentauth --tail=100
-kubectl logs -l app=verifier -n agentauth --tail=100
-
-# Check metrics endpoint
-kubectl port-forward svc/registry-metrics 9090:9090 -n agentauth
-curl localhost:9090/metrics
-
-# Check Redis
-redis-cli -h $REDIS_HOST cluster info
-redis-cli -h $REDIS_HOST info memory
-
-# Check PostgreSQL
-psql -h $PG_HOST -U agentauth -c "SELECT count(*) FROM pg_stat_activity"
-```
-
-### Escalation Path
-
-1. **P1 (Critical)**: Page on-call immediately
-   - Token verification down
-   - Audit buffer full
-   - Nonce store full
-
-2. **P2 (High)**: Page during business hours
-   - High latency
-   - Circuit breakers open
-   - SLO budget burning fast
-
-3. **P3 (Medium)**: Ticket for next business day
-   - Replica lag
-   - Cache hit ratio low
-   - Archival issues
diff --git a/docs/threat-model.md b/docs/threat-model.md
deleted file mode 100644
index 4892530..0000000
--- a/docs/threat-model.md
+++ /dev/null
@@ -1,334 +0,0 @@
-# AgentAuth Threat Model
-
-This document identifies security threats to the AgentAuth system and describes the mitigations implemented, residual risks, and detection mechanisms for each threat vector.
-
-## Overview
-
-AgentAuth is a capability-based authentication system for AI agents. The system involves:
-- **Registry Service**: Issues and manages agent access tokens (AATs)
-- **Verifier Service**: Validates tokens for service providers
-- **Agent SDK**: Client library for agents to authenticate
-- **Approval UI**: Human-facing interface for capability approvals
-- **Service Providers**: Third-party services that accept AgentAuth tokens
-
-## Threat Vectors
-
----
-
-### 1. Stolen Registry Signing Key
-
-**Attack Description:**
-An attacker obtains the registry's private signing key, enabling them to forge arbitrary AATs and capability grants. This is a catastrophic compromise that would allow impersonation of any agent.
-
-**Mitigations Implemented:**
-- Registry signing keys are stored exclusively in Hardware Security Modules (HSM) via KMS backends (AWS KMS, GCP Cloud KMS, or HashiCorp Vault Transit)
-- Keys never exist in plaintext form on any server - all signing operations occur within the HSM
-- The `InMemorySigningBackend` and `PlaintextKeyfile` backends are disabled in production via compile-time feature flags and CI checks
-- Key rotation is supported via the `key_id` field in tokens and the `/well-known/agentauth/keys` endpoint
-
-**Residual Risk:**
-- Compromise of cloud provider KMS infrastructure (extremely rare, covered by provider SLAs)
-- Insider threat with KMS admin access
-
-**Detection:**
-- Monitor KMS audit logs for unusual signing operations
-- Alert on tokens signed with unknown `key_id` values
-- Track signing operation volume - sudden spikes indicate compromise
-
----
-
-### 2. Stolen Agent Private Key
-
-**Attack Description:**
-An attacker steals an agent's private key, allowing them to authenticate as that agent and perform actions within the agent's granted capabilities.
-
-**Mitigations Implemented:**
-- Agent keys are stored in KMS, never as plaintext in the agent's runtime environment
-- OTP-based bootstrap flow ensures agents never handle raw private keys
-- DPoP (Demonstration of Proof of Possession) sender-constraint requires proof of key possession for every authenticated request
-- Short token lifetimes (15 minutes maximum) limit the window of exploitation
-
-**Residual Risk:**
-- Compromise of the KMS where agent keys are stored
-- If an attacker also has network MITM capability during the 15-minute token window
-
-**Detection:**
-- Monitor for DPoP proofs signed with keys not matching the `cnf` claim
-- Alert on authentication from unexpected IP addresses/regions
-- Track behavioral anomalies (sudden capability usage patterns)
-
----
-
-### 3. Phished Human Principal Credential
-
-**Attack Description:**
-An attacker tricks a human principal into approving malicious capability grants through phishing or social engineering.
-
-**Mitigations Implemented:**
-- WebAuthn/Passkey required for approval assertions - phishing-resistant by design
-- Approval assertions are cryptographically signed and bound to the specific capability set shown
-- Two-step confirmation required for dangerous capabilities (Transact, Delete)
-- Capability descriptions rendered in plain English to prevent confusion
-
-**Residual Risk:**
-- Real-time phishing where attacker proxies the legitimate UI
-- Social engineering to approve legitimate-looking but malicious requests
-
-**Detection:**
-- Monitor for unusual approval patterns (time, location, frequency)
-- Alert on approvals from new devices
-- Audit log all approval decisions with human-readable capability descriptions
-
----
-
-### 4. AAT Interception and Replay
-
-**Attack Description:**
-An attacker intercepts a valid AAT from network traffic and attempts to reuse it.
-
-**Mitigations Implemented:**
-- Nonce-based replay prevention: each token usage includes a unique nonce stored in Redis
-- DPoP sender-constraint: tokens are bound to a specific keypair; replay without the private key fails
-- Short token lifetimes (15 minutes) minimize replay window
-- TLS required for all communications
-
-**Residual Risk:**
-- If an attacker compromises both the AAT and the agent's DPoP private key
-- Redis failure allowing nonce storage bypass
-
-**Detection:**
-- Alert on nonce replay attempts (logged with source IP)
-- Monitor for high-volume verification requests with identical nonces
-- Track verification failures with "nonce already used" errors
-
----
-
-### 5. AAT Claims Forgery
-
-**Attack Description:**
-An attacker attempts to modify token claims (capabilities, expiry, service provider binding) to escalate privileges.
-
-**Mitigations Implemented:**
-- All token claims are covered by the registry's Ed25519 signature
-- `key_id` field is verified before selecting the public key for verification
-- Tampered claims cause signature verification failure
-- Verification uses constant-time comparison (via `subtle` crate) to prevent timing attacks
-
-**Residual Risk:**
-- Theoretical cryptographic break of Ed25519 (currently considered infeasible)
-
-**Detection:**
-- Log all verification failures with reason codes
-- Alert on repeated forgery attempts from the same source
-- Monitor for attempts to use old/rotated `key_id` values
-
----
-
-### 6. Cross-Service-Provider Token Reuse
-
-**Attack Description:**
-An attacker takes a token issued for Service Provider A and attempts to use it at Service Provider B.
-
-**Mitigations Implemented:**
-- Every AAT contains a `service_provider_id` claim binding it to a specific service provider
-- Verifiers must validate that the `service_provider_id` matches their own identity
-- DPoP proofs include the target URL, preventing replay across different endpoints
-
-**Residual Risk:**
-- Service provider misconfiguration not checking `service_provider_id`
-
-**Detection:**
-- Log service_provider_id mismatches at verification time
-- Alert on tokens verified by unexpected service providers (via audit logs)
-
----
-
-### 7. Malicious Service Provider Forging Audit Records
-
-**Attack Description:**
-A compromised or malicious service provider attempts to forge audit records to hide unauthorized access or frame other entities.
-
-**Mitigations Implemented:**
-- Audit events include a hash chain: each event contains `previous_event_hash`
-- Registry signs all audit records with `registry_signature`
-- `UPDATE` and `DELETE` operations are revoked at the database level for the service role
-- Audit events are immutable and append-only
-
-**Residual Risk:**
-- Registry compromise allowing signing of malicious audit records
-- Database admin with elevated privileges
-
-**Detection:**
-- Audit chain integrity verification endpoint (`/v1/audit/:agent_id/verify`)
-- Alert on hash chain breaks or missing events
-- Regular automated chain integrity checks
-
----
-
-### 8. Approval UI CSRF
-
-**Attack Description:**
-An attacker tricks a logged-in human principal into submitting an approval request through a malicious website.
-
-**Mitigations Implemented:**
-- `SameSite=Strict` cookie policy prevents cross-site request inclusion
-- Double Submit Cookie pattern: CSRF token in cookie and request body must match
-- `Origin` header validation rejects requests from unexpected origins
-- Approval assertion is cryptographically signed via WebAuthn - cannot be forged without the user's authenticator
-
-**Residual Risk:**
-- Browser vulnerabilities bypassing SameSite
-- XSS in the approval UI itself (mitigated by CSP)
-
-**Detection:**
-- Log requests with missing or mismatched CSRF tokens
-- Alert on approval attempts from unexpected origins
-- Monitor for patterns indicating automated CSRF attempts
-
----
-
-### 9. Grant Request Flooding / Approval Spam
-
-**Attack Description:**
-An attacker floods the system with grant requests or approval submissions to overwhelm human reviewers or cause denial of service.
-
-**Mitigations Implemented:**
-- Maximum 5 pending approval requests per agent at any time
-- Approval requests expire after 1 hour if not acted upon
-- Denied requests trigger exponential backoff cooldown: 1h, 4h, 24h
-- Rate limiting at load balancer, middleware, and SDK levels
-
-**Residual Risk:**
-- Distributed attack from many compromised agents
-- Resource exhaustion if flood protection thresholds are too high
-
-**Detection:**
-- Monitor pending approval counts per agent
-- Alert on agents hitting the pending limit repeatedly
-- Track denial rates and cooldown trigger frequency
-
----
-
-### 10. Agent Manifest Spoofing / Impersonation
-
-**Attack Description:**
-An attacker creates a fake agent manifest claiming to be a legitimate agent or claiming capabilities beyond what should be allowed.
-
-**Mitigations Implemented:**
-- Agent manifests are signed and registered through the registry
-- `model_origin` field tracks the source model provider
-- Registry validates manifest claims during registration
-- Capability grants cannot exceed what was declared in the original manifest
-
-**Residual Risk:**
-- Compromised agent provisioning pipeline
-- Social engineering to get a malicious manifest approved
-
-**Detection:**
-- Audit log all manifest registrations
-- Alert on capability requests exceeding manifest declarations
-- Monitor for manifests claiming sensitive `model_origin` values
-
----
-
-### 11. Registry Compromise
-
-**Attack Description:**
-An attacker gains control of the registry service, potentially accessing all agent data and signing keys.
-
-**Mitigations Implemented:**
-- Signing keys stored in HSM - even full registry compromise cannot extract raw keys
-- Registry does not store tokens - only issues them
-- Write operations require proper authentication
-- Separation of registry (write-heavy) and verifier (read-only) services limits blast radius
-- Database credentials are minimal-privilege
-
-**Residual Risk:**
-- Attacker could issue new tokens during compromise window
-- Access to agent metadata and grant history
-
-**Detection:**
-- Intrusion detection on registry hosts
-- Anomaly detection on token issuance rates
-- File integrity monitoring on registry binaries
-- Alert on unusual database query patterns
-
----
-
-### 12. Supply Chain Attack on SDK
-
-**Attack Description:**
-An attacker compromises the SDK build process or dependencies to inject malicious code that exfiltrates tokens or keys.
-
-**Mitigations Implemented:**
-- `cargo-deny` enforces license compliance and bans known-malicious crates
-- `cargo audit` checks for known vulnerabilities in dependencies
-- SDK makes no network requests except to configured registry/KMS endpoints
-- No telemetry or analytics in the SDK
-- Banned crates list includes native-tls (uses rustls only)
-
-**Residual Risk:**
-- Zero-day in a dependency before it's added to advisory database
-- Compromise of crates.io infrastructure
-
-**Detection:**
-- Reproducible builds enable verification
-- Network monitoring can detect unexpected outbound connections
-- Dependency diff review in CI for any new dependencies
-
----
-
-### 13. Secret Zero / First Provisioning
-
-**Attack Description:**
-An attacker intercepts the initial provisioning process to obtain or substitute agent credentials.
-
-**Mitigations Implemented:**
-- OTP (One-Time Password) bootstrap flow: agent receives single-use provisioning token
-- OTP is immediately invalidated after first use
-- Keypair is generated inside KMS - agent only receives a key reference, never the raw key
-- Reuse of OTP returns `409 Conflict` and emits security audit event
-
-**Residual Risk:**
-- OTP interception during initial deployment
-- Compromise of the system distributing OTPs
-
-**Detection:**
-- Audit log all bootstrap attempts
-- Alert on OTP reuse attempts
-- Monitor for bootstrap requests from unexpected sources
-
----
-
-## Security Invariants
-
-The following invariants must hold for the system to be secure:
-
-1. **No plaintext keys in production**: `InMemorySigningBackend` and `PlaintextKeyfile` never instantiated outside `#[cfg(test)]`
-2. **Constant-time comparisons**: All secret comparisons use `subtle::ConstantTimeEq` or ed25519-dalek's internal constant-time verification
-3. **TLS everywhere**: No service starts without TLS configured
-4. **Audit atomicity**: Audit write failures cause the primary operation to fail
-5. **Nonce uniqueness**: Every token usage has a unique nonce that cannot be replayed
-6. **DPoP binding**: Tokens without valid DPoP proofs are rejected
-7. **Capability boundary**: Agents cannot request capabilities beyond their manifest
-
----
-
-## Incident Response
-
-In case of security incident:
-
-1. **Immediate**: Revoke affected tokens, rotate compromised keys via KMS
-2. **Short-term**: Audit logs to determine blast radius, notify affected service providers
-3. **Long-term**: Root cause analysis, implement additional mitigations, update threat model
-
----
-
-## Review Schedule
-
-This threat model should be reviewed:
-- After any significant architectural change
-- After any security incident
-- At minimum quarterly
-
-Last reviewed: Stage 5 implementation

From d544bb7060ac47d6636881825e86e6460c38b973 Mon Sep 17 00:00:00 2001
From: Max Malkin <maxim_malkin@outlook.com>
Date: Tue, 3 Mar 2026 11:34:47 -0700
Subject: [PATCH 2/2] update README

---
 README.md | 34 +++++-----------------------------
 1 file changed, 5 insertions(+), 29 deletions(-)

diff --git a/README.md b/README.md
index 0d186c6..18c75b8 100644
--- a/README.md
+++ b/README.md
@@ -75,7 +75,7 @@ cargo build --workspace
 cargo nextest run --workspace
 ```
 
-### Running the Services
+### Running Services
 
 The easiest way to run all services locally is with the dev runner script:
 
@@ -83,9 +83,9 @@ The easiest way to run all services locally is with the dev runner script:
 ./dev.sh
 ```
 
-This starts the registry, verifier, and approval UI in a single terminal with colored log output. Press Ctrl+C to stop all services.
+This starts the registry, verifier, and approval UI in a single terminal with colored log output.
 
-To run services individually:
+To run individually:
 
 ```bash
 # Start the registry service
@@ -119,12 +119,8 @@ agentauth/
 ├── load-tests/              # k6 load test scripts
 ├── chaos/                   # Chaos engineering experiments
 ├── deploy/
-│   ├── helm/                # Kubernetes Helm charts
-│   └── grafana/             # Grafana dashboards
-└── docs/
-    ├── threat-model.md      # Security threat model
-    ├── runbook.md           # Operations runbook
-    └── capacity-planning.md # Sizing guidelines
+    ├── helm/                # Kubernetes Helm charts
+    └── grafana/             # Grafana dashboards
 ```
 
 ## SDK Usage
@@ -172,8 +168,6 @@ headers = await client.authenticate_headers("service-provider-id", "POST", "/api
 
 ## Security
 
-AgentAuth is designed with security as a primary concern:
-
 - All signing keys stored in HSMs (AWS KMS, GCP Cloud KMS, Vault Transit)
 - DPoP sender-constraint prevents token theft
 - Nonce-based replay prevention
@@ -181,24 +175,6 @@ AgentAuth is designed with security as a primary concern:
 - Immutable audit log with hash chain integrity
 - WebAuthn/Passkey for human approval signing
 
-See [docs/threat-model.md](docs/threat-model.md) for the full threat model.
-
-## Performance
-
-Target performance characteristics:
-
-| Operation | Throughput | p99 Latency |
-|-----------|------------|-------------|
-| Token verification (warm) | 10,000 req/s | < 5ms |
-| Token verification (cold) | 1,000 req/s | < 20ms |
-| Token issuance | 500 req/s | < 50ms |
-
-## Documentation
-
-- [Threat Model](docs/threat-model.md) - Security analysis and mitigations
-- [Operations Runbook](docs/runbook.md) - Alert response procedures
-- [Capacity Planning](docs/capacity-planning.md) - Sizing and scaling guidelines
-
 ## License
 
 MIT License