TARSy (Thoughtful Alert Response System) is an intelligent Site Reliability Engineering system that automatically processes alerts through sequential agent chains, retrieves runbooks, and uses MCP (Model Context Protocol) servers to gather system information for comprehensive multi-stage incident analysis.
Inspired by the spirit of sci-fi AI, TARSy is your reliable companion for SRE operations. π
demo-parallel-agents.webm
- README.md: This file - project overview and quick start
- docs/architecture-overview.md: High-level architecture concepts and design principles
- docs/functional-areas-design.md: Functional areas design and architecture documentation
- Python 3.13+ - Core backend runtime
- Node.js 18+ - Frontend development and build tools
- npm - Node.js package manager (comes with Node.js)
- uv - Modern Python package and project manager
- Install:
pip install uv - Alternative:
curl -LsSf https://astral.sh/uv/install.sh | sh
- Install:
- Podman (or Docker) - Container runtime
- podman-compose - Multi-container application management
- Install:
pip install podman-compose
- Install:
Quick Check: Run
make check-prereqsto verify all prerequisites are installed.
# 1. Initial setup (one-time only)
make setup
# 2. Configure API keys (REQUIRED)
# Edit backend/.env and set your API keys:
# - GOOGLE_API_KEY (get from https://aistudio.google.com/app/apikey)
# - GITHUB_TOKEN (get from https://github.com/settings/tokens)
# 3. Ensure Kubernetes/OpenShift access (REQUIRED)
# See [K8s Access Requirements](#k8s-access-reqs) section below for details
# 4. Start all services
make devServices will be available at:
- π₯οΈ TARSy Dashboard: http://localhost:5173
- Manual Alert Submission: http://localhost:5173/submit-alert
- π§ Backend API: http://localhost:8000 (docs at /docs)
Stop all services: make stop
For production-like testing with containerized services, authentication, and database:
# 1. Initial setup (one-time only)
make setup
# 2. Configure API keys and OAuth (REQUIRED)
# Edit backend/.env and set your API keys + OAuth configuration
# - See config/README.md for OAuth2 proxy customization (client IDs, secrets, org/team)
# - See docs/oauth2-proxy-setup.md for detailed GitHub OAuth setup guide
# - Configure LLM providers in backend/.env (GOOGLE_API_KEY, etc.)
# 3. Deploy complete containerized stack
make containers-deploy # Preserves database data (recommended)
# OR for fresh start:
make containers-deploy-fresh # Clean rebuild including databaseServices will be available at:
- π₯οΈ TARSy Dashboard: http://localhost:8080 (with OAuth authentication)
- π§ Backend API: http://localhost:8080/api (protected by OAuth2-proxy)
- ποΈ PostgreSQL Database: localhost:5432 (admin access)
Container Management:
- Update apps (preserve database):
make containers-deploy - Fresh deployment:
make containers-deploy-fresh - Stop containers:
make containers-stop - View logs:
make containers-logs - Check status:
make containers-status - Clean up:
make containers-clean(removes all containers and data)
For deploying TARSy to OpenShift or Kubernetes clusters:
# Complete deployment with local builds
make openshift-deployπ For complete OpenShift deployment guide: See deploy/README.md
This deployment is designed for development and testing environments, serving as a reference for production deployments in separate repositories.
- π οΈ Configuration-Based Agents: Deploy new agents and chain definitions via YAML configuration without code changes
- π§ Flexible Alert Processing: Accept arbitrary JSON payloads from any monitoring system
- π§ Chain-Based Agent Architecture: Specialized agents with domain-specific tools and AI reasoning working in coordinated stages
- β‘ Parallel Agent Execution: Run multiple agents concurrently for independent domain investigation with automatic synthesis. Supports multi-agent parallelism, replica parallelism for redundancy, and comparison parallelism for A/B testing different LLM providers or strategies
- π MCP Server Integration: Agents dynamically connect to MCP servers for domain-specific tools (kubectl, database clients, monitoring APIs). Add new MCP servers via configuration without code changes
- π€ Multi-LLM Provider Support: Configure and switch between multiple LLM providers (OpenAI, Google, Anthropic, xAI, etc.) via YAML. Define your own LLM provider. Optional Google Search grounding for Gemini models to enhance responses with real-time web information. Native thinking mode for Gemini 2.0+ provides visible internal reasoning and reliable structured tool calling
- π GitHub Runbook Integration: Optional automatic retrieval and inclusion of relevant runbooks from GitHub repositories per agent chain. Contextualizes investigations with team knowledge
- π Comprehensive Audit Trail: Complete visibility into chain processing workflows with stage-level timeline reconstruction
- π₯οΈ SRE Dashboard: Real-time monitoring with live LLM streaming and interactive chain timeline visualization
- π¬ Follow-up Chat: Continue investigating after sessions complete - ask clarifying questions, request deeper analysis, or explore different aspects with full context and tool access
- βΈοΈ Pause & Resume: Long-running investigations automatically pause at iteration limits and can be resumed with one click. Preserves full conversation state and continues exactly where it left off. For parallel stages, only paused agents re-execute while completed results are preserved
- π Data Masking: Hybrid masking system combining code-based structural analysis (Kubernetes Secrets) with regex patterns (API keys, passwords, certificates, emails, SSH keys) to automatically protect sensitive data in MCP responses and alert payloads
- π Tool Result Summarization: Automatic summarization of verbose MCP tool outputs using LLM-powered analysis. Reduces token usage and improves agent reasoning by focusing on relevant information while preserving full results in audit logs
Tarsy uses an AI-powered chain-based architecture where alerts flow through sequential stages of specialized agents that build upon each other's work using domain-specific tools to provide comprehensive expert recommendations to engineers.
π For high-level architecture concepts: See Architecture Overview
- Alert arrives from monitoring systems with flexible JSON payload
- Orchestrator selects appropriate agent chain based on alert type
- Runbook downloaded automatically from GitHub for chain guidance
- Sequential stages execute where each agent builds upon previous stage data using AI to select and execute domain-specific tools
- Stages can run multiple agents in parallel for independent investigation
- Parallel results automatically synthesized into unified analysis
- Automatic pause if investigation reaches iteration limits - preserves full state and allows manual resume with one click
- Comprehensive multi-stage analysis provided to engineers with actionable recommendations
- Follow-up chat available after investigation completes - engineers can ask questions, request more comprehensive analysis, or explore different aspects
- Full audit trail captured with stage-level detail for monitoring and continuous improvement
sequenceDiagram
participant MonitoringSystem
participant Orchestrator
participant AgentChains
participant GitHub
participant AI
participant MCPServers
participant Dashboard
participant Engineer
MonitoringSystem->>Orchestrator: Send Alert
Orchestrator->>AgentChains: Assign Alert & Context
AgentChains->>GitHub: Download Runbook
loop Investigation Loop
AgentChains->>AI: Investigate with LLM
AI->>MCPServers: Query/Actuate as needed
end
AgentChains->>Dashboard: Send Analysis & Recommendations
Engineer->>Dashboard: Review & Take Action
- Start All Services: Run
make devto start backend and dashboard - Submit an Alert: Use Manual Alert Submission at http://localhost:5173/submit-alert for testing TARSy
- Monitor via Dashboard: Watch real-time progress updates and historical analysis at http://localhost:5173
- View Results: See detailed processing timelines and comprehensive LLM analysis
- Stop Services: Run
make stopwhen finished
- Deploy Stack: Run
make containers-deploy(preserves database) ormake containers-deploy-fresh(clean start) - Login: Navigate to http://localhost:8080 and authenticate via GitHub OAuth
- Submit Alert: Use the dashboard at http://localhost:8080/submit-alert (OAuth protected)
- Monitor Processing: Watch real-time progress with full audit trail
- Stop Containers: Run
make containers-stopwhen finished
Tip: Use
make statusormake containers-statusto check which services are running.
The containerized deployment provides a production-like environment with:
- π OAuth2 Authentication: GitHub OAuth integration via oauth2-proxy
- π Reverse Proxy: Nginx handles all traffic routing and CORS
- ποΈ PostgreSQL Database: Persistent storage for processing history
- π¦ Production Builds: Optimized frontend and backend containers
- π Security: All API endpoints protected behind authentication
Architecture Overview:
Browser β Nginx (8080) β OAuth2-Proxy β Backend (FastAPI)
β Dashboard (Static Files)
π For OAuth2-proxy setup instructions: See docs/oauth2-proxy-setup.md
The system now supports flexible alert types from any monitoring source:
- Kubernetes Agent: Processes alerts from Kubernetes clusters (namespaces, pods, services, etc.)
- Any Monitoring System: Accepts arbitrary JSON payloads from Prometheus, AWS CloudWatch, ArgoCD, Datadog, etc.
- Agent-Agnostic Processing: New alert types can be added by creating specialized agents and updating agent registry
- LLM-Driven Analysis: Agents intelligently interpret any alert data structure without code changes to core system
The LLM-driven approach with flexible data structures means diverse alert types can be handled from any monitoring source, as long as:
- A runbook exists for the alert type
- An appropriate specialized agent is available or can be created
- The MCP servers have relevant tools for the monitoring domain
TARSy requires read-only access to a Kubernetes or OpenShift cluster to analyze and troubleshoot Kubernetes infrastructure issues. The system uses the kubernetes-mcp-server, which connects to your cluster via kubeconfig.
TARSy does not use oc or kubectl commands directly. Instead, it:
- Uses Kubernetes MCP Server
- Reads kubeconfig: Authenticates using your existing kubeconfig file
- Read-Only Operations: Configured with
--read-only --disable-destructiveflags - No Modifications: Cannot create, update, or delete cluster resources
If you're already logged into your OpenShift/Kubernetes cluster:
# Verify your current access
oc whoami
oc cluster-info
# TARSy will automatically use your current kubeconfig
# Default location: ~/.kube/config or $KUBECONFIGTo use a specific kubeconfig file:
# Set in backend/.env
KUBECONFIG=/path/to/your/kubeconfig
# Or set environment variable
export KUBECONFIG=/path/to/your/kubeconfigGET /health- Comprehensive health check with service status and warnings (HTTP 503 for degraded/unhealthy)POST /api/v1/alerts- Submit a new alert for processing (returnssession_idimmediately)- Optional alert_type: The
alert_typefield is optional and defaults to the configured default (typically "kubernetes") - Custom MCP Configuration: Optionally override default agent MCP server configuration via the
mcpfield in the request payload. This allows you to specify which MCP servers and tools to use for processing, providing fine-grained control over available tooling per alert.
- Optional alert_type: The
GET /api/v1/alert-types- Get supported alert types and default alert typeWebSocket /api/v1/ws- Real-time progress updates via WebSocket with channel subscriptions
GET /api/v1/history/sessions- List alert processing sessions with filtering and paginationGET /api/v1/history/sessions/{session_id}- Get detailed session with chronological timelineGET /api/v1/history/sessions/{session_id}/final-analysis- Get final analysis and executive summary with optional LLM conversation history- Query params:
include_conversation=true(analysis conversation),include_chat_conversation=true(chat conversation)
- Query params:
POST /api/v1/history/sessions/{session_id}/resume- Resume a paused session from where it left off. Session must be inPAUSEDstatePOST /api/v1/history/sessions/{session_id}/cancel- Cancel an active or paused session. Session must not be in a terminal state (COMPLETED, FAILED, CANCELLED)
POST /api/v1/sessions/{session_id}/chat- Create follow-up chat for a completed sessionGET /api/v1/sessions/{session_id}/chat-available- Check if chat is available for a sessionPOST /api/v1/chats/{chat_id}/messages- Send message to chat (AI response streams via WebSocket)GET /api/v1/chats/{chat_id}- Get chat details
GET /api/v1/system/warnings- Active system warnings (MCP/LLM init failures, etc.)GET /api/v1/system/mcp-servers- Get available MCP servers and their tools (used for custom MCP configuration)
- Alert Types: Define any alert type in
config/agents.yaml- no hardcoding required, just create corresponding runbooks - MCP Servers: Define custom MCP servers in
config/agents.yamlwith support for stdio, HTTP, and SSE transports. Can override built-in MCP servers (e.g., customize kubernetes-server with specific kubeconfig) - Agents: Create traditional hardcoded agent classes extending BaseAgent, or define configuration-based agents in
config/agents.yaml. Can override built-in agents to customize behavior - Chains: Define multi-stage workflows in
config/agents.yaml. Can override built-in chains to customize investigation workflows - LLM Providers: Built-in providers work out-of-the-box (OpenAI, Google, xAI, Anthropic, Vertex AI). Add custom providers via
config/llm_providers.yamlfor proxy configurations or model overrides
π For detailed extensibility examples: See Extensibility section in the Architecture Overview
TARSy uses Alembic for database schema versioning and migrations. The migration system automatically applies pending migrations on startup, ensuring your database schema is always up-to-date.
Quick Migration Workflow:
# 1. Modify SQLModel in backend/tarsy/models/
# 2. Generate migration from model changes
make migration msg="Add new field to AlertSession"
# 3. Review generated file in backend/alembic/versions/
# 4. Test migration
make migration-upgrade
# 5. If needed, rollback
make migration-downgradeAvailable Migration Commands:
make migration msg="Description" # Generate migration from model changes
make migration-manual msg="Desc" # Create empty migration for manual changes
make migration-upgrade # Apply all pending migrations
make migration-downgrade # Rollback last migration
make migration-status # Show current database version
make migration-history # Show full migration historyπ For complete migration documentation: See docs/database-migrations.md
# Run back-end and front-end (dashboard) tests
make testThe test suite includes comprehensive end-to-end integration tests covering the complete alert processing pipeline, agent specialization, error handling, and performance scenarios with full mocking of external services.
