A visual mission control for AI-powered SRE and support agents — watch your autonomous Ops assistant investigate and remediate incidents in real time.
NEW in v2.1 (December 2025):
- ⭐ Security Middleware Enforced - CORS, Helmet, rate limiting now actively protecting all endpoints
- ⭐ Concurrent Mission Queue - 3 missions execute simultaneously (300% throughput increase)
- ⭐ 80+ Comprehensive Tests - Full coverage of auth, middleware, and API routes
- ⭐ React Performance Optimized - memo, useCallback, useMemo prevent unnecessary re-renders
- ⭐ Structured Logging - Winston logger replaces console, ESLint enforced
- ⭐ OpenAPI/Swagger Docs - Complete API documentation with schemas
- ✅ BullMQ Job Queue - Redis-backed queue with retry logic and progress tracking
- ✅ Queue Management API - Real-time queue status and job position endpoints
v2.0 Foundation:
- ✅ PostgreSQL + Prisma - Persistent database storage
- ✅ JWT Authentication - Secure user authentication and RBAC
- ✅ WebSocket Support - Real-time updates (no more polling!)
- ✅ Docker Deployment - Full containerization with docker-compose
- ✅ CI/CD Pipeline - Automated testing and builds via GitHub Actions
📖 Upgrade Guide | 🏗️ Architecture Docs | 📋 Changelog
Ops-Agent-Desktop is a visual "mission control" interface for autonomous operations agents. It provides:
- Command Console: Submit high-level mission prompts like "Diagnose 500 errors on checkout service" or "Fix the Nginx error on Server 4"
- Live Agent View: Real-time browser automation screenshots showing exactly what your agent is doing
- Mission Timeline: Step-by-step event log categorized by type (OBSERVATION, ACTION, RCA, REMEDIATION)
- RCA & Remediation Display: Integration points for AutoRCA-Core (graph-based root cause analysis) and Secure-MCP-Gateway (policy-based action approvals)
This project demonstrates how AI-powered SRE and support agents can be made transparent, auditable, and trustworthy through visual feedback and clear separation between read-only observations and write interventions.
- SRE/DevOps Engineers building autonomous reliability agents
- Platform Engineers integrating AI into incident response workflows
- Security Teams requiring human-in-the-loop approvals for AI-driven actions
- Engineering Leaders evaluating AI-powered ops tooling
- Researchers exploring agent architectures for autonomous operations
┌─────────────────────────────────────────────────────────────────┐
│ USER / OPERATOR │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ FRONTEND (React + TypeScript) │
│ ┌──────────────────────┐ ┌────────────────────────────┐ │
│ │ Command Console │ │ Live Agent View │ │
│ │ - Mission prompt │ │ - Browser screenshots │ │
│ │ - Timeline of steps │ │ - RCA summary display │ │
│ │ - Status badges │ │ - Remediation proposals │ │
│ └──────────────────────┘ └────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│ HTTP API + Polling
▼
┌─────────────────────────────────────────────────────────────────┐
│ BACKEND (Node.js + TypeScript) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Mission Orchestration Service │ │
│ │ - Parses mission prompts │ │
│ │ - Manages mission lifecycle (PENDING → RUNNING → DONE) │ │
│ │ - Coordinates browser automation + integrations │ │
│ └──────────┬────────────────────────────────┬──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────────┐ │
│ │ Browser Agent │ │ Integration Stubs │ │
│ │ (Playwright) │ │ - AutoRCA-Core (RCA) │ │
│ │ - Navigate dashboards│ │ - Secure-MCP-Gateway │ │
│ │ - Capture screenshots│ │ (Action Approvals) │ │
│ │ - Execute actions │ │ │ │
│ └──────────┬───────────┘ └──────────────────────────┘ │
└─────────────┼──────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ MOCK OPS DASHBOARD │
│ - Simulated service health metrics │
│ - Error logs (500 errors on checkout service) │
│ - Quick actions (Restart Service, Scale, Rollback) │
└─────────────────────────────────────────────────────────────────┘
FUTURE INTEGRATIONS:
┌─────────────────────────────────────────────────────────────────┐
│ - AutoRCA-Core: Graph-based RCA over logs/metrics/traces │
│ - Secure-MCP-Gateway: Policy enforcement + human approvals │
│ - Real dashboards: Grafana, Kibana, Datadog, PagerDuty, etc. │
└─────────────────────────────────────────────────────────────────┘
Instead of writing scripts, operators type natural language commands:
- "Diagnose 500 errors on checkout service"
- "Find the root cause of database connection pool exhaustion"
- "Restart the failed Nginx service on Server 4"
Watch your agent navigate dashboards, read logs, and interact with UIs — just like a human operator would. Every step is captured as a screenshot and logged.
All agent actions are categorized and timestamped:
- OBSERVATION: Read-only actions (viewing dashboards, reading logs)
- ACTION: Write interventions (restart service, scale replicas) — require gateway approval
- RCA: Root cause analysis steps (calling AutoRCA-Core)
- REMEDIATION: Proposed fixes (calling Secure-MCP-Gateway for approval)
Clear separation between safe observations and risky interventions. All ACTION steps are designed to flow through Secure-MCP-Gateway for policy-based approvals before execution.
Built to plug into:
- AutoRCA-Core: Agentic root cause analysis engine
- Secure-MCP-Gateway: Security-first MCP gateway for ops tools
- Real dashboards: Grafana, Kibana, Datadog, etc. (replace mock-app with your actual stack)
- Concurrent Execution: 3 missions run simultaneously via BullMQ job queue
- React Optimizations: Memoized components prevent unnecessary re-renders
- Queue Management: Track job position, progress, and estimated wait time
- Automatic Retries: 3 attempts with exponential backoff on failures
- Security Middleware: Helmet headers, CORS validation, rate limiting enforced
- Structured Logging: Winston logger with correlation IDs and audit trails
- 80+ Security Tests: Comprehensive coverage of auth flows and attack vectors
- XSS Protection: Automatic input sanitization on all endpoints
- No Console Logging: ESLint enforces structured logging only
- OpenAPI/Swagger: Complete API documentation with schemas
- Comprehensive Tests: 80+ tests covering critical security paths
- Type Safety: Enhanced TypeScript with strict mode
- Queue Visibility: Real-time metrics on mission queue status
- Node.js 18+ and npm (or pnpm/yarn)
- Playwright browsers (will be installed automatically)
# Clone the repository
git clone https://github.com/nik-kale/Ops-Agent-Desktop.git
cd Ops-Agent-Desktop
# Install dependencies for all workspaces
npm install
# Install Playwright browsers
npx playwright install chromiumYou'll need three terminals to run the full stack:
npm run dev:backend
# Runs on http://localhost:3001npm run dev:frontend
# Runs on http://localhost:5173npm run dev:mock
# Runs on http://localhost:5174OR run all three concurrently:
npm run dev-
Open the Mission Control UI: Navigate to http://localhost:5173
-
Submit a Mission: In the command console (left panel), enter a mission prompt like:
Diagnose 500 errors on checkout service -
Watch the Agent Work:
- The mission timeline will populate with real-time steps
- Browser screenshots appear in the Live Agent View (right panel)
- RCA summary and remediation proposals will appear as the mission progresses
-
Review Results:
- See the root cause analysis from AutoRCA-Core (stubbed for MVP)
- Review the proposed remediation plan
- Observe intervention actions awaiting approval from Secure-MCP-Gateway
Ops-Agent-Desktop/
├── backend/ # Node.js + TypeScript backend
│ ├── src/
│ │ ├── api/ # Express API routes
│ │ ├── browser/ # Browser automation agent (Playwright)
│ │ ├── missions/ # Mission orchestration service
│ │ ├── types/ # TypeScript types
│ │ └── index.ts # Server entry point
│ ├── screenshots/ # Auto-generated mission screenshots
│ ├── package.json
│ └── tsconfig.json
│
├── frontend/ # React + TypeScript UI
│ ├── src/
│ │ ├── components/ # UI components (CommandConsole, LiveView)
│ │ ├── hooks/ # React hooks (useMission)
│ │ ├── services/ # API client
│ │ ├── types/ # TypeScript types
│ │ ├── App.tsx # Main app component
│ │ └── main.tsx # Entry point
│ ├── package.json
│ ├── tsconfig.json
│ └── vite.config.ts
│
├── mock-app/ # Simulated ops dashboard
│ ├── src/
│ │ ├── App.tsx # Mock dashboard UI
│ │ └── main.tsx
│ ├── package.json
│ └── vite.config.ts
│
├── package.json # Root workspace config
├── CLAUDE.md # Project instructions and design principles
└── README.md # This file
Here's what happens when you submit: "Diagnose 500 errors on checkout service"
- Frontend → POST
/api/missions→ Backend - Backend creates a new Mission and starts the Browser Agent
- Browser Agent:
- Opens mock ops dashboard (
http://localhost:5174) - Takes screenshot → OBSERVATION step
- Detects error alert on dashboard → OBSERVATION step
- Navigates to logs page → OBSERVATION step
- Takes screenshot of error logs → OBSERVATION step
- Opens mock ops dashboard (
- Backend calls AutoRCA-Core stub:
- Analyzes logs/metrics/traces (simulated)
- Returns RCA summary → RCA step
- Backend calls Secure-MCP-Gateway stub:
- Proposes remediation plan → REMEDIATION step
- Awaits approval for intervention actions
- Browser Agent (after approval):
- Clicks "Restart Service" button → ACTION step
- Takes final screenshot → OBSERVATION step
- Mission status → COMPLETED
All steps are streamed to the frontend via polling and displayed in real-time.
POST /api/missions
Content-Type: application/json
{
"prompt": "Diagnose 500 errors on checkout service"
}
Response: 201 Created
{
"missionId": "abc-123-def-456"
}GET /api/missions/{id}
Response: 200 OK
{
"mission": {
"id": "abc-123",
"prompt": "...",
"status": "COMPLETED",
"steps": [...],
"rcaSummary": "...",
"remediationProposal": "..."
}
}GET /api/missions/{id}/stream
Response: 200 OK
{
"mission": {...},
"latestScreenshot": "/screenshots/screenshot-abc.png"
}GET /api/missions
Response: 200 OK
{
"missions": [...]
}GET /api/queue/status
Response: 200 OK
{
"status": {
"waiting": 5, // Jobs waiting to execute
"active": 3, // Currently executing
"completed": 42, // Successfully completed
"failed": 2, // Failed jobs
"delayed": 0, // Delayed/scheduled
"workers": 3, // Active workers
"concurrency": 3 // Max concurrent jobs
}
}GET /api/queue/job/{jobId}
Response: 200 OK
{
"job": {
"id": "abc-123",
"state": "waiting",
"progress": 0,
"position": 3, // Position in queue
"attempts": 0,
"timestamp": "2025-12-26T10:30:00Z"
}
}GET /health
Response: 200 OK
{
"status": "ok",
"service": "ops-agent-desktop-backend"
}| Capability | Metric |
|---|---|
| Concurrent Missions | 3 simultaneous executions |
| Queue Throughput | 300% vs sequential |
| Retry Logic | 3 attempts, exponential backoff |
| Rate Limiting | 100 req/15min (general), 10 missions/hour |
| Component Performance | 90% fewer re-renders |
| Test Coverage | 80+ tests, 60%+ coverage |
Ops-Agent-Desktop is one component of a larger Autonomous Operations Fabric:
| Component | Role |
|---|---|
| AutoRCA-Core (ADAPT-RCA) | Graph-based root cause analysis engine for logs/metrics/traces |
| Secure-MCP-Gateway | Security-first MCP gateway with policy enforcement and human approvals |
| Ops-Agent-Desktop (this project) | Visual mission control for browser-based agent investigations |
| awesome-autonomous-ops | Curated list of tools and resources for AI-powered ops |
Replace the stub in backend/src/browser/browserAgent.ts:performRCA() with:
const rcaResult = await fetch('http://autorka-core:8000/analyze', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
logs: extractedLogs,
metrics: extractedMetrics,
traces: extractedTraces,
timeWindow: { start: incident.startTime, end: 'now' },
}),
});
const rcaSummary = await rcaResult.json();Replace the stub in backend/src/browser/browserAgent.ts:proposeRemediation() with:
const remediation = await fetch('http://secure-mcp-gateway:8080/propose', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
rcaSummary,
availableActions: ['restart', 'scale', 'rollback'],
}),
});
const proposal = await remediation.json();
// Proposal will include approval_required: true/false
// For write actions, poll gateway for approval statusReplace http://localhost:5174 in browserAgent.ts with:
- Grafana:
https://your-grafana.example.com/d/your-dashboard - Kibana:
https://your-kibana.example.com/app/logs - Datadog:
https://app.datadoghq.com/dashboard/your-dashboard
Update the navigation logic to match your actual dashboard's DOM structure.
Ops-Agent-Desktop is designed with a clear security boundary:
| Action Type | Risk Level | Gateway Required | Example |
|---|---|---|---|
| OBSERVATION | Low | ❌ No | Reading dashboards, viewing logs, taking screenshots |
| RCA | Low | ❌ No | Running analysis, generating summaries |
| REMEDIATION (Proposal) | Medium | ❌ No | Proposing fixes (not executing) |
| ACTION | High | ✅ Yes | Restarting services, scaling resources, modifying configs |
All ACTION steps should route through Secure-MCP-Gateway with:
- Policy-based allow/deny/review decisions
- Approval UI for human operators (Slack, PagerDuty, custom dashboard)
- Audit logging of all decisions and executions
- Run the Browser Agent with read-only access to dashboards where possible
- Use service accounts with minimal permissions for actions
- Never hardcode credentials (use environment variables or secret managers)
Every mission is logged with:
- Full step-by-step timeline
- Screenshots at each critical action
- Timestamps and correlation IDs
- Outcome (success/failure/approval status)
- WebSocket support for real-time mission streaming (replace polling)
- Multi-mission management (run multiple missions in parallel)
- Real AutoRCA-Core integration (graph-based RCA engine)
- Real Secure-MCP-Gateway integration (policy enforcement + approvals)
- Dashboard adapters for Grafana, Kibana, Datadog, PagerDuty
- LLM-based mission planning (parse natural language prompts into action plans)
- Approval UI for human-in-the-loop interventions
- Persistence layer (save missions to database for historical review)
- Agent telemetry (track success rates, time-to-resolution, etc.)
- Docker Compose setup for one-command deployment
# Build all workspaces
npm run build
# Start production backend
cd backend && npm start
# Serve frontend (use a static server or integrate with backend)
cd frontend && npm run preview# Backend
cd backend && npm run type-check
# Frontend
cd frontend && npm run type-check- Define the mission logic in
backend/src/browser/browserAgent.ts - Add new step types to
backend/src/types/mission.tsif needed - Update UI to handle new step types in
frontend/src/components/CommandConsole.tsx
Contributions are welcome! This is an open-source reference architecture for AI-powered autonomous operations.
Areas for contribution:
- Real dashboard adapters (Grafana, Kibana, Datadog, etc.)
- Additional mission templates (database troubleshooting, network diagnostics, etc.)
- Improved LLM-based mission planning
- Security hardening (sandboxing, secrets management)
- Testing (unit tests, integration tests, E2E tests)
Please open an issue or PR on GitHub.
MIT License - see LICENSE for details.
Part of the Autonomous Operations Ecosystem:
- AutoRCA-Core (ADAPT-RCA) — Agentic root cause analysis engine
- Secure-MCP-Gateway — Security-first MCP gateway for ops tools
- awesome-autonomous-ops — Curated list of AI-powered ops tools
Built by Nik Kale as part of a broader vision for AI-powered autonomous reliability engineering.
Inspired by the need for transparent, auditable, and safe AI agents in production operations.
Questions or feedback? Open an issue or reach out via GitHub Discussions.
Like this project? Give it a ⭐ and share with your SRE/DevOps team!