A Kubernetes-based truly self-healing system that automatically detects anomalies across multiple signals, diagnoses root causes through metric correlation, executes multi-step remediation strategies, and learns from past recovery outcomes — with a React dashboard for real-time monitoring.
The Self-Healing System monitors application health via Prometheus metrics, detects anomalies using sliding-window analysis across memory, CPU, and error rates, correlates signals to diagnose root causes, selects intelligent multi-step recovery strategies, and learns from historical outcomes to improve future decisions.
┌──────────────────────────┐
│ Learning Loop │ ✅ BUILT — records outcomes, recommends strategies
│ (improves over time) │
└──────────▲───────────────┘
│
┌──────────┴───────────────┐
│ Remediation Engine │ ✅ BUILT — multi-step strategies per root cause
│ (ScaleUp → Restart) │
└──────────▲───────────────┘
│
┌──────────┴───────────────┐
│ Root Cause Diagnosis │ ✅ BUILT — correlates memory + CPU + errors + traffic
│ (signal correlation) │
└──────────▲───────────────┘
│
┌──────────┴───────────────┐
│ Anomaly Detection │ ✅ BUILT — 3 rules: Memory, CPU, Error Rate
│ (sliding windows) │
└──────────▲───────────────┘
│
┌──────────┴───────────────┐
│ Observability Layer │ ✅ BUILT — Prometheus (4 metrics) + health checks
│ (metrics + health) │
└──────────────────────────┘
- Multi-Signal Anomaly Detection — 3 sliding-window rules: OLS regression-based memory leak detection (R² confidence gating), CPU spike detection, and high error rate detection
- Root Cause Diagnosis — Correlates 4 metrics (memory, CPU, error rate, request rate) to classify failures: ResourceExhaustion, TrafficOverload, ApplicationError, DependencyFailure, MemoryLeakSuspected, HighCpuUsage
- Strategic Remediation — Multi-step recovery strategies (e.g., ScaleUp → wait → RestartPod) selected per root cause
- Learning Loop — Records every recovery outcome and recommends the best strategy based on historical success rates
- Full Persistence — Failure events, recovery actions, and metrics stored in PostgreSQL
- Real-Time Dashboard — React 19 + Tailwind CSS v4 + Recharts with live metrics, failures, and recoveries
- Stress Testing — Built-in endpoints to simulate memory leaks and CPU exhaustion
- Comprehensive API — 20 REST endpoints across TaskApi and Analyzer services
| Document | Description |
|---|---|
| DemoApp/README.md | Full setup guide — architecture, installation, configuration, troubleshooting |
| DemoApp/docs/API.md | Complete API reference — all 18 endpoints, request/response examples, data models |
| DemoApp/install.sh | Automated Linux install script — installs all prerequisites and deploys to K8s |
| DemoApp/run.sh | Run script — starts port-forwards and frontend dev server |
cd DemoApp
chmod +x install.sh run.sh
# Install everything (Docker, Kind, kubectl, .NET, Node.js, K8s cluster)
./install.sh
# Start all services
./run.shSee DemoApp/README.md for manual installation and detailed configuration.
┌─────────────┐ ┌─────────────┐ ┌──────────────┐ ┌────────────┐
│ React UI │───▶│ TaskApi │◀───│ Analyzer │───▶│ Kubernetes │
│ (Vite) │ │ (.NET 8) │ │ (.NET 9) │ │ API │
└─────────────┘ └──────┬──────┘ └───────┬───────┘ └────────────┘
│ │
┌──────▼──────┐ ┌───────▼───────┐
│ PostgreSQL │ │ Prometheus │
│ 16 │ │ (Scraping) │
└─────────────┘ └───────────────┘
| Component | Tech Stack | Purpose |
|---|---|---|
| TaskApi | ASP.NET Core 8, EF Core, Npgsql, prometheus-net | REST API, data persistence, Prometheus metrics |
| Analyzer | ASP.NET Core 9, KubernetesClient | Anomaly detection, recovery orchestration |
| Frontend | React 19, Vite, Tailwind CSS v4, Recharts, Radix UI | Dashboard, monitoring, task management |
| PostgreSQL | PostgreSQL 16 | Tasks, metrics, failures, recoveries |
| Prometheus | prom/prometheus | Metrics scraping (5s interval) |
| Grafana | Grafana 11.0.0 | Metrics visualization |
| Kubernetes | Kind (Kubernetes in Docker) | Container orchestration |
The UML and ER diagrams document the full system design and data model.
The database schema and entity relationships are visualized in multiple ER diagram versions:
If you want to apply the self-healing approach in your own system, use this demo as a reference and follow these steps.
- Implement a
/healthendpoint that returns503when the service is unhealthy. - Expose
/metricsfor Prometheus scraping.
Reference files:
Use stress endpoints to validate healing behavior during load or failure scenarios.
Reference file:
- Configure liveness and readiness probes to call
/health. - Kubernetes will restart unhealthy pods automatically.
Reference file:
- Install Prometheus.
- Add a ServiceMonitor to scrape
/metrics.
Reference files:
- DemoApp/TaskApi/Kubernetes/service.yaml
- DemoApp/TaskApi/Kubernetes/service-monitor.yaml
- DemoApp/TaskApi/Kubernetes/prometheus.yaml
Use Docker Compose to run the API and database locally.
Reference file:
- Replace the Task API endpoints with your own business logic.
- Keep the health checks and metrics as part of your service boundary.
If you need a full walkthrough, follow DemoApp/README.md.
If you want a quick starting point, you can scaffold a minimal self-healing setup into your own service and replace the business logic.
Program.cs (core wiring)
using Microsoft.Extensions.Diagnostics.HealthChecks;
using Prometheus;
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddControllers();
builder.Services.AddHealthChecks()
.AddCheck("self", () => HealthCheckResult.Healthy("OK"));
var app = builder.Build();
app.UseRouting();
app.UseHttpMetrics();
app.MapControllers();
app.MapHealthChecks("/health");
app.MapMetrics();
app.Run();HealthController.cs
using Microsoft.AspNetCore.Mvc;
using System.Diagnostics;
namespace TaskApi.Controllers;
[ApiController]
[Route("health")]
public class HealthController : ControllerBase
{
[HttpGet]
public IActionResult Health()
{
var process = Process.GetCurrentProcess();
var memoryMb = process.WorkingSet64 / (1024 * 1024);
var threadCount = process.Threads.Count;
if (memoryMb > 500)
return StatusCode(503, $"Unhealthy: High memory usage {memoryMb} MB");
if (threadCount > 200)
return StatusCode(503, $"Unhealthy: Too many threads {threadCount}");
return Ok(new { status = "OK", memoryMb, threadCount });
}
}deployment.yaml (probes only)
containers:
- name: your-service
image: your-image:tag
ports:
- containerPort: 8080
name: http
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5service.yaml (for Prometheus scrape)
apiVersion: v1
kind: Service
metadata:
name: your-service
labels:
app: your-service
spec:
selector:
app: your-service
ports:
- name: http
port: 80
targetPort: 8080apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: your-service
labels:
app: prometheus
spec:
selector:
matchLabels:
app: your-service
endpoints:
- port: http
path: /metrics
interval: 5s- Keep
/healthand/metricsas stable endpoints. - Swap in your own controllers, DB, and domain models.
If you want the full working example, clone and follow DemoApp/README.md.
This project is licensed under the terms specified in the LICENSE file.
GitHub: @mukund58




