This project demonstrates a self-healing Kubernetes infrastructure using monitoring and alerting tools.
The setup automatically detects failures and triggers recovery actions, ensuring minimal downtime.
---
## 📌 Features
- Automated Monitoring with Prometheus
- Visual Dashboards with Grafana
- Alerting Mechanism via Alertmanager
- Custom Self-Healing Scripts for issue remediation
- Helm-based Deployment for modular configuration
- Configurable Alert Rules for different severity levels
---
## 📂 Project Structure
monitoring/
│── alertmanager-config.yaml # Alertmanager configuration (routes, receivers)
│── grafana-values.yaml # Helm values for Grafana
│── prometheus-values.yaml # Helm values for Prometheus
charts/
│── self-healing/ # Helm chart for self-healing components
scripts/
│── remediation.sh # Self-healing script triggered by alerts
README.md # Project documentation
---git clone https://github.com/your-username/Self-Healing-Infra.git
cd Self-Healing-Infrahelm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack -f monitoring/prometheus-values.yaml
helm install grafana grafana/grafana -f monitoring/grafana-values.yamlkubectl apply -f monitoring/alertmanager-config.yamlkubectl apply -f scripts/remediation-job.yaml-
Grafana: http://localhost:3000 Default credentials:
Username: admin Password: admin -
Prometheus: http://localhost:9090
-
Alertmanager: http://localhost:9093
- High CPU Usage: If CPU usage exceeds 80% for 5 minutes.
- Pod CrashLoopBackOff: If a pod is restarting repeatedly.
- Node Down: If a node becomes unreachable for 2 minutes.
- Prometheus detects a metric breach.
- Alertmanager sends an alert.
- Alert triggers Webhook Receiver.
- Webhook runs remediation.sh to resolve the issue.
- System returns to a healthy state automatically.
- Kubernetes
- Prometheus
- Grafana
- Alertmanager
- Helm
- Bash/Python for remediation scripts
MIT License © 2025 Aditya
Pull requests are welcome! Please make sure to update tests as appropriate and follow the existing coding style.
Author: Aditya Pawar Email: (adipawar47@gmail.com) GitHub: (https://github.com/adip47)
