Skip to content

adip47/Self-Healing-Infra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Self-Healing Infrastructure with Kubernetes, Prometheus, Grafana, and Alertmanager

Self Healing Infra

This project demonstrates a self-healing Kubernetes infrastructure using monitoring and alerting tools.  
The setup automatically detects failures and triggers recovery actions, ensuring minimal downtime.

---

## 📌 Features
- Automated Monitoring with Prometheus
- Visual Dashboards with Grafana
- Alerting Mechanism via Alertmanager
- Custom Self-Healing Scripts for issue remediation
- Helm-based Deployment for modular configuration
- Configurable Alert Rules for different severity levels

---

## 📂 Project Structure


monitoring/
│── alertmanager-config.yaml       # Alertmanager configuration (routes, receivers)
│── grafana-values.yaml            # Helm values for Grafana
│── prometheus-values.yaml         # Helm values for Prometheus
charts/
│── self-healing/                  # Helm chart for self-healing components
scripts/
│── remediation.sh                  # Self-healing script triggered by alerts
README.md                           # Project documentation


---

🚀 Deployment Steps

1️⃣ Clone the Repository

git clone https://github.com/your-username/Self-Healing-Infra.git
cd Self-Healing-Infra

2️⃣ Install Prometheus & Grafana

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack -f monitoring/prometheus-values.yaml
helm install grafana grafana/grafana -f monitoring/grafana-values.yaml

3️⃣ Configure Alertmanager

kubectl apply -f monitoring/alertmanager-config.yaml

4️⃣ Deploy Self-Healing Script

kubectl apply -f scripts/remediation-job.yaml

📊 Dashboards & Monitoring


⚠️ Alerting Examples

  • High CPU Usage: If CPU usage exceeds 80% for 5 minutes.
  • Pod CrashLoopBackOff: If a pod is restarting repeatedly.
  • Node Down: If a node becomes unreachable for 2 minutes.

🔄 Self-Healing Flow

  1. Prometheus detects a metric breach.
  2. Alertmanager sends an alert.
  3. Alert triggers Webhook Receiver.
  4. Webhook runs remediation.sh to resolve the issue.
  5. System returns to a healthy state automatically.

🛠️ Tech Stack

  • Kubernetes
  • Prometheus
  • Grafana
  • Alertmanager
  • Helm
  • Bash/Python for remediation scripts

📜 License

MIT License © 2025 Aditya


🤝 Contributions

Pull requests are welcome! Please make sure to update tests as appropriate and follow the existing coding style.


📧 Contact

Author: Aditya Pawar Email: (adipawar47@gmail.com) GitHub: (https://github.com/adip47)

About

This project implements a Self-Healing Infrastructure designed to automatically detect, respond, and recover from system failures without human intervention. By combining cloud-native tooling, event-driven automation, and observability-driven intelligence, it ensures high availability, scalability, and operational resilience.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors