Self-Healing Infrastructure with Kubernetes, Prometheus, Grafana, and Alertmanager

This project demonstrates a self-healing Kubernetes infrastructure using monitoring and alerting tools.  
The setup automatically detects failures and triggers recovery actions, ensuring minimal downtime.

---

## 📌 Features
- Automated Monitoring with Prometheus
- Visual Dashboards with Grafana
- Alerting Mechanism via Alertmanager
- Custom Self-Healing Scripts for issue remediation
- Helm-based Deployment for modular configuration
- Configurable Alert Rules for different severity levels

---

## 📂 Project Structure


monitoring/
│── alertmanager-config.yaml       # Alertmanager configuration (routes, receivers)
│── grafana-values.yaml            # Helm values for Grafana
│── prometheus-values.yaml         # Helm values for Prometheus
charts/
│── self-healing/                  # Helm chart for self-healing components
scripts/
│── remediation.sh                  # Self-healing script triggered by alerts
README.md                           # Project documentation


---

🚀 Deployment Steps

1️⃣ Clone the Repository

git clone https://github.com/your-username/Self-Healing-Infra.git
cd Self-Healing-Infra

2️⃣ Install Prometheus & Grafana

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack -f monitoring/prometheus-values.yaml
helm install grafana grafana/grafana -f monitoring/grafana-values.yaml

3️⃣ Configure Alertmanager

kubectl apply -f monitoring/alertmanager-config.yaml

4️⃣ Deploy Self-Healing Script

kubectl apply -f scripts/remediation-job.yaml

📊 Dashboards & Monitoring

Grafana: http://localhost:3000 Default credentials:
```
Username: admin
Password: admin
```
Prometheus: http://localhost:9090
Alertmanager: http://localhost:9093

⚠️ Alerting Examples

High CPU Usage: If CPU usage exceeds 80% for 5 minutes.
Pod CrashLoopBackOff: If a pod is restarting repeatedly.
Node Down: If a node becomes unreachable for 2 minutes.

🔄 Self-Healing Flow

Prometheus detects a metric breach.
Alertmanager sends an alert.
Alert triggers Webhook Receiver.
Webhook runs remediation.sh to resolve the issue.
System returns to a healthy state automatically.

🛠️ Tech Stack

Kubernetes
Prometheus
Grafana
Alertmanager
Helm
Bash/Python for remediation scripts

📜 License

🤝 Contributions

Pull requests are welcome! Please make sure to update tests as appropriate and follow the existing coding style.

📧 Contact

Author: Aditya Pawar Email: (adipawar47@gmail.com) GitHub: (https://github.com/adip47)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
argocd		argocd
chaos		chaos
helm		helm
infrastructure		infrastructure
scripts		scripts
services		services
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
banner3.png		banner3.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Healing Infrastructure with Kubernetes, Prometheus, Grafana, and Alertmanager

🚀 Deployment Steps

1️⃣ Clone the Repository

2️⃣ Install Prometheus & Grafana

3️⃣ Configure Alertmanager

4️⃣ Deploy Self-Healing Script

📊 Dashboards & Monitoring

⚠️ Alerting Examples

🔄 Self-Healing Flow

🛠️ Tech Stack

📜 License

🤝 Contributions

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Self-Healing Infrastructure with Kubernetes, Prometheus, Grafana, and Alertmanager

🚀 Deployment Steps

1️⃣ Clone the Repository

2️⃣ Install Prometheus & Grafana

3️⃣ Configure Alertmanager

4️⃣ Deploy Self-Healing Script

📊 Dashboards & Monitoring

⚠️ Alerting Examples

🔄 Self-Healing Flow

🛠️ Tech Stack

📜 License

🤝 Contributions

📧 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages