A comprehensive demo and learning observability stack that provides metrics collection, log aggregation, distributed tracing, and alerting capabilities using industry-standard open-source tools.
Disclaimer: This project is intended for demonstration, experimentation, and educational purposes only. It is NOT production ready. It runs all components in single containers with minimal configuration and without hardening (no auth, no TLS, single-node Elasticsearch, in-container Prometheus storage, no HA, no backup/restore strategy). Before any production use you must implement security, scaling, persistence, resilience, and operational safeguards.
- π Metrics Collection: Prometheus with custom alerting rules
- π Visualization: Pre-configured Grafana dashboards
- π Log Aggregation: Elasticsearch + Kibana for centralized logging
- π Distributed Tracing: Jaeger for request tracing
- π¨ Alerting: AlertManager with webhook integrations
- π Data Pipeline: OpenTelemetry Collector for data processing
- π₯οΈ System Monitoring: Node Exporter for host metrics
- π οΈ Easy Management: Convenient shell script for operations
- βΈοΈ Kubernetes Ready: Kustomize manifests for deploying the full stack + sample app (Kind or any cluster)
| Component | Purpose | Port | UI/API |
|---|---|---|---|
| Prometheus | Metrics collection and storage | 9090 | http://localhost:9090 |
| Grafana | Metrics visualization and dashboards | 3000 | http://localhost:3000 |
| Elasticsearch | Log storage and search | 9200 | http://localhost:9200 |
| Kibana | Log visualization and analysis | 5601 | http://localhost:5601 |
| Jaeger | Distributed tracing | 16686 | http://localhost:16686 |
| OpenTelemetry Collector | Data pipeline and processing | 4317/4318 | - |
| AlertManager | Alert management and routing | 9093 | http://localhost:9093 |
| Node Exporter | System metrics collection | 9100 | - |
| Kafka | Log pipeline buffering | 29092 | - |
| Kafka UI | Inspect Kafka topics | 8085 | http://localhost:8085 |
| Kafka JMX Exporter | Kafka metrics for Prometheus | 5556 | http://localhost:5556/metrics |
This toolkit is configured for development/testing environments. For production use, please review and implement the security measures outlined in SECURITY.md.
- Docker Engine 20.10+
- Docker Compose 2.0+ (either
docker-composeordocker compose) - At least 4GB of available RAM
- 10GB of free disk space
Note: This toolkit supports both the standalone
docker-composebinary and the newerdocker composeplugin. The management script will automatically detect which version is available.
-
Clone the repository:
git clone https://github.com/vigneshragupathy/observability-toolkit.git cd observability-toolkit -
Copy environment configuration (optional):
cp .env.example .env # Edit .env file to customize your environment -
Using the management script (Recommended):
./manage-stack.sh start
-
Using Docker Compose directly (Alternative to step 3 β choose one):
# Using docker-compose (standalone) docker-compose up -d # OR using docker compose (plugin) docker compose up -d
After starting the stack, you can access the following services:
- Grafana Dashboard: http://localhost:3000 (admin/admin)
- Prometheus Metrics: http://localhost:9090
- Jaeger Tracing: http://localhost:16686
- Kibana Logs: http://localhost:5601
- AlertManager: http://localhost:9093
Note: These URLs are only accessible when the stack is running locally.
A sample FastAPI + OpenTelemetry app lives under o11y-playground/o11y-python.
It runs in its own directory and just needs to share the Docker network named observability
so it can reach the toolkit's OpenTelemetry Collector at otel-collector:4317.
docker compose up -d --build Run it separately (after starting the stack):
cd o11y-playground/o11y-python
chmod +x run.sh # first time only
./run.sh up # build & start
./run.sh traffic # optional sample loadStop it:
./run.sh downEndpoints: /, /work, /error (http://localhost:8000)
These generate traces (Jaeger), metrics (Prometheus/Grafana), and logs (Kibana) independently of the core compose file.
You can also deploy the same observability toolkit to a Kubernetes cluster (tested with Kind) with namespace separation and autoβprovisioned Grafana dashboards.
Quick Kind demo:
cd kubernetes/kind
./setup.sh # creates kind cluster + applies kustomizeGeneric cluster:
cd kubernetes
./deploy.sh --waitThen port-forward (example):
kubectl -n observability port-forward svc/grafana 3000:3000 &
kubectl -n observability port-forward svc/prometheus 9090:9090 &
kubectl -n observability port-forward svc/jaeger-query 16686:16686 &
kubectl -n observability port-forward svc/kibana 5601:5601 &Kubernetes docs, build modes (external vs inβcluster Kaniko), and dashboard provisioning details live in kubernetes/README.md.
Kafka is enabled by default to demonstrate a decoupled log ingestion flow:
- Applications send logs to the OpenTelemetry Collector (OTLP) as usual.
- Collector (pipeline
logs_produce) publishes log records to Kafka topicotel-logsin OTLP JSON encoding. - A second Collector pipeline (
logs_consume) consumes from Kafka and forwards to Elasticsearch. - Kibana visualizes logs stored in Elasticsearch with no change required by applications.
Benefits demonstrated:
- Decouples ingestion from indexing (burst smoothing, backpressure handling concept).
- Provides a tap point to add stream processors / enrichment later.
- Shows how the Collector can both produce to and consume from Kafka.
Start the stack (Kafka already included):
./manage-stack.sh startOpt-out (no Kafka buffering, logs go straight to Elasticsearch):
./manage-stack.sh start --no-kafkaInspect the topic:
open http://localhost:8085 # Kafka UIProduce a sample log burst (using the demo app traffic command):
cd o11y-playground/o11y-python
./run.sh trafficIf you disable Kafka, the Collector config still defines Kafka pipelines; without the broker they will error. For a cleaner no-Kafka run, use --no-kafka which suppresses starting broker/UI (logs may show Kafka exporter connection retries until you adjust the Collector config). Future improvement: conditional Collector config templating.
Topic & encoding details:
- Topic:
otel-logs - Encoding:
otlp_json(human-inspectable payloads) - Consumer group:
otel-collector-log-consumer
If Kafka is down, the logs_produce pipeline retries (see exporter retry settings) and you may see backpressure in the Collector logs.
Grafana auto-loads dashboard JSON files from config/grafana/dashboards/ via provisioning (see config/grafana/provisioning/dashboards/dashboards.yml). Included demo dashboards:
| Dashboard Title | UID | File | Highlights |
|---|---|---|---|
| Observability Stack Overview | obs-overview |
observability-overview.json |
System CPU %, Memory %, Service availability table, HTTP request rate example |
| Node Exporter Overview | node-exporter-overview |
node-exporter-overview.json |
CPU (avg & per-core), Memory, Load, Filesystem %, Disk IO, Network throughput, Uptime |
| Kafka Overview | kafka-overview |
kafka-overview.json |
Topic message/byte rates, partition count, consumer lag, under-replicated partitions, log flow from Kafka to Elasticsearch |
If a dashboard doesnβt appear:
- Ensure the file exists under
config/grafana/dashboards/. - Restart Grafana:
docker compose restart grafana(or./manage-stack.sh restart). - Check logs:
docker compose logs grafana | grep -i provisioning.
To add your own:
- Create/export a dashboard JSON in the Grafana UI.
- Save it into
config/grafana/dashboards/(plain dashboard JSON, not wrapped). - Set a unique
uidto avoid clashes. - Restart (or wait for the
updateIntervalSecondsto reload).
config/
βββ prometheus/
β βββ prometheus.yml # Prometheus configuration
β βββ rules/
β βββ alerts.yml # Alerting rules
βββ otel/
β βββ otel-collector-config.yaml # OpenTelemetry Collector config
βββ alertmanager/
β βββ alertmanager.yml # AlertManager configuration
βββ grafana/
βββ provisioning/
β βββ datasources/ # Auto-configured data sources
β βββ dashboards/ # Dashboard provisioning
βββ dashboards/ # Dashboard JSON files
The manage-stack.sh script provides convenient management commands:
# Start the entire stack
./manage-stack.sh start
# Stop the stack
./manage-stack.sh stop
# Restart the stack
./manage-stack.sh restart
# Check status of all services
./manage-stack.sh status
# View logs (all services or specific service)
./manage-stack.sh logs
./manage-stack.sh logs prometheus
# Clean up everything (removes containers and volumes)
./manage-stack.sh cleanup
# Show help
./manage-stack.sh help- Prometheus scrapes metrics from:
- Application services (when deployed)
- System metrics via Node Exporter
- OpenTelemetry Collector metrics
- Custom exporters
- OpenTelemetry Collector receives logs via OTLP
- If Kafka profile enabled: logs are first published to Kafka (topic
otel-logs) then consumed and sent to Elasticsearch - If Kafka not enabled: (baseline) logs go directly to Elasticsearch
- Kibana provides log visualization and search
- OpenTelemetry Collector receives traces via OTLP
- Traces are exported to Jaeger
- Jaeger UI provides trace visualization and analysis
- Prometheus evaluates alerting rules
- AlertManager handles alert routing and notifications
- Configured webhooks for integration with external systems
- Expose metrics at
/metricsendpoint - Prometheus will auto-discover services in the
observabilitynetwork
- Send logs to:
http://otel-collector:4318/v1/logs(HTTP) - Or:
otel-collector:4317(gRPC)
- Send traces to:
http://otel-collector:4318/v1/traces(HTTP) - Or:
otel-collector:4317(gRPC)
# OpenTelemetry configuration
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=your-service-name
OTEL_RESOURCE_ATTRIBUTES=service.version=1.0.0,environment=dev
# Prometheus metrics
PROMETHEUS_METRICS_PORT=9090
PROMETHEUS_METRICS_PATH=/metrics-
Update Prometheus configuration (
config/prometheus/prometheus.yml):scrape_configs: - job_name: 'your-service' static_configs: - targets: ['your-service:port']
-
Add alerting rules (
config/prometheus/rules/alerts.yml):- alert: YourServiceDown expr: up{job="your-service"} == 0 for: 1m annotations: summary: "Your service is down"
-
Restart Prometheus:
# Using the management script ./manage-stack.sh restart # OR using Docker Compose directly docker-compose restart prometheus # or: docker compose restart prometheus
- Create dashboard JSON files in
config/grafana/dashboards/ - Restart Grafana or wait for auto-reload:
# Using the management script ./manage-stack.sh restart # OR restart just Grafana docker-compose restart grafana # or: docker compose restart grafana
- Elasticsearch: Adjust JVM heap size via
ES_JAVA_OPTS - Prometheus: Configure retention period and storage
- Grafana: Set up external database for production use
- Security: Enable authentication and TLS
- Persistence: Use external volumes for data
- Scaling: Use external managed services for production
- Backup: Implement regular backup strategies
- Database connection pool exhaustion
- High error rates (>10%)
- High response times (>2s)
- High memory usage (>90%)
- High CPU usage (>80%)
- Service availability
Update config/alertmanager/alertmanager.yml to add your webhook endpoints:
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#alerts'- Port conflicts: Ensure ports 3000, 5601, 9090, 9093, 9200, 16686 are available
- Memory issues: Increase Docker memory allocation (minimum 4GB recommended)
- Permission issues: Ensure proper file permissions in config directories
# View logs for specific service
./manage-stack.sh logs elasticsearch
# View all logs
./manage-stack.sh logs# Check service status
./manage-stack.sh status
# Manual health checks
curl http://localhost:9090/-/healthy # Prometheus
curl http://localhost:3000/api/health # Grafana
curl http://localhost:9200/_cluster/health # Elasticsearch- Prometheus Documentation
- Grafana Documentation
- Jaeger Documentation
- OpenTelemetry Documentation
- Elasticsearch Documentation
We welcome contributions! Please see our Contributing Guide for details on:
- How to submit bug reports and feature requests
- Development setup and testing procedures
- Code style and documentation standards
- Pull request process
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Security is important to us. Please review our Security Policy for:
- Reporting security vulnerabilities
- Production security considerations
- Best practices and recommendations
- Documentation: Check this README and the Contributing Guide
- Issues: Open an issue on GitHub for bug reports
- Discussions: Use GitHub Discussions for questions and ideas
This project uses several excellent open-source tools:
- Prometheus - Metrics collection and alerting
- Grafana - Metrics visualization
- Elasticsearch - Search and analytics engine
- Kibana - Data visualization
- Jaeger - Distributed tracing
- OpenTelemetry - Observability framework
This project is actively maintained. We aim to:
- Keep dependencies updated
- Add new observability tools as they become stable
- Improve documentation and examples
- Enhance security and production readiness
- Evolve Kubernetes deployment (Ingress, persistence, security hardening, optional operator-based stack)
When adding new components or configurations:
- Update this README
- Test with the management script
- Ensure proper service discovery configuration
- Add appropriate alerting rules
- If applicable, mirror changes in
kubernetes/manifests & docs