AI-driven Platform Operations for OpenShift Container Platform (OCP) and Cloud-Native Network Functions (CNFs).
AIOps NextGen is a unified observability and intelligence platform that provides:
- Multi-cluster fleet management and monitoring
- AI-powered insights with domain expert personas
- Real-time GPU and CNF telemetry
- Automated anomaly detection and root cause analysis
- Natural language queries across metrics, traces, and logs
OpenShift Container Platform 4.16+ (x86_64, ARM64)
- Air-Gapped Ready: Designed to work in environments without external internet access
- On-Premises First: All core components run on-premises within OpenShift
- Local LLM Preferred: Primary AI via vLLM with locally-hosted models (Llama, Mistral, Qwen); external AI APIs (Gemini, Claude, ChatGPT) supported as optional alternative
- Self-Contained Storage: MinIO or OpenShift Data Foundation for object storage
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USERS β
β Operators Β· SREs Β· Platform Engineers β
βββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PRESENTATION LAYER β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Frontend (React + TypeScript) β β
β β Fleet Dashboard β GPU Monitoring β AI Chat β Observability Explorer β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββ
β HTTPS / WSS
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ACCESS LAYER β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β API Gateway (FastAPI + OpenShift OAuth) β β
β β REST API β WebSocket Proxy β MCP Protocol β Rate Limiting β RBAC β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SERVICE LAYER β
β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β Cluster β β Observability β β Intelligence β β Real-Time β β
β β Registry β β Collector β β Engine β β Streaming β β
β β β β β β β β β β
β β β’ Fleet CRUD β β β’ Prometheus β β β’ LLM Router β β β’ WebSocket Hub β β
β β β’ Health Mon. β β β’ Tempo Traces β β β’ AI Personas β β β’ Event Routing β β
β β β’ Credentials β β β’ Loki Logs β β β’ Anomaly Det. β β β’ Subscriptions β β
β β β’ Capabilities β β β’ GPU Telemetry β β β’ RCA Engine β β β’ Backpressure β β
β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ β
β β β β β β
βββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA LAYER β
β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β PostgreSQL β β Redis β β Object Storage (MinIO/ODF) β β
β β β β β β β β
β β β’ clusters schema β β β’ DB 0: PubSub β β β’ aiops-reports bucket β β
β β β’ intelligence β β β’ DB 1: Rate Limit β β β’ aiops-attachments bucket β β
β β schema β β β’ DB 2: Cache β β β β
β β β β β’ DB 3: Sessions β β β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ON-PREMISES INTEGRATIONS β
β β
β ββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
β β SPOKE CLUSTERS β β LLM INFERENCE β β
β β β β β β
β β ββββββββββββ βββββββββββ βββββββββ β β ββββββββββββββββββββββββββββββββββββ β
β β βPrometheusβ β Tempo β β Loki β β β β vLLM Server ββ β
β β β Metrics β β Traces β β Logs β β β β β’ Llama 3.x / Mistral / Qwen ββ β
β β ββββββββββββ βββββββββββ βββββββββ β β β β’ GPU Accelerated (A100/H100) ββ β
β β β β β β’ OpenAI-Compatible API ββ β
β β 100+ OCP Clusters (Hub-Spoke) β β ββββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REQUEST FLOWS β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. USER QUERY FLOW (Metrics/Traces/Logs)
ββββββββ βββββββββββ βββββββββββββ βββββββββββββββββ βββββββββββ
β User ββββββΆβ FrontendββββββΆβAPI GatewayββββββΆβ Observability ββββββΆβ Spoke β
β βββββββ βββββββ βββββββ Collector βββββββ Cluster β
ββββββββ βββββββββββ βββββββββββββ βββββββββββββββββ βββββββββββ
β
βΌ
βββββββββββββββ
βCluster Reg. β (get endpoints)
βββββββββββββββ
2. AI CHAT FLOW (Natural Language β Tool Calls β Response)
ββββββββ βββββββββββ βββββββββββββ βββββββββββββββββ
β User ββββββΆβ FrontendββββββΆβAPI GatewayββββββΆβ Intelligence β
β β β β β β β Engine β
ββββββββ βββββββββββ βββββββββββββ βββββββββ¬ββββββββ
β² β
β SSE Stream βΌ
β βββββββββββββββ
β β LLM Providerβ
β β(vLLM/ExtAPI)β
β ββββββββ¬βββββββ
β β
β Tool Calls βΌ
β ββββββββββββββββββββββββββββββββββββββββββββββ
β β β
β βΌ βΌ
β βββββββββββββββββ βββββββββββββββββββ
βββ Observability β (query_metrics, β Cluster Registryβ
β Collector β search_traces) β (list_clusters) β
βββββββββββββββββ βββββββββββββββββββ
3. REAL-TIME EVENT FLOW (Push Updates)
βββββββββββββββββββ βββββββββββββ ββββββββββββββββ βββββββββββ
β Cluster RegistryββββββΆβ Redis ββββββΆβ Real-Time ββββββΆβ Frontendβ
β Obs. Collector β β PubSub β β Streaming β WS β β
β Intel. Engine β β β β ββββββΆβ β
βββββββββββββββββββ βββββββββββββ ββββββββββββββββ βββββββββββ
(Publishers) (Router) (Subscriber)
Events: CLUSTER_STATUS_CHANGED, ALERT_FIRED, GPU_UPDATE, ANOMALY_DETECTED, etc.
4. ANOMALY DETECTION & RCA FLOW
βββββββββββββββββ βββββββββββββββββ βββββββββββββ ββββββββββββββββ
β Observability ββββββΆβ Intelligence ββββββΆβ Redis ββββββΆβ Real-Time β
β Collector β β Engine β β PubSub β β Streaming β
β (metrics) β β (detect+RCA) β β β β β
βββββββββββββββββ βββββββββββββββββ βββββββββββββ ββββββββ¬ββββββββ
β β
βΌ βΌ
βββββββββββββββ βββββββββββββββ
β LLM Providerβ β Frontend β
β (explain) β β (alert UI) β
βββββββββββββββ βββββββββββββββ
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | React 18, TypeScript, Vite, Tailwind | Modern SPA |
| Gateway | FastAPI, OpenShift OAuth | Auth, routing, rate limiting |
| Services | Python 3.11+, FastAPI, SQLAlchemy | Microservices |
| AI/LLM | vLLM (preferred); Gemini/Claude/ChatGPT (optional) | LLM inference |
| Data | PostgreSQL 15, Redis 7, MinIO | Persistence, cache, objects |
| Observability | OpenTelemetry, Prometheus, Tempo, Loki | Telemetry |
| Deployment | Helm, Kustomize, OpenShift 4.16+ | Container orchestration |
aiops-nextgen/
βββ README.md # This file
βββ LICENSE # MIT License
βββ specs/ # Component specifications
β βββ 00-overview.md # Architecture overview
β βββ 01-data-models.md # Shared data models
β βββ 02-cluster-registry.md # Cluster Registry Service
β βββ 03-observability-collector.md # Observability Collector
β βββ 04-intelligence-engine.md # AI/LLM Engine
β βββ 05-realtime-streaming.md # Real-time Streaming
β βββ 06-api-gateway.md # API Gateway
β βββ 07-frontend.md # Frontend Application
β βββ 08-integration-matrix.md # Integration contracts
β βββ 09-deployment.md # OpenShift deployment
βββ deploy/ # Deployment manifests
β βββ helm/ # Helm charts (future)
β βββ openshift/ # Kustomize overlays for OpenShift
βββ src/ # Source code
βββ shared/ # Shared Python package (models, db, redis, config)
βββ cluster-registry/ # Fleet management service
βββ observability-collector/# Metrics federation service
βββ intelligence-engine/ # AI/LLM service
βββ realtime-streaming/ # WebSocket service
βββ api-gateway/ # Entry point service
βββ frontend/ # React SPA (pending)
βββ docker-compose.yml # Local development stack
βββ development-plan.md # Implementation roadmap
| Phase | Focus | Status |
|---|---|---|
| 1 | Foundation & Data Layer (shared models, PostgreSQL, Redis) | β Complete |
| 2 | Cluster Registry (fleet CRUD, health monitoring, events) | β Complete |
| 3 | Observability Collector (metrics, alerts, GPU telemetry) | β Complete |
| 4 | Intelligence Engine (LLM, personas, chat, tool calling) | β Complete |
| 5 | Real-time Streaming & API Gateway | β Complete |
| 6 | Frontend (React SPA) | Pending |
See src/development-plan.md for detailed task tracking.
| Environment | CPU | Memory | Storage |
|---|---|---|---|
| Development | 2.6 cores | 4.4 Gi | 11 Gi |
| Production (HA) | 8.2 cores | 14.5 Gi | 55 Gi |
| + Local LLM (3B) | +4 cores | +16 Gi | +50 Gi |
# Clone repository
git clone https://github.com/open-experiments/aiops-nextgen.git
cd aiops-nextgen
# Start infrastructure (PostgreSQL + Redis)
cd src && docker-compose up -d postgresql redis
# Start services (each in a separate terminal)
cd src/cluster-registry && uvicorn app.main:app --reload --port 8001
cd src/observability-collector && uvicorn app.main:app --reload --port 8002
cd src/intelligence-engine && uvicorn app.main:app --reload --port 8003
cd src/realtime-streaming && uvicorn app.main:app --reload --port 8004
cd src/api-gateway && uvicorn app.main:app --reload --port 8000# Login to OpenShift
oc login --token=<token> --server=<api-server>
# Create namespace and deploy
oc new-project aiops-nextgen
oc apply -k deploy/openshift/
# Verify deployment
oc get pods -n aiops-nextgen| Service | Port | Health Check |
|---|---|---|
| API Gateway | 8000 | GET /health, GET /ready |
| Cluster Registry | 8001 | GET /health, GET /ready |
| Observability Collector | 8002 | GET /health, GET /ready |
| Intelligence Engine | 8003 | GET /health, GET /ready |
| Real-Time Streaming | 8004 | GET /health, GET /ready |