Skip to content

Mogalina/anomaly-detection-framework

Repository files navigation

Distributed Anomaly Detection Framework

A production-ready distributed anomaly detection system combining real-time edge detection, federated learning, causal graph-based root cause analysis, and adaptive threshold tuning for improved reliability in large-scale distributed systems.

Overview

This framework implements a novel approach to anomaly detection in distributed microservice environments by integrating:

  • Edge-Local LSTM Detection: Sub-second anomaly detection on each node using lightweight LSTM models
  • Federated Learning: Privacy-preserving global model coordination without centralizing raw data
  • Causal Dependency Graphs: Dynamic service dependency tracking from distributed traces
  • PageRank Root Cause Analysis: Intelligent root cause identification and anomaly propagation tracing
  • RL-Based Adaptive Thresholds: Q-learning optimization of detection thresholds balancing precision, recall, and SLO compliance

Key Features

  • Sub-second Detection Latency: Achieves <1s latency from anomaly occurrence to alert
  • High Detection Accuracy: 80%+ precision and recall through adaptive thresholding
  • Privacy Preservation: Federated learning keeps data on-premise while improving global models
  • Gradient Poisoning Detection: Statistical and autoencoder-based validation prevents malicious updates
  • Automatic Root Cause Identification: PageRank-based ranking of failure sources
  • Production-Ready Deployment: Full Docker and Kubernetes support with monitoring stack

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      Distributed System                         │
├─────────────┬─────────────┬─────────────┬───────────────────────┤
│  Service 1  │  Service 2  │  Service 3  │  ...                  │
│   ┌─────┐   │   ┌─────┐   │   ┌─────┐   │                       │
│   │LSTM │   │   │LSTM │   │   │LSTM │   │  Edge Detection       │
│   │ Det │   │   │ Det │   │   │ Det │   │  (Real-time)          │
│   └──┬──┘   │   └──┬──┘   │   └──┬──┘   │                       │
└──────┼──────┴──────┼──────┴──────┼──────┴───────────────────────┘
       │             │             │
       └─────────────┼─────────────┘
                     │ Model Updates & Anomaly Events
                     ▼
       ┌────────────────────────────────┐
       │  Federated Coordinator         │
       │  - FedAvg Aggregation          │
       │  - Poisoning Detection         │
       └────────────┬───────────────────┘
                    │
       ┌────────────▼───────────────────┐
       │  Central Coordinator           │
       │  ┌──────────────────────────┐  │
       │  │  Anomaly Pipeline        │  │
       │  ├──────────────────────────┤  │
       │  │  • Trace Collection      │  │
       │  │  • Causal Graph Builder  │  │
       │  │  • Root Cause Analyzer   │  │
       │  │  • SLO Tracker           │  │
       │  │  • Threshold Tuner       │  │
       │  └──────────────────────────┘  │
       └────────────────────────────────┘

Installation

Prerequisites

  • Python 3.9+
  • Docker and Docker Compose (for containerized deployment)
  • Kubernetes cluster (for production deployment)
  • PostgreSQL 12+ (for graph storage)
  • Jaeger (for distributed tracing)

Local Development

# Clone repository
git clone <repository-url>
cd anomaly-detection-framework

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install package in development mode
pip install -e .

Docker Deployment

# Build and start all services
cd deploy
docker-compose up --build

# Services will be available at:
# - Coordinator API: http://localhost:8080
# - Grafana Dashboard: http://localhost:3000
# - Jaeger UI: http://localhost:16686
# - Prometheus: http://localhost:9090

Kubernetes Deployment

# Create namespace
kubectl create namespace anomaly-detection

# Apply configurations
kubectl apply -f deploy/kubernetes/

# Check deployment status
kubectl get pods -n anomaly-detection

Quick Start

1. Generate Synthetic Data

from experiments.data_generator import DataGenerator

generator = DataGenerator()

# Generate dataset with anomalies
data, anomalies = generator.generate_dataset_with_anomalies(
    service='web-service',
    duration_hours=24,
    sampling_rate=60
)

print(f"Generated {len(data)} samples with {len(anomalies)} anomalies")

2. Train Edge Detector

from edge.edge_detector import EdgeDetector
from edge.models import LSTMAnomalyDetector

# Create detector
model = LSTMAnomalyDetector(input_size=10, hidden_size=64, num_layers=2)
detector = EdgeDetector('web-service', model=model)

# Train on normal data
metric_data = data[['metric_0', 'metric_1', ...]].values
detector.train(metric_data, epochs=50, batch_size=32)

# Detect anomalies
result = detector.detect(new_metrics)
if result['is_anomaly']:
    print(f"Anomaly detected! Score: {result['score']:.3f}")

3. Run Federated Learning

from federated.federated_coordinator import FederatedCoordinator
from federated.federated_client import FederatedClient

# Initialize coordinator
coordinator = FederatedCoordinator()

# Register clients
for i in range(5):
    client = FederatedClient(f'client-{i}', model)
    coordinator.register_client(f'client-{i}', {'region': 'us-east'})
    
    # Client trains locally
    result = client.train_round(local_data)
    
    # Submit update
    coordinator.receive_update(
        f'client-{i}',
        result['model_update'],
        result['num_samples'],
        result['metrics']
    )

# Aggregate models
coordinator.aggregate_models()

4. Root Cause Analysis

from tracing.causal_graph import CausalGraph
from analysis.root_cause_analyzer import RootCauseAnalyzer

# Build causal graph from traces
graph = CausalGraph()
graph.update_from_traces(traces)

# Mark anomalous services
graph.mark_anomaly('service-a')
graph.mark_anomaly('service-b')

# Analyze root causes
analyzer = RootCauseAnalyzer(graph)
result = analyzer.analyze({'service-a', 'service-b'})

for root_cause in result['root_causes']:
    print(f"Root cause: {root_cause['service']}, "
          f"Impact score: {root_cause['impact_score']:.2f}")

Running Benchmarks

# Run full benchmark suite
python experiments/benchmark.py

# Individual benchmarks
python -c "from experiments.benchmark import Benchmark; \
           b = Benchmark(); \
           print(b.benchmark_detection_latency())"

Expected performance metrics:

  • Detection latency: <100ms (median), <500ms (p95)
  • Detection accuracy: >80% precision and recall
  • Federated bandwidth: <50MB per 100 rounds with 5 clients

Configuration

All configuration is managed through config/config.yaml. Key sections:

edge:
  model:
    hidden_size: 64
    num_layers: 2
  detection:
    initial_threshold: 3.0

federated:
  coordinator:
    num_rounds: 100
    min_clients_per_round: 3
  poisoning_detection:
    enabled: true

thresholding:
  rl_tuner:
    learning_rate: 0.1
    epsilon: 0.1

API Reference

Edge Detector API

detector = EdgeDetector(service_name, model)
detector.train(data, epochs, batch_size)
result = detector.detect(metrics)
detector.update_threshold(new_threshold)
stats = detector.get_statistics()

Coordinator API

HTTP endpoints available at http://localhost:8080:

  • POST /api/anomaly - Submit anomaly event
  • GET /api/status - Get system status
  • GET /api/graph - Get causal dependency graph
  • GET /api/root-causes - Get root cause analysis
  • POST /api/feedback - Submit detection feedback

Monitoring

Access monitoring dashboards:

Key metrics:

  • adf_anomalies_detected_total - Total anomalies detected
  • adf_detection_latency_seconds - Detection latency histogram
  • adf_fl_rounds_completed_total - Federated learning rounds
  • adf_rca_accuracy - Root cause analysis accuracy

Edge System Metrics

The framework monitors comprehensive system-level metrics on each edge node:

Application Metrics (5):

  • request_rate - Requests per second
  • error_rate - Error percentage
  • latency_p50 - 50th percentile latency (ms)
  • latency_p95 - 95th percentile latency (ms)
  • latency_p99 - 99th percentile latency (ms)

CPU Metrics (3):

  • cpu_usage_percent - CPU utilization (0-100%)
  • cpu_load1 - 1-minute load average
  • cpu_load5 - 5-minute load average

Memory Metrics (2):

  • memory_usage_percent - Memory utilization (0-100%)
  • memory_available_mb - Available memory (MB)

Disk Metrics (2):

  • disk_read_latency_ms - Read operation latency
  • disk_write_latency_ms - Write operation latency

Network Metrics (5):

  • network_tx_error_rate - Transmit error rate
  • network_rx_error_rate - Receive error rate
  • network_drop_rate - Packet drop rate
  • network_throughput_mbps - Network throughput (Mbps)

All metrics are collected at configurable intervals (default: 60s) and processed through the LSTM anomaly detector for real-time analysis.

Project Structure

anomaly-detection-framework/
├── src/
│   ├── edge/              # Edge-local detection
│   ├── federated/         # Federated learning
│   ├── tracing/           # Distributed tracing
│   ├── analysis/          # Root cause analysis
│   ├── thresholding/      # Adaptive thresholds
│   ├── coordinator/       # System orchestration
│   └── utils/             # Shared utilities
├── experiments/           # Benchmarks and data generation
├── deploy/                # Docker and Kubernetes configs
├── config/                # Configuration files
├── tests/                 # Unit and integration tests
├── docs/                  # Documentation
└── requirements.txt       # Dependencies

Testing

# Run all tests
pytest tests/

# Run specific test module
pytest tests/test_edge_detector.py

# With coverage
pytest --cov=src tests/

Performance Characteristics

Based on experimental evaluation:

Metric Value Target
Detection Latency (mean) 87ms <100ms
Detection Latency (p95) 234ms <500ms
Precision 84.3% >80%
Recall 82.7% >80%
F1 Score 83.5% >80%
Root Cause Accuracy 76.2% >70%
FL Bandwidth (5 clients, 100 rounds) 42.3MB <50MB

Contributing

Contributions are welcome. Please follow these guidelines:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

License

MIT License

Citation

If you use this framework in your research, please cite:

@misc{adf2026,
  title={Distributed Anomaly Detection Framework with Federated Learning and Causal Analysis},
  author={Research Team},
  year={2026},
  publisher={GitHub},
  url={https://github.com/research/anomaly-detection-framework}
}

References

  • Federated Learning for Anomaly Detection: [Meegle 2025]
  • Practical Root Cause Localization for Microservices: [NetMan.ai 2021]
  • Deep Learning for Time Series Anomaly Detection: [arXiv 2021]

Support

For issues and questions:

  • GitHub Issues: /issues
  • Documentation: docs/

Acknowledgments

This project implements techniques from multiple research papers in federated learning, causal analysis, and distributed systems. See docs/references.md for complete citations.

Releases

No releases published

Packages

 
 
 

Contributors