Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 82 additions & 4 deletions experiments/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
# Semian Experimental Resource
# Semian Experimental Resources

This directory contains an experimental resource adapter for running complex experiments with Semian.
This directory contains experimental resource adapters for running complex experiments with Semian.

## Overview

The `ExperimentalResource` class simulates a distributed service with multiple endpoints, each with configurable latencies following statistical distributions. This allows for testing various failure scenarios and performance characteristics.
Two resource types are available:

1. **ExperimentalResource** - Simulates a distributed service with multiple endpoints, each with configurable latencies following statistical distributions. Ideal for testing various failure scenarios and performance characteristics with synthetic traffic.

2. **TrafficReplayExperimentalResource** - Replays real production traffic patterns from Grafana exports, allowing you to test how your system would behave during actual incidents by simulating the exact latency patterns observed in production.

## Features

### Current Implementation
### ExperimentalResource (Synthetic Traffic)

1. **Multiple Endpoints**: Configure any number of endpoints, each with its own fixed latency
2. **Statistical Distributions**: Latencies are assigned based on statistical distributions
Expand All @@ -26,8 +30,20 @@ The `ExperimentalResource` class simulates a distributed service with multiple e
- **Error rate changes**: Modify error rate for the entire service
- **Gradual ramp-up**: Both degradations support gradual transitions over time

### TrafficReplayExperimentalResource (Production Traffic Replay)

1. **Real Traffic Patterns**: Load and replay latency patterns from Grafana JSON exports
2. **Time-Based Simulation**: Simulates requests as though an incident were happening in real-time
- Matches request latencies to timeline offsets
- Uses elapsed time since service start to find corresponding latencies
3. **Automatic Completion**: Stops accepting requests when timeline is exceeded
4. **Request Timeouts**: Configure a maximum timeout for requests
5. **Simple Interface**: No need to configure endpoints, distributions, or error rates - everything comes from the log file

## Usage

### Synthetic Traffic Generation

See `example_with_circuit_breaker.rb` for usage:

```
Expand All @@ -38,3 +54,65 @@ bundle exec ruby example_with_circuit_breaker.rb
Output:

![](./example_output.png)

### Traffic Replay Mode

The traffic replay feature allows you to simulate real production incidents by replaying latency patterns from Grafana exports.

#### How It Works

1. Export traffic data from Grafana as JSON (one JSON object per line)
2. Initialize the resource with `traffic_log_path` parameter
3. The service will simulate latencies based on elapsed time since initialization
4. When a request comes in at time T seconds after service start, it uses the latency from the log entry at offset T
5. When the service has been running longer than the log timeline, it stops accepting requests

#### Required JSON Format

Each line in the JSON file should be a complete JSON object with:
- `timestamp`: ISO8601 timestamp (e.g., `"2025-10-02T16:19:30.814890047Z"`)
- `attrs.db.sql.total_duration_ms`: Database latency in milliseconds

Example:
```json
{"timestamp": "2025-10-02T16:19:30.814890047Z", "attrs.db.sql.total_duration_ms": 2.5, "attrs.db.sql.total_count": 1}
{"timestamp": "2025-10-02T16:19:31.314890047Z", "attrs.db.sql.total_duration_ms": 5.8, "attrs.db.sql.total_count": 2}
{"timestamp": "2025-10-02T16:19:31.814890047Z", "attrs.db.sql.total_duration_ms": 12.3, "attrs.db.sql.total_count": 3}
```

If a request doesn't have `attrs.db.sql.total_duration_ms`, it's treated as 0ms latency.

#### Example Usage

```ruby
resource = Semian::Experiments::TrafficReplayExperimentalResource.new(
name: "my_service",
traffic_log_path: "path/to/grafana_export.json",
timeout: 30.0,
semian: {
circuit_breaker: true,
success_threshold: 2,
error_threshold: 3,
error_threshold_timeout: 10,
}
)

# Make requests - they'll be served with latencies from the log
begin
resource.request do |latency|
puts "Request completed with latency: #{(latency * 1000).round(2)}ms"
end
rescue Semian::Experiments::TrafficReplayExperimentalResource::TrafficReplayCompleteError
puts "Traffic replay completed!"
end
```

#### Running the Example

A complete example with a sample traffic log is provided:

```bash
bundle exec ruby example_with_traffic_replay.rb sample_traffic_log.json
```

The sample log simulates a 12-second incident where latency spikes from ~2ms to over 300ms and then recovers.
Binary file modified experiments/example_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
103 changes: 103 additions & 0 deletions experiments/example_with_traffic_replay.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
#!/usr/bin/env ruby
# frozen_string_literal: true

require_relative "traffic_replay_experimental_resource"

# Example usage of TrafficReplayExperimentalResource with traffic replay

puts "=== Semian ExperimentalResource - Traffic Replay Example ==="
puts

# Example 1: Create a resource with traffic replay from Grafana export
puts "Example: Using traffic replay from Grafana export"
puts "-" * 60

# To use this example, you need a Grafana export JSON file
# where each line is a JSON object with:
# - "timestamp": ISO8601 timestamp
# - "attrs.db.sql.total_duration_ms": latency in milliseconds

traffic_log_path = ARGV[0] || "path/to/grafana_export.json"

unless File.exist?(traffic_log_path)
puts "ERROR: Traffic log file not found: #{traffic_log_path}"
puts
puts "Usage: ruby #{__FILE__} <path_to_grafana_export.json>"
puts
puts "The JSON file should contain one JSON object per line, with fields:"
puts ' - "timestamp": ISO8601 timestamp (e.g., "2025-10-02T16:19:30.814890047Z")'
puts ' - "attrs.db.sql.total_duration_ms": latency in milliseconds'
puts
puts "Example JSON line:"
puts "{"
puts ' "timestamp": "2025-10-02T16:19:30.814890047Z",'
puts ' "attrs.db.sql.total_duration_ms": 5.2,'
puts " ... other fields ..."
puts "}"
exit 1
end

begin
# Create resource with traffic replay
resource = Semian::Experiments::TrafficReplayExperimentalResource.new(
name: "my_service",
traffic_log_path: traffic_log_path, # Path to Grafana JSON export
timeout: 0.1, # 100ms timeout
semian: {
circuit_breaker: true,
success_threshold: 2,
error_threshold: 3,
error_threshold_timeout: 10,
error_timeout: 0.2,
},
)

puts
puts "Resource created successfully!"
puts "Starting to process requests..."
puts "Press Ctrl+C to stop"
puts

# Make requests continuously until the timeline is exhausted
request_count = 0
loop do
result = resource.request do |latency|
request_count += 1
puts "[#{Time.now.strftime("%H:%M:%S")}] Request ##{request_count} - " \
"Latency: #{(latency * 1000).round(2)}ms"
{ latency: latency, request_number: request_count }
end

# Small delay between requests to avoid overwhelming the output
sleep(0.1)
rescue Semian::Experiments::TrafficReplayExperimentalResource::TrafficReplayCompleteError => e
puts
puts "Traffic replay completed!"
puts "Total requests processed: #{request_count}"
break
rescue Semian::Experiments::TrafficReplayExperimentalResource::CircuitOpenError => e
puts "[#{Time.now.strftime("%H:%M:%S")}] Circuit breaker is OPEN - #{e.message}"
sleep(1)
rescue Semian::Experiments::TrafficReplayExperimentalResource::TimeoutError => e
puts "[#{Time.now.strftime("%H:%M:%S")}] Timeout: #{e.message}"
sleep(1)
rescue Semian::Experiments::TrafficReplayExperimentalResource::ResourceBusyError => e
puts "[#{Time.now.strftime("%H:%M:%S")}] Resource busy: #{e.message}"
sleep(1)
rescue => e
puts "[#{Time.now.strftime("%H:%M:%S")}] Error: #{e.class} - #{e.message}"
sleep(0.5)
end

puts
puts "=== Replay Complete ==="
rescue ArgumentError => e
puts "ERROR: #{e.message}"
exit(1)
rescue Interrupt
puts
puts
puts "=== Interrupted by user ==="
puts "Total requests processed: #{request_count}"
exit(0)
end
26 changes: 26 additions & 0 deletions experiments/sample_traffic_log.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{"timestamp": "2025-10-02T16:19:30.000000000Z", "attrs.db.sql.total_duration_ms": 2.5, "attrs.db.sql.total_count": 1}
{"timestamp": "2025-10-02T16:19:30.500000000Z", "attrs.db.sql.total_duration_ms": 3.2, "attrs.db.sql.total_count": 1}
{"timestamp": "2025-10-02T16:19:31.000000000Z", "attrs.db.sql.total_duration_ms": 2.8, "attrs.db.sql.total_count": 1}
{"timestamp": "2025-10-02T16:19:31.500000000Z", "attrs.db.sql.total_duration_ms": 5.1, "attrs.db.sql.total_count": 2}
{"timestamp": "2025-10-02T16:19:32.000000000Z", "attrs.db.sql.total_duration_ms": 8.5, "attrs.db.sql.total_count": 2}
{"timestamp": "2025-10-02T16:19:32.500000000Z", "attrs.db.sql.total_duration_ms": 15.3, "attrs.db.sql.total_count": 3}
{"timestamp": "2025-10-02T16:19:33.000000000Z", "attrs.db.sql.total_duration_ms": 25.7, "attrs.db.sql.total_count": 4}
{"timestamp": "2025-10-02T16:19:33.500000000Z", "attrs.db.sql.total_duration_ms": 45.2, "attrs.db.sql.total_count": 5}
{"timestamp": "2025-10-02T16:19:34.000000000Z", "attrs.db.sql.total_duration_ms": 78.4, "attrs.db.sql.total_count": 8}
{"timestamp": "2025-10-02T16:19:34.500000000Z", "attrs.db.sql.total_duration_ms": 125.6, "attrs.db.sql.total_count": 10}
{"timestamp": "2025-10-02T16:19:35.000000000Z", "attrs.db.sql.total_duration_ms": 187.3, "attrs.db.sql.total_count": 15}
{"timestamp": "2025-10-02T16:19:35.500000000Z", "attrs.db.sql.total_duration_ms": 245.8, "attrs.db.sql.total_count": 18}
{"timestamp": "2025-10-02T16:19:36.000000000Z", "attrs.db.sql.total_duration_ms": 298.2, "attrs.db.sql.total_count": 20}
{"timestamp": "2025-10-02T16:19:36.500000000Z", "attrs.db.sql.total_duration_ms": 312.5, "attrs.db.sql.total_count": 22}
{"timestamp": "2025-10-02T16:19:37.000000000Z", "attrs.db.sql.total_duration_ms": 287.6, "attrs.db.sql.total_count": 20}
{"timestamp": "2025-10-02T16:19:37.500000000Z", "attrs.db.sql.total_duration_ms": 234.3, "attrs.db.sql.total_count": 18}
{"timestamp": "2025-10-02T16:19:38.000000000Z", "attrs.db.sql.total_duration_ms": 178.9, "attrs.db.sql.total_count": 15}
{"timestamp": "2025-10-02T16:19:38.500000000Z", "attrs.db.sql.total_duration_ms": 125.4, "attrs.db.sql.total_count": 12}
{"timestamp": "2025-10-02T16:19:39.000000000Z", "attrs.db.sql.total_duration_ms": 82.1, "attrs.db.sql.total_count": 8}
{"timestamp": "2025-10-02T16:19:39.500000000Z", "attrs.db.sql.total_duration_ms": 45.7, "attrs.db.sql.total_count": 5}
{"timestamp": "2025-10-02T16:19:40.000000000Z", "attrs.db.sql.total_duration_ms": 25.3, "attrs.db.sql.total_count": 3}
{"timestamp": "2025-10-02T16:19:40.500000000Z", "attrs.db.sql.total_duration_ms": 12.8, "attrs.db.sql.total_count": 2}
{"timestamp": "2025-10-02T16:19:41.000000000Z", "attrs.db.sql.total_duration_ms": 6.5, "attrs.db.sql.total_count": 1}
{"timestamp": "2025-10-02T16:19:41.500000000Z", "attrs.db.sql.total_duration_ms": 3.8, "attrs.db.sql.total_count": 1}
{"timestamp": "2025-10-02T16:19:42.000000000Z", "attrs.db.sql.total_duration_ms": 2.7, "attrs.db.sql.total_count": 1}

Loading
Loading