Shopify · adriangudas · Oct 8, 2025 · Oct 15, 2025 · Oct 15, 2025 · Oct 15, 2025
diff --git a/experiments/README.md b/experiments/README.md
@@ -1,14 +1,18 @@
-# Semian Experimental Resource
+# Semian Experimental Resources
 
-This directory contains an experimental resource adapter for running complex experiments with Semian.
+This directory contains experimental resource adapters for running complex experiments with Semian.
 
 ## Overview
 
-The `ExperimentalResource` class simulates a distributed service with multiple endpoints, each with configurable latencies following statistical distributions. This allows for testing various failure scenarios and performance characteristics.
+Two resource types are available:
+
+1. **ExperimentalResource** - Simulates a distributed service with multiple endpoints, each with configurable latencies following statistical distributions. Ideal for testing various failure scenarios and performance characteristics with synthetic traffic.
+
+2. **TrafficReplayExperimentalResource** - Replays real production traffic patterns from Grafana exports, allowing you to test how your system would behave during actual incidents by simulating the exact latency patterns observed in production.
 
 ## Features
 
-### Current Implementation
+### ExperimentalResource (Synthetic Traffic)
 
 1. **Multiple Endpoints**: Configure any number of endpoints, each with its own fixed latency
 2. **Statistical Distributions**: Latencies are assigned based on statistical distributions
@@ -26,8 +30,20 @@ The `ExperimentalResource` class simulates a distributed service with multiple e
    - **Error rate changes**: Modify error rate for the entire service
    - **Gradual ramp-up**: Both degradations support gradual transitions over time
 
+### TrafficReplayExperimentalResource (Production Traffic Replay)
+
+1. **Real Traffic Patterns**: Load and replay latency patterns from Grafana JSON exports
+2. **Time-Based Simulation**: Simulates requests as though an incident were happening in real-time
+   - Matches request latencies to timeline offsets
+   - Uses elapsed time since service start to find corresponding latencies
+3. **Automatic Completion**: Stops accepting requests when timeline is exceeded
+4. **Request Timeouts**: Configure a maximum timeout for requests
+5. **Simple Interface**: No need to configure endpoints, distributions, or error rates - everything comes from the log file
+
 ## Usage
 
+### Synthetic Traffic Generation
+
 See `example_with_circuit_breaker.rb` for usage:
 
 ```
@@ -38,3 +54,65 @@ bundle exec ruby example_with_circuit_breaker.rb
 Output:
 
 ![](./example_output.png)
+
+### Traffic Replay Mode
+
+The traffic replay feature allows you to simulate real production incidents by replaying latency patterns from Grafana exports.
+
+#### How It Works
+
+1. Export traffic data from Grafana as JSON (one JSON object per line)
+2. Initialize the resource with `traffic_log_path` parameter
+3. The service will simulate latencies based on elapsed time since initialization
+4. When a request comes in at time T seconds after service start, it uses the latency from the log entry at offset T
+5. When the service has been running longer than the log timeline, it stops accepting requests
+
+#### Required JSON Format
+
+Each line in the JSON file should be a complete JSON object with:
+- `timestamp`: ISO8601 timestamp (e.g., `"2025-10-02T16:19:30.814890047Z"`)
+- `attrs.db.sql.total_duration_ms`: Database latency in milliseconds
+
+Example:
+```json
+{"timestamp": "2025-10-02T16:19:30.814890047Z", "attrs.db.sql.total_duration_ms": 2.5, "attrs.db.sql.total_count": 1}
+{"timestamp": "2025-10-02T16:19:31.314890047Z", "attrs.db.sql.total_duration_ms": 5.8, "attrs.db.sql.total_count": 2}
+{"timestamp": "2025-10-02T16:19:31.814890047Z", "attrs.db.sql.total_duration_ms": 12.3, "attrs.db.sql.total_count": 3}
+```
+
+If a request doesn't have `attrs.db.sql.total_duration_ms`, it's treated as 0ms latency.
+
+#### Example Usage
+
+```ruby
+resource = Semian::Experiments::TrafficReplayExperimentalResource.new(
+  name: "my_service",
+  traffic_log_path: "path/to/grafana_export.json",
+  timeout: 30.0,
+  semian: {
+    circuit_breaker: true,
+    success_threshold: 2,
+    error_threshold: 3,
+    error_threshold_timeout: 10,
+  }
+)
+
+# Make requests - they'll be served with latencies from the log
+begin
+  resource.request do |latency|
+    puts "Request completed with latency: #{(latency * 1000).round(2)}ms"
+  end
+rescue Semian::Experiments::TrafficReplayExperimentalResource::TrafficReplayCompleteError
+  puts "Traffic replay completed!"
+end
+```
+
+#### Running the Example
+
+A complete example with a sample traffic log is provided:
+
+```bash
+bundle exec ruby example_with_traffic_replay.rb sample_traffic_log.json
+```
+
+The sample log simulates a 12-second incident where latency spikes from ~2ms to over 300ms and then recovers.
diff --git a/experiments/example_output.png b/experiments/example_output.png
diff --git a/experiments/example_with_traffic_replay.rb b/experiments/example_with_traffic_replay.rb
@@ -0,0 +1,103 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+
+require_relative "traffic_replay_experimental_resource"
+
+# Example usage of TrafficReplayExperimentalResource with traffic replay
+
+puts "=== Semian ExperimentalResource - Traffic Replay Example ==="
+puts
+
+# Example 1: Create a resource with traffic replay from Grafana export
+puts "Example: Using traffic replay from Grafana export"
+puts "-" * 60
+
+# To use this example, you need a Grafana export JSON file
+# where each line is a JSON object with:
+# - "timestamp": ISO8601 timestamp
+# - "attrs.db.sql.total_duration_ms": latency in milliseconds
+
+traffic_log_path = ARGV[0] || "path/to/grafana_export.json"
+
+unless File.exist?(traffic_log_path)
+  puts "ERROR: Traffic log file not found: #{traffic_log_path}"
+  puts
+  puts "Usage: ruby #{__FILE__} <path_to_grafana_export.json>"
+  puts
+  puts "The JSON file should contain one JSON object per line, with fields:"
+  puts '  - "timestamp": ISO8601 timestamp (e.g., "2025-10-02T16:19:30.814890047Z")'
+  puts '  - "attrs.db.sql.total_duration_ms": latency in milliseconds'
+  puts
+  puts "Example JSON line:"
+  puts "{"
+  puts '  "timestamp": "2025-10-02T16:19:30.814890047Z",'
+  puts '  "attrs.db.sql.total_duration_ms": 5.2,'
+  puts "  ... other fields ..."
+  puts "}"
+  exit 1
+end
+
+begin
+  # Create resource with traffic replay
+  resource = Semian::Experiments::TrafficReplayExperimentalResource.new(
+    name: "my_service",
+    traffic_log_path: traffic_log_path, # Path to Grafana JSON export
+    timeout: 0.1, # 100ms timeout
+    semian: {
+      circuit_breaker: true,
+      success_threshold: 2,
+      error_threshold: 3,
+      error_threshold_timeout: 10,
+      error_timeout: 0.2,
+    },
+  )
+
+  puts
+  puts "Resource created successfully!"
+  puts "Starting to process requests..."
+  puts "Press Ctrl+C to stop"
+  puts
+
+  # Make requests continuously until the timeline is exhausted
+  request_count = 0
+  loop do
+    result = resource.request do |latency|
+      request_count += 1
+      puts "[#{Time.now.strftime("%H:%M:%S")}] Request ##{request_count} - " \
+        "Latency: #{(latency * 1000).round(2)}ms"
+      { latency: latency, request_number: request_count }
+    end
+
+    # Small delay between requests to avoid overwhelming the output
+    sleep(0.1)
+  rescue Semian::Experiments::TrafficReplayExperimentalResource::TrafficReplayCompleteError => e
+    puts
+    puts "Traffic replay completed!"
+    puts "Total requests processed: #{request_count}"
+    break
+  rescue Semian::Experiments::TrafficReplayExperimentalResource::CircuitOpenError => e
+    puts "[#{Time.now.strftime("%H:%M:%S")}] Circuit breaker is OPEN - #{e.message}"
+    sleep(1)
+  rescue Semian::Experiments::TrafficReplayExperimentalResource::TimeoutError => e
+    puts "[#{Time.now.strftime("%H:%M:%S")}] Timeout: #{e.message}"
+    sleep(1)
+  rescue Semian::Experiments::TrafficReplayExperimentalResource::ResourceBusyError => e
+    puts "[#{Time.now.strftime("%H:%M:%S")}] Resource busy: #{e.message}"
+    sleep(1)
+  rescue => e
+    puts "[#{Time.now.strftime("%H:%M:%S")}] Error: #{e.class} - #{e.message}"
+    sleep(0.5)
+  end
+
+  puts
+  puts "=== Replay Complete ==="
+rescue ArgumentError => e
+  puts "ERROR: #{e.message}"
+  exit(1)
+rescue Interrupt
+  puts
+  puts
+  puts "=== Interrupted by user ==="
+  puts "Total requests processed: #{request_count}"
+  exit(0)
+end
diff --git a/experiments/sample_traffic_log.json b/experiments/sample_traffic_log.json
@@ -0,0 +1,26 @@
+{"timestamp": "2025-10-02T16:19:30.000000000Z", "attrs.db.sql.total_duration_ms": 2.5, "attrs.db.sql.total_count": 1}
+{"timestamp": "2025-10-02T16:19:30.500000000Z", "attrs.db.sql.total_duration_ms": 3.2, "attrs.db.sql.total_count": 1}
+{"timestamp": "2025-10-02T16:19:31.000000000Z", "attrs.db.sql.total_duration_ms": 2.8, "attrs.db.sql.total_count": 1}
+{"timestamp": "2025-10-02T16:19:31.500000000Z", "attrs.db.sql.total_duration_ms": 5.1, "attrs.db.sql.total_count": 2}
+{"timestamp": "2025-10-02T16:19:32.000000000Z", "attrs.db.sql.total_duration_ms": 8.5, "attrs.db.sql.total_count": 2}
+{"timestamp": "2025-10-02T16:19:32.500000000Z", "attrs.db.sql.total_duration_ms": 15.3, "attrs.db.sql.total_count": 3}
+{"timestamp": "2025-10-02T16:19:33.000000000Z", "attrs.db.sql.total_duration_ms": 25.7, "attrs.db.sql.total_count": 4}
+{"timestamp": "2025-10-02T16:19:33.500000000Z", "attrs.db.sql.total_duration_ms": 45.2, "attrs.db.sql.total_count": 5}
+{"timestamp": "2025-10-02T16:19:34.000000000Z", "attrs.db.sql.total_duration_ms": 78.4, "attrs.db.sql.total_count": 8}
+{"timestamp": "2025-10-02T16:19:34.500000000Z", "attrs.db.sql.total_duration_ms": 125.6, "attrs.db.sql.total_count": 10}
+{"timestamp": "2025-10-02T16:19:35.000000000Z", "attrs.db.sql.total_duration_ms": 187.3, "attrs.db.sql.total_count": 15}
+{"timestamp": "2025-10-02T16:19:35.500000000Z", "attrs.db.sql.total_duration_ms": 245.8, "attrs.db.sql.total_count": 18}
+{"timestamp": "2025-10-02T16:19:36.000000000Z", "attrs.db.sql.total_duration_ms": 298.2, "attrs.db.sql.total_count": 20}
+{"timestamp": "2025-10-02T16:19:36.500000000Z", "attrs.db.sql.total_duration_ms": 312.5, "attrs.db.sql.total_count": 22}
+{"timestamp": "2025-10-02T16:19:37.000000000Z", "attrs.db.sql.total_duration_ms": 287.6, "attrs.db.sql.total_count": 20}
+{"timestamp": "2025-10-02T16:19:37.500000000Z", "attrs.db.sql.total_duration_ms": 234.3, "attrs.db.sql.total_count": 18}
+{"timestamp": "2025-10-02T16:19:38.000000000Z", "attrs.db.sql.total_duration_ms": 178.9, "attrs.db.sql.total_count": 15}
+{"timestamp": "2025-10-02T16:19:38.500000000Z", "attrs.db.sql.total_duration_ms": 125.4, "attrs.db.sql.total_count": 12}
+{"timestamp": "2025-10-02T16:19:39.000000000Z", "attrs.db.sql.total_duration_ms": 82.1, "attrs.db.sql.total_count": 8}
+{"timestamp": "2025-10-02T16:19:39.500000000Z", "attrs.db.sql.total_duration_ms": 45.7, "attrs.db.sql.total_count": 5}
+{"timestamp": "2025-10-02T16:19:40.000000000Z", "attrs.db.sql.total_duration_ms": 25.3, "attrs.db.sql.total_count": 3}
+{"timestamp": "2025-10-02T16:19:40.500000000Z", "attrs.db.sql.total_duration_ms": 12.8, "attrs.db.sql.total_count": 2}
+{"timestamp": "2025-10-02T16:19:41.000000000Z", "attrs.db.sql.total_duration_ms": 6.5, "attrs.db.sql.total_count": 1}
+{"timestamp": "2025-10-02T16:19:41.500000000Z", "attrs.db.sql.total_duration_ms": 3.8, "attrs.db.sql.total_count": 1}
+{"timestamp": "2025-10-02T16:19:42.000000000Z", "attrs.db.sql.total_duration_ms": 2.7, "attrs.db.sql.total_count": 1}
+