updates to ab test

sebastiancrossa · sebastiancrossa · commit 13b3e82a18d5 · 2025-11-13T13:58:01.000-06:00
diff --git a/evaluations/ab-tests.mdx b/evaluations/ab-tests.mdx
@@ -5,13 +5,15 @@ description: "Run weighted A/B tests on models, prompts, or any variants in your
 
 ## Overview
 
-`ze.choose()` enables A/B testing by making weighted random selections between different variants (models, prompts, parameters, etc.) and automatically tracking which variant was chosen for each execution.
+`ze.choose()` enables A/B testing by making weighted random selections between variants (models, prompts, parameters, etc.), timeboxing each experiment, and automatically tracking the chosen variant on the active span/trace/session for downstream analytics.
 
 **Key features:**
 - Weighted random selection between variants
+- Experiment timeboxing via `duration_days`
 - Automatic tracking of choices within spans, traces, or sessions
 - Consistency caching — same entity always gets the same variant
-- Built-in validation of weights and variant keys
+- Built-in validation of weights, variant keys, and defaults
+- Automatic fallback to a default variant once an experiment completes
 
 ## Basic Usage
 
@@ -22,11 +24,13 @@ ze.init()
 
 # Must be called within a span, trace, or session context
 with ze.span("my_operation"):
-    # Choose between two models with 70/30 split
+    # Choose between two models with 70/30 split for 14 days
     model = ze.choose(
         "model_selection",
         variants={"fast": "gpt-4o-mini", "powerful": "gpt-4o"},
-        weights={"fast": 0.7, "powerful": 0.3}
+        weights={"fast": 0.7, "powerful": 0.3},
+        duration_days=14,
+        default_variant="fast"  # optional fallback after day 14
     )
     
     # Use the selected model
@@ -40,11 +44,37 @@ with ze.span("my_operation"):
 | `name` | `str` | Yes | Name of the A/B test (e.g., "model_selection", "prompt_variant") |
 | `variants` | `Dict[str, Any]` | Yes | Dictionary mapping variant keys to their values |
 | `weights` | `Dict[str, float]` | Yes | Dictionary mapping variant keys to selection probabilities (must sum to ~1.0) |
+| `duration_days` | `int` | Yes | Number of days the experiment should run; must be > 0 |
+| `default_variant` | `str` | No | Variant key to use automatically once the experiment ends (defaults to the first key if omitted) |
 
 ## Returns
 
 Returns the **value** from the selected variant (not the key).
 
+## Experiment Lifecycle & Defaults
+
+- `duration_days` timeboxes the experiment. Once the backend marks it completed, `ze.choose()` automatically serves the `default_variant`.
+- If `default_variant` is omitted, the first key in `variants` becomes the fallback.
+- When an experiment is still active, the same entity (span/trace/session) receives a cached, consistent variant choice.
+
+## Tracking Signals
+
+Attach success metrics to the same span where `ze.choose()` runs so dashboards can correlate outcomes with variant performance:
+
+```python
+with ze.span("recommendation_flow") as span:
+    model = ze.choose(
+        "reco_models_v2",
+        variants={"mini": "gpt-4o-mini", "full": "gpt-4o"},
+        weights={"mini": 0.6, "full": 0.4},
+        duration_days=21,
+        default_variant="mini",
+    )
+    
+    score = run_inference(model)
+    ze.set_signal(span, {"conversion_success": score > 0.75})
+```
+
 ## Complete Example
 
 ```python
@@ -65,19 +95,28 @@ with ze.span("model_ab_test", tags={"feature": "model_comparison"}):
         weights={
             "mini": 0.7,  # 70% traffic
             "full": 0.3   # 30% traffic
-        }
+        },
+        duration_days=14,
+        default_variant="mini"
     )
     
     # The selected model is automatically tracked
     response = client.chat.completions.create(
         model=selected_model,
         messages=[{"role": "user", "content": "Hello!"}]
     )
+    
+    # Attach a success signal tied to this span/choice
+    rating = evaluate_response(response)
+    ze.set_signal(span, {"response_quality": rating >= 0.7})
 ```
 
 ## Important Notes
 
 - **Context Required**: Must be called within an active `ze.span()`, trace, or session
-- **Consistency**: Same entity (span/trace/session) always receives the same variant
+- **Consistency**: Same entity (span/trace/session) always receives the same variant while the test runs
 - **Weight Validation**: Weights should sum to 1.0 (warns if not within 0.95-1.05)
+- **Duration Required**: `duration_days` must be > 0; experiments stop after this window
+- **Fallback Behavior**: Once the backend reports the test as completed, `default_variant` is used automatically
+- **Signal Analytics**: Use `ze.set_signal()` on the same span to compare variant impact in the dashboard
 - **Key Matching**: Variant keys and weight keys must match exactly