|
1 | 1 | --- |
2 | 2 | title: "A/B Tests" |
3 | | -description: "Run A/B tests on your models." |
| 3 | +description: "Run weighted A/B tests on models, prompts, or any variants in your code." |
4 | 4 | --- |
5 | 5 |
|
6 | | -## TL;DR |
| 6 | +## Overview |
7 | 7 |
|
8 | | -1. Pull (or create) a dataset |
9 | | -2. Write a **task** – any Python function that maps a row ➜ model output |
10 | | -3. Optionally write **evaluators** – functions that score `(row, output)` |
11 | | -4. Wrap them in `ze.Experiment` and call `.run()` |
| 8 | +`ze.choose()` enables A/B testing by making weighted random selections between different variants (models, prompts, parameters, etc.) and automatically tracking which variant was chosen for each execution. |
12 | 9 |
|
13 | | -## Minimal example |
| 10 | +**Key features:** |
| 11 | +- Weighted random selection between variants |
| 12 | +- Automatic tracking of choices within spans, traces, or sessions |
| 13 | +- Consistency caching — same entity always gets the same variant |
| 14 | +- Built-in validation of weights and variant keys |
| 15 | + |
| 16 | +## Basic Usage |
14 | 17 |
|
15 | 18 | ```python |
16 | 19 | import zeroeval as ze |
17 | 20 |
|
18 | 21 | ze.init() |
19 | | -dataset = ze.Dataset.pull("Capitals") |
20 | 22 |
|
21 | | -def task(row): |
22 | | - # imagine calling an LLM here |
23 | | - return row["input"].upper() |
| 23 | +# Must be called within a span, trace, or session context |
| 24 | +with ze.span("my_operation"): |
| 25 | + # Choose between two models with 70/30 split |
| 26 | + model = ze.choose( |
| 27 | + "model_selection", |
| 28 | + variants={"fast": "gpt-4o-mini", "powerful": "gpt-4o"}, |
| 29 | + weights={"fast": 0.7, "powerful": 0.3} |
| 30 | + ) |
| 31 | + |
| 32 | + # Use the selected model |
| 33 | + # model will be either "gpt-4o-mini" (70% chance) or "gpt-4o" (30% chance) |
| 34 | +``` |
| 35 | + |
| 36 | +## Parameters |
| 37 | + |
| 38 | +| Parameter | Type | Required | Description | |
| 39 | +| --- | --- | --- | --- | |
| 40 | +| `name` | `str` | Yes | Name of the A/B test (e.g., "model_selection", "prompt_variant") | |
| 41 | +| `variants` | `Dict[str, Any]` | Yes | Dictionary mapping variant keys to their values | |
| 42 | +| `weights` | `Dict[str, float]` | Yes | Dictionary mapping variant keys to selection probabilities (must sum to ~1.0) | |
24 | 43 |
|
25 | | -def exact_match(row, output): |
26 | | - return row["output"].upper() == output |
| 44 | +## Returns |
27 | 45 |
|
28 | | -exp = ze.Experiment( |
29 | | - dataset=dataset, |
30 | | - task=task, |
31 | | - evaluators=[exact_match], |
32 | | - name="Capitals-baseline" |
33 | | -) |
| 46 | +Returns the **value** from the selected variant (not the key). |
| 47 | + |
| 48 | +## Complete Example |
| 49 | + |
| 50 | +```python |
| 51 | +import zeroeval as ze |
| 52 | +import openai |
| 53 | + |
| 54 | +ze.init() |
| 55 | +client = openai.OpenAI() |
| 56 | + |
| 57 | +with ze.span("model_ab_test", tags={"feature": "model_comparison"}): |
| 58 | + # A/B test between two models |
| 59 | + selected_model = ze.choose( |
| 60 | + "model_selection", |
| 61 | + variants={ |
| 62 | + "mini": "gpt-4o-mini", |
| 63 | + "full": "gpt-4o" |
| 64 | + }, |
| 65 | + weights={ |
| 66 | + "mini": 0.7, # 70% traffic |
| 67 | + "full": 0.3 # 30% traffic |
| 68 | + } |
| 69 | + ) |
| 70 | + |
| 71 | + # The selected model is automatically tracked |
| 72 | + response = client.chat.completions.create( |
| 73 | + model=selected_model, |
| 74 | + messages=[{"role": "user", "content": "Hello!"}] |
| 75 | + ) |
34 | 76 | ``` |
| 77 | + |
| 78 | +## Important Notes |
| 79 | + |
| 80 | +- **Context Required**: Must be called within an active `ze.span()`, trace, or session |
| 81 | +- **Consistency**: Same entity (span/trace/session) always receives the same variant |
| 82 | +- **Weight Validation**: Weights should sum to 1.0 (warns if not within 0.95-1.05) |
| 83 | +- **Key Matching**: Variant keys and weight keys must match exactly |
0 commit comments