Skip to content

Commit cf2d2e0

Browse files
revamped ab tests section
1 parent 5f22f1f commit cf2d2e0

File tree

1 file changed

+68
-19
lines changed

1 file changed

+68
-19
lines changed

evaluations/ab-tests.mdx

Lines changed: 68 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,83 @@
11
---
22
title: "A/B Tests"
3-
description: "Run A/B tests on your models."
3+
description: "Run weighted A/B tests on models, prompts, or any variants in your code."
44
---
55

6-
## TL;DR
6+
## Overview
77

8-
1. Pull (or create) a dataset
9-
2. Write a **task** – any Python function that maps a row ➜ model output
10-
3. Optionally write **evaluators** – functions that score `(row, output)`
11-
4. Wrap them in `ze.Experiment` and call `.run()`
8+
`ze.choose()` enables A/B testing by making weighted random selections between different variants (models, prompts, parameters, etc.) and automatically tracking which variant was chosen for each execution.
129

13-
## Minimal example
10+
**Key features:**
11+
- Weighted random selection between variants
12+
- Automatic tracking of choices within spans, traces, or sessions
13+
- Consistency caching — same entity always gets the same variant
14+
- Built-in validation of weights and variant keys
15+
16+
## Basic Usage
1417

1518
```python
1619
import zeroeval as ze
1720

1821
ze.init()
19-
dataset = ze.Dataset.pull("Capitals")
2022

21-
def task(row):
22-
# imagine calling an LLM here
23-
return row["input"].upper()
23+
# Must be called within a span, trace, or session context
24+
with ze.span("my_operation"):
25+
# Choose between two models with 70/30 split
26+
model = ze.choose(
27+
"model_selection",
28+
variants={"fast": "gpt-4o-mini", "powerful": "gpt-4o"},
29+
weights={"fast": 0.7, "powerful": 0.3}
30+
)
31+
32+
# Use the selected model
33+
# model will be either "gpt-4o-mini" (70% chance) or "gpt-4o" (30% chance)
34+
```
35+
36+
## Parameters
37+
38+
| Parameter | Type | Required | Description |
39+
| --- | --- | --- | --- |
40+
| `name` | `str` | Yes | Name of the A/B test (e.g., "model_selection", "prompt_variant") |
41+
| `variants` | `Dict[str, Any]` | Yes | Dictionary mapping variant keys to their values |
42+
| `weights` | `Dict[str, float]` | Yes | Dictionary mapping variant keys to selection probabilities (must sum to ~1.0) |
2443

25-
def exact_match(row, output):
26-
return row["output"].upper() == output
44+
## Returns
2745

28-
exp = ze.Experiment(
29-
dataset=dataset,
30-
task=task,
31-
evaluators=[exact_match],
32-
name="Capitals-baseline"
33-
)
46+
Returns the **value** from the selected variant (not the key).
47+
48+
## Complete Example
49+
50+
```python
51+
import zeroeval as ze
52+
import openai
53+
54+
ze.init()
55+
client = openai.OpenAI()
56+
57+
with ze.span("model_ab_test", tags={"feature": "model_comparison"}):
58+
# A/B test between two models
59+
selected_model = ze.choose(
60+
"model_selection",
61+
variants={
62+
"mini": "gpt-4o-mini",
63+
"full": "gpt-4o"
64+
},
65+
weights={
66+
"mini": 0.7, # 70% traffic
67+
"full": 0.3 # 30% traffic
68+
}
69+
)
70+
71+
# The selected model is automatically tracked
72+
response = client.chat.completions.create(
73+
model=selected_model,
74+
messages=[{"role": "user", "content": "Hello!"}]
75+
)
3476
```
77+
78+
## Important Notes
79+
80+
- **Context Required**: Must be called within an active `ze.span()`, trace, or session
81+
- **Consistency**: Same entity (span/trace/session) always receives the same variant
82+
- **Weight Validation**: Weights should sum to 1.0 (warns if not within 0.95-1.05)
83+
- **Key Matching**: Variant keys and weight keys must match exactly

0 commit comments

Comments
 (0)