Skip to content

Commit ef0a5e2

Browse files
update docs with new pages
1 parent 90730c0 commit ef0a5e2

File tree

8 files changed

+221
-523
lines changed

8 files changed

+221
-523
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Calibrated LLM judges are AI evaluators that watch your traces, sessions, or spa
99

1010
## When to use
1111

12-
Use a calibrated judge when you want consistent, scalable evaluation of:
12+
Use a behavior when you want consistent, scalable evaluation of:
1313

1414
- Hallucinations, safety/policy violations
1515
- Response quality (helpfulness, tone, structure)

behaviors/pull-evaluations.mdx

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
---
2+
title: "Pulling Evaluations"
3+
description: "Retrieve judge evaluations via SDK or REST API"
4+
---
5+
6+
Retrieve judge evaluations programmatically for reporting, analysis, or integration into your own workflows.
7+
8+
## Finding your IDs
9+
10+
Before making API calls, you'll need these identifiers:
11+
12+
| ID | Where to find it |
13+
|---|---|
14+
| **Project ID** | Settings → Project, or in any URL after `/projects/` |
15+
| **Judge ID** | Click a judge in the dashboard; the ID is in the URL (`/judges/{judge_id}`) |
16+
| **Span ID** | In trace details, or returned by your instrumentation code |
17+
18+
## Python SDK
19+
20+
### Get evaluations by judge
21+
22+
Fetch all evaluations for a specific judge with pagination and optional filters.
23+
24+
```python
25+
import zeroeval as ze
26+
27+
ze.init(api_key="YOUR_API_KEY")
28+
29+
response = ze.get_behavior_evaluations(
30+
project_id="your-project-id",
31+
judge_id="your-judge-id",
32+
limit=100,
33+
offset=0,
34+
)
35+
36+
print(f"Total: {response['total']}")
37+
for eval in response["evaluations"]:
38+
print(f"Span: {eval['span_id']}")
39+
print(f"Result: {'PASS' if eval['evaluation_result'] else 'FAIL'}")
40+
print(f"Score: {eval.get('score')}") # For scored judges
41+
print(f"Reason: {eval['evaluation_reason']}")
42+
```
43+
44+
**Optional filters:**
45+
46+
```python
47+
response = ze.get_behavior_evaluations(
48+
project_id="your-project-id",
49+
judge_id="your-judge-id",
50+
limit=100,
51+
offset=0,
52+
start_date="2025-01-01T00:00:00Z",
53+
end_date="2025-01-31T23:59:59Z",
54+
evaluation_result=True, # Only passing evaluations
55+
feedback_state="with_user_feedback", # Only calibrated items
56+
)
57+
```
58+
59+
### Get evaluations by span
60+
61+
Fetch all judge evaluations for a specific span (useful when a span has been evaluated by multiple judges).
62+
63+
```python
64+
response = ze.get_span_evaluations(
65+
project_id="your-project-id",
66+
span_id="your-span-id",
67+
)
68+
69+
for eval in response["evaluations"]:
70+
print(f"Judge: {eval['judge_name']}")
71+
print(f"Result: {'PASS' if eval['evaluation_result'] else 'FAIL'}")
72+
if eval.get('evaluation_type') == 'scored':
73+
print(f"Score: {eval['score']} / {eval['score_max']}")
74+
```
75+
76+
## REST API
77+
78+
Use these endpoints directly with your API key in the `Authorization` header.
79+
80+
### Get evaluations by judge
81+
82+
```bash
83+
curl -X GET "https://api.zeroeval.com/projects/{project_id}/judges/{judge_id}/evaluations?limit=100&offset=0" \
84+
-H "Authorization: Bearer $ZEROEVAL_API_KEY"
85+
```
86+
87+
**Query parameters:**
88+
89+
| Parameter | Type | Description |
90+
|---|---|---|
91+
| `limit` | int | Results per page (1-500, default 100) |
92+
| `offset` | int | Pagination offset (default 0) |
93+
| `start_date` | string | Filter by date (ISO 8601) |
94+
| `end_date` | string | Filter by date (ISO 8601) |
95+
| `evaluation_result` | bool | `true` for passing, `false` for failing |
96+
| `feedback_state` | string | `with_user_feedback` or `without_user_feedback` |
97+
98+
### Get evaluations by span
99+
100+
```bash
101+
curl -X GET "https://api.zeroeval.com/projects/{project_id}/spans/{span_id}/evaluations" \
102+
-H "Authorization: Bearer $ZEROEVAL_API_KEY"
103+
```
104+
105+
## Response format
106+
107+
### Judge evaluations response
108+
109+
```json
110+
{
111+
"evaluations": [...],
112+
"total": 142,
113+
"limit": 100,
114+
"offset": 0
115+
}
116+
```
117+
118+
### Span evaluations response
119+
120+
```json
121+
{
122+
"span_id": "abc-123",
123+
"evaluations": [...]
124+
}
125+
```
126+
127+
### Evaluation object
128+
129+
| Field | Type | Description |
130+
|---|---|---|
131+
| `id` | string | Unique evaluation ID |
132+
| `span_id` | string | The evaluated span |
133+
| `evaluation_result` | bool | Pass (`true`) or fail (`false`) |
134+
| `evaluation_reason` | string | Judge's reasoning |
135+
| `confidence_score` | float | Model confidence (0-1) |
136+
| `score` | float \| null | Numeric score (scored judges only) |
137+
| `score_min` | float \| null | Minimum possible score |
138+
| `score_max` | float \| null | Maximum possible score |
139+
| `pass_threshold` | float \| null | Score required to pass |
140+
| `model_used` | string | LLM model that ran the evaluation |
141+
| `created_at` | string | ISO 8601 timestamp |
142+
143+
## Pagination example
144+
145+
For large result sets, paginate through all evaluations:
146+
147+
```python
148+
all_evaluations = []
149+
offset = 0
150+
limit = 100
151+
152+
while True:
153+
response = ze.get_behavior_evaluations(
154+
project_id="your-project-id",
155+
judge_id="your-judge-id",
156+
limit=limit,
157+
offset=offset,
158+
)
159+
160+
all_evaluations.extend(response["evaluations"])
161+
162+
if len(response["evaluations"]) < limit:
163+
break
164+
165+
offset += limit
166+
167+
print(f"Fetched {len(all_evaluations)} total evaluations")
168+
```
File renamed without changes.

docs.json

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -66,19 +66,18 @@
6666
]
6767
},
6868
{
69-
"group": "Calibrated Judges",
69+
"group": "Behaviors",
7070
"icon": "gavel",
7171
"pages": [
72-
"calibrated-judges/introduction",
73-
"calibrated-judges/setup"
72+
"behaviors/introduction",
73+
"behaviors/setup",
74+
"behaviors/pull-evaluations"
7475
]
7576
},
7677
{
7778
"group": "Experiments",
7879
"icon": "flask",
7980
"pages": [
80-
"evaluations/datasets",
81-
"evaluations/experiments",
8281
"evaluations/ab-tests",
8382
"evaluations/prompt-management"
8483
]

0 commit comments

Comments
 (0)