Skip to content

Commit be2a040

Browse files
Merge pull request #3 from zeroeval/seb/ze-132-streamline-behavior-tuning
Updated docs
2 parents ef0a5e2 + 429ccc1 commit be2a040

File tree

6 files changed

+138
-14
lines changed

6 files changed

+138
-14
lines changed

docs.json

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,12 +66,13 @@
6666
]
6767
},
6868
{
69-
"group": "Behaviors",
69+
"group": "Judges",
7070
"icon": "gavel",
7171
"pages": [
72-
"behaviors/introduction",
73-
"behaviors/setup",
74-
"behaviors/pull-evaluations"
72+
"judges/introduction",
73+
"judges/setup",
74+
"judges/submit-feedback",
75+
"judges/pull-evaluations"
7576
]
7677
},
7778
{

index.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,9 @@ description: "Start improving your AI applications with ZeroEval"
1919
performance
2020
</Card>
2121
<Card
22-
title="Behaviors"
22+
title="Judges"
2323
icon="gavel"
24-
href="/behaviors/introduction"
24+
href="/judges/introduction"
2525
>
2626
Get reliable AI evaluation with judges that are calibrated to human
2727
preferences
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,12 @@ description: "Continuously evaluate your production traffic with judges that lea
55

66
<video src="/videos/calibrated-judge.mp4" controls muted playsInline loop preload="metadata" />
77

8-
Calibrated LLM judges are AI evaluators that watch your traces, sessions, or spans and score behavior according to criteria you define. They get better over time the more you refine and correct their evaluations.
8+
Calibrated LLM judges are AI evaluators that watch your traces, sessions, or spans and score outputs according to criteria you define. They get better over time the more you refine and correct their evaluations.
99

1010
## When to use
1111

12-
Use a behavior when you want consistent, scalable evaluation of:
12+
Use a judge when you want consistent, scalable evaluation of:
1313

1414
- Hallucinations, safety/policy violations
1515
- Response quality (helpfulness, tone, structure)
16-
- Latency, cost, and error patterns tied to behaviors
16+
- Latency, cost, and error patterns tied to specific criteria
Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ import zeroeval as ze
2626

2727
ze.init(api_key="YOUR_API_KEY")
2828

29-
response = ze.get_behavior_evaluations(
29+
response = ze.get_judge_evaluations(
3030
project_id="your-project-id",
3131
judge_id="your-judge-id",
3232
limit=100,
@@ -44,7 +44,7 @@ for eval in response["evaluations"]:
4444
**Optional filters:**
4545

4646
```python
47-
response = ze.get_behavior_evaluations(
47+
response = ze.get_judge_evaluations(
4848
project_id="your-project-id",
4949
judge_id="your-judge-id",
5050
limit=100,
@@ -150,7 +150,7 @@ offset = 0
150150
limit = 100
151151

152152
while True:
153-
response = ze.get_behavior_evaluations(
153+
response = ze.get_judge_evaluations(
154154
project_id="your-project-id",
155155
judge_id="your-judge-id",
156156
limit=limit,
@@ -166,3 +166,7 @@ while True:
166166

167167
print(f"Fetched {len(all_evaluations)} total evaluations")
168168
```
169+
170+
## Related
171+
172+
- [Submitting Feedback](/judges/submit-feedback) - Programmatically submit feedback for judge evaluations
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ description: "Create and calibrate an AI judge in minutes"
77

88
## Creating a judge (&lt;5 mins)
99

10-
1. Go to [Monitoring → Judges → New Judge](https://app.zeroeval.com/monitoring/signal-automations).
11-
2. Sepcify the behaviour that you want to track from your production traffic.
10+
1. Go to [Monitoring → Judges → New Judge](https://app.zeroeval.com/monitoring/judges).
11+
2. Specify the criteria that you want to evaluate from your production traffic.
1212
3. Tweak the prompt of the judge until it matches what you are looking for!
1313

1414
That's it! Historical and future traces will be scored automatically and shown in the dashboard.

judges/submit-feedback.mdx

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
title: "Submitting Feedback"
3+
description: "Programmatically submit feedback for judge evaluations via SDK"
4+
---
5+
6+
## Overview
7+
8+
When calibrating judges, you can submit feedback programmatically using the SDK.
9+
This is useful for:
10+
11+
- Bulk feedback submission from automated pipelines
12+
- Integration with custom review workflows
13+
- Syncing feedback from external labeling tools
14+
15+
## Important: Using the Correct IDs
16+
17+
Judge evaluations involve two related spans:
18+
19+
| ID | Description |
20+
|---|---|
21+
| **Source Span ID** | The original LLM call that was evaluated |
22+
| **Judge Call Span ID** | The span created when the judge ran its evaluation |
23+
24+
When submitting feedback, always include the `judge_id` parameter to ensure
25+
feedback is correctly associated with the judge evaluation.
26+
27+
## Python SDK
28+
29+
### From the UI (Recommended)
30+
31+
The easiest way to get the correct IDs is from the Judge Evaluation modal:
32+
33+
1. Open a judge evaluation in the dashboard
34+
2. Expand the "SDK Integration" section
35+
3. Click "Copy" to copy the pre-filled Python code
36+
4. Paste and customize the generated code
37+
38+
### Manual Submission
39+
40+
```python
41+
from zeroeval import ZeroEval
42+
43+
client = ZeroEval()
44+
45+
# Submit feedback for a judge evaluation
46+
client.send_feedback(
47+
prompt_slug="your-judge-task-slug", # The task/prompt associated with the judge
48+
completion_id="span-id-here", # The span ID from the evaluation
49+
thumbs_up=True, # True = correct, False = incorrect
50+
reason="Optional explanation",
51+
judge_id="automation-id-here", # Required for judge feedback
52+
)
53+
```
54+
55+
### Parameters
56+
57+
| Parameter | Type | Required | Description |
58+
|---|---|---|---|
59+
| `prompt_slug` | str | Yes | The task slug associated with the judge |
60+
| `completion_id` | str | Yes | The span ID being evaluated |
61+
| `thumbs_up` | bool | Yes | `True` if judge was correct, `False` if wrong |
62+
| `reason` | str | No | Explanation of the feedback |
63+
| `judge_id` | str | Yes* | The judge automation ID (*required for judge feedback) |
64+
65+
## REST API
66+
67+
```bash
68+
curl -X POST "https://api.zeroeval.com/v1/prompts/{task_slug}/completions/{span_id}/feedback" \
69+
-H "Authorization: Bearer $ZEROEVAL_API_KEY" \
70+
-H "Content-Type: application/json" \
71+
-d '{
72+
"thumbs_up": true,
73+
"reason": "Judge correctly identified the issue",
74+
"judge_id": "automation-uuid-here"
75+
}'
76+
```
77+
78+
## Finding Your IDs
79+
80+
| ID | Where to Find It |
81+
|---|---|
82+
| **Task Slug** | In the judge settings, or the URL when editing the judge's prompt |
83+
| **Span ID** | In the evaluation modal, or via `get_judge_evaluations()` response |
84+
| **Judge ID** | In the URL when viewing a judge (`/judges/{judge_id}`) |
85+
86+
## Bulk Feedback Submission
87+
88+
For submitting feedback on multiple evaluations, you can iterate through evaluations:
89+
90+
```python
91+
from zeroeval import ZeroEval
92+
93+
client = ZeroEval()
94+
95+
# Get evaluations to review
96+
evaluations = client.get_judge_evaluations(
97+
project_id="your-project-id",
98+
judge_id="your-judge-id",
99+
limit=100,
100+
)
101+
102+
# Submit feedback for each
103+
for eval in evaluations["evaluations"]:
104+
# Your logic to determine if the evaluation was correct
105+
is_correct = your_review_logic(eval)
106+
107+
client.send_feedback(
108+
prompt_slug="your-judge-task-slug",
109+
completion_id=eval["span_id"],
110+
thumbs_up=is_correct,
111+
reason="Automated review",
112+
judge_id="your-judge-id",
113+
)
114+
```
115+
116+
## Related
117+
118+
- [Pulling Evaluations](/judges/pull-evaluations) - Retrieve judge evaluations programmatically
119+
- [Judge Setup](/judges/setup) - Configure and deploy judges

0 commit comments

Comments
 (0)