Merge pull request #3 from zeroeval/seb/ze-132-streamline-behavior-tuning

sebastiancrossa · web-flow · commit be2a040ed237 · 2026-01-22T10:21:01.000-06:00
Updated docs
diff --git a/docs.json b/docs.json
@@ -66,12 +66,13 @@
             ]
           },
           {
-            "group": "Behaviors",
+            "group": "Judges",
             "icon": "gavel",
             "pages": [
-              "behaviors/introduction",
-              "behaviors/setup",
-              "behaviors/pull-evaluations"
+              "judges/introduction",
+              "judges/setup",
+              "judges/submit-feedback",
+              "judges/pull-evaluations"
             ]
           },
           {
diff --git a/index.mdx b/index.mdx
@@ -19,9 +19,9 @@ description: "Start improving your AI applications with ZeroEval"
     performance
   </Card>
   <Card
-    title="Behaviors"
+    title="Judges"
     icon="gavel"
-    href="/behaviors/introduction"
+    href="/judges/introduction"
   >
     Get reliable AI evaluation with judges that are calibrated to human
     preferences
diff --git a/judges/introduction.mdx b/judges/introduction.mdx
@@ -5,12 +5,12 @@ description: "Continuously evaluate your production traffic with judges that lea
 
 <video src="/videos/calibrated-judge.mp4" controls muted playsInline loop preload="metadata" />
 
-Calibrated LLM judges are AI evaluators that watch your traces, sessions, or spans and score behavior according to criteria you define. They get better over time the more you refine and correct their evaluations.
+Calibrated LLM judges are AI evaluators that watch your traces, sessions, or spans and score outputs according to criteria you define. They get better over time the more you refine and correct their evaluations.
 
 ## When to use
 
-Use a behavior when you want consistent, scalable evaluation of:
+Use a judge when you want consistent, scalable evaluation of:
 
 - Hallucinations, safety/policy violations
 - Response quality (helpfulness, tone, structure)
-- Latency, cost, and error patterns tied to behaviors
+- Latency, cost, and error patterns tied to specific criteria
diff --git a/judges/pull-evaluations.mdx b/judges/pull-evaluations.mdx
@@ -26,7 +26,7 @@ import zeroeval as ze
 
 ze.init(api_key="YOUR_API_KEY")
 
-response = ze.get_behavior_evaluations(
+response = ze.get_judge_evaluations(
     project_id="your-project-id",
     judge_id="your-judge-id",
     limit=100,
@@ -44,7 +44,7 @@ for eval in response["evaluations"]:
 **Optional filters:**
 
 ```python
-response = ze.get_behavior_evaluations(
+response = ze.get_judge_evaluations(
     project_id="your-project-id",
     judge_id="your-judge-id",
     limit=100,
@@ -150,7 +150,7 @@ offset = 0
 limit = 100
 
 while True:
-    response = ze.get_behavior_evaluations(
+    response = ze.get_judge_evaluations(
         project_id="your-project-id",
         judge_id="your-judge-id",
         limit=limit,
@@ -166,3 +166,7 @@ while True:
 
 print(f"Fetched {len(all_evaluations)} total evaluations")
 ```
+
+## Related
+
+- [Submitting Feedback](/judges/submit-feedback) - Programmatically submit feedback for judge evaluations
diff --git a/judges/setup.mdx b/judges/setup.mdx
@@ -7,8 +7,8 @@ description: "Create and calibrate an AI judge in minutes"
 
 ## Creating a judge (&lt;5 mins)
 
-1. Go to [Monitoring → Judges → New Judge](https://app.zeroeval.com/monitoring/signal-automations).
-2. Sepcify the behaviour that you want to track from your production traffic.
+1. Go to [Monitoring → Judges → New Judge](https://app.zeroeval.com/monitoring/judges).
+2. Specify the criteria that you want to evaluate from your production traffic.
 3. Tweak the prompt of the judge until it matches what you are looking for!
 
 That's it! Historical and future traces will be scored automatically and shown in the dashboard.
diff --git a/judges/submit-feedback.mdx b/judges/submit-feedback.mdx
@@ -0,0 +1,119 @@
+---
+title: "Submitting Feedback"
+description: "Programmatically submit feedback for judge evaluations via SDK"
+---
+
+## Overview
+
+When calibrating judges, you can submit feedback programmatically using the SDK.
+This is useful for:
+
+- Bulk feedback submission from automated pipelines
+- Integration with custom review workflows
+- Syncing feedback from external labeling tools
+
+## Important: Using the Correct IDs
+
+Judge evaluations involve two related spans:
+
+| ID | Description |
+|---|---|
+| **Source Span ID** | The original LLM call that was evaluated |
+| **Judge Call Span ID** | The span created when the judge ran its evaluation |
+
+When submitting feedback, always include the `judge_id` parameter to ensure
+feedback is correctly associated with the judge evaluation.
+
+## Python SDK
+
+### From the UI (Recommended)
+
+The easiest way to get the correct IDs is from the Judge Evaluation modal:
+
+1. Open a judge evaluation in the dashboard
+2. Expand the "SDK Integration" section
+3. Click "Copy" to copy the pre-filled Python code
+4. Paste and customize the generated code
+
+### Manual Submission
+
+```python
+from zeroeval import ZeroEval
+
+client = ZeroEval()
+
+# Submit feedback for a judge evaluation
+client.send_feedback(
+    prompt_slug="your-judge-task-slug",  # The task/prompt associated with the judge
+    completion_id="span-id-here",         # The span ID from the evaluation
+    thumbs_up=True,                        # True = correct, False = incorrect
+    reason="Optional explanation",
+    judge_id="automation-id-here",         # Required for judge feedback
+)
+```
+
+### Parameters
+
+| Parameter | Type | Required | Description |
+|---|---|---|---|
+| `prompt_slug` | str | Yes | The task slug associated with the judge |
+| `completion_id` | str | Yes | The span ID being evaluated |
+| `thumbs_up` | bool | Yes | `True` if judge was correct, `False` if wrong |
+| `reason` | str | No | Explanation of the feedback |
+| `judge_id` | str | Yes* | The judge automation ID (*required for judge feedback) |
+
+## REST API
+
+```bash
+curl -X POST "https://api.zeroeval.com/v1/prompts/{task_slug}/completions/{span_id}/feedback" \
+  -H "Authorization: Bearer $ZEROEVAL_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "thumbs_up": true,
+    "reason": "Judge correctly identified the issue",
+    "judge_id": "automation-uuid-here"
+  }'
+```
+
+## Finding Your IDs
+
+| ID | Where to Find It |
+|---|---|
+| **Task Slug** | In the judge settings, or the URL when editing the judge's prompt |
+| **Span ID** | In the evaluation modal, or via `get_judge_evaluations()` response |
+| **Judge ID** | In the URL when viewing a judge (`/judges/{judge_id}`) |
+
+## Bulk Feedback Submission
+
+For submitting feedback on multiple evaluations, you can iterate through evaluations:
+
+```python
+from zeroeval import ZeroEval
+
+client = ZeroEval()
+
+# Get evaluations to review
+evaluations = client.get_judge_evaluations(
+    project_id="your-project-id",
+    judge_id="your-judge-id",
+    limit=100,
+)
+
+# Submit feedback for each
+for eval in evaluations["evaluations"]:
+    # Your logic to determine if the evaluation was correct
+    is_correct = your_review_logic(eval)
+    
+    client.send_feedback(
+        prompt_slug="your-judge-task-slug",
+        completion_id=eval["span_id"],
+        thumbs_up=is_correct,
+        reason="Automated review",
+        judge_id="your-judge-id",
+    )
+```
+
+## Related
+
+- [Pulling Evaluations](/judges/pull-evaluations) - Retrieve judge evaluations programmatically
+- [Judge Setup](/judges/setup) - Configure and deploy judges

Original file line number	Diff line number	Diff line change
`@@ -66,12 +66,13 @@`
`66`	`66`	`]`
`67`	`67`	`},`
`68`	`68`	`{`
`69`		`- "group": "Behaviors",`
	`69`	`+ "group": "Judges",`
`70`	`70`	`"icon": "gavel",`
`71`	`71`	`"pages": [`
`72`		`- "behaviors/introduction",`
`73`		`- "behaviors/setup",`
`74`		`- "behaviors/pull-evaluations"`
	`72`	`+ "judges/introduction",`
	`73`	`+ "judges/setup",`
	`74`	`+ "judges/submit-feedback",`
	`75`	`+ "judges/pull-evaluations"`
`75`	`76`	`]`
`76`	`77`	`},`
`77`	`78`	`{`