zeroeval
diff --git a/‎calibrated-judges/introduction.mdx‎ ‎behaviors/introduction.mdx‎calibrated-judges/introduction.mdx renamed to behaviors/introduction.mdx
Lines changed: 1 addition & 1 deletion b/‎calibrated-judges/introduction.mdx‎ ‎behaviors/introduction.mdx‎calibrated-judges/introduction.mdx renamed to behaviors/introduction.mdx
Lines changed: 1 addition & 1 deletion
diff --git a/‎behaviors/pull-evaluations.mdx‎
Lines changed: 168 additions & 0 deletions b/‎behaviors/pull-evaluations.mdx‎
Lines changed: 168 additions & 0 deletions
diff --git a/‎calibrated-judges/setup.mdx‎ ‎behaviors/setup.mdx‎calibrated-judges/setup.mdx renamed to behaviors/setup.mdx b/‎calibrated-judges/setup.mdx‎ ‎behaviors/setup.mdx‎calibrated-judges/setup.mdx renamed to behaviors/setup.mdx
diff --git a/‎docs.json‎
Lines changed: 4 additions & 5 deletions b/‎docs.json‎
Lines changed: 4 additions & 5 deletions
@@ -9,7 +9,7 @@ Calibrated LLM judges are AI evaluators that watch your traces, sessions, or spa
 
 ## When to use
 
-Use a calibrated judge when you want consistent, scalable evaluation of:
+Use a behavior when you want consistent, scalable evaluation of:
 
 - Hallucinations, safety/policy violations
 - Response quality (helpfulness, tone, structure)
 
@@ -0,0 +1,168 @@
+---
+title: "Pulling Evaluations"
+description: "Retrieve judge evaluations via SDK or REST API"
+---
+
+Retrieve judge evaluations programmatically for reporting, analysis, or integration into your own workflows.
+
+## Finding your IDs
+
+Before making API calls, you'll need these identifiers:
+
+| ID | Where to find it |
+|---|---|
+| **Project ID** | Settings → Project, or in any URL after `/projects/` |
+| **Judge ID** | Click a judge in the dashboard; the ID is in the URL (`/judges/{judge_id}`) |
+| **Span ID** | In trace details, or returned by your instrumentation code |
+
+## Python SDK
+
+### Get evaluations by judge
+
+Fetch all evaluations for a specific judge with pagination and optional filters.
+
+```python
+import zeroeval as ze
+
+ze.init(api_key="YOUR_API_KEY")
+
+response = ze.get_behavior_evaluations(
+    project_id="your-project-id",
+    judge_id="your-judge-id",
+    limit=100,
+    offset=0,
+)
+
+print(f"Total: {response['total']}")
+for eval in response["evaluations"]:
+    print(f"Span: {eval['span_id']}")
+    print(f"Result: {'PASS' if eval['evaluation_result'] else 'FAIL'}")
+    print(f"Score: {eval.get('score')}")  # For scored judges
+    print(f"Reason: {eval['evaluation_reason']}")
+```
+
+**Optional filters:**
+
+```python
+response = ze.get_behavior_evaluations(
+    project_id="your-project-id",
+    judge_id="your-judge-id",
+    limit=100,
+    offset=0,
+    start_date="2025-01-01T00:00:00Z",
+    end_date="2025-01-31T23:59:59Z",
+    evaluation_result=True,  # Only passing evaluations
+    feedback_state="with_user_feedback",  # Only calibrated items
+)
+```
+
+### Get evaluations by span
+
+Fetch all judge evaluations for a specific span (useful when a span has been evaluated by multiple judges).
+
+```python
+response = ze.get_span_evaluations(
+    project_id="your-project-id",
+    span_id="your-span-id",
+)
+
+for eval in response["evaluations"]:
+    print(f"Judge: {eval['judge_name']}")
+    print(f"Result: {'PASS' if eval['evaluation_result'] else 'FAIL'}")
+    if eval.get('evaluation_type') == 'scored':
+        print(f"Score: {eval['score']} / {eval['score_max']}")
+```
+
+## REST API
+
+Use these endpoints directly with your API key in the `Authorization` header.
+
+### Get evaluations by judge
+
+```bash
+curl -X GET "https://api.zeroeval.com/projects/{project_id}/judges/{judge_id}/evaluations?limit=100&offset=0" \
+  -H "Authorization: Bearer $ZEROEVAL_API_KEY"
+```
+
+**Query parameters:**
+
+| Parameter | Type | Description |
+|---|---|---|
+| `limit` | int | Results per page (1-500, default 100) |
+| `offset` | int | Pagination offset (default 0) |
+| `start_date` | string | Filter by date (ISO 8601) |
+| `end_date` | string | Filter by date (ISO 8601) |
+| `evaluation_result` | bool | `true` for passing, `false` for failing |
+| `feedback_state` | string | `with_user_feedback` or `without_user_feedback` |
+
+### Get evaluations by span
+
+```bash
+curl -X GET "https://api.zeroeval.com/projects/{project_id}/spans/{span_id}/evaluations" \
+  -H "Authorization: Bearer $ZEROEVAL_API_KEY"
+```
+
+## Response format
+
+### Judge evaluations response
+
+```json
+{
+  "evaluations": [...],
+  "total": 142,
+  "limit": 100,
+  "offset": 0
+}
+```
+
+### Span evaluations response
+
+```json
+{
+  "span_id": "abc-123",
+  "evaluations": [...]
+}
+```
+
+### Evaluation object
+
+| Field | Type | Description |
+|---|---|---|
+| `id` | string | Unique evaluation ID |
+| `span_id` | string | The evaluated span |
+| `evaluation_result` | bool | Pass (`true`) or fail (`false`) |
+| `evaluation_reason` | string | Judge's reasoning |
+| `confidence_score` | float | Model confidence (0-1) |
+| `score` | float \| null | Numeric score (scored judges only) |
+| `score_min` | float \| null | Minimum possible score |
+| `score_max` | float \| null | Maximum possible score |
+| `pass_threshold` | float \| null | Score required to pass |
+| `model_used` | string | LLM model that ran the evaluation |
+| `created_at` | string | ISO 8601 timestamp |
+
+## Pagination example
+
+For large result sets, paginate through all evaluations:
+
+```python
+all_evaluations = []
+offset = 0
+limit = 100
+
+while True:
+    response = ze.get_behavior_evaluations(
+        project_id="your-project-id",
+        judge_id="your-judge-id",
+        limit=limit,
+        offset=offset,
+    )
+    
+    all_evaluations.extend(response["evaluations"])
+    
+    if len(response["evaluations"]) < limit:
+        break
+    
+    offset += limit
+
+print(f"Fetched {len(all_evaluations)} total evaluations")
+```
@@ -66,19 +66,18 @@
             ]
           },
           {
-            "group": "Calibrated Judges",
+            "group": "Behaviors",
             "icon": "gavel",
             "pages": [
-              "calibrated-judges/introduction",
-              "calibrated-judges/setup"
+              "behaviors/introduction",
+              "behaviors/setup",
+              "behaviors/pull-evaluations"
             ]
           },
           {
             "group": "Experiments",
             "icon": "flask",
             "pages": [
-              "evaluations/datasets",
-              "evaluations/experiments",
               "evaluations/ab-tests",
               "evaluations/prompt-management"
             ]
Original file line number	Diff line number	Diff line change
`@@ -66,19 +66,18 @@`
`66`	`66`	`]`
`67`	`67`	`},`
`68`	`68`	`{`
`69`		`- "group": "Calibrated Judges",`
	`69`	`+ "group": "Behaviors",`
`70`	`70`	`"icon": "gavel",`
`71`	`71`	`"pages": [`
`72`		`- "calibrated-judges/introduction",`
`73`		`- "calibrated-judges/setup"`
	`72`	`+ "behaviors/introduction",`
	`73`	`+ "behaviors/setup",`
	`74`	`+ "behaviors/pull-evaluations"`
`74`	`75`	`]`
`75`	`76`	`},`
`76`	`77`	`{`
`77`	`78`	`"group": "Experiments",`
`78`	`79`	`"icon": "flask",`
`79`	`80`	`"pages": [`
`80`		`- "evaluations/datasets",`
`81`		`- "evaluations/experiments",`
`82`	`81`	`"evaluations/ab-tests",`
`83`	`82`	`"evaluations/prompt-management"`
`84`	`83`	`]`