|
| 1 | +--- |
| 2 | +title: "Pulling Evaluations" |
| 3 | +description: "Retrieve judge evaluations via SDK or REST API" |
| 4 | +--- |
| 5 | + |
| 6 | +Retrieve judge evaluations programmatically for reporting, analysis, or integration into your own workflows. |
| 7 | + |
| 8 | +## Finding your IDs |
| 9 | + |
| 10 | +Before making API calls, you'll need these identifiers: |
| 11 | + |
| 12 | +| ID | Where to find it | |
| 13 | +|---|---| |
| 14 | +| **Project ID** | Settings → Project, or in any URL after `/projects/` | |
| 15 | +| **Judge ID** | Click a judge in the dashboard; the ID is in the URL (`/judges/{judge_id}`) | |
| 16 | +| **Span ID** | In trace details, or returned by your instrumentation code | |
| 17 | + |
| 18 | +## Python SDK |
| 19 | + |
| 20 | +### Get evaluations by judge |
| 21 | + |
| 22 | +Fetch all evaluations for a specific judge with pagination and optional filters. |
| 23 | + |
| 24 | +```python |
| 25 | +import zeroeval as ze |
| 26 | + |
| 27 | +ze.init(api_key="YOUR_API_KEY") |
| 28 | + |
| 29 | +response = ze.get_behavior_evaluations( |
| 30 | + project_id="your-project-id", |
| 31 | + judge_id="your-judge-id", |
| 32 | + limit=100, |
| 33 | + offset=0, |
| 34 | +) |
| 35 | + |
| 36 | +print(f"Total: {response['total']}") |
| 37 | +for eval in response["evaluations"]: |
| 38 | + print(f"Span: {eval['span_id']}") |
| 39 | + print(f"Result: {'PASS' if eval['evaluation_result'] else 'FAIL'}") |
| 40 | + print(f"Score: {eval.get('score')}") # For scored judges |
| 41 | + print(f"Reason: {eval['evaluation_reason']}") |
| 42 | +``` |
| 43 | + |
| 44 | +**Optional filters:** |
| 45 | + |
| 46 | +```python |
| 47 | +response = ze.get_behavior_evaluations( |
| 48 | + project_id="your-project-id", |
| 49 | + judge_id="your-judge-id", |
| 50 | + limit=100, |
| 51 | + offset=0, |
| 52 | + start_date="2025-01-01T00:00:00Z", |
| 53 | + end_date="2025-01-31T23:59:59Z", |
| 54 | + evaluation_result=True, # Only passing evaluations |
| 55 | + feedback_state="with_user_feedback", # Only calibrated items |
| 56 | +) |
| 57 | +``` |
| 58 | + |
| 59 | +### Get evaluations by span |
| 60 | + |
| 61 | +Fetch all judge evaluations for a specific span (useful when a span has been evaluated by multiple judges). |
| 62 | + |
| 63 | +```python |
| 64 | +response = ze.get_span_evaluations( |
| 65 | + project_id="your-project-id", |
| 66 | + span_id="your-span-id", |
| 67 | +) |
| 68 | + |
| 69 | +for eval in response["evaluations"]: |
| 70 | + print(f"Judge: {eval['judge_name']}") |
| 71 | + print(f"Result: {'PASS' if eval['evaluation_result'] else 'FAIL'}") |
| 72 | + if eval.get('evaluation_type') == 'scored': |
| 73 | + print(f"Score: {eval['score']} / {eval['score_max']}") |
| 74 | +``` |
| 75 | + |
| 76 | +## REST API |
| 77 | + |
| 78 | +Use these endpoints directly with your API key in the `Authorization` header. |
| 79 | + |
| 80 | +### Get evaluations by judge |
| 81 | + |
| 82 | +```bash |
| 83 | +curl -X GET "https://api.zeroeval.com/projects/{project_id}/judges/{judge_id}/evaluations?limit=100&offset=0" \ |
| 84 | + -H "Authorization: Bearer $ZEROEVAL_API_KEY" |
| 85 | +``` |
| 86 | + |
| 87 | +**Query parameters:** |
| 88 | + |
| 89 | +| Parameter | Type | Description | |
| 90 | +|---|---|---| |
| 91 | +| `limit` | int | Results per page (1-500, default 100) | |
| 92 | +| `offset` | int | Pagination offset (default 0) | |
| 93 | +| `start_date` | string | Filter by date (ISO 8601) | |
| 94 | +| `end_date` | string | Filter by date (ISO 8601) | |
| 95 | +| `evaluation_result` | bool | `true` for passing, `false` for failing | |
| 96 | +| `feedback_state` | string | `with_user_feedback` or `without_user_feedback` | |
| 97 | + |
| 98 | +### Get evaluations by span |
| 99 | + |
| 100 | +```bash |
| 101 | +curl -X GET "https://api.zeroeval.com/projects/{project_id}/spans/{span_id}/evaluations" \ |
| 102 | + -H "Authorization: Bearer $ZEROEVAL_API_KEY" |
| 103 | +``` |
| 104 | + |
| 105 | +## Response format |
| 106 | + |
| 107 | +### Judge evaluations response |
| 108 | + |
| 109 | +```json |
| 110 | +{ |
| 111 | + "evaluations": [...], |
| 112 | + "total": 142, |
| 113 | + "limit": 100, |
| 114 | + "offset": 0 |
| 115 | +} |
| 116 | +``` |
| 117 | + |
| 118 | +### Span evaluations response |
| 119 | + |
| 120 | +```json |
| 121 | +{ |
| 122 | + "span_id": "abc-123", |
| 123 | + "evaluations": [...] |
| 124 | +} |
| 125 | +``` |
| 126 | + |
| 127 | +### Evaluation object |
| 128 | + |
| 129 | +| Field | Type | Description | |
| 130 | +|---|---|---| |
| 131 | +| `id` | string | Unique evaluation ID | |
| 132 | +| `span_id` | string | The evaluated span | |
| 133 | +| `evaluation_result` | bool | Pass (`true`) or fail (`false`) | |
| 134 | +| `evaluation_reason` | string | Judge's reasoning | |
| 135 | +| `confidence_score` | float | Model confidence (0-1) | |
| 136 | +| `score` | float \| null | Numeric score (scored judges only) | |
| 137 | +| `score_min` | float \| null | Minimum possible score | |
| 138 | +| `score_max` | float \| null | Maximum possible score | |
| 139 | +| `pass_threshold` | float \| null | Score required to pass | |
| 140 | +| `model_used` | string | LLM model that ran the evaluation | |
| 141 | +| `created_at` | string | ISO 8601 timestamp | |
| 142 | + |
| 143 | +## Pagination example |
| 144 | + |
| 145 | +For large result sets, paginate through all evaluations: |
| 146 | + |
| 147 | +```python |
| 148 | +all_evaluations = [] |
| 149 | +offset = 0 |
| 150 | +limit = 100 |
| 151 | + |
| 152 | +while True: |
| 153 | + response = ze.get_behavior_evaluations( |
| 154 | + project_id="your-project-id", |
| 155 | + judge_id="your-judge-id", |
| 156 | + limit=limit, |
| 157 | + offset=offset, |
| 158 | + ) |
| 159 | + |
| 160 | + all_evaluations.extend(response["evaluations"]) |
| 161 | + |
| 162 | + if len(response["evaluations"]) < limit: |
| 163 | + break |
| 164 | + |
| 165 | + offset += limit |
| 166 | + |
| 167 | +print(f"Fetched {len(all_evaluations)} total evaluations") |
| 168 | +``` |
0 commit comments