Skip to content

Commit 4e25e92

Browse files
updates to docs
1 parent 25442de commit 4e25e92

File tree

3 files changed

+252
-4
lines changed

3 files changed

+252
-4
lines changed

docs.json

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -71,17 +71,15 @@
7171
"pages": [
7272
"judges/introduction",
7373
"judges/setup",
74+
"judges/multimodal-evaluation",
7475
"judges/submit-feedback",
7576
"judges/pull-evaluations"
7677
]
7778
},
7879
{
7980
"group": "Experiments",
8081
"icon": "flask",
81-
"pages": [
82-
"evaluations/ab-tests",
83-
"evaluations/prompt-management"
84-
]
82+
"pages": ["evaluations/ab-tests", "evaluations/prompt-management"]
8583
},
8684
{
8785
"group": "LLM Gateway",

judges/multimodal-evaluation.mdx

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
---
2+
title: "Multimodal Evaluation"
3+
description: "Evaluate screenshots and images with LLM judges"
4+
---
5+
6+
LLM judges can evaluate spans that contain images alongside text. This is useful for browser agents, UI testing, visual QA, and any workflow where you need to assess visual output.
7+
8+
## How it works
9+
10+
1. **Attach images to spans** using the SDK's `add_screenshot()` or `add_image()` methods
11+
2. **Images are stored in S3** during span ingestion (base64 data is stripped from the span)
12+
3. **Judges fetch images** when evaluating the span and send them to a vision-capable LLM
13+
4. **Evaluation results** appear in the dashboard like any other judge evaluation
14+
15+
The LLM sees both the span's text data (input/output) and any attached images, giving it full context for evaluation.
16+
17+
## Attaching images to spans
18+
19+
### Screenshots with viewport context
20+
21+
For browser agents or responsive testing, use `add_screenshot()` to capture different viewports:
22+
23+
```python
24+
import zeroeval as ze
25+
26+
with ze.span(name="homepage_test", tags={"test_type": "visual"}) as span:
27+
# Desktop viewport
28+
span.add_screenshot(
29+
base64_data=desktop_base64,
30+
viewport="desktop",
31+
width=1920,
32+
height=1080,
33+
label="Homepage - Desktop"
34+
)
35+
36+
# Mobile viewport
37+
span.add_screenshot(
38+
base64_data=mobile_base64,
39+
viewport="mobile",
40+
width=375,
41+
height=812,
42+
label="Homepage - Mobile"
43+
)
44+
45+
span.set_io(
46+
input_data="Load homepage and capture screenshots",
47+
output_data="Captured 2 viewport screenshots"
48+
)
49+
```
50+
51+
### Generic images
52+
53+
For charts, diagrams, or UI component states, use `add_image()`:
54+
55+
```python
56+
with ze.span(name="button_hover_test") as span:
57+
span.add_image(
58+
base64_data=before_hover_base64,
59+
label="Button - Default State"
60+
)
61+
62+
span.add_image(
63+
base64_data=after_hover_base64,
64+
label="Button - Hover State"
65+
)
66+
67+
span.set_io(
68+
input_data="Test button hover interaction",
69+
output_data="Button changes color on hover"
70+
)
71+
```
72+
73+
## Creating a multimodal judge
74+
75+
Multimodal judges work like regular judges, but with criteria that reference attached images. The judge prompt should describe what to look for in the visual content.
76+
77+
### Example: UI consistency judge
78+
79+
```
80+
Evaluate whether the UI renders correctly across viewports.
81+
82+
Check for:
83+
- Layout breaks or overlapping elements
84+
- Text that's too small to read on mobile
85+
- Missing or broken images
86+
- Inconsistent spacing between viewports
87+
88+
Score 1 if all viewports render correctly, 0 if there are visual issues.
89+
```
90+
91+
### Example: Brand compliance judge
92+
93+
```
94+
Check if the page follows brand guidelines.
95+
96+
Look for:
97+
- Correct logo placement and sizing
98+
- Brand colors used consistently
99+
- Proper typography hierarchy
100+
- Appropriate whitespace
101+
102+
Score 1 for full compliance, 0 for violations.
103+
```
104+
105+
### Example: Accessibility judge
106+
107+
```
108+
Evaluate visual accessibility of the interface.
109+
110+
Check:
111+
- Sufficient color contrast
112+
- Text size readability
113+
- Clear visual hierarchy
114+
- Button/link affordances
115+
116+
Score 1 if accessible, 0 if there are issues. Include specific problems in the reasoning.
117+
```
118+
119+
## Filtering spans for multimodal evaluation
120+
121+
Use tags to identify which spans should be evaluated by your multimodal judge:
122+
123+
```python
124+
# Tag spans that have screenshots
125+
with ze.span(name="browser_test", tags={"has_screenshots": "true"}) as span:
126+
span.add_screenshot(...)
127+
```
128+
129+
Then configure your judge to only evaluate spans matching that tag. This prevents the judge from running on text-only spans where multimodal evaluation doesn't apply.
130+
131+
## Supported image formats
132+
133+
- JPEG
134+
- PNG
135+
- WebP
136+
- GIF
137+
138+
Images are validated during ingestion. The maximum size is 5MB per image, with up to 10 images per span.
139+
140+
## Viewing images in the dashboard
141+
142+
Screenshots appear in two places:
143+
144+
1. **Span details view** - Images show in the Data tab with viewport labels and dimensions
145+
2. **Judge evaluation modal** - When reviewing an evaluation, you'll see the images the judge analyzed
146+
147+
Images display with their labels, viewport type (for screenshots), and dimensions when available.
148+
149+
## Model support
150+
151+
Multimodal evaluation currently uses Gemini models, which support image inputs. When you create a judge, ZeroEval automatically handles the image formatting for the model.
152+
153+
<Note>
154+
Multimodal evaluation works best with specific, measurable criteria. Vague prompts like "does this look good?" will produce inconsistent results. Be explicit about what visual properties to check.
155+
</Note>

tracing/sdks/python/reference.mdx

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -484,6 +484,101 @@ def set_error(
484484
- `message` (str): Error message
485485
- `stack` (str, optional): Stack trace information
486486

487+
##### `add_screenshot()`
488+
489+
Attach a screenshot to the span for visual evaluation by LLM judges. Screenshots are uploaded to S3 during ingestion and can be evaluated alongside text data.
490+
491+
```python
492+
def add_screenshot(
493+
self,
494+
base64_data: str,
495+
viewport: str = "desktop",
496+
width: Optional[int] = None,
497+
height: Optional[int] = None,
498+
label: Optional[str] = None
499+
) -> None
500+
```
501+
502+
**Parameters:**
503+
- `self`: The Span instance
504+
- `base64_data` (str): Base64 encoded image data. Accepts raw base64 or data URL format (`data:image/png;base64,...`)
505+
- `viewport` (str, optional): Viewport type - `"desktop"`, `"mobile"`, or `"tablet"`. Defaults to `"desktop"`
506+
- `width` (int, optional): Image width in pixels
507+
- `height` (int, optional): Image height in pixels
508+
- `label` (str, optional): Human-readable description of the screenshot
509+
510+
**Example:**
511+
```python
512+
import zeroeval as ze
513+
514+
with ze.span(name="browser_test", tags={"test": "visual"}) as span:
515+
# Capture and attach a desktop screenshot
516+
span.add_screenshot(
517+
base64_data=desktop_screenshot_base64,
518+
viewport="desktop",
519+
width=1920,
520+
height=1080,
521+
label="Homepage - Desktop"
522+
)
523+
524+
# Also capture mobile view
525+
span.add_screenshot(
526+
base64_data=mobile_screenshot_base64,
527+
viewport="mobile",
528+
width=375,
529+
height=812,
530+
label="Homepage - iPhone"
531+
)
532+
533+
span.set_io(
534+
input_data="Navigate to homepage",
535+
output_data="Captured viewport screenshots"
536+
)
537+
```
538+
539+
##### `add_image()`
540+
541+
Attach a generic image to the span for visual evaluation. Use this for non-screenshot images like charts, diagrams, or UI component states.
542+
543+
```python
544+
def add_image(
545+
self,
546+
base64_data: str,
547+
label: Optional[str] = None,
548+
metadata: Optional[dict[str, Any]] = None
549+
) -> None
550+
```
551+
552+
**Parameters:**
553+
- `self`: The Span instance
554+
- `base64_data` (str): Base64 encoded image data. Accepts raw base64 or data URL format
555+
- `label` (str, optional): Human-readable description of the image
556+
- `metadata` (dict, optional): Additional metadata to store with the image
557+
558+
**Example:**
559+
```python
560+
import zeroeval as ze
561+
562+
with ze.span(name="chart_generation") as span:
563+
# Generate a chart and attach it
564+
chart_base64 = generate_chart(data)
565+
566+
span.add_image(
567+
base64_data=chart_base64,
568+
label="Monthly Revenue Chart",
569+
metadata={"chart_type": "bar", "data_points": 12}
570+
)
571+
572+
span.set_io(
573+
input_data="Generate revenue chart for Q4",
574+
output_data="Chart generated with 12 data points"
575+
)
576+
```
577+
578+
<Note>
579+
Images attached to spans can be evaluated by LLM judges configured for multimodal evaluation. See the [Multimodal Evaluation](/judges/multimodal-evaluation) guide for setup instructions.
580+
</Note>
581+
487582
## Context Functions
488583

489584
### `get_current_span()`

0 commit comments

Comments
 (0)