|
| 1 | +--- |
| 2 | +title: "Multimodal Evaluation" |
| 3 | +description: "Evaluate screenshots and images with LLM judges" |
| 4 | +--- |
| 5 | + |
| 6 | +LLM judges can evaluate spans that contain images alongside text. This is useful for browser agents, UI testing, visual QA, and any workflow where you need to assess visual output. |
| 7 | + |
| 8 | +## How it works |
| 9 | + |
| 10 | +1. **Attach images to spans** using the SDK's `add_screenshot()` or `add_image()` methods |
| 11 | +2. **Images are stored in S3** during span ingestion (base64 data is stripped from the span) |
| 12 | +3. **Judges fetch images** when evaluating the span and send them to a vision-capable LLM |
| 13 | +4. **Evaluation results** appear in the dashboard like any other judge evaluation |
| 14 | + |
| 15 | +The LLM sees both the span's text data (input/output) and any attached images, giving it full context for evaluation. |
| 16 | + |
| 17 | +## Attaching images to spans |
| 18 | + |
| 19 | +### Screenshots with viewport context |
| 20 | + |
| 21 | +For browser agents or responsive testing, use `add_screenshot()` to capture different viewports: |
| 22 | + |
| 23 | +```python |
| 24 | +import zeroeval as ze |
| 25 | + |
| 26 | +with ze.span(name="homepage_test", tags={"test_type": "visual"}) as span: |
| 27 | + # Desktop viewport |
| 28 | + span.add_screenshot( |
| 29 | + base64_data=desktop_base64, |
| 30 | + viewport="desktop", |
| 31 | + width=1920, |
| 32 | + height=1080, |
| 33 | + label="Homepage - Desktop" |
| 34 | + ) |
| 35 | + |
| 36 | + # Mobile viewport |
| 37 | + span.add_screenshot( |
| 38 | + base64_data=mobile_base64, |
| 39 | + viewport="mobile", |
| 40 | + width=375, |
| 41 | + height=812, |
| 42 | + label="Homepage - Mobile" |
| 43 | + ) |
| 44 | + |
| 45 | + span.set_io( |
| 46 | + input_data="Load homepage and capture screenshots", |
| 47 | + output_data="Captured 2 viewport screenshots" |
| 48 | + ) |
| 49 | +``` |
| 50 | + |
| 51 | +### Generic images |
| 52 | + |
| 53 | +For charts, diagrams, or UI component states, use `add_image()`: |
| 54 | + |
| 55 | +```python |
| 56 | +with ze.span(name="button_hover_test") as span: |
| 57 | + span.add_image( |
| 58 | + base64_data=before_hover_base64, |
| 59 | + label="Button - Default State" |
| 60 | + ) |
| 61 | + |
| 62 | + span.add_image( |
| 63 | + base64_data=after_hover_base64, |
| 64 | + label="Button - Hover State" |
| 65 | + ) |
| 66 | + |
| 67 | + span.set_io( |
| 68 | + input_data="Test button hover interaction", |
| 69 | + output_data="Button changes color on hover" |
| 70 | + ) |
| 71 | +``` |
| 72 | + |
| 73 | +## Creating a multimodal judge |
| 74 | + |
| 75 | +Multimodal judges work like regular judges, but with criteria that reference attached images. The judge prompt should describe what to look for in the visual content. |
| 76 | + |
| 77 | +### Example: UI consistency judge |
| 78 | + |
| 79 | +``` |
| 80 | +Evaluate whether the UI renders correctly across viewports. |
| 81 | +
|
| 82 | +Check for: |
| 83 | +- Layout breaks or overlapping elements |
| 84 | +- Text that's too small to read on mobile |
| 85 | +- Missing or broken images |
| 86 | +- Inconsistent spacing between viewports |
| 87 | +
|
| 88 | +Score 1 if all viewports render correctly, 0 if there are visual issues. |
| 89 | +``` |
| 90 | + |
| 91 | +### Example: Brand compliance judge |
| 92 | + |
| 93 | +``` |
| 94 | +Check if the page follows brand guidelines. |
| 95 | +
|
| 96 | +Look for: |
| 97 | +- Correct logo placement and sizing |
| 98 | +- Brand colors used consistently |
| 99 | +- Proper typography hierarchy |
| 100 | +- Appropriate whitespace |
| 101 | +
|
| 102 | +Score 1 for full compliance, 0 for violations. |
| 103 | +``` |
| 104 | + |
| 105 | +### Example: Accessibility judge |
| 106 | + |
| 107 | +``` |
| 108 | +Evaluate visual accessibility of the interface. |
| 109 | +
|
| 110 | +Check: |
| 111 | +- Sufficient color contrast |
| 112 | +- Text size readability |
| 113 | +- Clear visual hierarchy |
| 114 | +- Button/link affordances |
| 115 | +
|
| 116 | +Score 1 if accessible, 0 if there are issues. Include specific problems in the reasoning. |
| 117 | +``` |
| 118 | + |
| 119 | +## Filtering spans for multimodal evaluation |
| 120 | + |
| 121 | +Use tags to identify which spans should be evaluated by your multimodal judge: |
| 122 | + |
| 123 | +```python |
| 124 | +# Tag spans that have screenshots |
| 125 | +with ze.span(name="browser_test", tags={"has_screenshots": "true"}) as span: |
| 126 | + span.add_screenshot(...) |
| 127 | +``` |
| 128 | + |
| 129 | +Then configure your judge to only evaluate spans matching that tag. This prevents the judge from running on text-only spans where multimodal evaluation doesn't apply. |
| 130 | + |
| 131 | +## Supported image formats |
| 132 | + |
| 133 | +- JPEG |
| 134 | +- PNG |
| 135 | +- WebP |
| 136 | +- GIF |
| 137 | + |
| 138 | +Images are validated during ingestion. The maximum size is 5MB per image, with up to 10 images per span. |
| 139 | + |
| 140 | +## Viewing images in the dashboard |
| 141 | + |
| 142 | +Screenshots appear in two places: |
| 143 | + |
| 144 | +1. **Span details view** - Images show in the Data tab with viewport labels and dimensions |
| 145 | +2. **Judge evaluation modal** - When reviewing an evaluation, you'll see the images the judge analyzed |
| 146 | + |
| 147 | +Images display with their labels, viewport type (for screenshots), and dimensions when available. |
| 148 | + |
| 149 | +## Model support |
| 150 | + |
| 151 | +Multimodal evaluation currently uses Gemini models, which support image inputs. When you create a judge, ZeroEval automatically handles the image formatting for the model. |
| 152 | + |
| 153 | +<Note> |
| 154 | +Multimodal evaluation works best with specific, measurable criteria. Vague prompts like "does this look good?" will produce inconsistent results. Be explicit about what visual properties to check. |
| 155 | +</Note> |
0 commit comments