Image Assertions¶
pytest-aitest supports asserting on images returned by MCP tools. This is useful when your tools produce visual output — screenshots, charts, diagrams, or any image content.
Overview¶
There are two approaches to image assertions:
| Approach | Use Case | Fixture |
|---|---|---|
| Structural | Check images exist, size, type | result.tool_images_for() |
| AI-Graded | Vision LLM evaluates image quality | llm_assert_image |
Prerequisites¶
Your MCP tool must return images as ImageContentBlock in the MCP response. PydanticAI converts these to BinaryContent objects, which pytest-aitest extracts into ImageContent objects.
Checking If Images Were Returned¶
Use result.tool_images_for(tool_name) to get all images returned by a specific tool:
async def test_screenshot_captured(aitest_run, agent):
result = await aitest_run(agent, "Take a screenshot of the worksheet")
# Get all images from the "screenshot" tool
screenshots = result.tool_images_for("screenshot")
# At least one screenshot was taken
assert len(screenshots) > 0
# Check image properties
assert screenshots[-1].media_type == "image/png"
assert len(screenshots[-1].data) > 1000 # Reasonable image size
ImageContent Properties¶
| Property | Type | Description |
|---|---|---|
data |
bytes |
Raw image bytes |
media_type |
str |
MIME type (e.g., "image/png") |
AI-Graded Image Evaluation¶
Use the llm_assert_image fixture to have a vision-capable LLM evaluate an image against plain-English criteria:
async def test_dashboard_layout(aitest_run, agent, llm_assert_image):
result = await aitest_run(agent, "Create a dashboard with 4 charts")
screenshots = result.tool_images_for("screenshot")
assert len(screenshots) > 0
# Vision LLM judges the screenshot
assert llm_assert_image(
screenshots[-1],
"Shows 4 charts arranged without overlapping, each with a descriptive title"
)
How It Works¶
llm_assert_image uses pydantic-evals judge_output() which natively supports multimodal content. The image is sent to a vision-capable model along with your criterion, and the model evaluates whether the criterion is met.
Accepted Input Types¶
llm_assert_image accepts:
ImageContentfromresult.tool_images_for()(recommended)- Raw
byteswith optionalmedia_typeparameter (default:image/png)
# From tool_images_for (recommended)
screenshots = result.tool_images_for("screenshot")
assert llm_assert_image(screenshots[-1], "shows a bar chart")
# From raw bytes
with open("screenshot.png", "rb") as f:
assert llm_assert_image(f.read(), "shows a bar chart")
# With custom media type
assert llm_assert_image(jpeg_bytes, "shows a table", media_type="image/jpeg")
Vision Model Configuration¶
Command-Line Options¶
# Dedicated vision model (recommended for cost control)
pytest --llm-vision-model=azure/gpt-4o
# Falls back to --llm-model if --llm-vision-model not set
pytest --llm-model=azure/gpt-4o
# Falls back to --aitest-summary-model if neither set
pytest --aitest-summary-model=azure/gpt-4o
Model Requirements¶
The vision model must support image input. Recommended models:
| Provider | Models |
|---|---|
| OpenAI | gpt-4o, gpt-4o-mini |
| Anthropic | claude-sonnet-4, claude-haiku-4 |
| Azure | azure/gpt-4o |
gemini-2.0-flash |
Writing Effective Image Criteria¶
Good Criteria¶
- ✅
"Shows 4 charts in a 2x2 grid layout" - ✅
"Contains a bar chart with labeled axes" - ✅
"No elements overlap each other" - ✅
"Has a title at the top of the page" - ✅
"Data table has at least 5 rows of content"
Less Effective Criteria¶
- ❌
"The chart is blue"— too specific, may fail on theme changes - ❌
"Revenue is $1,234,567"— exact values are hard to read from images - ❌
"Looks professional"— too subjective, inconsistent results
Tips¶
- Focus on structural properties (layout, count, presence)
- Avoid exact values (hard to OCR reliably)
- Be specific but flexible about visual properties
- Combine with text assertions for comprehensive coverage
Complete Example: A/B Testing with Screenshots¶
"""A/B test: Does a screenshot tool improve dashboard quality?"""
import pytest
from pytest_aitest import Agent, Provider
CONTROL = Agent(
name="without-screenshot",
provider=Provider(model="azure/gpt-4o"),
mcp_servers=[excel_server],
allowed_tools=["file", "worksheet", "range", "table", "chart"],
)
EXPERIMENT = Agent(
name="with-screenshot",
provider=Provider(model="azure/gpt-4o"),
mcp_servers=[excel_server],
allowed_tools=["file", "worksheet", "range", "table", "chart", "screenshot"],
)
@pytest.mark.parametrize("agent", [CONTROL, EXPERIMENT], ids=lambda a: a.name)
async def test_dashboard(aitest_run, agent, llm_assert_image):
result = await aitest_run(agent, "Create a dashboard with 4 charts")
assert result.success
# Both variants should create charts
assert result.tool_was_called("chart")
# Experiment variant: verify visual quality
if agent.name == "with-screenshot":
screenshots = result.tool_images_for("screenshot")
if screenshots:
assert llm_assert_image(
screenshots[-1],
"Shows 4 charts with no overlapping elements"
)
HTML Reports¶
When tools return images, the HTML report shows inline thumbnails next to the tool call. This makes it easy to visually compare results across agents.
Cost Awareness¶
Vision model calls are more expensive than text-only calls. A single llm_assert_image call with a screenshot typically costs:
- GPT-4o: ~$0.01-0.03 per image
- Claude Sonnet: ~$0.01-0.02 per image
- GPT-4o-mini: ~$0.001-0.005 per image
Use --llm-vision-model to select a cost-appropriate model for your CI budget.