Skip to content

Image Assertions

pytest-aitest supports asserting on images returned by MCP tools. This is useful when your tools produce visual output — screenshots, charts, diagrams, or any image content.

Overview

There are two approaches to image assertions:

Approach Use Case Fixture
Structural Check images exist, size, type result.tool_images_for()
AI-Graded Vision LLM evaluates image quality llm_assert_image

Prerequisites

Your MCP tool must return images as ImageContentBlock in the MCP response. PydanticAI converts these to BinaryContent objects, which pytest-aitest extracts into ImageContent objects.

Checking If Images Were Returned

Use result.tool_images_for(tool_name) to get all images returned by a specific tool:

async def test_screenshot_captured(aitest_run, agent):
    result = await aitest_run(agent, "Take a screenshot of the worksheet")

    # Get all images from the "screenshot" tool
    screenshots = result.tool_images_for("screenshot")

    # At least one screenshot was taken
    assert len(screenshots) > 0

    # Check image properties
    assert screenshots[-1].media_type == "image/png"
    assert len(screenshots[-1].data) > 1000  # Reasonable image size

ImageContent Properties

Property Type Description
data bytes Raw image bytes
media_type str MIME type (e.g., "image/png")

AI-Graded Image Evaluation

Use the llm_assert_image fixture to have a vision-capable LLM evaluate an image against plain-English criteria:

async def test_dashboard_layout(aitest_run, agent, llm_assert_image):
    result = await aitest_run(agent, "Create a dashboard with 4 charts")

    screenshots = result.tool_images_for("screenshot")
    assert len(screenshots) > 0

    # Vision LLM judges the screenshot
    assert llm_assert_image(
        screenshots[-1],
        "Shows 4 charts arranged without overlapping, each with a descriptive title"
    )

How It Works

llm_assert_image uses pydantic-evals judge_output() which natively supports multimodal content. The image is sent to a vision-capable model along with your criterion, and the model evaluates whether the criterion is met.

Accepted Input Types

llm_assert_image accepts:

  • ImageContent from result.tool_images_for() (recommended)
  • Raw bytes with optional media_type parameter (default: image/png)
# From tool_images_for (recommended)
screenshots = result.tool_images_for("screenshot")
assert llm_assert_image(screenshots[-1], "shows a bar chart")

# From raw bytes
with open("screenshot.png", "rb") as f:
    assert llm_assert_image(f.read(), "shows a bar chart")

# With custom media type
assert llm_assert_image(jpeg_bytes, "shows a table", media_type="image/jpeg")

Vision Model Configuration

Command-Line Options

# Dedicated vision model (recommended for cost control)
pytest --llm-vision-model=azure/gpt-4o

# Falls back to --llm-model if --llm-vision-model not set
pytest --llm-model=azure/gpt-4o

# Falls back to --aitest-summary-model if neither set
pytest --aitest-summary-model=azure/gpt-4o

Model Requirements

The vision model must support image input. Recommended models:

Provider Models
OpenAI gpt-4o, gpt-4o-mini
Anthropic claude-sonnet-4, claude-haiku-4
Azure azure/gpt-4o
Google gemini-2.0-flash

Writing Effective Image Criteria

Good Criteria

  • "Shows 4 charts in a 2x2 grid layout"
  • "Contains a bar chart with labeled axes"
  • "No elements overlap each other"
  • "Has a title at the top of the page"
  • "Data table has at least 5 rows of content"

Less Effective Criteria

  • "The chart is blue" — too specific, may fail on theme changes
  • "Revenue is $1,234,567" — exact values are hard to read from images
  • "Looks professional" — too subjective, inconsistent results

Tips

  • Focus on structural properties (layout, count, presence)
  • Avoid exact values (hard to OCR reliably)
  • Be specific but flexible about visual properties
  • Combine with text assertions for comprehensive coverage

Complete Example: A/B Testing with Screenshots

"""A/B test: Does a screenshot tool improve dashboard quality?"""

import pytest
from pytest_aitest import Agent, Provider

CONTROL = Agent(
    name="without-screenshot",
    provider=Provider(model="azure/gpt-4o"),
    mcp_servers=[excel_server],
    allowed_tools=["file", "worksheet", "range", "table", "chart"],
)

EXPERIMENT = Agent(
    name="with-screenshot",
    provider=Provider(model="azure/gpt-4o"),
    mcp_servers=[excel_server],
    allowed_tools=["file", "worksheet", "range", "table", "chart", "screenshot"],
)

@pytest.mark.parametrize("agent", [CONTROL, EXPERIMENT], ids=lambda a: a.name)
async def test_dashboard(aitest_run, agent, llm_assert_image):
    result = await aitest_run(agent, "Create a dashboard with 4 charts")
    assert result.success

    # Both variants should create charts
    assert result.tool_was_called("chart")

    # Experiment variant: verify visual quality
    if agent.name == "with-screenshot":
        screenshots = result.tool_images_for("screenshot")
        if screenshots:
            assert llm_assert_image(
                screenshots[-1],
                "Shows 4 charts with no overlapping elements"
            )

HTML Reports

When tools return images, the HTML report shows inline thumbnails next to the tool call. This makes it easy to visually compare results across agents.

Cost Awareness

Vision model calls are more expensive than text-only calls. A single llm_assert_image call with a screenshot typically costs:

  • GPT-4o: ~$0.01-0.03 per image
  • Claude Sonnet: ~$0.01-0.02 per image
  • GPT-4o-mini: ~$0.001-0.005 per image

Use --llm-vision-model to select a cost-appropriate model for your CI budget.