AgentResult¶

Validate agent behavior using AgentResult properties and methods.

Properties¶

Property	Type	Description
`success`	`bool`	Did the agent complete without errors?
`final_response`	`str`	The agent's final text response
`turns`	`list[Turn]`	All execution turns
`duration_ms`	`float`	Total execution time
`token_usage`	`dict[str, int]`	Prompt and completion token counts
`cost_usd`	`float`	Estimated cost in USD
`error`	`str \\| None`	Error message if failed
`clarification_stats`	`ClarificationStats \\| None`	Clarification detection stats (when enabled)

Clarification Detection¶

Detect when the agent asks for clarification instead of acting autonomously. Uses an LLM judge to classify responses.

The judge performs a simple YES/NO classification, so a cheap model like gpt-5-mini is sufficient. Unlike --aitest-summary-model (which generates complex analysis), the judge doesn't need a capable model.

Configuration¶

from pytest_aitest import Agent, Provider, ClarificationDetection, ClarificationLevel

agent = Agent(
    provider=Provider(model="azure/gpt-5-mini"),
    mcp_servers=[server],
    clarification_detection=ClarificationDetection(
        enabled=True,
        level=ClarificationLevel.ERROR,       # INFO, WARNING, or ERROR
        judge_model="azure/gpt-5-mini",       # None = use agent's model
    ),
)

Assertions¶

# Did the agent ask for clarification?
assert not result.asked_for_clarification

# How many times?
assert result.clarification_count == 0

# Detailed stats
if result.clarification_stats:
    print(f"Count: {result.clarification_stats.count}")
    print(f"Turns: {result.clarification_stats.turn_indices}")
    print(f"Examples: {result.clarification_stats.examples}")

Tool Assertions¶

tool_was_called¶

Check if a tool was invoked:

# Basic check - was it called at all?
assert result.tool_was_called("get_balance")

# Check specific call count
assert result.tool_call_count("get_balance") == 2

tool_call_count¶

Get number of tool invocations:

count = result.tool_call_count("get_balance")
assert count >= 1
assert count <= 5

tool_call_arg¶

Get an argument from the first call to a tool:

# Get argument from first call
account = result.tool_call_arg("get_balance", "account")
assert account == "checking"

# For multiple calls, use tool_calls_for and index manually
calls = result.tool_calls_for("get_balance")
if len(calls) > 1:
    second_account = calls[1].arguments.get("account")

tool_calls_for¶

Get all calls to a specific tool:

calls = result.tool_calls_for("get_balance")

for call in calls:
    print(f"Called with: {call.arguments}")
    print(f"Result: {call.result}")

tool_images_for¶

Get all images returned by a specific tool:

screenshots = result.tool_images_for("screenshot")

for img in screenshots:
    print(f"Type: {img.media_type}, Size: {len(img.data)} bytes")

Returns a list of ImageContent objects. Each has:

Property	Type	Description
`data`	`bytes`	Raw image bytes
`media_type`	`str`	MIME type (e.g., `"image/png"`)

ToolCall Fields¶

Field	Type	Description
`name`	`str`	Tool name
`arguments`	`dict`	Arguments passed to the tool
`result`	`str \\| None`	Text result (or description for images)
`error`	`str \\| None`	Error message if failed
`duration_ms`	`float \\| None`	Call duration
`image_content`	`bytes \\| None`	Raw image data (if tool returned image)
`image_media_type`	`str \\| None`	Image MIME type (if tool returned image)

Output Assertions¶

Check Response Content¶

# Case-insensitive content check
assert "checking" in result.final_response.lower()

# Multiple conditions
response = result.final_response.lower()
assert "balance" in response
assert "1,500" in response or "1500" in response

Check for Absence¶

# Ensure no errors mentioned
assert "error" not in result.final_response.lower()
assert "failed" not in result.final_response.lower()

Performance Assertions¶

Execution Time¶

# Check total execution time
assert result.duration_ms < 30000  # Under 30 seconds

Token Usage¶

# Check total token consumption
total = result.token_usage.get("prompt", 0) + result.token_usage.get("completion", 0)
assert total < 5000

# Detailed breakdown
print(f"Prompt tokens: {result.token_usage.get('prompt', 0)}")
print(f"Completion tokens: {result.token_usage.get('completion', 0)}")

Cost¶

# Check estimated cost
assert result.cost_usd < 0.10  # Under 10 cents

Error Handling¶

Check for Success¶

# Basic success check
assert result.success

# With error message on failure
assert result.success, f"Agent failed: {result.error}"

Inspect Errors¶

if not result.success:
    print(f"Error: {result.error}")

    # Check last turn for details
    last_turn = result.turns[-1]
    print(f"Last message: {last_turn.content}")

AI-Powered Assertions¶

For semantic validation, use the built-in llm_assert fixture (powered by pydantic-evals LLM judge):

async def test_response_quality(aitest_run, agent, llm_assert):
    """Use the llm_assert fixture for semantic validation."""
    result = await aitest_run(agent, "What's my checking balance?")

    assert result.success
    assert llm_assert(result.final_response, "mentions account balance")

Complete Examples¶

Testing Tool Selection¶

async def test_correct_tool_selection(aitest_run, agent):
    result = await aitest_run(agent, "What's my checking balance?")

    assert result.success
    assert result.tool_was_called("get_balance")
    assert not result.tool_was_called("transfer")

    account = result.tool_call_arg("get_balance", "account")
    assert account.lower() == "checking"

Testing Multi-Step Workflow¶

async def test_trip_planning(aitest_run, agent):
    result = await aitest_run(
        agent,
        "Show me both my checking and savings balances"
    )

    assert result.success
    assert result.tool_call_count("get_balance") >= 2 or result.tool_was_called("get_all_balances")

    response = result.final_response.lower()
    assert "checking" in response
    assert "savings" in response