Skip to content

AgentResult

Validate agent behavior using AgentResult properties and methods.

Properties

Property Type Description
success bool Did the agent complete without errors?
final_response str The agent's final text response
turns list[Turn] All execution turns
duration_ms float Total execution time
token_usage dict[str, int] Prompt and completion token counts
cost_usd float Estimated cost in USD
error str \| None Error message if failed
clarification_stats ClarificationStats \| None Clarification detection stats (when enabled)

Clarification Detection

Detect when the agent asks for clarification instead of acting autonomously. Uses an LLM judge to classify responses.

The judge performs a simple YES/NO classification, so a cheap model like gpt-5-mini is sufficient. Unlike --aitest-summary-model (which generates complex analysis), the judge doesn't need a capable model.

Configuration

from pytest_aitest import Agent, Provider, ClarificationDetection, ClarificationLevel

agent = Agent(
    provider=Provider(model="azure/gpt-5-mini"),
    mcp_servers=[server],
    clarification_detection=ClarificationDetection(
        enabled=True,
        level=ClarificationLevel.ERROR,       # INFO, WARNING, or ERROR
        judge_model="azure/gpt-5-mini",       # None = use agent's model
    ),
)

Assertions

# Did the agent ask for clarification?
assert not result.asked_for_clarification

# How many times?
assert result.clarification_count == 0

# Detailed stats
if result.clarification_stats:
    print(f"Count: {result.clarification_stats.count}")
    print(f"Turns: {result.clarification_stats.turn_indices}")
    print(f"Examples: {result.clarification_stats.examples}")

Tool Assertions

tool_was_called

Check if a tool was invoked:

# Basic check - was it called at all?
assert result.tool_was_called("get_balance")

# Check specific call count
assert result.tool_call_count("get_balance") == 2

tool_call_count

Get number of tool invocations:

count = result.tool_call_count("get_balance")
assert count >= 1
assert count <= 5

tool_call_arg

Get an argument from the first call to a tool:

# Get argument from first call
account = result.tool_call_arg("get_balance", "account")
assert account == "checking"

# For multiple calls, use tool_calls_for and index manually
calls = result.tool_calls_for("get_balance")
if len(calls) > 1:
    second_account = calls[1].arguments.get("account")

tool_calls_for

Get all calls to a specific tool:

calls = result.tool_calls_for("get_balance")

for call in calls:
    print(f"Called with: {call.arguments}")
    print(f"Result: {call.result}")

tool_images_for

Get all images returned by a specific tool:

screenshots = result.tool_images_for("screenshot")

for img in screenshots:
    print(f"Type: {img.media_type}, Size: {len(img.data)} bytes")

Returns a list of ImageContent objects. Each has:

Property Type Description
data bytes Raw image bytes
media_type str MIME type (e.g., "image/png")

ToolCall Fields

Field Type Description
name str Tool name
arguments dict Arguments passed to the tool
result str \| None Text result (or description for images)
error str \| None Error message if failed
duration_ms float \| None Call duration
image_content bytes \| None Raw image data (if tool returned image)
image_media_type str \| None Image MIME type (if tool returned image)

Output Assertions

Check Response Content

# Case-insensitive content check
assert "checking" in result.final_response.lower()

# Multiple conditions
response = result.final_response.lower()
assert "balance" in response
assert "1,500" in response or "1500" in response

Check for Absence

# Ensure no errors mentioned
assert "error" not in result.final_response.lower()
assert "failed" not in result.final_response.lower()

Performance Assertions

Execution Time

# Check total execution time
assert result.duration_ms < 30000  # Under 30 seconds

Token Usage

# Check total token consumption
total = result.token_usage.get("prompt", 0) + result.token_usage.get("completion", 0)
assert total < 5000

# Detailed breakdown
print(f"Prompt tokens: {result.token_usage.get('prompt', 0)}")
print(f"Completion tokens: {result.token_usage.get('completion', 0)}")

Cost

# Check estimated cost
assert result.cost_usd < 0.10  # Under 10 cents

Error Handling

Check for Success

# Basic success check
assert result.success

# With error message on failure
assert result.success, f"Agent failed: {result.error}"

Inspect Errors

if not result.success:
    print(f"Error: {result.error}")

    # Check last turn for details
    last_turn = result.turns[-1]
    print(f"Last message: {last_turn.content}")

AI-Powered Assertions

For semantic validation, use the built-in llm_assert fixture (powered by pydantic-evals LLM judge):

async def test_response_quality(aitest_run, agent, llm_assert):
    """Use the llm_assert fixture for semantic validation."""
    result = await aitest_run(agent, "What's my checking balance?")

    assert result.success
    assert llm_assert(result.final_response, "mentions account balance")

Complete Examples

Testing Tool Selection

async def test_correct_tool_selection(aitest_run, agent):
    result = await aitest_run(agent, "What's my checking balance?")

    assert result.success
    assert result.tool_was_called("get_balance")
    assert not result.tool_was_called("transfer")

    account = result.tool_call_arg("get_balance", "account")
    assert account.lower() == "checking"

Testing Multi-Step Workflow

async def test_trip_planning(aitest_run, agent):
    result = await aitest_run(
        agent,
        "Show me both my checking and savings balances"
    )

    assert result.success
    assert result.tool_call_count("get_balance") >= 2 or result.tool_was_called("get_all_balances")

    response = result.final_response.lower()
    assert "checking" in response
    assert "savings" in response