Fixture Testing with Comprehensive Assertions¶
Test your AI agents using pre-built fixture test suites with semantic, tool, and performance assertions.
Overview¶
Fixture tests demonstrate best practices for comprehensive AI agent validation. Each scenario is a separate file in tests/fixtures/:
| File | Agents | Purpose |
|---|---|---|
scenario_01_single_agent.py |
1 | Basic report, no comparison UI |
scenario_02_multi_agent.py |
2 | Leaderboard, comparison |
scenario_03_sessions.py |
2 | Session grouping + comparison |
scenario_04_agent_selector.py |
3 | Agent selector UI |
They demonstrate:
- Semantic Assertions — AI validates response quality
- Tool Argument Assertions — Verify correct parameters passed
- Tool Count Assertions — Check single vs. multiple tool calls
- Performance Assertions — Validate cost and duration
- Multi-Agent Comparison — Compare models and skills
- Session Testing — Multi-turn context preservation
Running Fixture Tests¶
# Run a specific scenario
pytest tests/fixtures/scenario_01_single_agent.py -v
# Run a single test
pytest tests/fixtures/scenario_01_single_agent.py::test_balance_check -v
# Generate fixture JSON report
pytest tests/fixtures/scenario_01_single_agent.py -v \
--aitest-json=tests/fixtures/reports/01_single_agent.json
# Run all fixtures
pytest tests/fixtures/ -v
Test Suites¶
1. Single Agent (4 tests)¶
scenario_01_single_agent.py — one agent, multiple prompts:
agent = Agent(
name="banking-agent",
provider=Provider(model="azure/gpt-5-mini", rpm=10, tpm=10000),
mcp_servers=[banking_server],
system_prompt=BANKING_PROMPT,
)
async def test_balance_check(aitest_run, llm_assert):
result = await aitest_run(agent, "What's my checking account balance?")
assert result.success
assert result.tool_was_called("get_balance")
assert llm_assert(result.final_response, "mentions the account balance")
assert result.cost_usd < 0.05
Tests: test_balance_check, test_transfer_funds, test_transaction_history, test_expected_failure
2. Two Agents (6 tests)¶
scenario_02_multi_agent.py — define agents once, parametrize tests:
AGENTS = [
Agent(name="gpt-5-mini", provider=Provider(model="azure/gpt-5-mini"), ...),
Agent(name="gpt-4.1-mini", provider=Provider(model="azure/gpt-4.1-mini"), ...),
]
@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_balance_check(aitest_run, agent, llm_assert):
result = await aitest_run(agent, "What's my checking account balance?")
assert result.success
Each test runs on both models — AI analysis auto-generates leaderboard.
3. Sessions (6 tests)¶
scenario_03_sessions.py — multi-turn conversation with context:
@pytest.mark.session("banking-workflow")
class TestBankingWorkflow:
@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_check_balance(self, aitest_run, agent, llm_assert):
result = await aitest_run(agent, "What's my checking account balance?")
assert result.success
assert result.tool_was_called("get_balance")
@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_transfer_funds(self, aitest_run, agent, llm_assert):
result = await aitest_run(agent, "Transfer $100 from checking to savings")
assert result.is_session_continuation
The @pytest.mark.session marker ensures tests share agent state.
4. Agent Selector (6 tests)¶
scenario_04_agent_selector.py — three agents for selector UI:
AGENTS = [
Agent(name="gpt-5-mini", ...),
Agent(name="gpt-4.1-mini", ...),
Agent(name="gpt-5-mini+skill", ..., skill=FINANCIAL_SKILL),
]
@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_balance_query(aitest_run, agent, llm_assert):
result = await aitest_run(agent, "What's my checking account balance?")
assert result.success
With 3+ agents, the report includes an interactive agent selector.
Assertion Patterns¶
Semantic Assertions (LLM-Based)¶
Validate response quality using AI judgment:
# llm_assert is a pytest fixture — just add it to your test function signature
# Does response mention expected content?
assert llm_assert(result.final_response, "mentions account balance with dollar amount")
# Complex criteria
assert llm_assert(
result.final_response,
"lists both checking and savings account balances"
)
Tool Call Assertions¶
Verify the agent used the right tools:
# Simple: was tool called at all?
assert result.tool_was_called("get_balance")
# Count: how many times?
assert result.tool_call_count("get_balance") >= 2
# Arguments: what parameters were passed?
account = result.tool_call_arg("get_balance", "account")
assert account.lower() == "checking"
# Multiple calls: get all invocations
calls = result.tool_calls_for("transfer")
assert len(calls) >= 1
assert calls[0].arguments["amount"] == 100
Performance Assertions¶
Validate efficiency:
# Cost (in USD)
assert result.cost_usd < 0.01 # Under 1 cent
# Duration (in milliseconds)
assert result.duration_ms < 30000 # Under 30 seconds
# Token usage
total_tokens = (result.token_usage.get("prompt", 0) +
result.token_usage.get("completion", 0))
assert total_tokens < 5000
Generated Reports¶
Running fixture tests generates JSON and HTML reports:
pytest tests/fixtures/scenario_01_single_agent.py -v \
--aitest-json=tests/fixtures/reports/01_single_agent.json \
--aitest-html=docs/reports/01_single_agent.html
# Output:
# aitest JSON report: tests\fixtures\reports\01_single_agent.json
# aitest HTML report: docs\reports\01_single_agent.html
Reports include:
- AI Analysis — LLM-generated insights on performance
- Test Results — All assertions with pass/fail details
- Tool Feedback — Suggestions for improving tool descriptions
- Tool Call Flows — Mermaid diagrams showing sequence
- Leaderboard — Compare agents if multiple models tested
Running Comprehensive Test Suite¶
Generate reports for all 4 fixture suites:
# Generate each fixture individually
pytest tests/fixtures/scenario_01_single_agent.py -v \
--aitest-json=tests/fixtures/reports/01_single_agent.json
pytest tests/fixtures/scenario_02_multi_agent.py -v \
--aitest-json=tests/fixtures/reports/02_multi_agent.json
pytest tests/fixtures/scenario_03_sessions.py -v \
--aitest-json=tests/fixtures/reports/03_multi_agent_sessions.json
pytest tests/fixtures/scenario_04_agent_selector.py -v \
--aitest-json=tests/fixtures/reports/04_agent_selector.json
Then regenerate HTML from all JSONs:
This updates HTML reports in docs/reports/ without re-running tests (faster).
Assertion Workflow¶
When writing fixture tests, follow this pattern:
- Create agent — Configure provider, servers, system prompt, skill
- Run prompt — Use
aitest_run(agent, "user message") - Validate success —
assert result.success - Assert tool usage —
assert result.tool_was_called(...) - Check arguments —
assert result.tool_call_arg(...) == expected - Semantic validation —
assert llm_assert(response, "criterion") - Performance validation —
assert result.cost_usd < threshold
Example:
async def test_transfer_workflow(self, aitest_run, banking_server, llm_assert):
agent = Agent(...)
result = await aitest_run(
agent,
"Check my checking balance, then transfer $100 to savings."
)
# Validate execution
assert result.success, f"Agent failed: {result.error}"
# Validate tools used
assert result.tool_was_called("transfer")
amount = result.tool_call_arg("transfer", "amount")
assert amount == 100, f"Expected $100, got {amount}"
# Validate response quality
assert llm_assert(result.final_response, "confirms transfer of $100 to savings")
# Validate efficiency
assert result.cost_usd < 0.05, f"Cost too high: ${result.cost_usd}"
assert result.duration_ms < 30000, f"Took too long: {result.duration_ms}ms"
Debugging Failed Tests¶
If a fixture test fails:
- Check the error message — AI assertion detail explains why
- Run the test locally —
pytest tests/fixtures/... -vv - Check JSON report —
tests/fixtures/reports/01_*.jsonhas full details - Verify agent config — Is the right server/model/prompt used?
- Check tool availability — Does
result.available_toolsinclude what you need?
Common Issues¶
"Tool not found"
# Check available tools
print(f"Available: {[t.name for t in result.available_tools]}")
# Ensure MCP server is running
assert len(result.available_tools) > 0
"Wrong parameter passed"
# Get all calls to debug
calls = result.tool_calls_for("get_balance")
for i, call in enumerate(calls):
print(f"Call {i}: {call.arguments}")
"Response doesn't match criterion"
# Print what the LLM actually said
print(f"Response:\n{result.final_response}")
# Check if criterion is too strict
assert llm_assert(result.final_response, "mentions balance") # Too vague
assert llm_assert(result.final_response, "exact phrase") # Too narrow
Best Practices¶
Do ✅¶
- Use semantic assertions for subjective quality checks
- Use tool assertions for behavioral verification
- Use performance assertions for efficiency checks
- Combine assertion types for comprehensive validation
- Document what each assertion validates with comments
Don't ❌¶
- Don't rely only on
result.success(it's too broad) - Don't hardcode exact prices for cost assertions (use thresholds)
- Don't expect specific tool call counts without testing multiple times
- Don't use semantic assertions for tool-related checks (use
tool_was_called) - Don't assume tool arguments — always verify with
tool_call_arg