Complete Example: Tying It All Together¶
This guide walks through the hero test suite — a comprehensive example demonstrating all pytest-aitest capabilities in a single, cohesive banking scenario.
Generate the Report
Run pytest tests/showcase/ -v --aitest-html=report.html to generate the hero report.
The Scenario: Personal Finance Assistant¶
The hero test uses a Banking MCP server that simulates a personal finance application with:
- 2 accounts: checking ($1,500), savings ($3,000)
- 6 tools:
get_balance,get_all_balances,transfer,deposit,withdraw,get_transactions
This realistic scenario lets us test how well an LLM can understand and coordinate multiple tools.
Project Structure¶
tests/showcase/
├── test_hero.py # The comprehensive test suite
├── conftest.py # Shared fixtures
├── prompts/ # System prompts for comparison
│ ├── concise.md
│ ├── detailed.md
│ └── friendly.md
└── skills/
└── financial-advisor/ # Domain knowledge skill
├── SKILL.md
└── references/
└── budgeting-guide.md
Running the Hero Tests¶
# Run all showcase tests with HTML report
pytest tests/showcase/ -v --aitest-html=docs/demo/hero-report.html
# Run a specific test class
pytest tests/showcase/test_hero.py::TestModelComparison -v
1. Basic Tool Usage¶
The simplest tests verify the agent can use individual tools correctly.
class TestBasicOperations:
"""Basic single-tool operations demonstrating core functionality."""
async def test_check_single_balance(self, aitest_run, banking_server):
"""Check balance of one account - simplest possible test."""
agent = Agent(
provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
mcp_servers=[banking_server],
system_prompt=BANKING_PROMPT_BASE,
max_turns=8,
)
result = await aitest_run(agent, "What's my checking account balance?")
assert result.success
assert result.tool_was_called("get_balance")
What this tests:
- Can the LLM understand the user's intent?
- Does it select the correct tool (
get_balance)? - Does it pass valid parameters?
2. Multi-Tool Workflows¶
Complex operations require coordinating multiple tools in sequence.
class TestMultiToolWorkflows:
"""Complex workflows requiring coordination of multiple tools."""
async def test_transfer_and_verify(self, aitest_run, llm_assert, banking_server):
"""Transfer money and verify the result with balance check."""
agent = Agent(
provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
mcp_servers=[banking_server],
system_prompt=BANKING_PROMPT_BASE,
max_turns=8,
)
result = await aitest_run(
agent,
"Transfer $100 from checking to savings, then show me my new balances.",
)
assert result.success
# Should use multiple tools
assert result.tool_was_called("transfer")
assert result.tool_was_called("get_all_balances") or result.tool_was_called("get_balance")
assert llm_assert(
result.final_response,
"shows updated balances after transfer",
)
What this tests:
- Can the LLM break down a complex request?
- Does it call the right tools in sequence?
- Does it synthesize information coherently?
3. Session Continuity (Multi-Turn)¶
Session tests verify the agent maintains context across multiple turns.
@pytest.mark.session("savings-planning")
class TestSavingsPlanningSession:
"""Multi-turn session: Planning savings transfers.
Tests that the agent remembers context across turns:
- Turn 1: Check balances and discuss savings plan
- Turn 2: Reference the plan (must remember context)
- Turn 3: Verify the result
"""
async def test_01_establish_context(self, aitest_run, llm_assert, banking_server):
"""First turn: check balances and discuss savings goals."""
agent = Agent(
name="savings-01",
provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
mcp_servers=[banking_server],
system_prompt=BANKING_PROMPT_BASE,
max_turns=8,
)
result = await aitest_run(
agent,
"I want to save more money. Can you check my accounts and suggest "
"how much I could transfer to savings each month?",
)
assert result.success
assert result.tool_was_called("get_all_balances") or result.tool_was_called("get_balance")
async def test_02_reference_without_naming(self, aitest_run, llm_assert, banking_server):
"""Second turn: reference previous context."""
agent = Agent(
name="savings-02",
provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
mcp_servers=[banking_server],
system_prompt=BANKING_PROMPT_BASE,
max_turns=8,
)
result = await aitest_run(
agent,
"That sounds good. Let's start by moving $200 to savings right now.",
)
assert result.success
# Agent should remember context and execute the transfer
assert result.tool_was_called("transfer")
What this tests:
- Does the agent maintain conversation context?
- Can it resolve references to previous turns?
- Does session state persist correctly?
4. Model Comparison¶
Compare how different LLMs perform on the same task.
BENCHMARK_MODELS = ["gpt-5-mini", "gpt-4.1-mini"]
class TestModelComparison:
"""Compare how different models handle complex financial advice."""
@pytest.mark.parametrize("model", BENCHMARK_MODELS)
async def test_financial_advice_quality(self, aitest_run, llm_assert, banking_server, model: str):
"""Compare models on providing comprehensive financial advice."""
agent = Agent(
provider=Provider(model=f"azure/{model}"),
mcp_servers=[banking_server],
system_prompt=BANKING_PROMPT_BASE,
max_turns=8,
)
result = await aitest_run(
agent,
"I want to reach my vacation savings goal faster. Analyze my current "
"financial situation and recommend a concrete savings plan.",
)
assert result.success
assert len(result.all_tool_calls) >= 1
assert llm_assert(
result.final_response,
"provides actionable savings recommendations based on financial data",
)
What this tests:
- Which model gives better financial advice?
- Which model uses tools more efficiently?
- Cost vs quality tradeoffs
The report automatically generates a model comparison table showing:
- Pass/fail rates per model
- Token usage and costs
- AI-generated recommendations
5. System Prompt Comparison¶
Compare how different system prompts affect behavior.
First, define prompts in YAML files:
name: PROMPT_CONCISE
version: "1.0"
description: Brief, to-the-point financial advice
system_prompt: |
You are a personal finance assistant. Be concise and direct.
Give specific numbers and actionable advice in 2-3 sentences.
name: PROMPT_DETAILED
version: "1.0"
description: Thorough financial analysis with explanations
system_prompt: |
You are a personal finance assistant. Provide comprehensive analysis.
Explain your reasoning, show calculations, and consider multiple scenarios.
Then parametrize tests with them:
from pytest_aitest import load_prompts
ADVISOR_PROMPTS = load_prompts(Path(__file__).parent / "prompts")
class TestPromptComparison:
"""Compare how different prompt styles affect financial advice."""
@pytest.mark.parametrize("prompt", ADVISOR_PROMPTS, ids=lambda p: p.name)
async def test_advice_style_comparison(self, aitest_run, llm_assert, banking_server, prompt):
"""Compare concise vs detailed vs friendly advisory styles."""
agent = Agent(
provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
mcp_servers=[banking_server],
system_prompt=prompt.system_prompt,
max_turns=8,
)
result = await aitest_run(
agent,
"I'm worried about my spending. Can you check my accounts "
"and give me advice on managing my money better?",
)
assert result.success
assert result.tool_was_called("get_all_balances") or result.tool_was_called("get_balance")
What this tests:
- Which prompt style produces better advice?
- How does verbosity affect user experience?
- Are there quality vs token tradeoffs?
6. Skill Integration¶
Skills inject domain knowledge into the agent's context.
First, create a skill:
---
name: financial-advisor
description: Financial planning and budgeting expertise
version: 1.0.0
---
# Financial Advisor Skill
You are an expert financial advisor with deep knowledge of:
## Emergency Fund Guidelines
- Minimum: 3 months of expenses
- Recommended: 6 months of expenses
- High-risk professions: 9-12 months
## Budget Allocation (50/30/20 Rule)
- 50% Needs: rent, utilities, groceries, minimum debt payments
- 30% Wants: entertainment, dining out, subscriptions
- 20% Savings: emergency fund, retirement, goals
Then use it in tests:
class TestSkillEnhancement:
"""Test how skills improve financial advice quality."""
async def test_with_financial_skill(
self, aitest_run, llm_assert, banking_server, financial_advisor_skill
):
"""Agent with financial advisor skill should give better advice."""
agent = Agent(
provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
mcp_servers=[banking_server],
system_prompt=BANKING_PROMPT_BASE,
skill=financial_advisor_skill, # <-- Inject domain knowledge
max_turns=8,
)
result = await aitest_run(
agent,
"I have $1500 in checking. Should I keep it there or move some to savings? "
"What's a good emergency fund target for someone like me?",
)
assert result.success
assert llm_assert(
result.final_response,
"provides financial advice about emergency funds or savings allocation",
)
What this tests:
- Does domain knowledge improve advice quality?
- Does the agent apply the skill's guidelines?
- Are recommendations more concrete with skills?
7. Error Handling¶
Test graceful recovery from invalid operations.
class TestErrorHandling:
"""Test graceful handling of edge cases and errors."""
async def test_insufficient_funds_recovery(self, aitest_run, llm_assert, banking_server):
"""Agent should handle insufficient funds gracefully."""
agent = Agent(
provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
mcp_servers=[banking_server],
system_prompt=BANKING_PROMPT_BASE + " If an operation fails, explain why and suggest alternatives.",
max_turns=8,
)
result = await aitest_run(
agent,
"Transfer $50,000 from my checking to savings.", # Way more than available!
)
assert result.success
assert result.tool_was_called("transfer")
assert llm_assert(
result.final_response,
"explains insufficient funds or suggests an alternative amount",
)
What this tests:
- Does the agent attempt the operation?
- Does it handle tool errors gracefully?
- Does it provide helpful error messages?
Key Takeaways¶
Test Structure Best Practices¶
- One server, many tests — Reuse the same MCP server across test classes
- Named agents for sessions — Use
name="session-01"to track multi-turn state - Semantic assertions — Use
llm_assertfor behavior verification - Parametrize for comparisons — Use
@pytest.mark.parametrizefor model/prompt grids
Assertion Patterns¶
| Assertion | Purpose |
|---|---|
result.success |
Agent completed without errors |
result.tool_was_called("name") |
Specific tool was invoked |
result.tool_call_count("name") >= N |
Tool called at least N times |
len(result.all_tool_calls) >= N |
Total tool calls threshold |
llm_assert(response, condition) |
Semantic response validation |
Report Features¶
The generated report includes:
- Model comparison tables with cost/quality analysis
- Prompt comparison showing style differences
- Session flow diagrams for multi-turn tests
- AI-powered insights suggesting improvements
- Failure analysis with root cause identification
Next Steps¶
- Test MCP Servers — Deep dive into MCP server configuration
- Generate Reports — Customize report output
- Configuration Reference — All available options