Complete Example: Tying It All Together¶

This guide walks through the hero test suite — a comprehensive example demonstrating all pytest-aitest capabilities in a single, cohesive banking scenario.

Generate the Report

Run pytest tests/showcase/ -v --aitest-html=report.html to generate the hero report.

The Scenario: Personal Finance Assistant¶

The hero test uses a Banking MCP server that simulates a personal finance application with:

2 accounts: checking ($1,500), savings ($3,000)
6 tools: get_balance, get_all_balances, transfer, deposit, withdraw, get_transactions

This realistic scenario lets us test how well an LLM can understand and coordinate multiple tools.

Project Structure¶

tests/showcase/
├── test_hero.py           # The comprehensive test suite
├── conftest.py            # Shared fixtures
├── prompts/               # System prompts for comparison
│   ├── concise.md
│   ├── detailed.md
│   └── friendly.md
└── skills/
    └── financial-advisor/ # Domain knowledge skill
        ├── SKILL.md
        └── references/
            └── budgeting-guide.md

Running the Hero Tests¶

# Run all showcase tests with HTML report
pytest tests/showcase/ -v --aitest-html=docs/demo/hero-report.html

# Run a specific test class
pytest tests/showcase/test_hero.py::TestModelComparison -v

1. Basic Tool Usage¶

The simplest tests verify the agent can use individual tools correctly.

class TestBasicOperations:
    """Basic single-tool operations demonstrating core functionality."""

    async def test_check_single_balance(self, aitest_run, banking_server):
        """Check balance of one account - simplest possible test."""
        agent = Agent(
            provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
            mcp_servers=[banking_server],
            system_prompt=BANKING_PROMPT_BASE,
            max_turns=8,
        )

        result = await aitest_run(agent, "What's my checking account balance?")

        assert result.success
        assert result.tool_was_called("get_balance")

What this tests:

Can the LLM understand the user's intent?
Does it select the correct tool (get_balance)?
Does it pass valid parameters?

2. Multi-Tool Workflows¶

Complex operations require coordinating multiple tools in sequence.

class TestMultiToolWorkflows:
    """Complex workflows requiring coordination of multiple tools."""

    async def test_transfer_and_verify(self, aitest_run, llm_assert, banking_server):
        """Transfer money and verify the result with balance check."""
        agent = Agent(
            provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
            mcp_servers=[banking_server],
            system_prompt=BANKING_PROMPT_BASE,
            max_turns=8,
        )

        result = await aitest_run(
            agent,
            "Transfer $100 from checking to savings, then show me my new balances.",
        )

        assert result.success
        # Should use multiple tools
        assert result.tool_was_called("transfer")
        assert result.tool_was_called("get_all_balances") or result.tool_was_called("get_balance")
        assert llm_assert(
            result.final_response,
            "shows updated balances after transfer",
        )

What this tests:

Can the LLM break down a complex request?
Does it call the right tools in sequence?
Does it synthesize information coherently?

3. Session Continuity (Multi-Turn)¶

Session tests verify the agent maintains context across multiple turns.

@pytest.mark.session("savings-planning")
class TestSavingsPlanningSession:
    """Multi-turn session: Planning savings transfers.

    Tests that the agent remembers context across turns:
    - Turn 1: Check balances and discuss savings plan
    - Turn 2: Reference the plan (must remember context)
    - Turn 3: Verify the result
    """

    async def test_01_establish_context(self, aitest_run, llm_assert, banking_server):
        """First turn: check balances and discuss savings goals."""
        agent = Agent(
            name="savings-01",
            provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
            mcp_servers=[banking_server],
            system_prompt=BANKING_PROMPT_BASE,
            max_turns=8,
        )

        result = await aitest_run(
            agent,
            "I want to save more money. Can you check my accounts and suggest "
            "how much I could transfer to savings each month?",
        )

        assert result.success
        assert result.tool_was_called("get_all_balances") or result.tool_was_called("get_balance")

    async def test_02_reference_without_naming(self, aitest_run, llm_assert, banking_server):
        """Second turn: reference previous context."""
        agent = Agent(
            name="savings-02",
            provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
            mcp_servers=[banking_server],
            system_prompt=BANKING_PROMPT_BASE,
            max_turns=8,
        )

        result = await aitest_run(
            agent,
            "That sounds good. Let's start by moving $200 to savings right now.",
        )

        assert result.success
        # Agent should remember context and execute the transfer
        assert result.tool_was_called("transfer")

What this tests:

Does the agent maintain conversation context?
Can it resolve references to previous turns?
Does session state persist correctly?

4. Model Comparison¶

Compare how different LLMs perform on the same task.

BENCHMARK_MODELS = ["gpt-5-mini", "gpt-4.1-mini"]

class TestModelComparison:
    """Compare how different models handle complex financial advice."""

    @pytest.mark.parametrize("model", BENCHMARK_MODELS)
    async def test_financial_advice_quality(self, aitest_run, llm_assert, banking_server, model: str):
        """Compare models on providing comprehensive financial advice."""
        agent = Agent(
            provider=Provider(model=f"azure/{model}"),
            mcp_servers=[banking_server],
            system_prompt=BANKING_PROMPT_BASE,
            max_turns=8,
        )

        result = await aitest_run(
            agent,
            "I want to reach my vacation savings goal faster. Analyze my current "
            "financial situation and recommend a concrete savings plan.",
        )

        assert result.success
        assert len(result.all_tool_calls) >= 1
        assert llm_assert(
            result.final_response,
            "provides actionable savings recommendations based on financial data",
        )

What this tests:

Which model gives better financial advice?
Which model uses tools more efficiently?
Cost vs quality tradeoffs

The report automatically generates a model comparison table showing:

Pass/fail rates per model
Token usage and costs
AI-generated recommendations

5. System Prompt Comparison¶

Compare how different system prompts affect behavior.

First, define prompts in YAML files:

prompts/concise.yaml

name: PROMPT_CONCISE
version: "1.0"
description: Brief, to-the-point financial advice
system_prompt: |
  You are a personal finance assistant. Be concise and direct.
  Give specific numbers and actionable advice in 2-3 sentences.

prompts/detailed.yaml

name: PROMPT_DETAILED
version: "1.0"
description: Thorough financial analysis with explanations
system_prompt: |
  You are a personal finance assistant. Provide comprehensive analysis.
  Explain your reasoning, show calculations, and consider multiple scenarios.

Then parametrize tests with them:

from pytest_aitest import load_prompts

ADVISOR_PROMPTS = load_prompts(Path(__file__).parent / "prompts")

class TestPromptComparison:
    """Compare how different prompt styles affect financial advice."""

    @pytest.mark.parametrize("prompt", ADVISOR_PROMPTS, ids=lambda p: p.name)
    async def test_advice_style_comparison(self, aitest_run, llm_assert, banking_server, prompt):
        """Compare concise vs detailed vs friendly advisory styles."""
        agent = Agent(
            provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
            mcp_servers=[banking_server],
            system_prompt=prompt.system_prompt,
            max_turns=8,
        )

        result = await aitest_run(
            agent,
            "I'm worried about my spending. Can you check my accounts "
            "and give me advice on managing my money better?",
        )

        assert result.success
        assert result.tool_was_called("get_all_balances") or result.tool_was_called("get_balance")

What this tests:

Which prompt style produces better advice?
How does verbosity affect user experience?
Are there quality vs token tradeoffs?

6. Skill Integration¶

Skills inject domain knowledge into the agent's context.

First, create a skill:

skills/financial-advisor/SKILL.md

---
name: financial-advisor
description: Financial planning and budgeting expertise
version: 1.0.0
---

# Financial Advisor Skill

You are an expert financial advisor with deep knowledge of:

## Emergency Fund Guidelines
- Minimum: 3 months of expenses
- Recommended: 6 months of expenses
- High-risk professions: 9-12 months

## Budget Allocation (50/30/20 Rule)
- 50% Needs: rent, utilities, groceries, minimum debt payments
- 30% Wants: entertainment, dining out, subscriptions
- 20% Savings: emergency fund, retirement, goals

Then use it in tests:

class TestSkillEnhancement:
    """Test how skills improve financial advice quality."""

    async def test_with_financial_skill(
        self, aitest_run, llm_assert, banking_server, financial_advisor_skill
    ):
        """Agent with financial advisor skill should give better advice."""
        agent = Agent(
            provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
            mcp_servers=[banking_server],
            system_prompt=BANKING_PROMPT_BASE,
            skill=financial_advisor_skill,  # <-- Inject domain knowledge
            max_turns=8,
        )

        result = await aitest_run(
            agent,
            "I have $1500 in checking. Should I keep it there or move some to savings? "
            "What's a good emergency fund target for someone like me?",
        )

        assert result.success
        assert llm_assert(
            result.final_response,
            "provides financial advice about emergency funds or savings allocation",
        )

What this tests:

Does domain knowledge improve advice quality?
Does the agent apply the skill's guidelines?
Are recommendations more concrete with skills?

7. Error Handling¶

Test graceful recovery from invalid operations.

class TestErrorHandling:
    """Test graceful handling of edge cases and errors."""

    async def test_insufficient_funds_recovery(self, aitest_run, llm_assert, banking_server):
        """Agent should handle insufficient funds gracefully."""
        agent = Agent(
            provider=Provider(model=f"azure/{DEFAULT_MODEL}"),
            mcp_servers=[banking_server],
            system_prompt=BANKING_PROMPT_BASE + " If an operation fails, explain why and suggest alternatives.",
            max_turns=8,
        )

        result = await aitest_run(
            agent,
            "Transfer $50,000 from my checking to savings.",  # Way more than available!
        )

        assert result.success
        assert result.tool_was_called("transfer")
        assert llm_assert(
            result.final_response,
            "explains insufficient funds or suggests an alternative amount",
        )

What this tests:

Does the agent attempt the operation?
Does it handle tool errors gracefully?
Does it provide helpful error messages?

Key Takeaways¶

Test Structure Best Practices¶

One server, many tests — Reuse the same MCP server across test classes
Named agents for sessions — Use name="session-01" to track multi-turn state
Semantic assertions — Use llm_assert for behavior verification
Parametrize for comparisons — Use @pytest.mark.parametrize for model/prompt grids

Assertion Patterns¶

Assertion	Purpose
`result.success`	Agent completed without errors
`result.tool_was_called("name")`	Specific tool was invoked
`result.tool_call_count("name") >= N`	Tool called at least N times
`len(result.all_tool_calls) >= N`	Total tool calls threshold
`llm_assert(response, condition)`	Semantic response validation

Report Features¶

The generated report includes:

Model comparison tables with cost/quality analysis
Prompt comparison showing style differences
Session flow diagrams for multi-turn tests
AI-powered insights suggesting improvements
Failure analysis with root cause identification

Next Steps¶

Test MCP Servers — Deep dive into MCP server configuration
Generate Reports — Customize report output
Configuration Reference — All available options