Complete Example: Tying It All Together¶

This guide walks through the hero test suite — a comprehensive example demonstrating all pytest-skill-engineering capabilities in a single, cohesive banking scenario.

Generate the Report

Run pytest tests/showcase/ -v --aitest-html=report.html to generate the hero report.

The Scenario: Personal Finance Assistant¶

The hero test uses a Banking MCP server that simulates a personal finance application with:

2 accounts: checking ($1,500), savings ($3,000)
6 tools: get_balance, get_all_balances, transfer, deposit, withdraw, get_transactions

This realistic scenario lets us test how well an LLM can understand and coordinate multiple tools.

Project Structure¶

tests/showcase/
├── test_hero.py           # The comprehensive test suite
├── conftest.py            # Shared fixtures
├── agents/                # Agent instruction files for comparison
│   ├── concise.agent.md
│   ├── detailed.agent.md
│   └── friendly.agent.md
└── skills/
    └── financial-advisor/ # Domain knowledge skill
        ├── SKILL.md
        └── references/
            └── budgeting-guide.md

Running the Hero Tests¶

# Run all showcase tests with HTML report
pytest tests/showcase/ -v --aitest-html=docs/demo/hero-report.html

# Run a specific test class
pytest tests/showcase/test_hero.py::TestModelComparison -v

1. Basic Tool Usage¶

The simplest tests verify the agent can use individual tools correctly.

class TestBasicOperations:
    """Basic single-tool operations demonstrating core functionality."""

    async def test_check_single_balance(self, copilot_eval):
        """Check balance of one account - simplest possible test."""
        agent = CopilotEval(
            name="banking-v1",
            instructions=BANKING_PROMPT_BASE,
        )

        result = await copilot_eval(agent, "What's my checking account balance?")

        assert result.success
        assert result.tool_was_called("get_balance")

What this tests:

Can the LLM understand the user's intent?
Does it select the correct tool (get_balance)?
Does it pass valid parameters?

2. Multi-Tool Workflows¶

Complex operations require coordinating multiple tools in sequence.

class TestMultiToolWorkflows:
    """Complex workflows requiring coordination of multiple tools."""

    async def test_transfer_and_verify(self, copilot_eval, llm_assert):
        """Transfer money and verify the result with balance check."""
        agent = CopilotEval(
            name="banking-v1",
            instructions=BANKING_PROMPT_BASE,
        )

        result = await copilot_eval(
            agent,
            "Transfer $100 from checking to savings, then show me my new balances.",
        )

        assert result.success
        # Should use multiple tools
        assert result.tool_was_called("transfer")
        assert result.tool_was_called("get_all_balances") or result.tool_was_called("get_balance")
        assert llm_assert(
            result.final_response,
            "shows updated balances after transfer",
        )

What this tests:

Can the LLM break down a complex request?
Does it call the right tools in sequence?
Does it synthesize information coherently?

3. Multi-Turn Context¶

Context across turns is provided naturally through the conversation history in the LLM session. Each test is independent — use explicit context in your prompts to test multi-step scenarios.

async def test_plan_then_execute(self, copilot_eval, llm_assert):
    """Test that the agent can plan and then execute a savings transfer."""
    agent = CopilotEval(
        name="savings-planner",
        instructions=BANKING_PROMPT_BASE,
    )

    result = await copilot_eval(
        agent,
        "Check my account balances, identify the best opportunity to save more "
        "money, then transfer $200 to my savings account.",
    )

    assert result.success
    assert result.tool_was_called("transfer")
    assert llm_assert(
        result.final_response,
        "shows updated balances after transfer",
    )

What this tests:

Can the LLM handle a compound instruction?
Does it use tools in the right order?
Does it synthesize the result coherently?

4. Model Comparison¶

Compare how different LLMs perform on the same task.

BENCHMARK_MODELS = ["gpt-5-mini", "gpt-4.1-mini"]

class TestModelComparison:
    """Compare how different models handle complex financial advice."""

    @pytest.mark.parametrize("model", BENCHMARK_MODELS)
    async def test_financial_advice_quality(self, copilot_eval, llm_assert, model: str):
        """Compare models on providing comprehensive financial advice."""
        agent = CopilotEval(
            name="banking-v1",
            model=model,
            instructions=BANKING_PROMPT_BASE,
        )

        result = await copilot_eval(
            agent,
            "I want to reach my vacation savings goal faster. Analyze my current "
            "financial situation and recommend a concrete savings plan.",
        )

        assert result.success
        assert len(result.all_tool_calls) >= 1
        assert llm_assert(
            result.final_response,
            "provides actionable savings recommendations based on financial data",
        )

What this tests:

Which model gives better financial advice?
Which model uses tools more efficiently?
Cost vs quality tradeoffs

The report automatically generates a model comparison table showing:

Pass/fail rates per model
Token usage and costs
AI-generated recommendations

5. Agent Instruction Comparison¶

Compare how different agent instruction styles affect behavior.

Store each style as an .agent.md file:

tests/showcase/
└── agents/
    ├── concise.agent.md
    ├── detailed.agent.md
    └── friendly.agent.md

agents/concise.agent.md

---
name: AGENT_CONCISE
description: Brief, to-the-point financial advice
---

You are a personal finance assistant. Be concise and direct.
Give specific numbers and actionable advice in 2-3 sentences.

agents/detailed.agent.md

---
name: AGENT_DETAILED
description: Thorough financial analysis with explanations
---

You are a personal finance assistant. Provide comprehensive analysis.
Explain your reasoning, show calculations, and consider multiple scenarios.

Then parametrize tests over them:

from pathlib import Path
from pytest_skill_engineering.copilot import CopilotEval
from pytest_skill_engineering.core.evals import load_custom_agent

AGENT_PATHS = list((Path(__file__).parent / "agents").glob("*.agent.md"))

class TestPromptComparison:
    """Compare how different agent instruction styles affect financial advice."""

    @pytest.mark.parametrize("agent_path", AGENT_PATHS, ids=lambda p: p.stem.replace(".agent", ""))
    async def test_advice_style_comparison(self, copilot_eval, llm_assert, agent_path):
        """Compare concise vs detailed vs friendly advisory styles."""
        agent = CopilotEval(
            name=agent_path.stem,
            custom_agents=[load_custom_agent(agent_path)],
        )

        result = await copilot_eval(
            agent,
            "I'm worried about my spending. Can you check my accounts "
            "and give me advice on managing my money better?",
        )

        assert result.success
        assert result.tool_was_called("get_all_balances") or result.tool_was_called("get_balance")

What this tests:

Which prompt style produces better advice?
How does verbosity affect user experience?
Are there quality vs token tradeoffs?

6. Skill Integration¶

Skills inject domain knowledge into the agent's context.

First, create a skill:

skills/financial-advisor/SKILL.md

---
name: financial-advisor
description: Financial planning and budgeting expertise
version: 1.0.0
---

# Financial Advisor Skill

You are an expert financial advisor with deep knowledge of:

## Emergency Fund Guidelines
- Minimum: 3 months of expenses
- Recommended: 6 months of expenses
- High-risk professions: 9-12 months

## Budget Allocation (50/30/20 Rule)
- 50% Needs: rent, utilities, groceries, minimum debt payments
- 30% Wants: entertainment, dining out, subscriptions
- 20% Savings: emergency fund, retirement, goals

Then use it in tests:

class TestSkillEnhancement:
    """Test how skills improve financial advice quality."""

    async def test_with_financial_skill(
        self, copilot_eval, llm_assert, financial_advisor_skill
    ):
        """Eval with financial advisor skill should give better advice."""
        agent = CopilotEval(
            name="banking-v1",
            instructions=BANKING_PROMPT_BASE,
            skill_directories=["tests/showcase/skills/financial-advisor"],
        )

        result = await copilot_eval(
            agent,
            "I have $1500 in checking. Should I keep it there or move some to savings? "
            "What's a good emergency fund target for someone like me?",
        )

        assert result.success
        assert llm_assert(
            result.final_response,
            "provides financial advice about emergency funds or savings allocation",
        )

What this tests:

Does domain knowledge improve advice quality?
Does the agent apply the skill's guidelines?
Are recommendations more concrete with skills?

7. Error Handling¶

Test graceful recovery from invalid operations.

class TestErrorHandling:
    """Test graceful handling of edge cases and errors."""

    async def test_insufficient_funds_recovery(self, copilot_eval, llm_assert):
        """Eval should handle insufficient funds gracefully."""
        instructions = BANKING_PROMPT_BASE + " If an operation fails, explain why and suggest alternatives."
        agent = CopilotEval(
            name="banking-v1",
            instructions=instructions,
        )

        result = await copilot_eval(
            agent,
            "Transfer $50,000 from my checking to savings.",  # Way more than available!
        )

        assert result.success
        assert result.tool_was_called("transfer")
        assert llm_assert(
            result.final_response,
            "explains insufficient funds or suggests an alternative amount",
        )

What this tests:

Does the agent attempt the operation?
Does it handle tool errors gracefully?
Does it provide helpful error messages?

Key Takeaways¶

Test Structure Best Practices¶

One scenario, many tests — Reuse the same test scenario across test classes
Named agents for sessions — Use name="session-01" to track multi-turn state
Semantic assertions — Use llm_assert for behavior verification
Parametrize for comparisons — Use @pytest.mark.parametrize for model/prompt grids

Assertion Patterns¶

Assertion	Purpose
`result.success`	Eval completed without errors
`result.tool_was_called("name")`	Specific tool was invoked
`result.tool_call_count("name") >= N`	Tool called at least N times
`len(result.all_tool_calls) >= N`	Total tool calls threshold
`llm_assert(response, condition)`	Semantic response validation

Report Features¶

The generated report includes:

Model comparison tables with cost/quality analysis
Prompt comparison showing style differences
Session flow diagrams for multi-turn tests
AI-powered insights suggesting improvements
Failure analysis with root cause identification

Next Steps¶

Test MCP Servers — Deep dive into MCP server configuration
Generate Reports — Customize report output
Configuration Reference — All available options