Skip to content

Architecture

How pytest-skill-engineering executes tests and dispatches tools.

Overview

┌─────────────────────────────────────────────────────────┐
│                     pytest-skill-engineering                        │
├─────────────────────────────────────────────────────────┤
│  Test: "What's my checking balance?"                      │
│                         │                                │
│                         ▼                                │
│  ┌─────────────────────────────────────────────────┐    │
│  │              EvalEngine                         │    │
│  │  ┌──────────┐   ┌─────────┐    ┌─────────────┐  │    │
│  │  │PydanticAI│◄──►│  Tool   │◄──►│ MCP/CLI     │  │    │
│  │  │  (LLM)   │   │Dispatch │    │ Servers     │  │    │
│  │  └──────────┘   └─────────┘    └─────────────┘  │    │
│  └─────────────────────────────────────────────────┘    │
│                         │                                │
│                         ▼                                │
│  EvalResult { turns, tool_calls, final_response }      │
└─────────────────────────────────────────────────────────┘

The Eval Execution Loop

When you call await copilot_eval(agent, "prompt"), here's what happens:

1. Server Startup

All MCP and CLI servers defined in the agent are started as subprocesses:

agent = CopilotEval(
    name="banking-test",
    instructions="You are a banking assistant.",
)

Servers remain running for the duration of the test session.

2. Tool Discovery

The engine queries each server for its available tools:

  • MCP servers: Uses the MCP protocol's tools/list method
  • CLI servers: Reads the tool definitions from the server wrapper

Tools are exposed to PydanticAI via native MCP toolsets.

3. LLM Loop

The engine enters a turn-based loop:

Turn 1: Send prompt + tool definitions to LLM
        LLM responds: "I'll check the balance" + tool_call(get_balance, account="checking")

Turn 2: Execute tool, send result to LLM
        LLM responds: "Your checking balance is $1,500.00"

Done: No more tool calls, return final response

The loop continues until: - The LLM responds without requesting tool calls (success) - Maximum turns reached (configurable via max_turns) - An error occurs

4. Tool Dispatch

When the LLM requests a tool call:

  1. Engine finds which server owns the tool
  2. Sends the call to that server (MCP protocol or CLI execution)
  3. Captures the result
  4. Returns it to the LLM in the next turn

5. Result Collection

Every turn is recorded in the EvalResult:

result = await copilot_eval(agent, "What's my checking balance?")

result.turns          # List of all conversation turns
result.all_tool_calls # All tool calls made
result.final_response # The LLM's final text response
result.success        # True if completed without errors

MCP vs CLI Servers

Both server types provide tools, but work differently:

MCP Servers

Native MCP protocol over stdio:

from pytest_skill_engineering import MCPServer

MCPServer(
    command=["python", "my_server.py"],
)
  • Tools defined via @server.tool() decorator
  • Full MCP protocol support
  • Bidirectional communication

CLI Servers

Command-line tools wrapped as callable tools:

from pytest_skill_engineering import CLIServer

CLIServer(
    command="git",
    tool_prefix="git",  # Creates "git_execute" tool
)

The LLM calls it like: git_execute(args="status --porcelain")

  • Stdout captured as tool result
  • Simple wrapper for existing CLIs

Skill Injection

When an agent has a skill, it's injected into the system prompt:

agent = CopilotEval(
    name="assistant",
    instructions="You are a helpful assistant.",
    skill_directories=["skills/financial-advisor"],
)

The skill content is prepended to the agent's instructions, giving the LLM domain knowledge before it sees the user's request.

Rate Limiting & Retries

PydanticAI handles transient failures automatically via its built-in retry mechanism:

  • 429 Too Many Requests: Automatic retry with backoff
  • Connection errors: Automatic retry
  • API errors: Automatic retry for transient failures

The retries field (default: 1) controls the maximum number of retries PydanticAI attempts when a tool call returns an error. Increase this value for agents that interact with unreliable tools or external services:

CopilotEval(
    name="resilient-agent",
    instructions="You are a helpful assistant.",
    retries=3,  # Allow up to 3 retries on tool errors
)

Test Iterations

LLM responses are non-deterministic. Running a test once tells you whether it passed that time, not whether the configuration is reliable. The --aitest-iterations=N CLI option reruns each test N times and aggregates the results.

Under the hood, pytest_generate_tests parametrizes every copilot_eval test with _aitest_iteration values 1..N. The report generator groups iterations by agent + test and computes an iteration pass rate.

pytest tests/ --aitest-iterations=5

Reports show per-test iteration breakdowns including pass count, pass rate, total duration, total tokens, and total cost.