A/B Testing MCP Servers¶

Compare different MCP server implementations to find what works best.

Why A/B Test Servers?¶

Your MCP server's tool descriptions, schemas, and response formats are the API that LLMs interact with. Small changes can have big impacts:

Did your refactor break tool discoverability?
Does the new description improve tool selection?
Is the v2 output format easier for LLMs to parse?

A/B testing answers these questions with data.

Basic Server Comparison¶

Compare two versions of your MCP server:

from pytest_aitest import Agent, Provider, MCPServer

# Two versions to compare
banking_v1 = MCPServer(command=["python", "banking_v1.py"])
banking_v2 = MCPServer(command=["python", "banking_v2.py"])

AGENTS = [
    Agent(
        name="banking-v1",
        provider=Provider(model="azure/gpt-5-mini"),
        mcp_servers=[banking_v1],
    ),
    Agent(
        name="banking-v2",
        provider=Provider(model="azure/gpt-5-mini"),
        mcp_servers=[banking_v2],
    ),
]

@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_balance_query(aitest_run, agent):
    result = await aitest_run(agent, "What's my checking balance?")
    assert result.success
    assert result.tool_was_called("get_balance")

The report shows which server performs better.

What the Report Reveals¶

Metric	What It Tells You
Pass rate	Does the new server break anything?
Tool selection	Is the LLM picking the right tools?
Tool call count	Is the new server more efficient?
Token usage	Does better tool output reduce LLM tokens?
Duration	Is response time affected?

Common A/B Testing Scenarios¶

Iterating on Tool Descriptions¶

Test whether a clearer description improves tool usage:

# v1: Vague description
# get_balance: "Gets balance data"

# v2: Clear description with examples  
# get_balance: "Get current balance for a bank account. Example: get_balance('checking')"

@pytest.mark.parametrize("agent", [agent_v1, agent_v2], ids=["vague", "clear"])
async def test_tool_discovery(aitest_run, agent):
    result = await aitest_run(agent, "I need to check how much money I have")
    assert result.tool_was_called("get_balance")

Comparing Implementations¶

Test your server against an open-source alternative:

my_server = MCPServer(command=["python", "my_server.py"])
reference = MCPServer(command=["npx", "-y", "@org/reference-server"])

AGENTS = [
    Agent(name="my-implementation", mcp_servers=[my_server], ...),
    Agent(name="reference-implementation", mcp_servers=[reference], ...),
]

Testing Backend Changes¶

Verify a database migration doesn't affect LLM interactions:

server_sqlite = MCPServer(
    command=["python", "server.py"],
    env={"DATABASE_URL": "sqlite:///test.db"},
)

server_postgres = MCPServer(
    command=["python", "server.py"],
    env={"DATABASE_URL": "postgresql://localhost/test"},
)

Evaluating Schema Changes¶

Test whether a new input schema is clearer:

# v1: Single "query" parameter
# v2: Separate "account" and "type" parameters

@pytest.mark.parametrize("agent", [agent_v1, agent_v2])
async def test_ambiguous_query(aitest_run, agent):
    # This query is ambiguous - does the LLM handle it correctly?
    result = await aitest_run(agent, "How much do I have in checking?")
    assert result.success

Multi-Dimensional Comparison¶

Test servers across multiple models to find interactions:

MODELS = ["gpt-5-mini", "gpt-4.1"]
SERVERS = {"v1": banking_v1, "v2": banking_v2}

AGENTS = [
    Agent(
        name=f"{server_name}-{model}",
        provider=Provider(model=f"azure/{model}"),
        mcp_servers=[server],
    )
    for server_name, server in SERVERS.items()
    for model in MODELS
]

# 2 servers × 2 models = 4 configurations

This reveals interactions like: - "v2 works great with gpt-4.1 but fails with gpt-5-mini" - "gpt-5-mini needs better tool descriptions to match gpt-4.1 performance"

AI Insights for Server Comparison¶

When you run with --aitest-summary-model, the report includes:

🔧 MCP TOOL FEEDBACK

banking-v1/get_balance — 60% success rate
Current: "Gets balance data"
Issue: LLM often calls get_all_balances instead
Suggested: "Get CURRENT balance for a specific bank account.
            For all accounts at once, use get_all_balances."

banking-v2/get_balance — 95% success rate  
Description is clear and well-targeted.

Best Practices¶

Use the same model — Isolate the server variable by using identical providers
Test edge cases — Include ambiguous prompts that stress-test descriptions
Run multiple times — LLM responses vary; run enough tests to see patterns
Check token usage — Better descriptions might cost more but improve accuracy
Name servers clearly — Use descriptive names that appear in reports (v1, v2, sqlite, postgres)

Next Steps¶

Comparing Configurations — More comparison patterns
Generate Reports — Get AI insights on your comparison

📁 Real Example: test_ab_servers.py — Server version comparison and tool description impact testing