A/B Testing MCP Servers¶
Compare different MCP server implementations to find what works best.
Why A/B Test Servers?¶
Your MCP server's tool descriptions, schemas, and response formats are the API that LLMs interact with. Small changes can have big impacts:
- Did your refactor break tool discoverability?
- Does the new description improve tool selection?
- Is the v2 output format easier for LLMs to parse?
A/B testing answers these questions with data.
Basic Server Comparison¶
Compare two versions of your MCP server:
from pytest_aitest import Agent, Provider, MCPServer
# Two versions to compare
banking_v1 = MCPServer(command=["python", "banking_v1.py"])
banking_v2 = MCPServer(command=["python", "banking_v2.py"])
AGENTS = [
Agent(
name="banking-v1",
provider=Provider(model="azure/gpt-5-mini"),
mcp_servers=[banking_v1],
),
Agent(
name="banking-v2",
provider=Provider(model="azure/gpt-5-mini"),
mcp_servers=[banking_v2],
),
]
@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_balance_query(aitest_run, agent):
result = await aitest_run(agent, "What's my checking balance?")
assert result.success
assert result.tool_was_called("get_balance")
The report shows which server performs better.
What the Report Reveals¶
| Metric | What It Tells You |
|---|---|
| Pass rate | Does the new server break anything? |
| Tool selection | Is the LLM picking the right tools? |
| Tool call count | Is the new server more efficient? |
| Token usage | Does better tool output reduce LLM tokens? |
| Duration | Is response time affected? |
Common A/B Testing Scenarios¶
Iterating on Tool Descriptions¶
Test whether a clearer description improves tool usage:
# v1: Vague description
# get_balance: "Gets balance data"
# v2: Clear description with examples
# get_balance: "Get current balance for a bank account. Example: get_balance('checking')"
@pytest.mark.parametrize("agent", [agent_v1, agent_v2], ids=["vague", "clear"])
async def test_tool_discovery(aitest_run, agent):
result = await aitest_run(agent, "I need to check how much money I have")
assert result.tool_was_called("get_balance")
Comparing Implementations¶
Test your server against an open-source alternative:
my_server = MCPServer(command=["python", "my_server.py"])
reference = MCPServer(command=["npx", "-y", "@org/reference-server"])
AGENTS = [
Agent(name="my-implementation", mcp_servers=[my_server], ...),
Agent(name="reference-implementation", mcp_servers=[reference], ...),
]
Testing Backend Changes¶
Verify a database migration doesn't affect LLM interactions:
server_sqlite = MCPServer(
command=["python", "server.py"],
env={"DATABASE_URL": "sqlite:///test.db"},
)
server_postgres = MCPServer(
command=["python", "server.py"],
env={"DATABASE_URL": "postgresql://localhost/test"},
)
Evaluating Schema Changes¶
Test whether a new input schema is clearer:
# v1: Single "query" parameter
# v2: Separate "account" and "type" parameters
@pytest.mark.parametrize("agent", [agent_v1, agent_v2])
async def test_ambiguous_query(aitest_run, agent):
# This query is ambiguous - does the LLM handle it correctly?
result = await aitest_run(agent, "How much do I have in checking?")
assert result.success
Multi-Dimensional Comparison¶
Test servers across multiple models to find interactions:
MODELS = ["gpt-5-mini", "gpt-4.1"]
SERVERS = {"v1": banking_v1, "v2": banking_v2}
AGENTS = [
Agent(
name=f"{server_name}-{model}",
provider=Provider(model=f"azure/{model}"),
mcp_servers=[server],
)
for server_name, server in SERVERS.items()
for model in MODELS
]
# 2 servers × 2 models = 4 configurations
This reveals interactions like: - "v2 works great with gpt-4.1 but fails with gpt-5-mini" - "gpt-5-mini needs better tool descriptions to match gpt-4.1 performance"
AI Insights for Server Comparison¶
When you run with --aitest-summary-model, the report includes:
🔧 MCP TOOL FEEDBACK
banking-v1/get_balance — 60% success rate
Current: "Gets balance data"
Issue: LLM often calls get_all_balances instead
Suggested: "Get CURRENT balance for a specific bank account.
For all accounts at once, use get_all_balances."
banking-v2/get_balance — 95% success rate
Description is clear and well-targeted.
Best Practices¶
-
Use the same model — Isolate the server variable by using identical providers
-
Test edge cases — Include ambiguous prompts that stress-test descriptions
-
Run multiple times — LLM responses vary; run enough tests to see patterns
-
Check token usage — Better descriptions might cost more but improve accuracy
-
Name servers clearly — Use descriptive names that appear in reports (
v1,v2,sqlite,postgres)
Next Steps¶
- Comparing Configurations — More comparison patterns
- Generate Reports — Get AI insights on your comparison
📁 Real Example: test_ab_servers.py — Server version comparison and tool description impact testing