pytest-aitest - Test Report

🤖AI Analysis

Recommended for Deploy

gpt-4.1-mini

Delivers a 100% pass rate at ~40% lower total cost than the alternative, with faster responses, fewer tokens, and consistent multi-turn tool usage.

100%Pass Rate

$0.001419Total Cost

3,005Tokens

6

Total Tests

0

Failures

2

Agents

3.0

Avg Turns

Comparative Analysis

Why the winner wins

Lower realized cost: Achieves the same 100% pass rate at ~40% lower total cost than gpt-5-mini ($0.001419 vs $0.002354 across identical tests).
Token efficiency: Uses ~28% fewer tokens (3,005 vs 4,180), indicating tighter reasoning and less verbose responses without sacrificing correctness.
Faster execution: Consistently lower durations per turn, improving perceived latency in multi-turn sessions.

Notable patterns

Equivalent tool correctness: Both agents correctly chained tools across a 3-turn session (get_balance → transfer → get_all_balances) with no retries or confusion.
Verbosity differences: gpt-5-mini tended to add longer follow-ups and prompts, increasing token usage and cost despite identical outcomes.
Stable session context: Neither agent exhibited context drift across turns; balances and actions remained coherent.

Alternatives

gpt-5-mini: Same pass rate and correct tool usage, but higher cost per test and more verbose outputs. Viable if model-specific features are needed; otherwise not cost-optimal.

🔧 MCP Tool Feedback

banking_server

Overall, tools are clear and reliably discoverable. Agents selected the correct tool each time with valid parameters.

Tool	Status	Calls	Issues
get_balance	✅	2	Working well
transfer	✅	2	Working well
get_all_balances	✅	2	Working well

💡 Optimizations

#	Optimization	Priority	Estimated Savings
1	Trim conversational follow-ups	recommended	~15% cost reduction
2	Compact tool responses	suggestion	~20–30% fewer tool-response tokens

1. Trim conversational follow-ups (recommended)

Current: Agents often append open-ended follow-up questions after completing the task.
Change: In the system prompt, add: “After completing the user’s request successfully, provide the result succinctly and do not ask follow-up questions unless explicitly requested.”
Impact: ~15% cost reduction from fewer generated tokens per turn.

2. Compact tool responses (suggestion)

Current: Tool JSON includes both raw values and formatted strings plus descriptive messages.
Change: Return only fields required for the response text (omit redundant formatted strings and messages).
Impact: ~20–30% fewer tool-response tokens, compounding savings in multi-turn sessions.

📦 Tool Response Optimization

get_all_balances (from banking_server)

Current response size: ~90–110 tokens
Issues found: Redundant formatted fields and total_formatted duplicate information the agent can derive or format itself.
Suggested optimization: Remove formatted strings and return numeric balances only.
Estimated savings: ~30 tokens per call (~25% reduction)

Example current vs optimized:

// Current
{
  "accounts": {
    "checking": {"balance": 1500.0, "formatted": "$1,500.00"},
    "savings": {"balance": 3000.0, "formatted": "$3,000.00"}
  },
  "total": 4500.0,
  "total_formatted": "$4,500.00"
}

// Optimized
{
  "accounts": {
    "checking": 1500.0,
    "savings": 3000.0
  },
  "total": 4500.0
}

This optimization preserves all necessary information while reducing token overhead for every verification step.

🏆 Agent Leaderboard

	Agent	Tests	Pass Rate	Tokens	Cost	Duration
🥇	gpt-4.1-mini	3/3	100%	3,005	$0.001419	19.9s
🥈	gpt-5-mini	3/3	100%	4,180	$0.002354	30.2s

📋 Test Results

6 / 6 tests

🔗Multi-turn banking session with 2 agents.(3 tests)

3/3

▼

✅First turn: check account balance.

Total 1,617 tok·Total $0.000666·Δ +27%·Δ +71%·Δ +0%

gpt-4.1-mini:✅6.6s

gpt-5-mini:✅11.4s

✅Second turn: transfer money.

Total 2,318 tok·Total $0.001193·Δ +35%·Δ +30%·Δ +46%

gpt-4.1-mini:✅6.8s

gpt-5-mini:✅8.8s

✅Third turn: verify the transfer.

Total 3,250 tok·Total $0.001915·Δ +48%·Δ +54%·Δ +118%

gpt-4.1-mini:✅6.5s

gpt-5-mini:✅10.0s