pytest-aitest - Test Report

🤖AI Analysis

Recommended for Deploy

gpt-5-mini

Achieves a 100% pass rate while correctly executing multi-step tool flows and error recovery. Slightly higher cost than the alternative, but the only configuration that consistently follows the required tool-first behavior in failure scenarios.

100%Pass Rate

$0.003314Total Cost

4,589Tokens

6

Total Tests

1

Failures

2

Agents

3.2

Avg Turns

Comparative Analysis

Why the winner wins

Only agent with correct error-path tool usage: gpt-5-mini is the only configuration that attempted the withdraw tool even when funds were insufficient, satisfying the test’s requirement to exercise error recovery logic.
Reliable multi-step chaining: Successfully chains transfer → get_all_balances without hesitation or re-asking the user, matching the intended MCP workflow.
Higher correctness outweighs marginal cost: Although gpt-4.1-mini is cheaper in absolute terms, its 67% pass rate makes it unsuitable for deployment where correctness is mandatory.

Notable patterns

Cheaper model avoided “failing” tools: gpt-4.1-mini inferred insufficiency from get_balance and responded conversationally instead of calling withdraw, indicating a bias toward helpful explanation over tool execution.
Error tests expose prompt–model interaction: The insufficient-funds scenario is the only place where the models diverge, suggesting that tool-error paths are the critical discriminator.

Alternatives

gpt-4.1-mini: Lower total cost, but failed a core error-recovery test because it did not call the required tool. Not recommended unless the prompt is adjusted to force tool invocation on withdrawal intents.

❌ Failure Analysis

Failure Summary

gpt-4.1-mini (1 failure)

Test	Root Cause	Fix
Insufficient funds — tests error recovery	Model chose to explain insufficiency instead of invoking `withdraw`	Update system prompt to mandate calling `withdraw` for any withdrawal request, even if funds appear insufficient

Insufficient funds — tests error recovery (gpt-4.1-mini)

Problem: The agent responded with a balance explanation and options without ever calling the withdraw tool.
Root Cause: The model inferred the failure outcome after calling get_balance and optimized for a helpful natural-language response rather than executing the requested action via the tool.
Behavioral Mechanism: The absence of explicit language like “always attempt the requested banking action via tools” allows the model to short-circuit. Phrases implying helpfulness or user guidance (e.g., offering alternatives) prime it to stop after explanation instead of acting.
Fix: Add an explicit instruction to the system prompt:

“For any user request to withdraw funds, you MUST call the withdraw tool exactly once, even if you expect it to fail due to insufficient funds.”

🔧 MCP Tool Feedback

banking_server

Overall, tools are discoverable and consistently named. Error responses are clear, enabling the agent to recover and explain outcomes.

Tool	Status	Calls	Issues
get_balance	✅	3	Working well
get_all_balances	✅	3	Working well
transfer	✅	2	Working well
withdraw	✅	1	Error surfaced correctly

💡 Optimizations

#	Optimization	Priority	Estimated Savings
1	Force tool-first behavior for withdrawals	recommended	Prevents 33% failure rate
2	Trim unused fields from balance responses	suggestion	~15% token reduction

1. Force tool-first behavior for withdrawals (recommended)

Current: Models may infer failure from balances and skip calling withdraw.
Change: Explicitly require a withdraw tool call for any withdrawal intent.
Impact: Eliminates the observed failure mode; improves pass rate from 67% to 100% for gpt-4.1-mini.

2. Trim unused fields from balance responses (suggestion)

Current: Tools return both raw and formatted totals that are not always referenced.
Change: Remove aggregate totals when per-account balances are sufficient.
Impact: ~15% token reduction in multi-balance responses.

📦 Tool Response Optimization

get_all_balances (from banking_server)

Current response size: ~80 tokens
Issues found: Includes total and total_formatted even when the agent only lists per-account balances.
Suggested optimization: Make totals optional via a parameter (e.g., include_total=false).
Estimated savings: ~15 tokens per call (~18% reduction)

Example current vs optimized:

// Current (~80 tokens)
{
  "accounts": {
    "checking": {"balance": 1400.0, "formatted": "$1,400.00"},
    "savings": {"balance": 3100.0, "formatted": "$3,100.00"}
  },
  "total": 4500.0,
  "total_formatted": "$4,500.00"
}

// Optimized (~65 tokens)
{
  "accounts": {
    "checking": {"formatted": "$1,400.00"},
    "savings": {"formatted": "$3,100.00"}
  }
}

🏆 Agent Leaderboard

	Agent	Tests	Pass Rate	Tokens	Cost	Duration
🥇	gpt-5-mini	3/3	100%	4,589	$0.003314	35.3s
🥈	gpt-4.1-mini	2/3	67%	2,495	$0.001251	20.5s

📋 Test Results

6 / 6 tests

📋tests/fixtures/scenario_02_multi_agent.py(3 tests)

3/3

2/3

▼

✅Basic balance query — all agents should pass.

Total 1,660 tok·Total $0.000664·Δ +25%·Δ +59%·Δ +8%

gpt-5-mini:✅11.0s

gpt-4.1-mini:✅6.9s

✅Transfer with verification — tests multi-step tool use.

Total 2,427 tok·Total $0.001545·Δ +45%·Δ +24%·Δ +93%

gpt-5-mini:✅10.0s

gpt-4.1-mini:✅8.1s

✅Insufficient funds — tests error recovery.⚡

Total 2,997 tok·Total $0.002356·Δ +192%·Δ +161%·Δ +419%

gpt-5-mini:✅14.3s

gpt-4.1-mini:❌5.5s

gpt-5-mini

passed

14.32s

duration

4

turns

2

tools

2,232

tokens

$0.001975

cost

📊Sequence

sequenceDiagram participant User participant Agent participant Tools User->>Agent: "Withdraw $50,000 from my checking account" Agent->>Tools: "withdraw({'account': 'checking', 'amount': 50000})" Tools-->>Agent: "Insufficient funds. checking has $1,500.00." Agent->>Tools: "get_all_balances({})" Tools-->>Agent: "{'accounts': {'checking': {'balance': 1500.0, 'formatted': '" Agent->>User: "I can’t withdraw $50,000 — your checking account has insufficient funds. Current"

🔧Tool Calls

✅withdraw

✅get_all_balances

✓Assertions

✅llmexplains that the withdrawal failed due to insufficient funds

💬Response

I can’t withdraw $50,000 — your checking account has insufficient funds.

Current balances:
- Checking: $1,500.00
- Savings: $3,000.00
- Total across accounts: $4,500.00

Options I can do for you now:
- Withdraw the available $1,500 from checking (proceed now).
- Transfer up to $3,000 from savings into checking, then withdraw (you’d still only be able to withdraw up to $4,500 total).
- Transfer the full $3,000 from savings to checking and withdraw the full $4,500.
- Arrange a deposit or external incoming transfer (you’d need to add funds to reach $50,000).

Which would you like to do? If you want a transfer or withdrawal, tell me the amount and I’ll perform it.

gpt-4.1-mini

failed

5.48s

duration

3

turns

1

tools

765

tokens

$0.000380

cost

📊Sequence

sequenceDiagram participant User participant Agent participant Tools User->>Agent: "Withdraw $50,000 from my checking account" Agent->>Tools: "get_balance({'account': 'checking'})" Tools-->>Agent: "{'account': 'checking', 'balance': 1500.0, 'formatted': '$1," Agent->>User: "Your checking account balance is $1,500.00, which is less than the $50,000 you w"

🔧Tool Calls

✅get_balance

💬Response

Your checking account balance is $1,500.00, which is less than the $50,000 you want to withdraw. You cannot withdraw that amount. Would you like to withdraw a smaller amount or do something else?

❌Error

      AssertionError: assert False
       +  where False = tool_was_called('withdraw')
       +    where tool_was_called = AgentResult(SUCCESS)\n  Turns: 3\n  Tools called: get_balance\n  Duration: 2105ms\n  Tokens: 765 | Cost: $0.000380\n  Final: 'Your checking account balance is $1,500.00, which is less than the $50,000 you want to withdraw. You'....tool_was_called