Recommended for Deploy
gpt-5-mini
Achieves a 100% pass rate while correctly executing multi-step tool flows and error recovery. Slightly higher cost than the alternative, but the only configuration that consistently follows the required tool-first behavior in failure scenarios.
100%Pass Rate
$0.003314Total Cost
4,589Tokens
Comparative Analysis
Why the winner wins
- Only agent with correct error-path tool usage: gpt-5-mini is the only configuration that attempted the
withdraw tool even when funds were insufficient, satisfying the testβs requirement to exercise error recovery logic.
- Reliable multi-step chaining: Successfully chains
transfer β get_all_balances without hesitation or re-asking the user, matching the intended MCP workflow.
- Higher correctness outweighs marginal cost: Although gpt-4.1-mini is cheaper in absolute terms, its 67% pass rate makes it unsuitable for deployment where correctness is mandatory.
Notable patterns
- Cheaper model avoided βfailingβ tools: gpt-4.1-mini inferred insufficiency from
get_balance and responded conversationally instead of calling withdraw, indicating a bias toward helpful explanation over tool execution.
- Error tests expose promptβmodel interaction: The insufficient-funds scenario is the only place where the models diverge, suggesting that tool-error paths are the critical discriminator.
Alternatives
- gpt-4.1-mini: Lower total cost, but failed a core error-recovery test because it did not call the required tool. Not recommended unless the prompt is adjusted to force tool invocation on withdrawal intents.
β Failure Analysis
Failure Summary
gpt-4.1-mini (1 failure)
| Test |
Root Cause |
Fix |
| Insufficient funds β tests error recovery |
Model chose to explain insufficiency instead of invoking withdraw |
Update system prompt to mandate calling withdraw for any withdrawal request, even if funds appear insufficient |
Insufficient funds β tests error recovery (gpt-4.1-mini)
- Problem: The agent responded with a balance explanation and options without ever calling the
withdraw tool.
- Root Cause: The model inferred the failure outcome after calling
get_balance and optimized for a helpful natural-language response rather than executing the requested action via the tool.
- Behavioral Mechanism: The absence of explicit language like βalways attempt the requested banking action via toolsβ allows the model to short-circuit. Phrases implying helpfulness or user guidance (e.g., offering alternatives) prime it to stop after explanation instead of acting.
- Fix: Add an explicit instruction to the system prompt:
βFor any user request to withdraw funds, you MUST call the withdraw tool exactly once, even if you expect it to fail due to insufficient funds.β
π§ MCP Tool Feedback
banking_server
Overall, tools are discoverable and consistently named. Error responses are clear, enabling the agent to recover and explain outcomes.
| Tool |
Status |
Calls |
Issues |
| get_balance |
β
|
3 |
Working well |
| get_all_balances |
β
|
3 |
Working well |
| transfer |
β
|
2 |
Working well |
| withdraw |
β
|
1 |
Error surfaced correctly |
π‘ Optimizations
| # |
Optimization |
Priority |
Estimated Savings |
| 1 |
Force tool-first behavior for withdrawals |
recommended |
Prevents 33% failure rate |
| 2 |
Trim unused fields from balance responses |
suggestion |
~15% token reduction |
1. Force tool-first behavior for withdrawals (recommended)
- Current: Models may infer failure from balances and skip calling
withdraw.
- Change: Explicitly require a
withdraw tool call for any withdrawal intent.
- Impact: Eliminates the observed failure mode; improves pass rate from 67% to 100% for gpt-4.1-mini.
2. Trim unused fields from balance responses (suggestion)
- Current: Tools return both raw and formatted totals that are not always referenced.
- Change: Remove aggregate totals when per-account balances are sufficient.
- Impact: ~15% token reduction in multi-balance responses.
π¦ Tool Response Optimization
get_all_balances (from banking_server)
- Current response size: ~80 tokens
- Issues found: Includes
total and total_formatted even when the agent only lists per-account balances.
- Suggested optimization: Make totals optional via a parameter (e.g.,
include_total=false).
- Estimated savings: ~15 tokens per call (~18% reduction)
Example current vs optimized:
// Current (~80 tokens)
{
"accounts": {
"checking": {"balance": 1400.0, "formatted": "$1,400.00"},
"savings": {"balance": 3100.0, "formatted": "$3,100.00"}
},
"total": 4500.0,
"total_formatted": "$4,500.00"
}
// Optimized (~65 tokens)
{
"accounts": {
"checking": {"formatted": "$1,400.00"},
"savings": {"formatted": "$3,100.00"}
}
}