Two agents compared side-by-side.

βœ—1 Failed
February 07, 2026 at 07:20 PMπŸ“„ scenario_02_multi_agent.py6 tests55.8s738–2,232 tokπŸ§ͺ $-0.019128πŸ€– $0.0237πŸ’° $0.004565
πŸ€–AI Analysis
Recommended for Deploy
gpt-5-mini
Achieves a 100% pass rate while correctly executing multi-step tool flows and error recovery. Slightly higher cost than the alternative, but the only configuration that consistently follows the required tool-first behavior in failure scenarios.
100%Pass Rate
$0.003314Total Cost
4,589Tokens
6
Total Tests
1
Failures
2
Agents
3.2
Avg Turns

Comparative Analysis

Why the winner wins

  • Only agent with correct error-path tool usage: gpt-5-mini is the only configuration that attempted the withdraw tool even when funds were insufficient, satisfying the test’s requirement to exercise error recovery logic.
  • Reliable multi-step chaining: Successfully chains transfer β†’ get_all_balances without hesitation or re-asking the user, matching the intended MCP workflow.
  • Higher correctness outweighs marginal cost: Although gpt-4.1-mini is cheaper in absolute terms, its 67% pass rate makes it unsuitable for deployment where correctness is mandatory.

Notable patterns

  • Cheaper model avoided β€œfailing” tools: gpt-4.1-mini inferred insufficiency from get_balance and responded conversationally instead of calling withdraw, indicating a bias toward helpful explanation over tool execution.
  • Error tests expose prompt–model interaction: The insufficient-funds scenario is the only place where the models diverge, suggesting that tool-error paths are the critical discriminator.

Alternatives

  • gpt-4.1-mini: Lower total cost, but failed a core error-recovery test because it did not call the required tool. Not recommended unless the prompt is adjusted to force tool invocation on withdrawal intents.

❌ Failure Analysis

Failure Summary

gpt-4.1-mini (1 failure)

Test Root Cause Fix
Insufficient funds β€” tests error recovery Model chose to explain insufficiency instead of invoking withdraw Update system prompt to mandate calling withdraw for any withdrawal request, even if funds appear insufficient

Insufficient funds β€” tests error recovery (gpt-4.1-mini)

  • Problem: The agent responded with a balance explanation and options without ever calling the withdraw tool.
  • Root Cause: The model inferred the failure outcome after calling get_balance and optimized for a helpful natural-language response rather than executing the requested action via the tool.
  • Behavioral Mechanism: The absence of explicit language like β€œalways attempt the requested banking action via tools” allows the model to short-circuit. Phrases implying helpfulness or user guidance (e.g., offering alternatives) prime it to stop after explanation instead of acting.
  • Fix: Add an explicit instruction to the system prompt:

    β€œFor any user request to withdraw funds, you MUST call the withdraw tool exactly once, even if you expect it to fail due to insufficient funds.”

πŸ”§ MCP Tool Feedback

banking_server

Overall, tools are discoverable and consistently named. Error responses are clear, enabling the agent to recover and explain outcomes.

Tool Status Calls Issues
get_balance βœ… 3 Working well
get_all_balances βœ… 3 Working well
transfer βœ… 2 Working well
withdraw βœ… 1 Error surfaced correctly

πŸ’‘ Optimizations

# Optimization Priority Estimated Savings
1 Force tool-first behavior for withdrawals recommended Prevents 33% failure rate
2 Trim unused fields from balance responses suggestion ~15% token reduction

1. Force tool-first behavior for withdrawals (recommended)

  • Current: Models may infer failure from balances and skip calling withdraw.
  • Change: Explicitly require a withdraw tool call for any withdrawal intent.
  • Impact: Eliminates the observed failure mode; improves pass rate from 67% to 100% for gpt-4.1-mini.

2. Trim unused fields from balance responses (suggestion)

  • Current: Tools return both raw and formatted totals that are not always referenced.
  • Change: Remove aggregate totals when per-account balances are sufficient.
  • Impact: ~15% token reduction in multi-balance responses.

πŸ“¦ Tool Response Optimization

get_all_balances (from banking_server)

  • Current response size: ~80 tokens
  • Issues found: Includes total and total_formatted even when the agent only lists per-account balances.
  • Suggested optimization: Make totals optional via a parameter (e.g., include_total=false).
  • Estimated savings: ~15 tokens per call (~18% reduction)

Example current vs optimized:

// Current (~80 tokens)
{
  "accounts": {
    "checking": {"balance": 1400.0, "formatted": "$1,400.00"},
    "savings": {"balance": 3100.0, "formatted": "$3,100.00"}
  },
  "total": 4500.0,
  "total_formatted": "$4,500.00"
}

// Optimized (~65 tokens)
{
  "accounts": {
    "checking": {"formatted": "$1,400.00"},
    "savings": {"formatted": "$3,100.00"}
  }
}

πŸ† Agent Leaderboard

AgentTestsPass RateTokensCostDuration
πŸ₯‡
gpt-5-mini
3/3100%4,589$0.00331435.3s
πŸ₯ˆ
gpt-4.1-mini
2/367%2,495$0.00125120.5s

πŸ“‹ Test Results

6 / 6 tests
πŸ“‹tests/fixtures/scenario_02_multi_agent.py(3 tests)
3/3
2/3
β–Ό
βœ…Basic balance query β€” all agents should pass.
Total 1,660 tokΒ·Total $0.000664Β·Ξ” +25%Β·Ξ” +59%Β·Ξ” +8%
gpt-5-mini:βœ…11.0s
gpt-4.1-mini:βœ…6.9s
βœ…Transfer with verification β€” tests multi-step tool use.
Total 2,427 tokΒ·Total $0.001545Β·Ξ” +45%Β·Ξ” +24%Β·Ξ” +93%
gpt-5-mini:βœ…10.0s
gpt-4.1-mini:βœ…8.1s
βœ…Insufficient funds β€” tests error recovery.⚑
Total 2,997 tokΒ·Total $0.002356Β·Ξ” +192%Β·Ξ” +161%Β·Ξ” +419%
gpt-5-mini:βœ…14.3s
gpt-4.1-mini:❌5.5s