Core banking tests — parametrized across benchmark agents.

All Passed
February 15, 2026 at 12:03 PM📄 test_hero.py36 tests2750.9s828–3,496 tok🧪 N/A💰 N/A
🤖AI Analysis
Recommended for Deploy
gpt-5-mini
Achieves a flawless 100% pass rate with perfect iteration consistency, reliable tool usage across single- and multi-turn workflows, and zero recorded cost. Strong balance of speed, correctness, and prompt-following makes it the safest default deployment.
100%Pass Rate
$0.000000Total Cost
15,318Tokens
36
Total Tests
0
Failures
10
Agents
3.3
Avg Turns

Comparative Analysis

Why the winner wins

  • Perfect reliability: gpt-5-mini passed 100% of tests and 100% of iterations (10/10), including error handling and multi-turn context carryover.
  • Tool discipline: Consistently called the correct MCP tools first (get_balance, get_all_balances, transfer) and never hallucinated balances, exactly following the default system prompt.
  • Robust edge-case handling: In insufficient-funds scenarios, it gracefully explained failures and suggested valid alternatives without retry loops or unsafe assumptions.
  • Prompt versatility: Maintained correctness across default, concise, detailed, friendly, and skill-augmented configurations.

Consistency (iterations only)

  • No flakiness detected: All agents, including the winner, achieved 100% iteration pass rates. No test fell below full iteration success.

Notable patterns

  • Model vs verbosity trade-off: gpt-4.1 often used fewer tokens in advisory-style prompts, but did not outperform gpt-5-mini on capability or reliability.
  • Prompt-driven verbosity, not correctness: “Detailed” and “friendly” prompts significantly increased token usage but did not degrade tool usage or accuracy.
  • Skill amplification: The financial-advisor skill reliably injected structured, principle-based advice (emergency fund, 50/30/20 rule) without interfering with tool calls.

Alternatives

  • gpt-4.1: Equal pass rate and iteration consistency with lower token usage in many prompts. A strong alternative if token efficiency becomes a priority over default selection.
  • Prompt variants (concise/friendly/detailed): All viable. Choose based on UX needs rather than correctness—none introduced failures.
  • Skill-augmented agents: Both models with the financial-advisor skill performed well; use when higher-quality financial guidance is desired.

🔧 MCP Tool Feedback

MCPServerStdio

Overall, tools were easy to discover and were used correctly and consistently. Error messages were clear enough for agents to recover gracefully.

Tool Status Calls Issues
get_balance Many Working well
get_all_balances Many Working well
transfer Many Clear error on insufficient funds enabled good recovery
get_transactions Few Returned empty data cleanly
deposit ⚠️ 0 Not exercised by tests
withdraw ⚠️ 0 Not exercised by tests

Note: deposit and withdraw were never called. This is a test coverage observation, not a defect.

📝 System Prompt Feedback

default (effective)

  • Behavioral impact: The explicit directive “ALWAYS call the appropriate tool first” strongly anchored the agents to correct tool usage and prevented hallucination.
  • Problem: None observed.
  • Suggested change: None — this prompt is a solid baseline.

concise / detailed / friendly (effective, model-agnostic)

  • Behavioral impact:
  • concise: Encouraged direct answers and minimal verbosity without skipping tools.
  • detailed: Increased deliberation and explanation length but did not cause hesitation or permission-seeking.
  • friendly: Added warmth and occasional extra tool calls (e.g., get_transactions) without harming correctness.
  • Problem: Token bloat in detailed/friendly variants.
  • Suggested change: Add a single constraint line to verbose prompts:

    “Do not call additional tools unless they materially change the recommendation.”

📚 Skill Feedback

financial-advisor (positive)

  • Usage rate: High — principles were consistently reflected in responses.
  • Token cost: Moderate increase due to structured explanations.
  • Problem: None observed; content aligned well with user questions.
  • Suggested change: Consider a short “Quick take” section at the top of the skill instructions to encourage brief summaries before deep dives.

💡 Optimizations

# Optimization Priority Estimated Savings
1 Limit optional tool calls in advisory prompts recommended 10–15% cost reduction
2 Trim verbose prompt variants suggestion 10% fewer tokens
3 Add tests for deposit/withdraw info Coverage improvement

1. Limit optional tool calls in advisory prompts (recommended)

  • Current: Friendly and detailed prompts sometimes trigger extra, non-essential tool calls.
  • Change: Explicitly restrict tool usage to cases where results affect recommendations.
  • Impact: ~10–15% cost reduction from fewer tool calls and shorter responses.

2. Trim verbose prompt variants (suggestion)

  • Current: Detailed/friendly prompts encourage long explanations by default.
  • Change: Add a “summary first, details on request” instruction.
  • Impact: ~10% fewer tokens per advisory response.

3. Add tests for deposit/withdraw (info)

  • Current: These tools are unexercised.
  • Change: Add one happy-path and one error-path test for each.
  • Impact: Improved confidence and coverage; no direct cost savings.

🏆 Agent Leaderboard

AgentTestsPass RateTokensCostDuration
🥇
gpt-4.1
10/10100%10,577N/A58.8s
🥈
gpt-5-mini
10/10100%15,318N/A1277.4s
🥉
gpt-4.1 + concise
2/2100%1,661N/A11.7s
4
gpt-4.1 + detailed
2/2100%2,291N/A14.5s
5
gpt-4.1 + financial-advisor
2/2100%1,750N/A19.4s
6
gpt-4.1 + friendly
2/2100%2,083N/A10.9s
7
gpt-5-mini + concise
2/2100%3,937N/A25.6s
8
gpt-5-mini + detailed
2/2100%6,812N/A46.3s
9
gpt-5-mini + financial-advisor
2/2100%5,678N/A1240.9s
10
gpt-5-mini + friendly
2/2100%5,242N/A45.3s
Compare:(Click to swap agents)

📋 Test Results

36 / 36 tests
🔗Core banking tests — parametrized across benchmark agents.(3 tests)
3/3
3/3
Check account balance.
Total 4,047 tok·Total N/A·Δ +18%·Δ +44%·Δ +0%
gpt-4.1:8.8s
gpt-5-mini:12.7s
Transfer funds between accounts.
Total 4,762 tok·Total N/A·Δ +28%·Δ +36%·Δ +0%
gpt-4.1:8.4s
gpt-5-mini:11.5s
Handle insufficient funds gracefully.
Total 6,960 tok·Total N/A·Δ +155%·Δ +5135%·Δ +0%
gpt-4.1:23.6s
gpt-5-mini:1232.9s
🔗Multi-turn session: savings transfer workflow.(2 tests)
2/2
2/2
First turn: check account balances.
Total 4,340 tok·Total N/A·Δ +19%·Δ +16%·Δ +0%
gpt-4.1:10.6s
gpt-5-mini:9.1s
Second turn: transfer based on previous context.
Total 5,786 tok·Total N/A·Δ +15%·Δ +50%·Δ +0%
gpt-4.1:7.4s
gpt-5-mini:11.1s
📋Compare how different prompt styles affect responses.(1 tests)
0/0
0/0
Compare advisory styles across prompts.
gpt-4.1:
gpt-5-mini:
📋Test how skills improve advice quality.(1 tests)
0/0
0/0
Agent with financial advisor skill gives better advice.
gpt-4.1:
gpt-5-mini: