Core banking tests — parametrized across benchmark agents.

All Passed
February 23, 2026 at 11:17 AM📄 test_hero.py18 tests225.4s830–3,146 tok🧪 $0.0454🤖 $0.0484💰 $0.0938
🤖AI Analysis
Recommended for Deploy
default
Achieves a 100% pass rate across the largest test surface (12 tests) with reliable tool-first behavior and consistent multi-turn handling, at a total cost of $0.023589.
100%Pass Rate
$0.023589Total Cost
17,014Tokens
18
Total Tests
0
Failures
4
Eval Configs
3.3
Avg Turns

Comparative Analysis

Why the winner wins

  • Coverage advantage: The winner (default) passed 12 tests versus 2 tests each for concise, friendly, and detailed, demonstrating robustness across core banking, multi‑turn sessions, advisory styles, and skill-augmented advice.
  • Tool discipline: Consistently follows “tool-first” instructions (balances before advice, transfers via tools, graceful handling of insufficient funds) with no deviations.
  • Session reliability: Correctly carries context across multi-turn workflows (balance → transfer) without redundant clarifications or missed tool calls.
  • Cost-effective at scale: While per-test cost is not directly comparable across configs due to different test counts, the winner maintains low total cost while handling the widest scenario set.

Notable patterns

  • Concise prompts minimize tokens: concise produces the lowest token usage per advisory response while still calling the necessary tools.
  • Detailed prompts expand reasoning, not tool usage: detailed adds explanatory depth without increasing tool coverage—useful for UX, not for operational accuracy.
  • Model parity on correctness: Both gpt-5-mini and gpt-4.1 achieve perfect pass rates; differences are primarily in verbosity and cost, not correctness.

Alternatives

  • concise: Lowest token footprint and cost on advisory tests; trade-off is reduced explanation depth. Suitable for cost-sensitive deployments focused on quick guidance.
  • friendly: Strong UX tone with moderate token cost; useful for consumer-facing experiences where encouragement matters.
  • detailed: Rich explanations with higher token cost; best for educational contexts. Not necessary for transactional banking flows.

🔧 MCP Tool Feedback

MCPServerStdio

Overall assessment: High discoverability and correct usage. Tools are clearly named and consistently invoked correctly across models and prompts.

Tool Status Calls Issues
get_balance Many Working well
get_all_balances Many Working well
transfer Many Working well, including error handling
get_transactions Few Working well
deposit ⚠️ 0 Not covered by tests
withdraw ⚠️ 0 Not covered by tests

Note: deposit and withdraw were not called, but no tests asserted their usage. This is a coverage observation, not a failure.

📋 MCP Prompt Template Feedback

default (clear)

  • Description clarity: Clear and explicit about mandatory tool usage.
  • Arguments: N/A (instructional prompt).
  • Issue: None observed.
  • Suggested change: None.

🤖 Custom Agent Feedback

default (effective)

  • Description match: Matches expected behavior exactly—transactional, accurate, tool-driven.
  • Token count: 17,014
  • Behavioral impact: Strongly primes immediate tool invocation and avoids hallucination.
  • Gap analysis: None.
  • Suggested change: None.

concise (effective)

  • Description match: Behavior is brief, direct, and action-oriented.
  • Token count: 2,982
  • Behavioral impact: Language like “Be brief and direct” suppresses verbosity while preserving correct tool calls.
  • Gap analysis: None.
  • Suggested change: None.

detailed (effective)

  • Description match: Provides comprehensive explanations and context as instructed.
  • Token count: 4,284
  • Behavioral impact: Phrases like “explain the reasoning” and “consider multiple perspectives” increase response length without affecting correctness.
  • Gap analysis: None.
  • Suggested change: None.

friendly (effective)

  • Description match: Encouraging tone with celebratory language.
  • Token count: 3,774
  • Behavioral impact: Positive framing increases verbosity slightly but does not interfere with tool usage.
  • Gap analysis: None.
  • Suggested change: None.

📚 Skill Feedback

financial-advisor (positive)

  • Usage rate: Referenced in all skill-augmented advisory responses.
  • Token cost: Included in advisory tests (up to ~3k tokens total).
  • Problem: None—skill content directly shaped advice structure and priorities.
  • Suggested change: None. The “Always call get_all_balances first” rule is especially effective.

💡 Optimizations

# Optimization Priority Estimated Savings
1 Trim unused tool response fields recommended 10–15% token reduction
2 Align advisory verbosity to prompt intent suggestion 5–10% cost reduction

1. Trim unused tool response fields (recommended)

  • Current: Tool responses include both raw and formatted values plus explanatory messages.
  • Change: Remove fields not used in agent responses (e.g., duplicate message strings when balances are already restated).
  • Impact: 10–15% cost reduction from smaller tool payloads.

2. Align advisory verbosity to prompt intent (suggestion)

  • Current: friendly and detailed sometimes exceed practical length for banking dashboards.
  • Change: Add an optional instruction line: “End with a 3-bullet action summary.”
  • Impact: 5–10% cost reduction with improved scannability.

📦 Tool Response Optimization

transfer (from MCPServerStdio)

  • Current response size: ~80–100 tokens per call
  • Issues found: Redundant fields (amount_formatted + numeric amount; verbose message)
  • Suggested optimization: Keep numeric values and new balances only.
  • Estimated savings: ~30 tokens per call (~30–35% reduction)

Example current vs optimized:

// Current
{
  "transaction_id": "TX0001",
  "type": "transfer",
  "from_account": "checking",
  "to_account": "savings",
  "amount": 100.0,
  "amount_formatted": "$100.00",
  "new_balance_from": 1400.0,
  "new_balance_to": 3100.0,
  "message": "Successfully transferred $100.00 from checking to savings."
}

// Optimized
{
  "transaction_id": "TX0001",
  "from": "checking",
  "to": "savings",
  "amount": 100.0,
  "new_from": 1400.0,
  "new_to": 3100.0
}

🏆 Eval Leaderboard

EvalTestsPass RateTokensCostDuration
🥇
default
12/12100%17,014$0.0236133.8s
🥈
concise
2/2100%2,982$0.00477624.5s
🥉
friendly
2/2100%3,774$0.00769033.1s
4
detailed
2/2100%4,284$0.00932634.0s
Compare:(Click to swap agents)

📋 Test Results

18 / 18 tests
🔗Core banking tests — parametrized across benchmark agents.(3 tests)
3/3
0/0
Check account balance.
Total 2,043 tok·Total $0.002522·Δ +0%·Δ +0%·Δ +0%
default:19.4s
concise:
Transfer funds between accounts.
Total 2,273 tok·Total $0.003012·Δ +0%·Δ +0%·Δ +0%
default:14.5s
concise:
Handle insufficient funds gracefully.
Total 2,760 tok·Total $0.003981·Δ +0%·Δ +0%·Δ +0%
default:34.6s
concise:
🔗Multi-turn session: savings transfer workflow.(2 tests)
2/2
0/0
First turn: check account balances.
Total 2,159 tok·Total $0.002707·Δ +0%·Δ +0%·Δ +0%
default:14.3s
concise:
Second turn: transfer based on previous context.
Total 2,856 tok·Total $0.003543·Δ +0%·Δ +0%·Δ +0%
default:15.3s
concise:
📋Compare how different prompt styles affect responses.(1 tests)
0/0
1/1
Compare advisory styles across prompts.
Total 2,982 tok·Total $0.004776·Δ +0%·Δ +0%·Δ +0%
default:
concise:24.5s
📋Test how skills improve advice quality.(1 tests)
1/1
0/0
Eval with financial advisor skill gives better advice.
Total 4,923 tok·Total $0.007823·Δ +0%·Δ +0%·Δ +0%
default:35.7s
concise: