Recommended for Deploy
gpt-5-mini
Delivers a perfect pass rate with reliable tool usage and the lowest total cost among agents that handled multi-step financial queries correctly.
100%Pass Rate
$0.003737Total Cost
3,448Tokens
Comparative Analysis
Why the winner wins
- 100% pass rate at lower realized cost than the skill-augmented variant while still correctly chaining
get_all_balances for advice.
- Decisive tool usage: immediately fetches balances without asking permission, avoiding extra turns and failures.
- Balanced verbosity: provides actionable guidance without the token bloat seen in the skill variant.
Notable patterns
- Skill increases quality but costs more:
gpt-5-mini + financial-advisor adds structured advice and options, but at higher cost and tokens.
- Permission-seeking hurts reliability:
gpt-4.1-mini asked the user to provide balances or permission instead of calling tools, causing the only failure.
- Tool descriptions were clear enough: failures stemmed from agent behavior, not tool ambiguity.
Alternatives
- gpt-5-mini + financial-advisor: Same pass rate with richer financial guidance; trade-off is higher cost and tokens.
- gpt-4.1-mini: Lowest cost per run, but unreliable due to permission-seeking behavior; not recommended for autonomous tool-first workflows.
❌ Failure Analysis
Failure Summary
gpt-4.1-mini (1 failure)
| Test |
Root Cause |
Fix |
| Financial advice — tests differentiation between agents (skill vs no skill). |
Agent asked for user-provided balances instead of calling available tools. |
Instruct the agent to proactively fetch balances when advice depends on account data. |
Financial advice — tests differentiation between agents (skill vs no skill). (gpt-4.1-mini)
- Problem: The agent did not call
get_all_balances or get_balance, causing the assertion failure.
- Root Cause: The system prompt allowed or encouraged deference to the user for data retrieval.
- Behavioral Mechanism: Language implying optionality (“If you want, I can retrieve…”) primed the model into permission-seeking mode, delaying action and skipping tool calls.
- Fix: Add explicit instruction:
“When a user asks for advice that depends on account data, always call the appropriate balance tool immediately without asking for permission.”
🔧 MCP Tool Feedback
banking_server
Overall, tools are discoverable and consistently used by compliant agents.
| Tool |
Status |
Calls |
Issues |
| get_balance |
✅ |
3 |
Working well |
| get_all_balances |
✅ |
2 |
Working well |
📝 System Prompt Feedback
default prompt (mixed)
- Token count: Not provided
- Behavioral impact: Neutral wording allows model discretion; with
gpt-4.1-mini this led to permission-seeking.
- Problem: Lacks a hard requirement to act autonomously.
- Suggested change:
“Do not ask the user to provide information that can be retrieved via tools. Fetch required data first, then respond.”
financial-advisor (effective with gpt-5-mini)
- Token count: Not provided
- Behavioral impact: Encourages structured, proactive financial guidance.
- Problem: Higher verbosity than necessary for tests.
- Suggested change: Remove optional exploratory prompts unless user asks for deeper planning.
📚 Skill Feedback
financial-advisor (positive)
- Usage rate: High in advice responses
- Token cost: Increased vs no-skill agent
- Problem: Some boilerplate suggestions not required for test success.
- Suggested change: Split skill into “core rules” and “extended guidance” and load extended guidance only when requested.
💡 Optimizations
| # |
Optimization |
Priority |
Estimated Savings |
| 1 |
Enforce autonomous tool-first behavior |
recommended |
Avoids failures; fewer retries |
| 2 |
Trim skill verbosity |
suggestion |
~15% cost reduction |
| 3 |
Compact tool responses |
info |
~10–20 tokens per call |
1. Enforce autonomous tool-first behavior (recommended)
- Current: Some agents ask permission before calling tools.
- Change: Add explicit system instruction to always fetch required data first.
- Impact: Reliability improvement; prevents test failures.
2. Trim skill verbosity (suggestion)
- Current: Financial-advisor skill includes optional planning prompts.
- Change: Remove or gate extended advice.
- Impact: ~15% cost reduction.
3. Compact tool responses (info)
- Current: Tool responses include both raw and formatted fields.
- Change: Return only one representation where possible.
- Impact: ~10–20 token savings per call.
📦 Tool Response Optimization
get_all_balances (from banking_server)
- Current response size: Includes duplicate numeric and formatted strings.
- Issues found: Redundant
formatted fields increase tokens.
- Suggested optimization: Return numeric balances only; format in the assistant.
- Estimated savings: ~20 tokens per call
Example current vs optimized:
// Current
{
"accounts": {
"checking": {"balance": 1500.0, "formatted": "$1,500.00"},
"savings": {"balance": 3000.0, "formatted": "$3,000.00"}
},
"total": 4500.0,
"total_formatted": "$4,500.00"
}
// Optimized
{
"accounts": {
"checking": 1500.0,
"savings": 3000.0
},
"total": 4500.0
}