Three agents for testing the agent selector UI.

1 Failed
February 07, 2026 at 07:22 PM📄 scenario_04_agent_selector.py6 tests75.7s377–3,235 tok🧪 $-0.016951🤖 $0.0261💰 $0.009104
🤖AI Analysis
Recommended for Deploy
gpt-5-mini
Delivers a perfect pass rate with reliable tool usage and the lowest total cost among agents that handled multi-step financial queries correctly.
100%Pass Rate
$0.003737Total Cost
3,448Tokens
6
Total Tests
1
Failures
3
Agents
2.8
Avg Turns

Comparative Analysis

Why the winner wins

  • 100% pass rate at lower realized cost than the skill-augmented variant while still correctly chaining get_all_balances for advice.
  • Decisive tool usage: immediately fetches balances without asking permission, avoiding extra turns and failures.
  • Balanced verbosity: provides actionable guidance without the token bloat seen in the skill variant.

Notable patterns

  • Skill increases quality but costs more: gpt-5-mini + financial-advisor adds structured advice and options, but at higher cost and tokens.
  • Permission-seeking hurts reliability: gpt-4.1-mini asked the user to provide balances or permission instead of calling tools, causing the only failure.
  • Tool descriptions were clear enough: failures stemmed from agent behavior, not tool ambiguity.

Alternatives

  • gpt-5-mini + financial-advisor: Same pass rate with richer financial guidance; trade-off is higher cost and tokens.
  • gpt-4.1-mini: Lowest cost per run, but unreliable due to permission-seeking behavior; not recommended for autonomous tool-first workflows.

❌ Failure Analysis

Failure Summary

gpt-4.1-mini (1 failure)

Test Root Cause Fix
Financial advice — tests differentiation between agents (skill vs no skill). Agent asked for user-provided balances instead of calling available tools. Instruct the agent to proactively fetch balances when advice depends on account data.

Financial advice — tests differentiation between agents (skill vs no skill). (gpt-4.1-mini)

  • Problem: The agent did not call get_all_balances or get_balance, causing the assertion failure.
  • Root Cause: The system prompt allowed or encouraged deference to the user for data retrieval.
  • Behavioral Mechanism: Language implying optionality (“If you want, I can retrieve…”) primed the model into permission-seeking mode, delaying action and skipping tool calls.
  • Fix: Add explicit instruction:

    “When a user asks for advice that depends on account data, always call the appropriate balance tool immediately without asking for permission.”

🔧 MCP Tool Feedback

banking_server

Overall, tools are discoverable and consistently used by compliant agents.

Tool Status Calls Issues
get_balance 3 Working well
get_all_balances 2 Working well

📝 System Prompt Feedback

default prompt (mixed)

  • Token count: Not provided
  • Behavioral impact: Neutral wording allows model discretion; with gpt-4.1-mini this led to permission-seeking.
  • Problem: Lacks a hard requirement to act autonomously.
  • Suggested change:

    “Do not ask the user to provide information that can be retrieved via tools. Fetch required data first, then respond.”

financial-advisor (effective with gpt-5-mini)

  • Token count: Not provided
  • Behavioral impact: Encourages structured, proactive financial guidance.
  • Problem: Higher verbosity than necessary for tests.
  • Suggested change: Remove optional exploratory prompts unless user asks for deeper planning.

📚 Skill Feedback

financial-advisor (positive)

  • Usage rate: High in advice responses
  • Token cost: Increased vs no-skill agent
  • Problem: Some boilerplate suggestions not required for test success.
  • Suggested change: Split skill into “core rules” and “extended guidance” and load extended guidance only when requested.

💡 Optimizations

# Optimization Priority Estimated Savings
1 Enforce autonomous tool-first behavior recommended Avoids failures; fewer retries
2 Trim skill verbosity suggestion ~15% cost reduction
3 Compact tool responses info ~10–20 tokens per call

1. Enforce autonomous tool-first behavior (recommended)

  • Current: Some agents ask permission before calling tools.
  • Change: Add explicit system instruction to always fetch required data first.
  • Impact: Reliability improvement; prevents test failures.

2. Trim skill verbosity (suggestion)

  • Current: Financial-advisor skill includes optional planning prompts.
  • Change: Remove or gate extended advice.
  • Impact: ~15% cost reduction.

3. Compact tool responses (info)

  • Current: Tool responses include both raw and formatted fields.
  • Change: Return only one representation where possible.
  • Impact: ~10–20 token savings per call.

📦 Tool Response Optimization

get_all_balances (from banking_server)

  • Current response size: Includes duplicate numeric and formatted strings.
  • Issues found: Redundant formatted fields increase tokens.
  • Suggested optimization: Return numeric balances only; format in the assistant.
  • Estimated savings: ~20 tokens per call

Example current vs optimized:

// Current
{
  "accounts": {
    "checking": {"balance": 1500.0, "formatted": "$1,500.00"},
    "savings": {"balance": 3000.0, "formatted": "$3,000.00"}
  },
  "total": 4500.0,
  "total_formatted": "$4,500.00"
}

// Optimized
{
  "accounts": {
    "checking": 1500.0,
    "savings": 3000.0
  },
  "total": 4500.0
}

🏆 Agent Leaderboard

AgentTestsPass RateTokensCostDuration
🥇
gpt-5-mini
2/2100%3,448$0.00373731.1s
🥈
gpt-5-mini + financial-advisor
2/2100%4,908$0.00482333.3s
🥉
gpt-4.1-mini
1/250%1,113$0.00054411.3s
Compare:(Click to swap agents)

📋 Test Results

6 / 6 tests
📋tests/fixtures/scenario_04_agent_selector.py(2 tests)
2/2
2/2
Basic balance query — all agents should pass.
Total 2,595 tok·Total $0.001224·Δ +81%·Δ +31%·Δ +183%
gpt-5-mini:11.1s
gpt-5-mini + financial-advisor:8.5s
Financial advice — tests differentiation between agents (ski…
Total 5,761 tok·Total $0.007336·Δ +28%·Δ +24%·Δ +15%
gpt-5-mini:20.1s
gpt-5-mini + financial-advisor:24.8s