Single agent tests - basic report without comparison UI.

βœ—1 Failed
February 07, 2026 at 07:19 PMπŸ“„ scenario_01_single_agent.py4 tests37.2s509–1,322 tokπŸ§ͺ $-0.016434πŸ€– $0.0186πŸ’° $0.002150
πŸ€–AI Analysis
Recommended for Deploy
banking-agent
Reliable single-agent setup that passes core banking workflows with correct tool usage at very low cost. One failure is attributable to an artificially strict turn limit, not tool or prompt correctness.
75%Pass Rate
$0.002150Total Cost
3,786Tokens
4
Total Tests
1
Failures
1
Agents
2.8
Avg Turns

❌ Failure Analysis

Failure Summary

banking-agent (1 failure)

Test Root Cause Fix
Test that fails due to turn limit β€” for report variety. max_turns=1 prevents multi-step execution Increase turn limit or allow tool chaining in one turn

Test that fails due to turn limit β€” for report variety. (banking-agent)

  • Problem: The user requested a compound workflow: check balances, transfer funds, then show updated balances and transaction history.
  • Root Cause: The test enforces max_turns=1, but the workflow inherently requires multiple tool calls and at least one follow-up response. The agent correctly initiated the first step (get_all_balances) but was cut off before completing the sequence.
  • Behavioral Mechanism: Not prompt-induced. The failure is caused by the test harness constraint, not by hesitation or permission-seeking language in the system prompt.
  • Fix: Increase the test configuration to max_turns >= 3, or refactor the test to assert partial progress when max_turns=1.

πŸ”§ MCP Tool Feedback

banking-mcp

Overall, tools are clearly named and correctly invoked. The agent consistently selects the right tool without confusion.

Tool Status Calls Issues
get_balance βœ… 1 Working well
transfer βœ… 1 Working well
get_transactions βœ… 1 Working well
get_all_balances βœ… 1 Working well

πŸ’‘ Optimizations

# Optimization Priority Estimated Savings
1 Increase turn limit for compound tasks recommended Prevents 25% test failure rate
2 Trim verbose assistant follow-ups suggestion ~10–15% token reduction

1. Increase turn limit for compound tasks (recommended)

  • Current: Complex requests are tested with max_turns=1, causing premature failure.
  • Change: Set max_turns to at least 3 for workflows involving multiple tool calls.
  • Impact: Eliminates the only observed failure without increasing per-test cost.

2. Trim verbose assistant follow-ups (suggestion)

  • Current: After successful tool calls, the assistant offers multiple optional next steps, adding tokens.
  • Change: Replace multi-option follow-ups with a single concise question, e.g., β€œAnything else I can help with?”
  • Impact: ~10–15% cost reduction through fewer output tokens.

πŸ“¦ Tool Response Optimization

transfer (from banking-mcp)

  • Current response size: ~70–80 tokens
  • Issues found: Includes redundant formatted fields and a human-readable message that the assistant restates anyway.
  • Suggested optimization: Remove message and pre-formatted fields when not required by tests.
  • Estimated savings: ~25 tokens per call (~30% reduction)

Example current vs optimized:

// Current (~75 tokens)
{
  "transaction_id": "TX0001",
  "type": "transfer",
  "from_account": "checking",
  "to_account": "savings",
  "amount": 200,
  "amount_formatted": "$200.00",
  "new_balance_from": 1300.0,
  "new_balance_to": 3200.0,
  "message": "Successfully transferred $200.00 from checking to savings."
}

// Optimized (~50 tokens)
{
  "transaction_id": "TX0001",
  "from_account": "checking",
  "to_account": "savings",
  "amount": 200,
  "new_balance_from": 1300.0,
  "new_balance_to": 3200.0
}

πŸ“‹ Test Results

4 / 4 tests
πŸ“‹tests/fixtures/scenario_01_single_agent.py(4 tests)
β–Ό
βœ…Basic balance check β€” should pass.
13.7sΒ·1πŸ”§Β·931 tokΒ·$0.000338
βœ…Transfer money β€” tests the transfer tool.
7.1sΒ·1πŸ”§Β·1,024 tokΒ·$0.000401
βœ…View transactions β€” multiple tool calls possible.
11.9sΒ·1πŸ”§Β·1,322 tokΒ·$0.001134
❌Test that fails due to turn limit β€” for report variety.
4.5sΒ·1πŸ”§Β·509 tokΒ·$0.000278