banking-agent (1 failure)
| Test | Root Cause | Fix |
|---|---|---|
| Test that fails due to turn limit β for report variety. | max_turns=1 prevents multi-step execution | Increase turn limit or allow tool chaining in one turn |
max_turns=1, but the workflow inherently requires multiple tool calls and at least one follow-up response. The agent correctly initiated the first step (get_all_balances) but was cut off before completing the sequence.max_turns >= 3, or refactor the test to assert partial progress when max_turns=1.Overall, tools are clearly named and correctly invoked. The agent consistently selects the right tool without confusion.
| Tool | Status | Calls | Issues |
|---|---|---|---|
| get_balance | β | 1 | Working well |
| transfer | β | 1 | Working well |
| get_transactions | β | 1 | Working well |
| get_all_balances | β | 1 | Working well |
| # | Optimization | Priority | Estimated Savings |
|---|---|---|---|
| 1 | Increase turn limit for compound tasks | recommended | Prevents 25% test failure rate |
| 2 | Trim verbose assistant follow-ups | suggestion | ~10β15% token reduction |
max_turns=1, causing premature failure.max_turns to at least 3 for workflows involving multiple tool calls.message and pre-formatted fields when not required by tests.Example current vs optimized:
// Current (~75 tokens)
{
"transaction_id": "TX0001",
"type": "transfer",
"from_account": "checking",
"to_account": "savings",
"amount": 200,
"amount_formatted": "$200.00",
"new_balance_from": 1300.0,
"new_balance_to": 3200.0,
"message": "Successfully transferred $200.00 from checking to savings."
}
// Optimized (~50 tokens)
{
"transaction_id": "TX0001",
"from_account": "checking",
"to_account": "savings",
"amount": 200,
"new_balance_from": 1300.0,
"new_balance_to": 3200.0
}