pytest-aitest - Test Report

🤖AI Analysis

Recommended for Deploy

gpt-5-mini

Delivers a perfect pass rate with reliable tool usage and the lowest total cost among agents that handled multi-step financial queries correctly.

100%Pass Rate

$0.003737Total Cost

3,448Tokens

6

Total Tests

1

Failures

3

Agents

2.8

Avg Turns

Comparative Analysis

Why the winner wins

100% pass rate at lower realized cost than the skill-augmented variant while still correctly chaining get_all_balances for advice.
Decisive tool usage: immediately fetches balances without asking permission, avoiding extra turns and failures.
Balanced verbosity: provides actionable guidance without the token bloat seen in the skill variant.

Notable patterns

Skill increases quality but costs more: gpt-5-mini + financial-advisor adds structured advice and options, but at higher cost and tokens.
Permission-seeking hurts reliability: gpt-4.1-mini asked the user to provide balances or permission instead of calling tools, causing the only failure.
Tool descriptions were clear enough: failures stemmed from agent behavior, not tool ambiguity.

Alternatives

gpt-5-mini + financial-advisor: Same pass rate with richer financial guidance; trade-off is higher cost and tokens.
gpt-4.1-mini: Lowest cost per run, but unreliable due to permission-seeking behavior; not recommended for autonomous tool-first workflows.

❌ Failure Analysis

Failure Summary

gpt-4.1-mini (1 failure)

Test	Root Cause	Fix
Financial advice — tests differentiation between agents (skill vs no skill).	Agent asked for user-provided balances instead of calling available tools.	Instruct the agent to proactively fetch balances when advice depends on account data.

Financial advice — tests differentiation between agents (skill vs no skill). (gpt-4.1-mini)

Problem: The agent did not call get_all_balances or get_balance, causing the assertion failure.
Root Cause: The system prompt allowed or encouraged deference to the user for data retrieval.
Behavioral Mechanism: Language implying optionality (“If you want, I can retrieve…”) primed the model into permission-seeking mode, delaying action and skipping tool calls.
Fix: Add explicit instruction:

“When a user asks for advice that depends on account data, always call the appropriate balance tool immediately without asking for permission.”

🔧 MCP Tool Feedback

banking_server

Overall, tools are discoverable and consistently used by compliant agents.

Tool	Status	Calls	Issues
get_balance	✅	3	Working well
get_all_balances	✅	2	Working well

📝 System Prompt Feedback

default prompt (mixed)

Token count: Not provided
Behavioral impact: Neutral wording allows model discretion; with gpt-4.1-mini this led to permission-seeking.
Problem: Lacks a hard requirement to act autonomously.
Suggested change:

“Do not ask the user to provide information that can be retrieved via tools. Fetch required data first, then respond.”

financial-advisor (effective with gpt-5-mini)

Token count: Not provided
Behavioral impact: Encourages structured, proactive financial guidance.
Problem: Higher verbosity than necessary for tests.
Suggested change: Remove optional exploratory prompts unless user asks for deeper planning.

📚 Skill Feedback

financial-advisor (positive)

Usage rate: High in advice responses
Token cost: Increased vs no-skill agent
Problem: Some boilerplate suggestions not required for test success.
Suggested change: Split skill into “core rules” and “extended guidance” and load extended guidance only when requested.

💡 Optimizations

#	Optimization	Priority	Estimated Savings
1	Enforce autonomous tool-first behavior	recommended	Avoids failures; fewer retries
2	Trim skill verbosity	suggestion	~15% cost reduction
3	Compact tool responses	info	~10–20 tokens per call

1. Enforce autonomous tool-first behavior (recommended)

Current: Some agents ask permission before calling tools.
Change: Add explicit system instruction to always fetch required data first.
Impact: Reliability improvement; prevents test failures.

2. Trim skill verbosity (suggestion)

Current: Financial-advisor skill includes optional planning prompts.
Change: Remove or gate extended advice.
Impact: ~15% cost reduction.

3. Compact tool responses (info)

Current: Tool responses include both raw and formatted fields.
Change: Return only one representation where possible.
Impact: ~10–20 token savings per call.

📦 Tool Response Optimization

get_all_balances (from banking_server)

Current response size: Includes duplicate numeric and formatted strings.
Issues found: Redundant formatted fields increase tokens.
Suggested optimization: Return numeric balances only; format in the assistant.
Estimated savings: ~20 tokens per call

Example current vs optimized:

// Current
{
  "accounts": {
    "checking": {"balance": 1500.0, "formatted": "$1,500.00"},
    "savings": {"balance": 3000.0, "formatted": "$3,000.00"}
  },
  "total": 4500.0,
  "total_formatted": "$4,500.00"
}

// Optimized
{
  "accounts": {
    "checking": 1500.0,
    "savings": 3000.0
  },
  "total": 4500.0
}

🏆 Agent Leaderboard

	Agent	Tests	Pass Rate	Tokens	Cost	Duration
🥇	gpt-5-mini	2/2	100%	3,448	$0.003737	31.1s
🥈	gpt-5-mini + financial-advisor	2/2	100%	4,908	$0.004823	33.3s
🥉	gpt-4.1-mini	1/2	50%	1,113	$0.000544	11.3s

Compare:(Click to swap agents)

gpt-5-mini100%gpt-5-mini + financial-advisor100%gpt-4.1-mini50%

📋 Test Results

6 / 6 tests

📋tests/fixtures/scenario_04_agent_selector.py(2 tests)

2/2

▼

✅Basic balance query — all agents should pass.

Total 2,595 tok·Total $0.001224·Δ +81%·Δ +31%·Δ +183%

gpt-5-mini:✅11.1s

gpt-5-mini + financial-advisor:✅8.5s

gpt-5-mini

passed

11.07s

duration

3

turns

1

tools

922

tokens

$0.000320

cost

📊Sequence

sequenceDiagram participant User participant Agent participant Tools User->>Agent: "What's my checking account balance?" Agent->>Tools: "get_balance({'account': 'checking'})" Tools-->>Agent: "{'account': 'checking', 'balance': 1500.0, 'formatted': '$1," Agent->>User: "Your checking account balance is $1,500.00. Would you like to do anything else w"

🔧Tool Calls

✅get_balance

✓Assertions

✅llmprovides the current checking account balance amount

💬Response

Your checking account balance is $1,500.00. Would you like to do anything else with this account?

gpt-5-mini + financial-advisor

passed

8.48s

duration

3

turns

1

tools

1,673

tokens

$0.000905

cost

📊Sequence

sequenceDiagram participant User participant Agent participant Tools User->>Agent: "What's my checking account balance?" Agent->>Tools: "get_balance({'account': 'checking'})" Tools-->>Agent: "{'account': 'checking', 'balance': 1500.0, 'formatted': '$1," Agent->>User: "Your checking account balance is $1,500.00. Would you like me to: - Show your sa"

🔧Tool Calls

✅get_balance

✓Assertions

✅llmprovides the current checking account balance amount

💬Response

Your checking account balance is $1,500.00.

Would you like me to:
- Show your savings balance or recent transactions?
- Transfer money between accounts or deposit/withdraw?
- Help set up a budget or savings plan (emergency fund recommendations)?

✅Financial advice — tests differentiation between agents (ski…⚡

Total 5,761 tok·Total $0.007336·Δ +28%·Δ +24%·Δ +15%

gpt-5-mini:✅20.1s

gpt-5-mini + financial-advisor:✅24.8s

gpt-5-mini

passed

20.05s

duration

3

turns

1

tools

2,526

tokens

$0.003418

cost

📊Sequence

sequenceDiagram participant User participant Agent participant Tools User->>Agent: "I have money in checking and savings. How should I allocate my funds?" Agent->>Tools: "get_all_balances({})" Tools-->>Agent: "{'accounts': {'checking': {'balance': 1500.0, 'formatted': '" Agent->>User: "I pulled your balances: checking $1,500 and savings $3,000 (total $4,500). I can"

🔧Tool Calls

✅get_all_balances

✓Assertions

✅llmprovides financial advice about fund allocation

💬Response

I pulled your balances: checking $1,500 and savings $3,000 (total $4,500).

I can give a useful recommendation, but to tailor it I need a couple quick facts:
- Rough monthly living expenses (rent/mortgage, utilities, food, transport)
- Any upcoming big payments (car, vacation, home down payment) in the next 12 months
- Whether you have high‑interest debt or want to invest/retire sooner

Until then, here are clear, practical options you can choose from (I won’t move any money unless you ask me to):

1) Safety-first (emergency fund focus)
- Keep 1–2 months of expenses in checking for bills/buffers.
- Put the rest in savings until you reach a 3–6 month emergency fund.
Example if your monthly expenses are ~$1,500: keep $1,500 in checking (you already have that) and keep the $3,000 in savings as a 2‑month starter emergency fund. Then build from there.

2) Balanced (short-term goals + emergency)
- Checking: 1 month of expenses
- Savings: emergency fund (3 months) + separate sub-accounts for short-term goals (vacation, car, etc.)
- Any extra beyond emergency goes to goal accounts or investments.
Example split of your $4,500 (if monthly expenses ~$1,500): $1,500 checking / $4,500 total → target savings $4,500 for 3 months; you’d need to add $1,500 to savings—plan to divert extra income until target met.

3) Growth-forward (invest / pay down debt)
- Keep a 1‑month buffer in checking, keep 3 months in savings, then direct remaining funds to high‑priority debt payoff or investments (IRA/brokerage).
Example: if you’re comfortable with less cash cushion, keep $1,500 checking, $1,500 savings (partial emergency), and invest or use remaining $1,500 to pay down debt or invest.

Next steps I can do for you (tell me which and I’ll execute):
- Reallocate between your checking and savings (specify amounts), e.g., “Move $X from savings to checking.”
- Help set a target emergency fund and show how much more you need.
- Create suggested amounts to invest or pay down debt, and prepare a transfer.

Which scenario fits you, and what are your monthly expenses and goals? If you want me to move money now, tell me the exact transfer (from, to, amount) and I’ll run it.

gpt-5-mini + financial-advisor

passed

24.79s

duration

3

turns

1

tools

3,235

tokens

$0.003919

cost

📊Sequence

sequenceDiagram participant User participant Agent participant Tools User->>Agent: "I have money in checking and savings. How should I allocate my funds?" Agent->>Tools: "get_all_balances({})" Tools-->>Agent: "{'accounts': {'checking': {'balance': 1500.0, 'formatted': '" Agent->>User: "Thanks — I checked your accounts: you have $1,500 in checking and $3,000 in savi"

🔧Tool Calls

✅get_all_balances

✓Assertions

✅llmprovides financial advice about fund allocation

💬Response

Thanks — I checked your accounts: you have $1,500 in checking and $3,000 in savings for a total of $4,500.

High-level recommendations (priority order)
1. Emergency fund first — 3–6 months of your essential expenses kept in savings (separate, ideally in a high‑yield savings account).
2. Pay down any high‑interest debt (credit cards, payday loans) once you have at least a small emergency buffer.
3. Contribute to retirement, at least enough to get any employer match.
4. Other goals (vacation, home down payment) after the above.

Why I need more info
- To give specific targets I need your monthly take‑home pay and your monthly essential expenses (rent/mortgage, utilities, groceries, transportation, minimum debt payments). Also tell me if you have high‑interest debt or an employer match.

Immediate, practical steps you can take now
- Keep enough in checking to cover upcoming bills and one month of expected spending. Move the rest to savings (or better, a high‑yield savings account).
- Automate "pay yourself first": schedule an automatic transfer from checking to savings right after each payday.
- If you have high‑interest debt (>7–10% APR), prioritize paying that down after you have a small emergency buffer (~$1,000) or at least 1 month of expenses.
- If you have an employer 401(k) match, make sure you’re contributing at least enough to get the full match.

How to apply the 50/30/20 rule (use your take‑home pay)
- 50% needs (housing, utilities, groceries, transportation)
- 30% wants (dining out, entertainment)
- 20% savings & debt repayment (emergency fund, retirement, extra debt payments)

Quick example (illustrative)
- If your monthly take‑home pay = $3,000 and your essential monthly expenses = $2,000:
  - 3 months of expenses = $6,000 emergency fund target.
  - You currently have $3,000 in savings → need $3,000 more.
  - Keep $2,000 in checking for bills (you currently have $1,500, so you may want to top it up).
  - Set an automated transfer of $500/month from checking → savings until the emergency fund is met (would take ~6 months).

Red flags to watch for
- Spending more than your income
- No emergency fund or underfunded emergency fund for your situation
- High credit utilization or high‑APR debt
- Spending on dining/entertainment that outpaces savings

Next steps I can do for you
- Create a specific monthly plan if you tell me: monthly take‑home pay, monthly essential expenses, any high‑interest debt balances & rates, and whether you get a 401(k) match.
- If you want, I can also move money between your checking and savings (tell me how much and which direction) or pull recent transactions so we can categorize spending.

Which would you like to do next? Provide your monthly numbers and any debt/retirement details and I’ll give a concrete allocation plan.