pytest-skill-engineering

🤖AI Analysis

Recommended for Deploy

default

Achieves a 100% pass rate across the largest test surface (12 tests) with reliable tool-first behavior and consistent multi-turn handling, at a total cost of $0.023589.

100%Pass Rate

$0.023589Total Cost

17,014Tokens

18

Total Tests

0

Failures

4

Eval Configs

3.3

Avg Turns

Comparative Analysis

Why the winner wins

Coverage advantage: The winner (default) passed 12 tests versus 2 tests each for concise, friendly, and detailed, demonstrating robustness across core banking, multi‑turn sessions, advisory styles, and skill-augmented advice.
Tool discipline: Consistently follows “tool-first” instructions (balances before advice, transfers via tools, graceful handling of insufficient funds) with no deviations.
Session reliability: Correctly carries context across multi-turn workflows (balance → transfer) without redundant clarifications or missed tool calls.
Cost-effective at scale: While per-test cost is not directly comparable across configs due to different test counts, the winner maintains low total cost while handling the widest scenario set.

Notable patterns

Concise prompts minimize tokens: concise produces the lowest token usage per advisory response while still calling the necessary tools.
Detailed prompts expand reasoning, not tool usage: detailed adds explanatory depth without increasing tool coverage—useful for UX, not for operational accuracy.
Model parity on correctness: Both gpt-5-mini and gpt-4.1 achieve perfect pass rates; differences are primarily in verbosity and cost, not correctness.

Alternatives

concise: Lowest token footprint and cost on advisory tests; trade-off is reduced explanation depth. Suitable for cost-sensitive deployments focused on quick guidance.
friendly: Strong UX tone with moderate token cost; useful for consumer-facing experiences where encouragement matters.
detailed: Rich explanations with higher token cost; best for educational contexts. Not necessary for transactional banking flows.

🔧 MCP Tool Feedback

MCPServerStdio

Overall assessment: High discoverability and correct usage. Tools are clearly named and consistently invoked correctly across models and prompts.

Tool	Status	Calls	Issues
get_balance	✅	Many	Working well
get_all_balances	✅	Many	Working well
transfer	✅	Many	Working well, including error handling
get_transactions	✅	Few	Working well
deposit	⚠️	0	Not covered by tests
withdraw	⚠️	0	Not covered by tests

Note: deposit and withdraw were not called, but no tests asserted their usage. This is a coverage observation, not a failure.

📋 MCP Prompt Template Feedback

default (clear)

Description clarity: Clear and explicit about mandatory tool usage.
Arguments: N/A (instructional prompt).
Issue: None observed.
Suggested change: None.

🤖 Custom Agent Feedback

default (effective)

Description match: Matches expected behavior exactly—transactional, accurate, tool-driven.
Token count: 17,014
Behavioral impact: Strongly primes immediate tool invocation and avoids hallucination.
Gap analysis: None.
Suggested change: None.

concise (effective)

Description match: Behavior is brief, direct, and action-oriented.
Token count: 2,982
Behavioral impact: Language like “Be brief and direct” suppresses verbosity while preserving correct tool calls.
Gap analysis: None.
Suggested change: None.

detailed (effective)

Description match: Provides comprehensive explanations and context as instructed.
Token count: 4,284
Behavioral impact: Phrases like “explain the reasoning” and “consider multiple perspectives” increase response length without affecting correctness.
Gap analysis: None.
Suggested change: None.

friendly (effective)

Description match: Encouraging tone with celebratory language.
Token count: 3,774
Behavioral impact: Positive framing increases verbosity slightly but does not interfere with tool usage.
Gap analysis: None.
Suggested change: None.

📚 Skill Feedback

financial-advisor (positive)

Usage rate: Referenced in all skill-augmented advisory responses.
Token cost: Included in advisory tests (up to ~3k tokens total).
Problem: None—skill content directly shaped advice structure and priorities.
Suggested change: None. The “Always call get_all_balances first” rule is especially effective.

💡 Optimizations

#	Optimization	Priority	Estimated Savings
1	Trim unused tool response fields	recommended	10–15% token reduction
2	Align advisory verbosity to prompt intent	suggestion	5–10% cost reduction

1. Trim unused tool response fields (recommended)

Current: Tool responses include both raw and formatted values plus explanatory messages.
Change: Remove fields not used in agent responses (e.g., duplicate message strings when balances are already restated).
Impact: 10–15% cost reduction from smaller tool payloads.

2. Align advisory verbosity to prompt intent (suggestion)

Current: friendly and detailed sometimes exceed practical length for banking dashboards.
Change: Add an optional instruction line: “End with a 3-bullet action summary.”
Impact: 5–10% cost reduction with improved scannability.

📦 Tool Response Optimization

transfer (from MCPServerStdio)

Current response size: ~80–100 tokens per call
Issues found: Redundant fields (amount_formatted + numeric amount; verbose message)
Suggested optimization: Keep numeric values and new balances only.
Estimated savings: ~30 tokens per call (~30–35% reduction)

Example current vs optimized:

// Current
{
  "transaction_id": "TX0001",
  "type": "transfer",
  "from_account": "checking",
  "to_account": "savings",
  "amount": 100.0,
  "amount_formatted": "$100.00",
  "new_balance_from": 1400.0,
  "new_balance_to": 3100.0,
  "message": "Successfully transferred $100.00 from checking to savings."
}

// Optimized
{
  "transaction_id": "TX0001",
  "from": "checking",
  "to": "savings",
  "amount": 100.0,
  "new_from": 1400.0,
  "new_to": 3100.0
}

🏆 Eval Leaderboard

	Eval	Tests	Pass Rate	Tokens	Cost	Duration
🥇	default	12/12	100%	17,014	$0.0236	133.8s
🥈	concise	2/2	100%	2,982	$0.004776	24.5s
🥉	friendly	2/2	100%	3,774	$0.007690	33.1s
4	detailed	2/2	100%	4,284	$0.009326	34.0s

Compare:(Click to swap agents)

default100%concise100%friendly100%detailed100%

📋 Test Results

18 / 18 tests

🔗Core banking tests — parametrized across benchmark agents.(3 tests)

3/3

0/0

▼

✅Check account balance.

Total 2,043 tok·Total $0.002522·Δ +0%·Δ +0%·Δ +0%

default:✅19.4s

concise:—

✅Transfer funds between accounts.

Total 2,273 tok·Total $0.003012·Δ +0%·Δ +0%·Δ +0%

default:✅14.5s

concise:—

✅Handle insufficient funds gracefully.

Total 2,760 tok·Total $0.003981·Δ +0%·Δ +0%·Δ +0%

default:✅34.6s

concise:—

🔗Multi-turn session: savings transfer workflow.(2 tests)

2/2

0/0

▼

✅First turn: check account balances.

Total 2,159 tok·Total $0.002707·Δ +0%·Δ +0%·Δ +0%

default:✅14.3s

concise:—

✅Second turn: transfer based on previous context.

Total 2,856 tok·Total $0.003543·Δ +0%·Δ +0%·Δ +0%

default:✅15.3s

concise:—

📋Compare how different prompt styles affect responses.(1 tests)

0/0

1/1

▼

✅Compare advisory styles across prompts.

Total 2,982 tok·Total $0.004776·Δ +0%·Δ +0%·Δ +0%

default:—

concise:✅24.5s

default

No result for this agent

concise

passed

24.49s

duration

3

turns

1

tools

2,982

tokens

$0.004776

cost

2/2

iterations passed

100%

iter pass rate

🔧Tool Calls

✅get_all_balances

💬Response

Here are your balances:
- Checking: $1,500
- Savings: $3,000

Suggestions:
- Keep 1-2 months’ expenses in checking for easy access.
- Move extra funds into savings or investments to earn better returns.
- Set up automatic transfers to boost savings regularly.
- Review spending and cut unnecessary expenses.

detailed

passed

34.03s

duration

3

turns

1

tools

4,284

tokens

$0.009326

cost

2/2

iterations passed

100%

iter pass rate

🔧Tool Calls

✅get_all_balances

💬Response

Here’s a summary of your current account balances:

- Checking: $1,500.00
- Savings: $3,000.00
- Total: $4,500.00

Advice on Managing Your Money Better:

1. Emergency Fund:
A good rule of thumb is to keep 3–6 months’ worth of living expenses in your savings as an emergency fund. With $3,000 in savings, consider whether this amount would cover your needs if you lost income unexpectedly. If not, prioritizing a bit more savings can add security.

2. Monthly Cash Flow:
With $1,500 in checking, it’s wise to ensure this matches your typical monthly spending and bills. Keeping too much in checking doesn’t earn you interest, but too little could lead to overdrafts or missed payments. If your checking balance consistently exceeds your needs, consider moving excess to your savings, or possibly to a higher-yield account.

3. Savings Goals/Investing:
Beyond an emergency fund, think about any medium- or long-term goals (e.g., a vacation, new car, or retirement). If you already have enough in your emergency fund, you might allocate monthly contributions toward these goals—potentially in investment accounts with higher growth potential than a standard savings.

4. Debt Review:
If you have any high-interest debt (like credit card balances), prioritizing paying that down can offer a strong financial return, often higher than savings interest rates.

5. Budgeting and Tracking:
Regularly review your transactions to understand your spending patterns. This can reveal areas for improvement (e.g., reducing discretionary spending) and help optimize your saving or investing strategies.

Would you like a review of recent transactions or advice tailored to a specific financial goal (like saving for something, investing, or budgeting)? Let me know more about your financial priorities, and I can provide more personalized guidance.

📋Test how skills improve advice quality.(1 tests)

1/1

0/0

▼

✅Eval with financial advisor skill gives better advice.

Total 4,923 tok·Total $0.007823·Δ +0%·Δ +0%·Δ +0%

default:✅35.7s

concise:—