Compare concise vs verbose instructions.

βœ—1 Failed
February 11, 2026 at 03:03 PMπŸ“„ test_instructions.py4 tests104.1sπŸ§ͺ $0.3034πŸ€– $0.0223πŸ’° $0.3257
πŸ€–AI Analysis

🎯 Executive Summary

Metric Value
Best Configuration style-concise / gpt-5.2-codex
Pass Rate 3/4 tests (75%)
Total Cost $0.3257
Total Tokens 0 reported
Recommendation 🟑 Improve

Verdict: Concise instructions reliably complete file-creation tasks at lower cost. Domain-specific instructions triggered plan-only behavior and failed to produce files.

Configuration Scorecard

Agent Pass Rate Cost Avg Duration Verdict
style-concise 1/1 (100%) $0.0457 17.6s πŸ† Best
style-verbose 1/1 (100%) $0.0585 27.5s ⚠️ Slower, costlier
restricted 1/1 (100%) $0.0771 13.3s βœ… Correctly constrained
fastapi-expert 0/1 (0%) $0.1444 45.7s ❌ Not ready

❌ Failure Analysis

Results Matrix

Test style-concise style-verbose restricted fastapi-expert Failure Type
Create calculator file βœ… βœ… β€” β€” β€”
Restricted file creation β€” β€” βœ… β€” β€”
Domain-specific REST API β€” β€” β€” ❌ Plan-only output

Failure: Agent with domain-specific instructions follows them

Aspect Detail
Agent fastapi-expert (gpt-5.2-codex)
Task Create a simple REST API with GET /health returning {status: 'ok'}
Expected At least one Python file (e.g., main.py) created in workspace
Actual No Python files created; agent wrote a plan to an external session-state path
Root Cause Instructions induced planning mode without explicit execution; agent waited for a β€œstart” signal and wrote outside the test workspace

Fix (exact instruction rewrite): You are a FastAPI expert. Do not enter planning-only mode. Always implement immediately. Create all files in the current working directory. Do not wait for confirmation words like "start". Expected impact: Eliminate plan-only stalls; +25% pass rate and ~30–45s time savings on API tasks.

πŸ“ Instruction Effectiveness

Instruction Style Tests Pass Rate Avg Cost Avg Duration Assessment
concise 1 100% $0.0457 17.6s βœ… Effective
verbose 1 100% $0.0585 27.5s ⚠️ Higher cost, no benefit
restricted 1 100% $0.0771 13.3s βœ… Correct behavior
domain-expert 1 0% $0.1444 45.7s ❌ Ineffective

Problematic Instructions

Problem: Domain-specific instructions triggered planning without execution and wrote files outside the workspace.

Suggested replacement: Build a minimal FastAPI app now. Create main.py in the current directory with a GET /health endpoint. Return {"status": "ok"}. Do not create plans or ask for confirmation.

πŸ”§ Tool Usage

Tool Proficiency Matrix

Tool Total Calls Success Issues
powershell 6 βœ… 6/6 Used for planning output in failure
report_intent 7 βœ… 7/7 β€”
apply_patch 1 βœ… 1/1 β€”
view 3 βœ… 3/3 β€”
glob 1 ⚠️ 0/1 Unnecessary scan

Efficiency Analysis

Metric Value Assessment
Avg tools per test 4.5 ⚠️ Slightly high
Unnecessary tool calls 2 glob + directory listings in failure
Failed tool calls 0 βœ…

πŸ’‘ Optimizations

Priority Change Expected Impact
πŸ”΄ Critical Disable plan-only behavior in domain instructions Fix 1 failure (+25% pass rate)
🟑 Recommended Prefer concise instructions by default ~22% cost reduction
🟒 Nice to have Remove redundant directory scans Faster runs, fewer tokens

Details: 1. Critical: Enforce Immediate Implementation - Current: Agent writes plans and waits for β€œstart”. - Change: Add explicit β€œimplement immediately” clause (see rewrite above). - Impact: Prevents zero-file failures; saves ~$0.10–$0.15/test in wasted planning.

  1. Recommended: Default to Concise Instructions
  2. Current: Verbose adds docstrings/comments with no test benefit.
  3. Change: Use concise for baseline tasks.
  4. Impact: Lower cost and faster completion with identical correctness.

πŸ† Agent Leaderboard

AgentTestsPass RateTokensCostDuration
πŸ₯‡
style-concise
1/1100%0$0.045717.6s
πŸ₯ˆ
style-verbose
1/1100%0$0.058527.5s
πŸ₯‰
restricted
1/1100%0$0.077113.3s
4
fastapi-expert
0/10%0$0.144445.7s
Compare:(Click to swap agents)

πŸ“‹ Test Results

4 / 4 tests
πŸ“‹Compare concise vs verbose instructions.(1 tests)
1/1
1/1
β–Ό
βœ…Both instruction styles should successfully create a file.
Total 0 tokΒ·Total $0.1042Β·Ξ” +0%Β·Ξ” +56%Β·Ξ” +28%
style-concise:βœ…17.6s
style-verbose:βœ…27.5s
πŸ”—Test that instructions constrain agent behavior.(2 tests)
0/0
0/0
β–Ό
βœ…Agent with restricted instructions stays within bounds.
style-concise:β€”
style-verbose:β€”
❌Agent with domain-specific instructions follows them.
style-concise:β€”
style-verbose:β€”