Optimizing Instructions with AI¶
optimize_instruction() closes the test→optimize→test loop.
When a test fails — the agent ignored an instruction or produced unexpected output — call optimize_instruction() to get a concrete, LLM-generated suggestion for improving the instruction. Drop the suggestion into pytest.fail() so the test failure message includes a ready-to-use fix.
The Loop¶
This is test-driven prompt engineering: your tests define the standard; the optimizer helps you reach it.
Basic Usage¶
import pytest
from pytest_codingagents import CopilotAgent, optimize_instruction
async def test_docstring_instruction(copilot_run, tmp_path):
agent = CopilotAgent(
instructions="Write Python code.",
working_directory=str(tmp_path),
)
result = await copilot_run(agent, "Create math.py with add(a, b) and subtract(a, b).")
if '"""' not in result.file("math.py"):
suggestion = await optimize_instruction(
agent.instructions or "",
result,
"Agent should add Google-style docstrings to every function.",
)
pytest.fail(f"No docstrings found.\n\n{suggestion}")
The failure message will look like:
FAILED test_math.py::test_docstring_instruction
No docstrings found.
💡 Suggested instruction:
Write Python code. Add Google-style docstrings to every function.
The docstring should describe what the function does, its parameters (Args:),
and its return value (Returns:).
Changes: Added explicit docstring format mandate with Args/Returns sections.
Reasoning: The original instruction did not mention documentation. The agent
produced code without docstrings because there was no requirement to add them.
With A/B Testing¶
Pair optimize_instruction() with ab_run to test the fix before committing:
import pytest
from pytest_codingagents import CopilotAgent, optimize_instruction
async def test_docstring_instruction_iterates(ab_run, tmp_path):
baseline = CopilotAgent(instructions="Write Python code.")
treatment = CopilotAgent(
instructions="Write Python code. Add Google-style docstrings to every function."
)
b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b).")
assert b.success and t.success
if '"""' not in t.file("math.py"):
suggestion = await optimize_instruction(
treatment.instructions or "",
t,
"Treatment agent should add docstrings — treatment instruction did not work.",
)
pytest.fail(f"Treatment still no docstrings.\n\n{suggestion}")
# Confirm baseline does NOT have docstrings (differential assertion)
assert '"""' not in b.file("math.py"), "Baseline unexpectedly has docstrings"
API Reference¶
optimize_instruction(current_instruction: str, result: CopilotResult, criterion: str, *, model: str | Model = 'azure/gpt-5.2-chat') -> InstructionSuggestion
async
¶
Analyze a result and suggest an improved instruction.
Uses pydantic-ai structured output to analyze the gap between a current instruction and the agent's observed behavior, returning a concrete, actionable improvement.
Designed to drop into pytest.fail() so the failure message
contains a ready-to-use fix.
Model strings follow the same provider/model format used by
pytest-aitest. Azure Entra ID auth is handled automatically
when AZURE_API_BASE or AZURE_OPENAI_ENDPOINT is set.
Example::
result = await copilot_run(agent, task)
if '"""' not in result.file("main.py"):
suggestion = await optimize_instruction(
agent.instructions or "",
result,
"Agent should add docstrings to all functions.",
)
pytest.fail(f"No docstrings found.\n\n{suggestion}")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
current_instruction
|
str
|
The agent's current instruction text. |
required |
result
|
CopilotResult
|
The |
required |
criterion
|
str
|
What the agent should have done — the test expectation
in plain English (e.g. |
required |
model
|
str | Model
|
Provider/model string (e.g. |
'azure/gpt-5.2-chat'
|
Returns:
| Name | Type | Description |
|---|---|---|
An |
InstructionSuggestion
|
class: |
InstructionSuggestion(instruction: str, reasoning: str, changes: str)
dataclass
¶
A suggested improvement to a Copilot agent instruction.
Returned by :func:optimize_instruction. Designed to drop into
pytest.fail() so the failure message includes an actionable fix.
Attributes:
| Name | Type | Description |
|---|---|---|
instruction |
str
|
The improved instruction text to use instead. |
reasoning |
str
|
Explanation of why this change would close the gap. |
changes |
str
|
Short description of what was changed (one sentence). |
Example::
suggestion = await optimize_instruction(
agent.instructions,
result,
"Agent should add docstrings to all functions.",
)
pytest.fail(f"No docstrings found.\n\n{suggestion}")
Choosing a Model¶
optimize_instruction() defaults to openai:gpt-4o-mini — cheap, fast, and precise enough for instruction analysis.
Override with the model keyword argument:
suggestion = await optimize_instruction(
agent.instructions or "",
result,
"Agent should use type hints.",
model="anthropic:claude-3-haiku-20240307",
)
Any LiteLLM-compatible model string works.
The Criterion¶
Write the criterion as a plain-English statement of what the agent should have done:
| Situation | Good criterion |
|---|---|
| Missing docstrings | "Agent should add Google-style docstrings to every function." |
| Wrong framework | "Agent should use FastAPI, not Flask." |
| Missing type hints | "All function signatures must include type annotations." |
| No error handling | "All I/O operations must be wrapped in try/except." |
The more specific the criterion, the more actionable the suggestion.