Optimizing Instructions with AI¶

optimize_instruction() closes the test→optimize→test loop.

When a test fails — the agent ignored an instruction or produced unexpected output — call optimize_instruction() to get a concrete, LLM-generated suggestion for improving the instruction. Drop the suggestion into pytest.fail() so the test failure message includes a ready-to-use fix.

The Loop¶

write test → run → fail → optimize → update instruction → run → pass

This is test-driven prompt engineering: your tests define the standard; the optimizer helps you reach it.

Basic Usage¶

import pytest
from pytest_codingagents import CopilotAgent, optimize_instruction


async def test_docstring_instruction(copilot_run, tmp_path):
    agent = CopilotAgent(
        instructions="Write Python code.",
        working_directory=str(tmp_path),
    )

    result = await copilot_run(agent, "Create math.py with add(a, b) and subtract(a, b).")

    if '"""' not in result.file("math.py"):
        suggestion = await optimize_instruction(
            agent.instructions or "",
            result,
            "Agent should add Google-style docstrings to every function.",
        )
        pytest.fail(f"No docstrings found.\n\n{suggestion}")

The failure message will look like:

FAILED test_math.py::test_docstring_instruction

No docstrings found.

💡 Suggested instruction:

  Write Python code. Add Google-style docstrings to every function.
  The docstring should describe what the function does, its parameters (Args:),
  and its return value (Returns:).

  Changes: Added explicit docstring format mandate with Args/Returns sections.
  Reasoning: The original instruction did not mention documentation. The agent
  produced code without docstrings because there was no requirement to add them.

With A/B Testing¶

Pair optimize_instruction() with ab_run to test the fix before committing:

import pytest
from pytest_codingagents import CopilotAgent, optimize_instruction


async def test_docstring_instruction_iterates(ab_run, tmp_path):
    baseline = CopilotAgent(instructions="Write Python code.")
    treatment = CopilotAgent(
        instructions="Write Python code. Add Google-style docstrings to every function."
    )

    b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b).")

    assert b.success and t.success

    if '"""' not in t.file("math.py"):
        suggestion = await optimize_instruction(
            treatment.instructions or "",
            t,
            "Treatment agent should add docstrings — treatment instruction did not work.",
        )
        pytest.fail(f"Treatment still no docstrings.\n\n{suggestion}")

    # Confirm baseline does NOT have docstrings (differential assertion)
    assert '"""' not in b.file("math.py"), "Baseline unexpectedly has docstrings"

API Reference¶

`optimize_instruction(current_instruction: str, result: CopilotResult, criterion: str, *, model: str | Model = 'azure/gpt-5.2-chat') -> InstructionSuggestion` `async` ¶

Analyze a result and suggest an improved instruction.

Uses pydantic-ai structured output to analyze the gap between a current instruction and the agent's observed behavior, returning a concrete, actionable improvement.

Designed to drop into pytest.fail() so the failure message contains a ready-to-use fix.

Model strings follow the same provider/model format used by pytest-aitest. Azure Entra ID auth is handled automatically when AZURE_API_BASE or AZURE_OPENAI_ENDPOINT is set.

Example::

result = await copilot_run(agent, task)
if '"""' not in result.file("main.py"):
    suggestion = await optimize_instruction(
        agent.instructions or "",
        result,
        "Agent should add docstrings to all functions.",
    )
    pytest.fail(f"No docstrings found.\n\n{suggestion}")

Parameters:

Name	Type	Description	Default
`current_instruction`	`str`	The agent's current instruction text.	required
`result`	`CopilotResult`	The `CopilotResult` from the (failed) run.	required
`criterion`	`str`	What the agent should have done — the test expectation in plain English (e.g. `"Always write docstrings"`).	required
`model`	`str \| Model`	Provider/model string (e.g. `"azure/gpt-5.2-chat"`, `"openai/gpt-4o-mini"`) or a pre-configured pydantic-ai `Model` object. Defaults to `"azure/gpt-5.2-chat"`.	`'azure/gpt-5.2-chat'`

Returns:

Name	Type	Description
`An`	`InstructionSuggestion`	class:`InstructionSuggestion` with the improved instruction.

`InstructionSuggestion(instruction: str, reasoning: str, changes: str)` `dataclass` ¶

A suggested improvement to a Copilot agent instruction.

Returned by :func:optimize_instruction. Designed to drop into pytest.fail() so the failure message includes an actionable fix.

Attributes:

Name	Type	Description
`instruction`	`str`	The improved instruction text to use instead.
`reasoning`	`str`	Explanation of why this change would close the gap.
`changes`	`str`	Short description of what was changed (one sentence).

Example::

suggestion = await optimize_instruction(
    agent.instructions,
    result,
    "Agent should add docstrings to all functions.",
)
pytest.fail(f"No docstrings found.\n\n{suggestion}")

Choosing a Model¶

optimize_instruction() defaults to openai:gpt-4o-mini — cheap, fast, and precise enough for instruction analysis.

Override with the model keyword argument:

suggestion = await optimize_instruction(
    agent.instructions or "",
    result,
    "Agent should use type hints.",
    model="anthropic:claude-3-haiku-20240307",
)

Any LiteLLM-compatible model string works.

The Criterion¶

Write the criterion as a plain-English statement of what the agent should have done:

Situation	Good criterion
Missing docstrings	`"Agent should add Google-style docstrings to every function."`
Wrong framework	`"Agent should use FastAPI, not Flask."`
Missing type hints	`"All function signatures must include type annotations."`
No error handling	`"All I/O operations must be wrapped in try/except."`

The more specific the criterion, the more actionable the suggestion.