Test Coding Agents¶

pytest-skill-engineering can test real coding agents like GitHub Copilot — not just synthetic agents backed by MCP servers.

Install¶

uv add pytest-skill-engineering

This installs the github-copilot-sdk package alongside pytest-skill-engineering.

Quick Start¶

from pytest_skill_engineering.copilot import CopilotEval

@pytest.mark.copilot
async def test_creates_module(copilot_eval, tmp_path):
    agent = CopilotEval(
        name="coder",
        instructions="Create production-quality Python code.",
        working_directory=str(tmp_path),
    )
    result = await copilot_eval(
        agent,
        "Create calculator.py with add, subtract, multiply, divide functions.",
    )
    assert result.success
    assert (tmp_path / "calculator.py").exists()

CopilotEval Configuration¶

CopilotEval is the configuration object for Copilot SDK sessions:

from pytest_skill_engineering.copilot import CopilotEval

agent = CopilotEval(
    name="my-agent",                    # Required: unique agent name
    instructions="Your instructions.",   # System prompt for the agent
    model="gpt-5.2",                     # Optional: model override
    working_directory=str(tmp_path),     # Working directory for file ops
    max_turns=25,                        # Max conversation turns
    timeout_s=300.0,                     # Timeout in seconds
    excluded_tools=["run_in_terminal"],  # Tools to block
    skill_directories=["./skills"],      # Skill directories to load
    reasoning_effort="high",             # Reasoning effort level
    custom_agents=[                      # Custom subagents
        {
            "name": "test-writer",
            "prompt": "Write pytest tests.",
            "description": "Writes unit tests.",
        }
    ],
)

Fixtures¶

`copilot_eval`¶

Runs a single Copilot agent against a task:

result = await copilot_eval(agent, "Create hello.py with print('hello')")

Returns a CopilotResult with:

result.success — Whether the session completed without errors
result.error — Error message if failed
result.final_response — Eval's final text response
result.all_tool_calls — List of ToolCall objects
result.tool_was_called("name") — Check if a tool was called
result.tool_names_called — Set of tool names used
result.file("path") — Read a file from the working directory
result.files_created / result.files_modified — File tracking
result.usage — Token usage info
result.total_cost_usd — Estimated cost
result.subagent_invocations — Custom agent dispatch events
result.reasoning_traces — Reasoning effort traces
result.raw_events — Full SDK event stream

`ab_run`¶

Runs two agents against the same task in isolated directories:

@pytest.mark.copilot
async def test_ab_comparison(ab_run, tmp_path):
    baseline = CopilotEval(name="baseline", instructions="Write minimal code.")
    treatment = CopilotEval(name="treatment", instructions="Write documented code.")

    b, t = await ab_run(baseline, treatment, "Create calculator.py with add and subtract.")

    assert b.success and t.success
    assert '"""' in t.file("calculator.py")  # Treatment has docstrings

Custom Agents¶

Define subagents that the main agent can delegate to:

agent = CopilotEval(
    name="orchestrator",
    instructions="Delegate test writing to the test-writer agent.",
    custom_agents=[
        {
            "name": "test-writer",
            "prompt": "Write pytest tests for the given code.",
            "description": "Writes unit tests.",
            "tools": ["create_file", "read_file"],  # Optional tool restriction
        }
    ],
)

Loading from a file¶

Use load_custom_agent() to load a .agent.md file into a custom agent dict:

from pytest_skill_engineering import load_custom_agent
from pytest_skill_engineering.copilot import CopilotEval

test_writer = load_custom_agent(".github/agents/test-writer.agent.md")

@pytest.mark.copilot
async def test_orchestrator_delegates_test_writing(copilot_eval):
    agent = CopilotEval(
        name="orchestrator",
        instructions="Delegate test writing to the test-writer agent.",
        custom_agents=[test_writer],
    )
    result = await copilot_eval(agent, "Write unit tests for calculator.py")
    assert result.success

Loading all agents from a directory¶

Use load_custom_agents() to load all .agent.md files from a directory:

from pytest_skill_engineering import load_custom_agents

# Load all sub-agents except the orchestrator
subagents = load_custom_agents(
    ".github/agents/",
    exclude={"orchestrator"},
)

@pytest.mark.copilot
async def test_full_agent_team(copilot_eval):
    agent = CopilotEval(
        name="orchestrator",
        instructions="Delegate tasks to the appropriate specialist.",
        custom_agents=subagents,
    )
    result = await copilot_eval(agent, "Create and test a calculator module.")
    assert result.success

Both functions are available directly from pytest_skill_engineering:

from pytest_skill_engineering import load_custom_agent, load_custom_agents

Asserting on `result.subagent_invocations`¶

CopilotResult.subagent_invocations tracks which sub-agents were dispatched:

async def test_correct_subagent_is_invoked(copilot_eval):
    agents = load_custom_agents(".github/agents/")
    agent = CopilotEval(
        name="orchestrator",
        instructions="Use specialist agents for each task.",
        custom_agents=agents,
    )
    result = await copilot_eval(agent, "Write unit tests for the billing module.")

    invoked = [s.eval_name for s in result.subagent_invocations]
    assert "test-writer" in invoked

Testing Skills¶

Skills are domain knowledge packages loaded from a directory containing a SKILL.md file. Use skill_directories to inject a skill into a Copilot session — this is the right way to test Copilot skills, as it exercises the same loading path end users experience.

from pytest_skill_engineering.copilot import CopilotEval

async def test_skill_presents_scenarios(copilot_eval):
    agent = CopilotEval(
        name="with-skill",
        skill_directories=["skills/my-skill"],  # loads SKILL.md + references/
        max_turns=10,
    )
    result = await copilot_eval(agent, "What can you help me with?")
    assert result.success
    assert "scenario-a" in result.final_response.lower()

Comparing with and without skill¶

async def test_skill_improves_routing(copilot_eval):
    without = CopilotEval(name="no-skill", max_turns=10)
    with_skill = CopilotEval(
        name="with-skill",
        skill_directories=["skills/my-skill"],
        max_turns=10,
    )

    r_without = await copilot_eval(without, "Get the ACR baseline for TPID 12345.")
    r_with    = await copilot_eval(with_skill, "Get the ACR baseline for TPID 12345.")

    # Skill should cause the agent to call the right tool
    assert r_with.tool_was_called("ExecuteQueries")

When to use `CopilotEval` vs `Eval` + `Skill`¶

	`CopilotEval` + `skill_directories`	`Eval` + `Skill.from_path()`
What runs the agent	Real GitHub Copilot (CLI SDK)	PydanticAI synthetic loop
Skill loading	Native Copilot skill loading	Injected as virtual reference tools
MCP auth	Handled by Copilot CLI (OAuth cached)	Managed by test process (token required)
Use when	Testing a Copilot skill end-to-end	Testing MCP servers / tool descriptions

Rule of thumb: If you built a SKILL.md for Copilot users, test it with CopilotEval. If you're testing whether your MCP server tools are discoverable and usable, use Eval.

Skill Directories¶

Load skill files that inject domain knowledge into the agent:

agent = CopilotEval(
    name="with-skills",
    instructions="Apply all standards from your skills.",
    skill_directories=["./skills"],
)

Tool Restrictions¶

Block specific tools to control agent behavior:

agent = CopilotEval(
    name="no-terminal",
    instructions="Create files only.",
    excluded_tools=["run_in_terminal"],
)
result = await copilot_eval(agent, "Create hello.py")
assert not result.tool_was_called("run_in_terminal")

Reporting¶

Copilot test results flow into the same HTML report. The report auto-detects whether tests used the Copilot SDK and adapts the AI analysis accordingly.

Copilot as Model Provider¶

You can use Copilot-accessible models for all LLM calls in aitest — judge assertions, AI insights, scoring, and prompt optimization — without needing a separate Azure or OpenAI subscription.

Use model names directly (no prefix needed):

# AI insights report
pytest tests/ --aitest-summary-model=gpt-5-mini --aitest-html=report.html

# LLM assertions and scoring
pytest tests/ --llm-model=copilot/gpt-5-mini

This routes calls through the Copilot SDK, authenticated via gh auth login or GITHUB_TOKEN. Available models are whatever your Copilot subscription provides (e.g., gpt-5-mini, gpt-5.2, claude-opus-4.5).

Prompt Optimization with Copilot¶

optimize_instruction works with any model provider, including Copilot:

from pytest_skill_engineering import optimize_instruction

async def test_optimize_system_prompt(optimize_instruction):
    result = await optimize_instruction(
        instruction="You are a helpful assistant.",
        test_cases=[
            {"prompt": "Transfer $100 to savings", "expected": "uses transfer tool"},
        ],
        judge_model="copilot/gpt-5-mini",
    )
    print(result.improved_instruction)

Integration tests: `integration_judge_model` fixture¶

When writing integration tests that need an auxiliary LLM (for judge assertions, optimizer, etc.), use the integration_judge_model fixture instead of hard-coding a provider. It fails loudly if no provider is reachable.

# tests/integration/conftest.py or your conftest

# The fixture is provided by pytest_skill_engineering and probes providers automatically.
# Override with env var if needed:
# AITEST_INTEGRATION_JUDGE_MODEL=copilot/gpt-5-mini pytest ...

async def test_my_optimizer(optimize_instruction, integration_judge_model):
    result = await optimize_instruction(
        instruction="...",
        test_cases=[...],
        judge_model=integration_judge_model,  # discovered at runtime
    )
    assert result.improved_instruction

Provider probe order (first reachable wins): 1. AITEST_INTEGRATION_JUDGE_MODEL env var override 2. Azure (AZURE_API_BASE or AZURE_OPENAI_ENDPOINT + auth) 3. OpenAI (OPENAI_API_KEY) 4. Copilot (gh auth login or GITHUB_TOKEN)

The test fails (not skips) if none are available:

FAILED - No LLM provider available. Set AITEST_INTEGRATION_JUDGE_MODEL or configure Azure/OpenAI/Copilot credentials.

Force a specific provider:

AITEST_INTEGRATION_JUDGE_MODEL=copilot/gpt-5-mini pytest tests/integration/copilot/test_optimizer_integration.py -v