Test Coding Agents¶
pytest-skill-engineering can test real coding agents like GitHub Copilot — not just synthetic agents backed by MCP servers.
Install¶
This installs the github-copilot-sdk package alongside pytest-skill-engineering.
Quick Start¶
from pytest_skill_engineering.copilot import CopilotEval
@pytest.mark.copilot
async def test_creates_module(copilot_eval, tmp_path):
agent = CopilotEval(
name="coder",
instructions="Create production-quality Python code.",
working_directory=str(tmp_path),
)
result = await copilot_eval(
agent,
"Create calculator.py with add, subtract, multiply, divide functions.",
)
assert result.success
assert (tmp_path / "calculator.py").exists()
CopilotEval Configuration¶
CopilotEval is the configuration object for Copilot SDK sessions:
from pytest_skill_engineering.copilot import CopilotEval
agent = CopilotEval(
name="my-agent", # Required: unique agent name
instructions="Your instructions.", # System prompt for the agent
model="gpt-5.2", # Optional: model override
working_directory=str(tmp_path), # Working directory for file ops
max_turns=25, # Max conversation turns
timeout_s=300.0, # Timeout in seconds
excluded_tools=["run_in_terminal"], # Tools to block
skill_directories=["./skills"], # Skill directories to load
reasoning_effort="high", # Reasoning effort level
custom_agents=[ # Custom subagents
{
"name": "test-writer",
"prompt": "Write pytest tests.",
"description": "Writes unit tests.",
}
],
)
Fixtures¶
copilot_eval¶
Runs a single Copilot agent against a task:
Returns a CopilotResult with:
result.success— Whether the session completed without errorsresult.error— Error message if failedresult.final_response— Eval's final text responseresult.all_tool_calls— List ofToolCallobjectsresult.tool_was_called("name")— Check if a tool was calledresult.tool_names_called— Set of tool names usedresult.file("path")— Read a file from the working directoryresult.files_created/result.files_modified— File trackingresult.usage— Token usage inforesult.total_cost_usd— Estimated costresult.subagent_invocations— Custom agent dispatch eventsresult.reasoning_traces— Reasoning effort tracesresult.raw_events— Full SDK event stream
ab_run¶
Runs two agents against the same task in isolated directories:
@pytest.mark.copilot
async def test_ab_comparison(ab_run, tmp_path):
baseline = CopilotEval(name="baseline", instructions="Write minimal code.")
treatment = CopilotEval(name="treatment", instructions="Write documented code.")
b, t = await ab_run(baseline, treatment, "Create calculator.py with add and subtract.")
assert b.success and t.success
assert '"""' in t.file("calculator.py") # Treatment has docstrings
Custom Agents¶
Define subagents that the main agent can delegate to:
agent = CopilotEval(
name="orchestrator",
instructions="Delegate test writing to the test-writer agent.",
custom_agents=[
{
"name": "test-writer",
"prompt": "Write pytest tests for the given code.",
"description": "Writes unit tests.",
"tools": ["create_file", "read_file"], # Optional tool restriction
}
],
)
Loading from a file¶
Use load_custom_agent() to load a .agent.md file into a custom agent dict:
from pytest_skill_engineering import load_custom_agent
from pytest_skill_engineering.copilot import CopilotEval
test_writer = load_custom_agent(".github/agents/test-writer.agent.md")
@pytest.mark.copilot
async def test_orchestrator_delegates_test_writing(copilot_eval):
agent = CopilotEval(
name="orchestrator",
instructions="Delegate test writing to the test-writer agent.",
custom_agents=[test_writer],
)
result = await copilot_eval(agent, "Write unit tests for calculator.py")
assert result.success
Loading all agents from a directory¶
Use load_custom_agents() to load all .agent.md files from a directory:
from pytest_skill_engineering import load_custom_agents
# Load all sub-agents except the orchestrator
subagents = load_custom_agents(
".github/agents/",
exclude={"orchestrator"},
)
@pytest.mark.copilot
async def test_full_agent_team(copilot_eval):
agent = CopilotEval(
name="orchestrator",
instructions="Delegate tasks to the appropriate specialist.",
custom_agents=subagents,
)
result = await copilot_eval(agent, "Create and test a calculator module.")
assert result.success
Both functions are available directly from pytest_skill_engineering:
Asserting on result.subagent_invocations¶
CopilotResult.subagent_invocations tracks which sub-agents were dispatched:
async def test_correct_subagent_is_invoked(copilot_eval):
agents = load_custom_agents(".github/agents/")
agent = CopilotEval(
name="orchestrator",
instructions="Use specialist agents for each task.",
custom_agents=agents,
)
result = await copilot_eval(agent, "Write unit tests for the billing module.")
invoked = [s.eval_name for s in result.subagent_invocations]
assert "test-writer" in invoked
Testing Skills¶
Skills are domain knowledge packages loaded from a directory containing a SKILL.md file. Use skill_directories to inject a skill into a Copilot session — this is the right way to test Copilot skills, as it exercises the same loading path end users experience.
from pytest_skill_engineering.copilot import CopilotEval
async def test_skill_presents_scenarios(copilot_eval):
agent = CopilotEval(
name="with-skill",
skill_directories=["skills/my-skill"], # loads SKILL.md + references/
max_turns=10,
)
result = await copilot_eval(agent, "What can you help me with?")
assert result.success
assert "scenario-a" in result.final_response.lower()
Comparing with and without skill¶
async def test_skill_improves_routing(copilot_eval):
without = CopilotEval(name="no-skill", max_turns=10)
with_skill = CopilotEval(
name="with-skill",
skill_directories=["skills/my-skill"],
max_turns=10,
)
r_without = await copilot_eval(without, "Get the ACR baseline for TPID 12345.")
r_with = await copilot_eval(with_skill, "Get the ACR baseline for TPID 12345.")
# Skill should cause the agent to call the right tool
assert r_with.tool_was_called("ExecuteQueries")
When to use CopilotEval vs Eval + Skill¶
CopilotEval + skill_directories |
Eval + Skill.from_path() |
|
|---|---|---|
| What runs the agent | Real GitHub Copilot (CLI SDK) | PydanticAI synthetic loop |
| Skill loading | Native Copilot skill loading | Injected as virtual reference tools |
| MCP auth | Handled by Copilot CLI (OAuth cached) | Managed by test process (token required) |
| Use when | Testing a Copilot skill end-to-end | Testing MCP servers / tool descriptions |
Rule of thumb: If you built a
SKILL.mdfor Copilot users, test it withCopilotEval. If you're testing whether your MCP server tools are discoverable and usable, useEval.
Skill Directories¶
Load skill files that inject domain knowledge into the agent:
agent = CopilotEval(
name="with-skills",
instructions="Apply all standards from your skills.",
skill_directories=["./skills"],
)
Tool Restrictions¶
Block specific tools to control agent behavior:
agent = CopilotEval(
name="no-terminal",
instructions="Create files only.",
excluded_tools=["run_in_terminal"],
)
result = await copilot_eval(agent, "Create hello.py")
assert not result.tool_was_called("run_in_terminal")
Reporting¶
Copilot test results flow into the same HTML report. The report auto-detects whether tests used the Copilot SDK and adapts the AI analysis accordingly.
Copilot as Model Provider¶
You can use Copilot-accessible models for all LLM calls in aitest — judge assertions, AI insights, scoring, and prompt optimization — without needing a separate Azure or OpenAI subscription.
Use model names directly (no prefix needed):
# AI insights report
pytest tests/ --aitest-summary-model=gpt-5-mini --aitest-html=report.html
# LLM assertions and scoring
pytest tests/ --llm-model=copilot/gpt-5-mini
This routes calls through the Copilot SDK, authenticated via gh auth login or GITHUB_TOKEN. Available models are whatever your Copilot subscription provides (e.g., gpt-5-mini, gpt-5.2, claude-opus-4.5).
Prompt Optimization with Copilot¶
optimize_instruction works with any model provider, including Copilot:
from pytest_skill_engineering import optimize_instruction
async def test_optimize_system_prompt(optimize_instruction):
result = await optimize_instruction(
instruction="You are a helpful assistant.",
test_cases=[
{"prompt": "Transfer $100 to savings", "expected": "uses transfer tool"},
],
judge_model="copilot/gpt-5-mini",
)
print(result.improved_instruction)
Integration tests: integration_judge_model fixture¶
When writing integration tests that need an auxiliary LLM (for judge assertions, optimizer, etc.), use the integration_judge_model fixture instead of hard-coding a provider. It fails loudly if no provider is reachable.
# tests/integration/conftest.py or your conftest
# The fixture is provided by pytest_skill_engineering and probes providers automatically.
# Override with env var if needed:
# AITEST_INTEGRATION_JUDGE_MODEL=copilot/gpt-5-mini pytest ...
async def test_my_optimizer(optimize_instruction, integration_judge_model):
result = await optimize_instruction(
instruction="...",
test_cases=[...],
judge_model=integration_judge_model, # discovered at runtime
)
assert result.improved_instruction
Provider probe order (first reachable wins):
1. AITEST_INTEGRATION_JUDGE_MODEL env var override
2. Azure (AZURE_API_BASE or AZURE_OPENAI_ENDPOINT + auth)
3. OpenAI (OPENAI_API_KEY)
4. Copilot (gh auth login or GITHUB_TOKEN)
The test fails (not skips) if none are available:
FAILED - No LLM provider available. Set AITEST_INTEGRATION_JUDGE_MODEL or configure Azure/OpenAI/Copilot credentials.
Force a specific provider: