Evals¶

The core concept in pytest-skill-engineering.

What is an Eval?¶

An Eval (specifically CopilotEval) is a test configuration that bundles everything needed to run a test using the real GitHub Copilot coding agent.

CopilotEval = Copilot Agent + Skills + Custom Agents + Instructions

from pytest_skill_engineering.copilot import CopilotEval

agent = CopilotEval(
    name="banking-test",
    instructions="You are a banking assistant.",
    skill_directories=["skills/banking-advisor"],  # Optional
    max_turns=10,
)

The Eval is NOT What You Test¶

You don't test evals. You USE evals to test:

Target	Question
MCP Server	Can Copilot understand and use my tools?
Agent Skill	Does this domain knowledge improve performance?
Custom Agent	Do these `.agent.md` instructions trigger proper subagent dispatch?
Tool Descriptions	Can Copilot discover and use tools correctly?

The Eval is the test harness that bundles the GitHub Copilot coding agent with the configuration you want to evaluate.

CopilotEval Components¶

Component	Required	Example
Name	✓	`"banking-test"`
Instructions	Optional	`"You are a helpful assistant."`
Skills	Optional	`skill_directories=["skills/banking"]`
Custom Agents	Optional	`custom_agents=[load_custom_agent("agents/reviewer.agent.md")]`
Model	Optional	`model="gpt-5.2"` (defaults to Copilot's active model)
Working Directory	Optional	`working_directory=str(tmp_path)`

Eval Leaderboard¶

When you test multiple evals, the report shows an Eval Leaderboard.

This happens automatically — no configuration needed. Just parametrize your tests:

from pathlib import Path
import pytest
from pytest_skill_engineering.copilot import CopilotEval
from pytest_skill_engineering import Skill

SKILLS = {
    "v1": Skill.from_path("skills/financial-advisor-v1"),
    "v2": Skill.from_path("skills/financial-advisor-v2"),
}

@pytest.mark.parametrize("skill_name,skill", SKILLS.items())
async def test_banking(copilot_eval, skill_name, skill):
    agent = CopilotEval(
        name=f"banking-{skill_name}",
        model="gpt-5-mini",
        skill=skill,
    )
    result = await copilot_eval(agent, "What's my checking balance?")
    assert result.success

The report shows:

Eval	Pass Rate	Cost
gpt-5-mini (v1)	100%	$0.002
gpt-5-mini (v2)	100%	$0.004

Winning Criteria¶

Winning Eval = Highest pass rate → Lowest cost (tiebreaker)

Pass rate (primary) — 100% beats 95%, always
Cost (tiebreaker) — Among equal pass rates, cheaper wins

Dimension Detection¶

The AI analysis detects what varies between evals to provide targeted feedback:

What Varies	AI Feedback Focuses On
Model	Which model works best with your tools
Skill	Whether domain knowledge helps
Custom Agent	Which `.agent.md` instructions produce better behavior
Server	Which implementation is more reliable

This is for AI analysis only — the leaderboard always appears when multiple evals are tested.

Examples¶

Compare Models¶

MODELS = ["gpt-5-mini", "gpt-4.1"]

@pytest.mark.parametrize("model", MODELS)
async def test_with_model(copilot_eval, model):
    agent = CopilotEval(
        name=f"banking-{model}",
        model=model,
    )
    result = await copilot_eval(agent, "What's my checking balance?")
    assert result.success

Compare Custom Agent Versions¶

A/B test two versions of a .agent.md file to find which instructions work better:

from pathlib import Path
from pytest_skill_engineering.copilot import CopilotEval
from pytest_skill_engineering.core.evals import load_custom_agent

AGENT_VERSIONS = {
    path.stem: load_custom_agent(path)
    for path in Path(".github/agents").glob("reviewer-*.agent.md")
}

@pytest.mark.parametrize("name,agent_def", AGENT_VERSIONS.items())
async def test_reviewer(copilot_eval, name, agent_def):
    agent = CopilotEval(
        name=f"reviewer-{name}",
        custom_agents=[agent_def],
    )
    result = await copilot_eval(agent, "Review src/auth.py for security issues")
    assert result.success

Compare Multiple Dimensions¶

MODELS = ["gpt-5-mini", "gpt-4.1"]
SKILLS = {
    "v1": Skill.from_path("skills/advisor-v1"),
    "v2": Skill.from_path("skills/advisor-v2"),
}

@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("skill_name,skill", SKILLS.items())
async def test_combinations(copilot_eval, model, skill_name, skill):
    agent = CopilotEval(
        name=f"banking-{model}-{skill_name}",
        model=model,
        skill=skill,
    )
    result = await copilot_eval(agent, "What's my checking balance?")
    assert result.success

Custom Agents¶

A custom agent is a specialized sub-agent defined in a .agent.md or .md file with YAML frontmatter (name, description, tools) and a markdown prompt body.

from pytest_skill_engineering.copilot import CopilotEval
from pytest_skill_engineering.core.evals import load_custom_agent

# Load and test a custom agent via CopilotEval
agent = CopilotEval(
    name="reviewer-test",
    custom_agents=[load_custom_agent(".github/agents/reviewer.agent.md")],
)

The tools frontmatter field maps to allowed_tools — restricting which tools the agent can call. See Custom Agents for a full guide.

Next Steps¶

Choosing a Test Harness — Eval vs CopilotEval: full trade-off guide
Comparing Configurations — More comparison patterns
A/B Testing Servers — Test server versions
AI Analysis — What the AI evaluation produces