Skip to content

Evals

The core concept in pytest-skill-engineering.

What is an Eval?

An Eval (specifically CopilotEval) is a test configuration that bundles everything needed to run a test using the real GitHub Copilot coding agent.

CopilotEval = Copilot Agent + Skills + Custom Agents + Instructions
from pytest_skill_engineering.copilot import CopilotEval

agent = CopilotEval(
    name="banking-test",
    instructions="You are a banking assistant.",
    skill_directories=["skills/banking-advisor"],  # Optional
    max_turns=10,
)

The Eval is NOT What You Test

You don't test evals. You USE evals to test:

Target Question
MCP Server Can Copilot understand and use my tools?
Agent Skill Does this domain knowledge improve performance?
Custom Agent Do these .agent.md instructions trigger proper subagent dispatch?
Tool Descriptions Can Copilot discover and use tools correctly?

The Eval is the test harness that bundles the GitHub Copilot coding agent with the configuration you want to evaluate.

CopilotEval Components

Component Required Example
Name "banking-test"
Instructions Optional "You are a helpful assistant."
Skills Optional skill_directories=["skills/banking"]
Custom Agents Optional custom_agents=[load_custom_agent("agents/reviewer.agent.md")]
Model Optional model="gpt-5.2" (defaults to Copilot's active model)
Working Directory Optional working_directory=str(tmp_path)

Eval Leaderboard

When you test multiple evals, the report shows an Eval Leaderboard.

This happens automatically — no configuration needed. Just parametrize your tests:

from pathlib import Path
import pytest
from pytest_skill_engineering.copilot import CopilotEval
from pytest_skill_engineering import Skill

SKILLS = {
    "v1": Skill.from_path("skills/financial-advisor-v1"),
    "v2": Skill.from_path("skills/financial-advisor-v2"),
}

@pytest.mark.parametrize("skill_name,skill", SKILLS.items())
async def test_banking(copilot_eval, skill_name, skill):
    agent = CopilotEval(
        name=f"banking-{skill_name}",
        model="gpt-5-mini",
        skill=skill,
    )
    result = await copilot_eval(agent, "What's my checking balance?")
    assert result.success

The report shows:

Eval Pass Rate Cost
gpt-5-mini (v1) 100% $0.002
gpt-5-mini (v2) 100% $0.004

Winning Criteria

Winning Eval = Highest pass rate → Lowest cost (tiebreaker)

  1. Pass rate (primary) — 100% beats 95%, always
  2. Cost (tiebreaker) — Among equal pass rates, cheaper wins

Dimension Detection

The AI analysis detects what varies between evals to provide targeted feedback:

What Varies AI Feedback Focuses On
Model Which model works best with your tools
Skill Whether domain knowledge helps
Custom Agent Which .agent.md instructions produce better behavior
Server Which implementation is more reliable

This is for AI analysis only — the leaderboard always appears when multiple evals are tested.

Examples

Compare Models

MODELS = ["gpt-5-mini", "gpt-4.1"]

@pytest.mark.parametrize("model", MODELS)
async def test_with_model(copilot_eval, model):
    agent = CopilotEval(
        name=f"banking-{model}",
        model=model,
    )
    result = await copilot_eval(agent, "What's my checking balance?")
    assert result.success

Compare Custom Agent Versions

A/B test two versions of a .agent.md file to find which instructions work better:

from pathlib import Path
from pytest_skill_engineering.copilot import CopilotEval
from pytest_skill_engineering.core.evals import load_custom_agent

AGENT_VERSIONS = {
    path.stem: load_custom_agent(path)
    for path in Path(".github/agents").glob("reviewer-*.agent.md")
}

@pytest.mark.parametrize("name,agent_def", AGENT_VERSIONS.items())
async def test_reviewer(copilot_eval, name, agent_def):
    agent = CopilotEval(
        name=f"reviewer-{name}",
        custom_agents=[agent_def],
    )
    result = await copilot_eval(agent, "Review src/auth.py for security issues")
    assert result.success

Compare Multiple Dimensions

MODELS = ["gpt-5-mini", "gpt-4.1"]
SKILLS = {
    "v1": Skill.from_path("skills/advisor-v1"),
    "v2": Skill.from_path("skills/advisor-v2"),
}

@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("skill_name,skill", SKILLS.items())
async def test_combinations(copilot_eval, model, skill_name, skill):
    agent = CopilotEval(
        name=f"banking-{model}-{skill_name}",
        model=model,
        skill=skill,
    )
    result = await copilot_eval(agent, "What's my checking balance?")
    assert result.success

Custom Agents

A custom agent is a specialized sub-agent defined in a .agent.md or .md file with YAML frontmatter (name, description, tools) and a markdown prompt body.

from pytest_skill_engineering.copilot import CopilotEval
from pytest_skill_engineering.core.evals import load_custom_agent

# Load and test a custom agent via CopilotEval
agent = CopilotEval(
    name="reviewer-test",
    custom_agents=[load_custom_agent(".github/agents/reviewer.agent.md")],
)

The tools frontmatter field maps to allowed_tools — restricting which tools the agent can call. See Custom Agents for a full guide.

Next Steps