Eval Skills¶

An Eval Skill is a domain knowledge module following the agentskills.io specification. Skills provide:

Instructions — Domain knowledge and behavioral guidelines for the agent
References — On-demand documents the agent can look up

Creating a Skill¶

A skill is a directory with a SKILL.md file:

financial-advisor/
├── SKILL.md           # Instructions (required)
└── references/        # On-demand lookup docs (optional)
    └── budgeting-guide.md

SKILL.md Format¶

---
name: financial-advisor
description: Guidelines for personal finance management
---

# Financial Advisor Guidelines

## Budget Analysis
- Follow the 50/30/20 rule: 50% needs, 30% wants, 20% savings
- Emergency fund should cover 3-6 months of expenses
- Track spending categories: housing, food, transport, entertainment

## Red Flags
- Savings below 10% of income
- No emergency fund
- High-interest debt accumulating

For detailed budgeting advice, use the reference document.

Skill References¶

References are documents the agent can look up on demand rather than having them always in context. When a skill has a references/ directory, two virtual tools are automatically injected:

Tool	Description
`list_skill_references`	Lists available reference documents
`read_skill_reference`	Reads a specific document by filename

Example Reference Document¶

references/budgeting-guide.md

# Budgeting Guide

## The 50/30/20 Rule
- 50% Needs: rent, utilities, groceries, insurance
- 30% Wants: dining out, entertainment, shopping
- 20% Savings: emergency fund, investments, debt payoff

## Building an Emergency Fund
- Start with $1,000 mini-fund
- Build to 3 months of expenses
- Keep in high-yield savings account
- Don't invest emergency fund
...

How the Eval Uses References¶

When you tell the skill to "use the reference document for budgeting advice", the agent will:

Call list_skill_references() → sees budgeting-guide.md
Call read_skill_reference(filename="budgeting-guide.md") → gets the content
Use that content to formulate a detailed response

This keeps your base prompt lean while providing detailed information when needed.

When to Use References vs Instructions¶

Use Instructions (SKILL.md)	Use References
Core decision logic	Detailed lookup tables
Always-needed context	Supplementary details
Short, critical rules	Long documentation
< 500 tokens	> 500 tokens per doc

Example: Put budget analysis rules in SKILL.md, but detailed budgeting breakdowns in references/budgeting-guide.md.

Using a Skill¶

from pytest_skill_engineering.copilot import CopilotEval

agent = CopilotEval(
    name="with-skill",
    skill_directories=["skills/financial-advisor"],
)

Testing Skill Effectiveness¶

Compare agents with and without skills:

from pytest_skill_engineering.copilot import CopilotEval

agent_without_skill = CopilotEval(
    name="without-skill",
    instructions="You are a banking assistant.",
)

agent_with_skill = CopilotEval(
    name="with-skill",
    instructions="You are a banking assistant.",
    skill_directories=["skills/financial-advisor"],
)

AGENTS = [agent_without_skill, agent_with_skill]

@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_financial_advice(copilot_eval, agent):
    """Does the skill improve financial recommendations?"""
    result = await copilot_eval(
        agent, 
        "I have $5,000 to allocate. How should I split it between needs, savings, and wants?"
    )
    assert result.success

The report shows whether the skill improves performance.

Next Steps¶

Comparing Configurations — Systematic testing patterns
Multi-Turn Sessions — Conversations with context

📁 Real Examples: - pydantic/test_05_skills.py — Skill loading, metadata, and before/after comparisons

Copilot Skills¶

Are you testing a skill for GitHub Copilot? Use CopilotEval with skill_directories instead — not Eval + Skill.from_path().

The Eval + Skill approach above tests whether a generic LLM can leverage skill content via injected tools. It does not test how GitHub Copilot itself loads and uses the skill.

When your skill is built for Copilot (e.g. distributed via npx skills add), you want the real Copilot agent to load it — exactly as end users will experience it:

from pytest_skill_engineering.copilot import CopilotEval

async def test_skill_presents_scenarios(copilot_eval):
    agent = CopilotEval(
        name="with-skill",
        skill_directories=["skills/my-skill"],  # loads SKILL.md + references/
        max_turns=10,
    )
    result = await copilot_eval(agent, "What can you help me with?")
    assert result.success
    assert "baseline" in result.final_response.lower()

Copilot loads the skill natively — no synthetic tool injection. MCP servers configured in ~/.copilot/mcp-config.json (or via the session's mcp_servers) are available automatically.

See Test Coding Agents for a full example.