Eval Skills¶
An Eval Skill is a domain knowledge module following the agentskills.io specification. Skills provide:
- Instructions — Domain knowledge and behavioral guidelines for the agent
- References — On-demand documents the agent can look up
Creating a Skill¶
A skill is a directory with a SKILL.md file:
financial-advisor/
├── SKILL.md # Instructions (required)
└── references/ # On-demand lookup docs (optional)
└── budgeting-guide.md
SKILL.md Format¶
---
name: financial-advisor
description: Guidelines for personal finance management
---
# Financial Advisor Guidelines
## Budget Analysis
- Follow the 50/30/20 rule: 50% needs, 30% wants, 20% savings
- Emergency fund should cover 3-6 months of expenses
- Track spending categories: housing, food, transport, entertainment
## Red Flags
- Savings below 10% of income
- No emergency fund
- High-interest debt accumulating
For detailed budgeting advice, use the reference document.
Skill References¶
References are documents the agent can look up on demand rather than having them always in context. When a skill has a references/ directory, two virtual tools are automatically injected:
| Tool | Description |
|---|---|
list_skill_references |
Lists available reference documents |
read_skill_reference |
Reads a specific document by filename |
Example Reference Document¶
# Budgeting Guide
## The 50/30/20 Rule
- 50% Needs: rent, utilities, groceries, insurance
- 30% Wants: dining out, entertainment, shopping
- 20% Savings: emergency fund, investments, debt payoff
## Building an Emergency Fund
- Start with $1,000 mini-fund
- Build to 3 months of expenses
- Keep in high-yield savings account
- Don't invest emergency fund
...
How the Eval Uses References¶
When you tell the skill to "use the reference document for budgeting advice", the agent will:
- Call
list_skill_references()→ seesbudgeting-guide.md - Call
read_skill_reference(filename="budgeting-guide.md")→ gets the content - Use that content to formulate a detailed response
This keeps your base prompt lean while providing detailed information when needed.
When to Use References vs Instructions¶
| Use Instructions (SKILL.md) | Use References |
|---|---|
| Core decision logic | Detailed lookup tables |
| Always-needed context | Supplementary details |
| Short, critical rules | Long documentation |
| < 500 tokens | > 500 tokens per doc |
Example: Put budget analysis rules in SKILL.md, but detailed budgeting breakdowns in references/budgeting-guide.md.
Using a Skill¶
from pytest_skill_engineering.copilot import CopilotEval
agent = CopilotEval(
name="with-skill",
skill_directories=["skills/financial-advisor"],
)
Testing Skill Effectiveness¶
Compare agents with and without skills:
from pytest_skill_engineering.copilot import CopilotEval
agent_without_skill = CopilotEval(
name="without-skill",
instructions="You are a banking assistant.",
)
agent_with_skill = CopilotEval(
name="with-skill",
instructions="You are a banking assistant.",
skill_directories=["skills/financial-advisor"],
)
AGENTS = [agent_without_skill, agent_with_skill]
@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_financial_advice(copilot_eval, agent):
"""Does the skill improve financial recommendations?"""
result = await copilot_eval(
agent,
"I have $5,000 to allocate. How should I split it between needs, savings, and wants?"
)
assert result.success
The report shows whether the skill improves performance.
Next Steps¶
- Comparing Configurations — Systematic testing patterns
- Multi-Turn Sessions — Conversations with context
📁 Real Examples: - pydantic/test_05_skills.py — Skill loading, metadata, and before/after comparisons
Copilot Skills¶
Are you testing a skill for GitHub Copilot? Use
CopilotEvalwithskill_directoriesinstead — notEval+Skill.from_path().
The Eval + Skill approach above tests whether a generic LLM can leverage skill content via injected tools. It does not test how GitHub Copilot itself loads and uses the skill.
When your skill is built for Copilot (e.g. distributed via npx skills add), you want the real Copilot agent to load it — exactly as end users will experience it:
from pytest_skill_engineering.copilot import CopilotEval
async def test_skill_presents_scenarios(copilot_eval):
agent = CopilotEval(
name="with-skill",
skill_directories=["skills/my-skill"], # loads SKILL.md + references/
max_turns=10,
)
result = await copilot_eval(agent, "What can you help me with?")
assert result.success
assert "baseline" in result.final_response.lower()
Copilot loads the skill natively — no synthetic tool injection. MCP servers configured in ~/.copilot/mcp-config.json (or via the session's mcp_servers) are available automatically.
See Test Coding Agents for a full example.