Comparing Configurations¶
The power of pytest-skill-engineering is comparing different configurations to find what works best — whether that's models, skill versions, or custom agent files.
Pattern 1: Explicit Configurations¶
Define agents with meaningful names when testing distinct approaches:
from pytest_skill_engineering.copilot import CopilotEval
# Compare: no skill vs with skill
agent_baseline = CopilotEval(
name="baseline",
instructions="You are a banking assistant.",
)
agent_with_skill = CopilotEval(
name="with-skill",
instructions="You are a banking assistant.",
skill_directories=["skills/financial-advisor"],
)
# Compare: two versions of a custom agent file
from pytest_skill_engineering.core.evals import load_custom_agent
agent_v1 = CopilotEval(
name="advisor-v1",
custom_agents=[load_custom_agent(".github/agents/advisor-v1.agent.md")],
)
agent_v2 = CopilotEval(
name="advisor-v2",
custom_agents=[load_custom_agent(".github/agents/advisor-v2.agent.md")],
)
AGENTS = [agent_baseline, agent_with_skill, agent_v1, agent_v2]
@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_balance_query(copilot_eval, agent):
"""Which configuration handles balance queries best?"""
result = await copilot_eval(agent, "What's my checking balance?")
assert result.success
Use explicit configurations when:
- Testing conceptually different approaches (baseline vs skill, v1 vs v2)
- Names have meaning ("with-skill", "without-skill")
- You want full control over each configuration
Pattern 2: Generated Configurations¶
Generate configurations from all permutations for systematic testing:
from pathlib import Path
from pytest_skill_engineering.copilot import CopilotEval
MODELS = ["gpt-5-mini", "gpt-4.1"]
SKILL_VERSIONS = [path.stem for path in Path("skills").iterdir() if path.is_dir()]
# Generate all combinations: 2 models × N skill versions
AGENTS = [
CopilotEval(
name=f"{model}-{skill_name}",
model=model,
skill_directories=[f"skills/{skill_name}"],
)
for model in MODELS
for skill_name in SKILL_VERSIONS
]
@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_balance_query(copilot_eval, agent):
"""Test with different model/skill combinations."""
result = await copilot_eval(agent, "What's my checking balance?")
assert result.success
Use generated configurations when:
- Testing all combinations systematically
- Looking for interactions (e.g., "skill v2 works with gpt-4.1 but fails with gpt-5-mini")
- Comparing multiple dimensions at once
What the Report Shows¶
The report shows an Eval Leaderboard (auto-detected when multiple agents are tested):
| Eval | Pass Rate | Tokens | Cost |
|---|---|---|---|
| gpt-5-mini-v2 | 100% | 747 | $0.002 |
| gpt-4.1-v2 | 100% | 560 | $0.008 |
| gpt-5-mini-v1 | 90% | 1,203 | $0.004 |
| gpt-4.1-v1 | 90% | 892 | $0.012 |
Winning eval: Highest pass rate → lowest cost (tiebreaker).
This helps you answer:
- "Does skill v2 outperform v1?"
- "Can I use a cheaper model with my tools?"
- "Which custom agent instructions produce better behavior?"
Next Steps¶
- Custom Agents — A/B test agent instruction files
- A/B Testing Servers — Compare server implementations
- Multi-Turn Sessions — Test conversations with context
📁 Real Examples: - pydantic/test_01_basic.py — Single agent workflows - pydantic/test_04_matrix.py — Multi-dimension comparison