API Reference¶

Auto-generated API documentation from source code.

Copilot Types¶

CopilotEval(name: str = 'copilot', model: str | None = None, reasoning_effort: Literal['low', 'medium', 'high', 'xhigh'] | None = None, instructions: str | None = None, system_message_mode: Literal['append', 'replace'] = 'append', working_directory: str | None = None, allowed_tools: list[str] | None = None, excluded_tools: list[str] | None = None, max_turns: int = 25, timeout_s: float = 300.0, max_retries: int = 2, retry_delay_s: float = 5.0, auto_confirm: bool = True, mcp_servers: dict[str, Any] = dict(), custom_agents: list[dict[str, Any]] = list(), skill_directories: list[str] = list(), disabled_skills: list[str] = list(), extra_config: dict[str, Any] = dict(), active_agent: str = '', hooks: dict[str, Any] = dict(), persona: 'Persona' = (lambda: _default_persona())()) `dataclass` ¶

Configuration for a GitHub Copilot agent test.

Maps to the Copilot SDK's SessionConfig. User-facing field names are kept intuitive (e.g. instructions), while build_session_config() maps them to the SDK's actual system_message TypedDict.

The SDK's SessionConfig has no maxTurns field — turn limits are enforced externally by the runner via timeout_s.

Example

Minimal¶

CopilotEval()

With instructions and model¶

CopilotEval( name="security-reviewer", model="claude-sonnet-4", instructions="Review code for security vulnerabilities.", )

With custom tools and references¶

CopilotEval( name="file-creator", instructions="Create files as requested.", working_directory="/tmp/workspace", allowed_tools=["create_file", "read_file"], )

`build_session_config() -> dict[str, Any]` ¶

Build a SessionConfig dict for the Copilot SDK.

Returns a dict compatible with CopilotClient.create_session(). Only includes non-None/non-default fields to avoid overriding SDK defaults.

SDK field mapping (Python snake_case TypedDict keys): instructions → system_message: {mode, content} allowed_tools → available_tools excluded_tools → excluded_tools reasoning_effort → reasoning_effort working_directory → working_directory mcp_servers → mcp_servers custom_agents → custom_agents skill_directories → skill_directories disabled_skills → disabled_skills

Note: max_turns is NOT part of SessionConfig — the runner enforces turn limits externally.

`from_copilot_config(path: str | Path = '.', **overrides: Any) -> 'CopilotEval'` `classmethod` ¶

Load a CopilotEval from a directory containing Copilot config files.

Looks for the following files under path:

.github/copilot-instructions.md → instructions
.github/agents/*.agent.md → custom_agents

Point path at any directory — your production project, a dedicated test fixture project, or a shared team config repo. Any keyword argument overrides the loaded value.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Root directory to load config from. Defaults to the current working directory.	`'.'`
`**overrides`	`Any`	Override any `CopilotEval` field after loading, e.g. `model="claude-opus-4.5"`.	`{}`

Returns:

Type	Description
`'CopilotEval'`	A `CopilotEval` initialised from the discovered config files.

Example::

# Load from the current project (production config as baseline)
baseline = CopilotEval.from_copilot_config()

# A/B test: same config, one instruction changed
treatment = CopilotEval.from_copilot_config(
    instructions="Always add type hints.",
)

# Load from a dedicated test-fixture project
agent = CopilotEval.from_copilot_config("tests/fixtures/strict-agent")

# Load from a shared team agent library
agent = CopilotEval.from_copilot_config("/shared/team/copilot-config")

`from_plugin(path: str | Path, *, model: str = '', persona: 'Persona | None' = None, instructions: str = '', working_directory: str = '', name: str = '') -> 'CopilotEval'` `classmethod` ¶

Create a CopilotEval from a plugin directory.

Loads the plugin's agents, skills, MCP servers, and instructions, then constructs a CopilotEval with all components wired together.

Uses :func:~pytest_skill_engineering.core.plugin.load_plugin to discover the plugin structure. The persona is auto-detected from the plugin path (ClaudeCodePersona for .claude/ paths, VSCodePersona otherwise) unless explicitly overridden.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to the plugin directory (may contain `plugin.json`, or be a `.github/` / `.claude/` project config directory).	required
`model`	`str`	Model override for the eval.	`''`
`persona`	`'Persona \| None'`	IDE persona override. Auto-detected when `None`.	`None`
`instructions`	`str`	Additional instructions to append to the plugin's discovered instructions.	`''`
`working_directory`	`str`	Override the working directory.	`''`
`name`	`str`	Override the eval name (defaults to plugin metadata name).	`''`

Returns:

Type	Description
`'CopilotEval'`	A `CopilotEval` initialised from the plugin.

Example::

agent = CopilotEval.from_plugin("my-plugin/")

agent = CopilotEval.from_plugin(
    ".claude/",
    model="claude-sonnet-4",
    instructions="Focus on security reviews.",
)

`from_claude_config(path: str | Path = '.', *, model: str = '', persona: 'Persona | None' = None, instructions: str = '', working_directory: str = '', name: str = 'claude-code-eval') -> 'CopilotEval'` `classmethod` ¶

Create a CopilotEval from a Claude Code project directory.

Scans for:

CLAUDE.md (project root) and .claude/CLAUDE.md → instructions
.claude/agents/*.md → custom agents
.claude/skills/ → skill directories (subdirs with SKILL.md)
.mcp.json → MCP server configs

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Root of the Claude Code project (default: current directory).	`'.'`
`model`	`str`	Model override for the eval.	`''`
`persona`	`'Persona \| None'`	IDE persona override. Defaults to `ClaudeCodePersona`.	`None`
`instructions`	`str`	Additional instructions to append to the discovered `CLAUDE.md` content.	`''`
`working_directory`	`str`	Override the working directory.	`''`
`name`	`str`	Override the eval name.	`'claude-code-eval'`

Returns:

Type	Description
`'CopilotEval'`	A `CopilotEval` initialised from the discovered config files.

Example::

# Load from the current project
agent = CopilotEval.from_claude_config()

# Load from a specific directory with model override
agent = CopilotEval.from_claude_config(
    "tests/fixtures/claude-project",
    model="claude-sonnet-4",
)

`CopilotResult(turns: list[Turn] = list(), success: bool = True, error: str | None = None, duration_ms: float = 0.0, usage: list[UsageInfo] = list(), reasoning_traces: list[str] = list(), subagent_invocations: list[SubagentInvocation] = list(), permission_requested: bool = False, permissions: list[dict[str, Any]] = list(), model_used: str | None = None, total_premium_requests: float = 0.0, raw_events: list[Any] = list(), agent: CopilotEval | None = None)` `dataclass` ¶

Result of running a prompt against GitHub Copilot.

Captures the full event stream from the SDK, including tool calls, reasoning traces, subagent routing, permissions, and token usage.

Example

result = await copilot_eval(agent, "Create hello.py") assert result.success assert result.tool_was_called("create_file") assert "hello" in result.final_response.lower()

`final_response: str | None` `property` ¶

Get the last assistant response.

`all_responses: list[str]` `property` ¶

Get all assistant responses.

`all_tool_calls: list[ToolCall]` `property` ¶

Get all tool calls across all turns.

`tool_names_called: set[str]` `property` ¶

Get set of all tool names that were called.

`total_input_tokens: int` `property` ¶

Total input tokens across all model turns.

`total_output_tokens: int` `property` ¶

Total output tokens across all model turns.

`total_tokens: int` `property` ¶

Total tokens (input + output) across all model turns.

`token_usage: dict[str, int]` `property` ¶

Token usage dict compatible with pytest-skill-engineering's EvalResult.

Keys use short names (prompt, completion, total) to match the format pytest-skill-engineering reads in its collector and generator.

`working_directory: Path` `property` ¶

Working directory where the agent operated.

Resolved from agent.working_directory when set; falls back to the current working directory.

`tool_was_called(name: str) -> bool` ¶

Check if a specific tool was called.

`tool_call_count(name: str) -> int` ¶

Count how many times a specific tool was called.

`tool_calls_for(name: str) -> list[ToolCall]` ¶

Get all calls to a specific tool.

`tool_was_called_with(name: str, **expected_args: Any) -> bool` ¶

Check if a tool was called with specific argument values.

Returns True if at least one call to the named tool has all the specified argument key-value pairs.

Parameters:

Name	Type	Description	Default
`name`	`str`	Tool name to check.	required
`**expected_args`	`Any`	Expected argument key-value pairs.	`{}`

Example::

assert result.tool_was_called_with("size_vm", region="westeurope", cores=8)
assert result.tool_was_called_with("get_balance", account="checking")

`file(path: str) -> str` ¶

Read the content of a file relative to the working directory.

Parameters:

Name	Type	Description	Default
`path`	`str`	Relative file path (e.g. `"main.py"` or `"src/utils.py"`).	required

Returns:

Type	Description
`str`	File content as a string.

Raises:

Type	Description
`FileNotFoundError`	If the file does not exist.

`file_exists(path: str) -> bool` ¶

Check whether a file exists in the working directory.

Parameters:

Name	Type	Description	Default
`path`	`str`	Relative file path.	required

Returns:

Type	Description
`bool`	`True` if the file exists, `False` otherwise.

`files_matching(pattern: str = '**/*') -> list[Path]` ¶

Find files matching a glob pattern in the working directory.

Parameters:

Name	Type	Description	Default
`pattern`	`str`	Glob pattern relative to the working directory. Defaults to `"*/"` (all files recursively).	`'*/'`

Returns:

Type	Description
`list[Path]`	Sorted list of matching `Path` objects (files only, no
`list[Path]`	directories).

Example::

# All Python files created by the agent
py_files = result.files_matching("**/*.py")
assert py_files, "No Python files were created"

# Top-level test files
test_files = result.files_matching("test_*.py")

Result Types¶

EvalResult(turns: list[Turn], success: bool, error: str | None = None, duration_ms: float = 0.0, token_usage: dict[str, int] = dict(), cost_usd: float = 0.0, _messages: list[Any] = list(), session_context_count: int = 0, assertions: list[Assertion] = list(), available_tools: list[ToolInfo] = list(), skill_info: SkillInfo | None = None, effective_system_prompt: str = '', mcp_prompts: list[MCPPrompt] = list(), prompt_name: str | None = None, custom_agent_info: CustomAgentInfo | None = None, premium_requests: float = 0.0, instruction_files: list[InstructionFileInfo] = list(), clarification_stats: ClarificationStats | None = None) `dataclass` ¶

Result of running an agent with rich inspection capabilities.

Example

result = await eval_run(agent, "Hello!") assert result.success assert "hello" in result.final_response.lower() assert result.tool_was_called("read_file")

Session continuity: pass messages to next test¶

next_result = await eval_run(agent, "Follow up", messages=result.messages)

`messages: list[Any]` `property` ¶

Get full conversation messages for session continuity.

Use this to pass conversation history to the next test in a session:

result = await eval_run(agent, "First message")
next_result = await eval_run(agent, "Continue", messages=result.messages)

`is_session_continuation: bool` `property` ¶

Check if this result is part of a multi-turn session.

Returns True if prior messages were passed via the messages parameter.

`final_response: str` `property` ¶

Get the last assistant response.

`all_responses: list[str]` `property` ¶

Get all assistant responses.

`all_tool_calls: list[ToolCall]` `property` ¶

Get all tool calls across all turns.

`tool_names_called: set[str]` `property` ¶

Get set of all tool names that were called.

`asked_for_clarification: bool` `property` ¶

Check if the agent asked for clarification instead of acting.

Returns True if clarification detection was enabled AND the agent asked at least one clarifying question.

Example

result = await eval_run(agent, "Check my balance") assert not result.asked_for_clarification

`clarification_count: int` `property` ¶

Number of times the agent asked for clarification.

`tool_context: str` `property` ¶

Summarise tool calls and their results as plain text.

Use this as the context argument for llm_score so the judge can see what tools were called and what data they returned.

Example::

score = llm_score(
    result.final_response,
    TOOL_QUALITY_RUBRIC,
    context=result.tool_context,
)

`tool_was_called(name: str) -> bool` ¶

Check if a specific tool was called.

`tool_was_called_from_server(server_name: str, tool_name: str) -> bool` ¶

Check if a specific tool from a named MCP server was called.

Useful for plugin testing where tools may be namespaced by server. Checks both the plain tool name and the server_tool prefixed form.

Parameters:

Name	Type	Description	Default
`server_name`	`str`	Name of the MCP server (e.g., `"filesystem"`).	required
`tool_name`	`str`	Name of the tool (e.g., `"read_file"`).	required

Returns:

Type	Description
`bool`	True if either `tool_name` or `server_name_tool_name` was called.

Example::

assert result.tool_was_called_from_server("banking", "get_balance")

`tool_call_count(name: str) -> int` ¶

Count how many times a specific tool was called.

`tool_calls_for(name: str) -> list[ToolCall]` ¶

Get all calls to a specific tool.

`tool_call_arg(tool_name: str, arg_name: str) -> Any` ¶

Get argument value from the first call to a tool.

Parameters:

Name	Type	Description	Default
`tool_name`	`str`	Name of the tool	required
`arg_name`	`str`	Name of the argument	required

Returns:

Type	Description
`Any`	Argument value or None if not found

`tool_was_called_with(name: str, **expected_args: Any) -> bool` ¶

Check if a tool was called with specific argument values.

Returns True if at least one call to the named tool has all the specified argument key-value pairs.

Parameters:

Name	Type	Description	Default
`name`	`str`	Tool name to check.	required
`**expected_args`	`Any`	Expected argument key-value pairs.	`{}`

Example::

assert result.tool_was_called_with("size_vm", region="westeurope", cores=8)
assert result.tool_was_called_with("get_balance", account="checking")

`tool_images_for(name: str) -> list[ImageContent]` ¶

Get all images returned by a specific tool.

Parameters:

Name	Type	Description	Default
`name`	`str`	Name of the tool (e.g., "screenshot")	required

Returns:

Type	Description
`list[ImageContent]`	List of ImageContent objects from tool calls that returned images.

Example

screenshots = result.tool_images_for("screenshot") assert len(screenshots) > 0 assert screenshots[-1].media_type == "image/png"

`Turn(role: str, content: str, tool_calls: list[ToolCall] = list())` `dataclass` ¶

A single conversational turn.

`text: str` `property` ¶

Get the text content of this turn.

`ToolCall(name: str, arguments: dict[str, Any], result: str | None = None, error: str | None = None, duration_ms: float | None = None, image_content: bytes | None = None, image_media_type: str | None = None)` `dataclass` ¶

A tool call made by the agent.

`ClarificationStats(count: int = 0, turn_indices: list[int] = list(), examples: list[str] = list())` `dataclass` ¶

Statistics about clarification requests detected during execution.

Tracks when the agent asks for user input instead of executing the task. Only populated when clarification_detection is enabled on the agent.

Example

result = await eval_run(agent, "Check my balance") if result.clarification_stats: print(f"Eval asked {result.clarification_stats.count} question(s)")

`ToolInfo(name: str, description: str, input_schema: dict[str, Any], server_name: str)` `dataclass` ¶

Metadata about an MCP tool for AI analysis.

Captures the tool's description and schema as exposed to the LLM, enabling the AI to analyze whether tool descriptions are clear and suggest improvements.

`SkillInfo(name: str, description: str, instruction_content: str, reference_names: list[str] = list())` `dataclass` ¶

Metadata about a skill for AI analysis.

Captures the skill's instruction content and references, enabling the AI to analyze skill effectiveness and suggest improvements.

`SubagentInvocation(name: str, status: str, duration_ms: float | None = None)` `dataclass` ¶

A subagent invocation observed during agent execution.

Tracks when an orchestrator agent dispatches work to a named sub-agent, along with the final status and duration of that invocation.

Example

result = await copilot_eval(agent, "Build and test the project") assert any(s.name == "coder" for s in result.subagent_invocations) assert all(s.status == "completed" for s in result.subagent_invocations)

Scoring Types¶

options:
  show_source: false
  heading_level: 3

`ScoreResult(scores: dict[str, int], total: int, max_total: int, weighted_score: float, reasoning: str)` `dataclass` ¶

Structured result from a multi-dimension LLM evaluation.

Attributes:

Name	Type	Description
`scores`	`dict[str, int]`	Per-dimension scores keyed by dimension name.
`total`	`int`	Sum of all dimension scores.
`max_total`	`int`	Maximum possible total score.
`weighted_score`	`float`	Weighted composite score (0.0 – 1.0).
`reasoning`	`str`	Free-text explanation from the judge.

`assert_score(result: ScoreResult, *, min_total: int | None = None, min_pct: float | None = None, min_dimensions: dict[str, int] | None = None) -> None` ¶

Assert that judge scores meet minimum thresholds.

Parameters:

Name	Type	Description	Default
`result`	`ScoreResult`	ScoreResult from an LLMScore evaluation.	required
`min_total`	`int \| None`	Minimum total score (sum of all dimensions).	`None`
`min_pct`	`float \| None`	Minimum weighted percentage (0.0 – 1.0).	`None`
`min_dimensions`	`dict[str, int] \| None`	Per-dimension minimum scores keyed by name.	`None`

Raises:

Type	Description
`AssertionError`	If any threshold is not met.

`LLMScore(model: str)` ¶

Callable that evaluates content against a multi-dimension rubric.

Uses the Copilot SDK with structured prompting to extract per-dimension scores from a judge LLM.

Example::

def test_plan_quality(llm_score):
    rubric = [
        ScoringDimension("accuracy", "Factually correct", max_score=5),
        ScoringDimension("completeness", "Covers all points", max_score=5),
    ]
    result = llm_score(plan_text, rubric)
    assert result.total >= 7

`call(content: str, rubric: list[ScoringDimension], *, content_label: str = 'content', context: str | None = None) -> ScoreResult` ¶

Evaluate content against a multi-dimension rubric.

Parameters:

Name	Type	Description	Default
`content`	`str`	The text to evaluate.	required
`rubric`	`list[ScoringDimension]`	List of ScoringDimension definitions.	required
`content_label`	`str`	How to describe the content to the judge (e.g. `"implementation plan"`, `"code review"`).	`'content'`
`context`	`str \| None`	Optional background context for the judge (e.g. the original task prompt, source code).	`None`

Returns:

Type	Description
`ScoreResult`	ScoreResult with per-dimension scores and reasoning.

`async_score(content: str, rubric: list[ScoringDimension], *, content_label: str = 'content', context: str | None = None) -> ScoreResult` `async` ¶

Async variant for use in async test functions.

Same parameters as __call__.

Skill Types¶

`Skill(path: Path, metadata: SkillMetadata, content: str, references: dict[str, str] = dict(), scripts: dict[str, str] = dict(), assets: tuple[str, ...] = ())` `dataclass` ¶

An Eval Skill loaded from a SKILL.md file.

Skills provide domain knowledge to agents by: 1. Prepending instructions to the system prompt 2. Optionally providing reference documents via virtual tools

Example

skill = Skill.from_path(Path("skills/my-skill")) agent = Eval(provider=provider, skill=skill)

`name: str` `property` ¶

Skill name from metadata.

`description: str` `property` ¶

Skill description from metadata.

`has_references: bool` `property` ¶

Whether this skill has reference documents.

`has_scripts: bool` `property` ¶

Whether this skill has executable scripts.

`has_assets: bool` `property` ¶

Whether this skill has asset files.

`assets_dir: Path | None` `property` ¶

Path to assets directory, if it exists.

`from_path(path: Path | str) -> Skill` `classmethod` ¶

Load a skill from a directory containing SKILL.md.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Path to skill directory or SKILL.md file	required

Returns:

Type	Description
`Skill`	Loaded Skill instance

Raises:

Type	Description
`SkillError`	If skill cannot be loaded or is invalid

`SkillMetadata(name: str, description: str, version: str | None = None, license: str | None = None, tags: tuple[str, ...] = (), compatibility: str | None = None, metadata_entries: tuple[tuple[str, str], ...] = (), allowed_tools: tuple[str, ...] = ())` `dataclass` ¶

Metadata from SKILL.md frontmatter.

Required fields per agentskills.io spec: - name: lowercase letters and hyphens only, 1-64 chars - description: what the skill does, max 1024 chars

Optional fields: - version: semantic version string - license: SPDX license identifier - tags: list of categorization tags - compatibility: environment requirements, max 500 chars - metadata: arbitrary key-value pairs (stored as tuple of tuples for frozen compat) - allowed_tools: tool names the skill is designed to work with

`metadata_dict: dict[str, str]` `property` ¶

Return metadata entries as a dict for convenient access.

`__post_init__() -> None` ¶

Validate metadata per agentskills.io spec.

`load_skill(path: Path | str) -> Skill` ¶

Load a skill from a path.

Convenience function wrapping Skill.from_path().

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Path to skill directory or SKILL.md file	required

Returns:

Type	Description
`Skill`	Loaded Skill instance

Custom Agent Types¶

`load_custom_agent(path: Path | str, *, overrides: dict[str, Any] | None = None) -> dict[str, Any]` ¶

Load a .agent.md or .md file into a custom agent dict.

Parses YAML frontmatter with PyYAML for structured metadata (description, tools, handoffs, maturity, etc.) and uses the markdown body as the agent's prompt.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Path to the `.agent.md` or `.md` file.	required
`overrides`	`dict[str, Any] \| None`	Additional fields to merge into the result. Use this to set `tools`, `mcp_servers`, or `infer` fields that aren't in the agent file itself.	`None`

Returns:

Type	Description
`dict[str, Any]`	Dict with keys: - `name` (str): Derived from filename. - `prompt` (str): Markdown body after frontmatter. - `description` (str): From frontmatter, empty if absent. - `metadata` (dict): Full parsed frontmatter dict.
`dict[str, Any]`	Compatible with :attr:`CopilotEval.custom_agents` and
`dict[str, Any]`	meth:`Eval.from_agent_file`.

Raises:

Type	Description
`FileNotFoundError`	If the file does not exist.
`ValueError`	If the file has no content after frontmatter stripping.

`load_custom_agents(directory: Path | str, *, include: set[str] | None = None, exclude: set[str] | None = None, overrides: dict[str, dict[str, Any]] | None = None) -> list[dict[str, Any]]` ¶

Load all .agent.md files from a directory.

Parameters:

Name	Type	Description	Default
`directory`	`Path \| str`	Path to directory containing `.agent.md` files.	required
`include`	`set[str] \| None`	If set, only load agents with these names. Names are derived from filenames (e.g. `task-researcher.agent.md` → `task-researcher`).	`None`
`exclude`	`set[str] \| None`	Eval names to skip.	`None`
`overrides`	`dict[str, dict[str, Any]] \| None`	Per-agent override dicts keyed by agent name. Merged into each matching agent's config.	`None`

Returns:

Type	Description
`list[dict[str, Any]]`	List of custom agent dicts, sorted by name.

Raises:

Type	Description
`FileNotFoundError`	If the directory does not exist.

Plugin Types¶

`Plugin(metadata: PluginMetadata, path: Path, agents: list[dict[str, Any]] = list(), skills: list[Skill] = list(), mcp_servers: dict[str, dict[str, Any]] = dict(), hooks: list[HookDefinition] = list(), instructions: str = '', extensions: list[Path] = list())` `dataclass` ¶

A loaded plugin with all its components resolved.

Created by :func:load_plugin. Contains everything discovered from the plugin directory: agents, skills, MCP servers, hooks, instructions, and extensions.

`PluginMetadata(name: str, version: str = '', description: str = '', author: str = '')` `dataclass` ¶

Plugin manifest metadata.

Extracted from plugin.json or inferred from the directory name.

`HookDefinition(event: str, command: str, pattern: str = '')` `dataclass` ¶

A lifecycle hook from a plugin's hooks configuration.

Hooks allow plugins to execute shell commands at specific lifecycle events.

Example hooks.json::

[
    {
        "event": "tool.execution_complete",
        "command": "npm run lint",
        "pattern": "*.ts"
    }
]

Plugin Loading¶

`load_plugin(path: str | Path) -> Plugin` ¶

Load a plugin from a directory containing plugin.json.

Supports both GitHub Copilot CLI and Claude Code plugin formats. Discovers and loads: custom agents, skills, MCP servers, hooks, instructions, and extensions.

For .github/ directories, discovers agents from agents/ and instructions from copilot-instructions.md.

For .claude/ directories, discovers agents from agents/ and instructions from CLAUDE.md.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to the plugin directory (must contain plugin.json, or be a `.github/` / `.claude/` project config directory)	required

Returns:

Type	Description
`Plugin`	Plugin with all components resolved

Raises:

Type	Description
`FileNotFoundError`	If plugin.json doesn't exist (for non-project dirs)
`ValueError`	If plugin.json is invalid

API Reference¶

Copilot Types¶

Minimal¶

With instructions and model¶

With custom tools and references¶

build_session_config() -> dict[str, Any] ¶

from_copilot_config(path: str | Path = '.', **overrides: Any) -> 'CopilotEval' classmethod ¶

from_plugin(path: str | Path, *, model: str = '', persona: 'Persona | None' = None, instructions: str = '', working_directory: str = '', name: str = '') -> 'CopilotEval' classmethod ¶

from_claude_config(path: str | Path = '.', *, model: str = '', persona: 'Persona | None' = None, instructions: str = '', working_directory: str = '', name: str = 'claude-code-eval') -> 'CopilotEval' classmethod ¶

final_response: str | None property ¶

all_responses: list[str] property ¶

all_tool_calls: list[ToolCall] property ¶

tool_names_called: set[str] property ¶

total_input_tokens: int property ¶

total_output_tokens: int property ¶

total_tokens: int property ¶

token_usage: dict[str, int] property ¶

working_directory: Path property ¶

tool_was_called(name: str) -> bool ¶

tool_call_count(name: str) -> int ¶

tool_calls_for(name: str) -> list[ToolCall] ¶

tool_was_called_with(name: str, **expected_args: Any) -> bool ¶

file(path: str) -> str ¶

file_exists(path: str) -> bool ¶

files_matching(pattern: str = '**/*') -> list[Path] ¶

Result Types¶

Session continuity: pass messages to next test¶

messages: list[Any] property ¶

is_session_continuation: bool property ¶

final_response: str property ¶

all_responses: list[str] property ¶

all_tool_calls: list[ToolCall] property ¶

tool_names_called: set[str] property ¶

asked_for_clarification: bool property ¶

clarification_count: int property ¶

tool_context: str property ¶

tool_was_called(name: str) -> bool ¶

tool_was_called_from_server(server_name: str, tool_name: str) -> bool ¶

tool_call_count(name: str) -> int ¶

tool_calls_for(name: str) -> list[ToolCall] ¶

tool_call_arg(tool_name: str, arg_name: str) -> Any ¶

tool_was_called_with(name: str, **expected_args: Any) -> bool ¶

tool_images_for(name: str) -> list[ImageContent] ¶

Turn(role: str, content: str, tool_calls: list[ToolCall] = list()) dataclass ¶

text: str property ¶

ToolCall(name: str, arguments: dict[str, Any], result: str | None = None, error: str | None = None, duration_ms: float | None = None, image_content: bytes | None = None, image_media_type: str | None = None) dataclass ¶

ClarificationStats(count: int = 0, turn_indices: list[int] = list(), examples: list[str] = list()) dataclass ¶

ToolInfo(name: str, description: str, input_schema: dict[str, Any], server_name: str) dataclass ¶

SkillInfo(name: str, description: str, instruction_content: str, reference_names: list[str] = list()) dataclass ¶

SubagentInvocation(name: str, status: str, duration_ms: float | None = None) dataclass ¶

Scoring Types¶

ScoreResult(scores: dict[str, int], total: int, max_total: int, weighted_score: float, reasoning: str) dataclass ¶

assert_score(result: ScoreResult, *, min_total: int | None = None, min_pct: float | None = None, min_dimensions: dict[str, int] | None = None) -> None ¶

LLMScore(model: str) ¶

__call__(content: str, rubric: list[ScoringDimension], *, content_label: str = 'content', context: str | None = None) -> ScoreResult ¶

async_score(content: str, rubric: list[ScoringDimension], *, content_label: str = 'content', context: str | None = None) -> ScoreResult async ¶

Skill Types¶

Skill(path: Path, metadata: SkillMetadata, content: str, references: dict[str, str] = dict(), scripts: dict[str, str] = dict(), assets: tuple[str, ...] = ()) dataclass ¶

name: str property ¶

description: str property ¶

has_references: bool property ¶

has_scripts: bool property ¶

has_assets: bool property ¶

assets_dir: Path | None property ¶

from_path(path: Path | str) -> Skill classmethod ¶

SkillMetadata(name: str, description: str, version: str | None = None, license: str | None = None, tags: tuple[str, ...] = (), compatibility: str | None = None, metadata_entries: tuple[tuple[str, str], ...] = (), allowed_tools: tuple[str, ...] = ()) dataclass ¶

metadata_dict: dict[str, str] property ¶

__post_init__() -> None ¶

load_skill(path: Path | str) -> Skill ¶

Custom Agent Types¶

load_custom_agent(path: Path | str, *, overrides: dict[str, Any] | None = None) -> dict[str, Any] ¶

load_custom_agents(directory: Path | str, *, include: set[str] | None = None, exclude: set[str] | None = None, overrides: dict[str, dict[str, Any]] | None = None) -> list[dict[str, Any]] ¶

Plugin Types¶

Plugin(metadata: PluginMetadata, path: Path, agents: list[dict[str, Any]] = list(), skills: list[Skill] = list(), mcp_servers: dict[str, dict[str, Any]] = dict(), hooks: list[HookDefinition] = list(), instructions: str = '', extensions: list[Path] = list()) dataclass ¶

PluginMetadata(name: str, version: str = '', description: str = '', author: str = '') dataclass ¶

HookDefinition(event: str, command: str, pattern: str = '') dataclass ¶

Plugin Loading¶

load_plugin(path: str | Path) -> Plugin ¶

`build_session_config() -> dict[str, Any]` ¶

`from_copilot_config(path: str | Path = '.', **overrides: Any) -> 'CopilotEval'` `classmethod` ¶

`from_plugin(path: str | Path, *, model: str = '', persona: 'Persona | None' = None, instructions: str = '', working_directory: str = '', name: str = '') -> 'CopilotEval'` `classmethod` ¶

`from_claude_config(path: str | Path = '.', *, model: str = '', persona: 'Persona | None' = None, instructions: str = '', working_directory: str = '', name: str = 'claude-code-eval') -> 'CopilotEval'` `classmethod` ¶

`final_response: str | None` `property` ¶

`all_responses: list[str]` `property` ¶

`all_tool_calls: list[ToolCall]` `property` ¶

`tool_names_called: set[str]` `property` ¶

`total_input_tokens: int` `property` ¶

`total_output_tokens: int` `property` ¶

`total_tokens: int` `property` ¶

`token_usage: dict[str, int]` `property` ¶

`working_directory: Path` `property` ¶

`tool_was_called(name: str) -> bool` ¶

`tool_call_count(name: str) -> int` ¶

`tool_calls_for(name: str) -> list[ToolCall]` ¶

`tool_was_called_with(name: str, **expected_args: Any) -> bool` ¶

`file(path: str) -> str` ¶

`file_exists(path: str) -> bool` ¶

`files_matching(pattern: str = '**/*') -> list[Path]` ¶

`messages: list[Any]` `property` ¶

`is_session_continuation: bool` `property` ¶

`final_response: str` `property` ¶

`all_responses: list[str]` `property` ¶

`all_tool_calls: list[ToolCall]` `property` ¶

`tool_names_called: set[str]` `property` ¶

`asked_for_clarification: bool` `property` ¶

`clarification_count: int` `property` ¶

`tool_context: str` `property` ¶

`tool_was_called(name: str) -> bool` ¶

`tool_was_called_from_server(server_name: str, tool_name: str) -> bool` ¶

`tool_call_count(name: str) -> int` ¶

`tool_calls_for(name: str) -> list[ToolCall]` ¶

`tool_call_arg(tool_name: str, arg_name: str) -> Any` ¶

`tool_was_called_with(name: str, **expected_args: Any) -> bool` ¶

`tool_images_for(name: str) -> list[ImageContent]` ¶

`Turn(role: str, content: str, tool_calls: list[ToolCall] = list())` `dataclass` ¶

`text: str` `property` ¶

`ToolCall(name: str, arguments: dict[str, Any], result: str | None = None, error: str | None = None, duration_ms: float | None = None, image_content: bytes | None = None, image_media_type: str | None = None)` `dataclass` ¶

`ClarificationStats(count: int = 0, turn_indices: list[int] = list(), examples: list[str] = list())` `dataclass` ¶

`ToolInfo(name: str, description: str, input_schema: dict[str, Any], server_name: str)` `dataclass` ¶

`SkillInfo(name: str, description: str, instruction_content: str, reference_names: list[str] = list())` `dataclass` ¶

`SubagentInvocation(name: str, status: str, duration_ms: float | None = None)` `dataclass` ¶

`ScoreResult(scores: dict[str, int], total: int, max_total: int, weighted_score: float, reasoning: str)` `dataclass` ¶

`assert_score(result: ScoreResult, *, min_total: int | None = None, min_pct: float | None = None, min_dimensions: dict[str, int] | None = None) -> None` ¶

`LLMScore(model: str)` ¶

`call(content: str, rubric: list[ScoringDimension], *, content_label: str = 'content', context: str | None = None) -> ScoreResult` ¶

`async_score(content: str, rubric: list[ScoringDimension], *, content_label: str = 'content', context: str | None = None) -> ScoreResult` `async` ¶

`Skill(path: Path, metadata: SkillMetadata, content: str, references: dict[str, str] = dict(), scripts: dict[str, str] = dict(), assets: tuple[str, ...] = ())` `dataclass` ¶

`name: str` `property` ¶

`description: str` `property` ¶

`has_references: bool` `property` ¶

`has_scripts: bool` `property` ¶

`has_assets: bool` `property` ¶

`assets_dir: Path | None` `property` ¶

`from_path(path: Path | str) -> Skill` `classmethod` ¶

`SkillMetadata(name: str, description: str, version: str | None = None, license: str | None = None, tags: tuple[str, ...] = (), compatibility: str | None = None, metadata_entries: tuple[tuple[str, str], ...] = (), allowed_tools: tuple[str, ...] = ())` `dataclass` ¶

`metadata_dict: dict[str, str]` `property` ¶

`__post_init__() -> None` ¶

`load_skill(path: Path | str) -> Skill` ¶

`load_custom_agent(path: Path | str, *, overrides: dict[str, Any] | None = None) -> dict[str, Any]` ¶

`load_custom_agents(directory: Path | str, *, include: set[str] | None = None, exclude: set[str] | None = None, overrides: dict[str, dict[str, Any]] | None = None) -> list[dict[str, Any]]` ¶

`Plugin(metadata: PluginMetadata, path: Path, agents: list[dict[str, Any]] = list(), skills: list[Skill] = list(), mcp_servers: dict[str, dict[str, Any]] = dict(), hooks: list[HookDefinition] = list(), instructions: str = '', extensions: list[Path] = list())` `dataclass` ¶

`PluginMetadata(name: str, version: str = '', description: str = '', author: str = '')` `dataclass` ¶

`HookDefinition(event: str, command: str, pattern: str = '')` `dataclass` ¶

`load_plugin(path: str | Path) -> Plugin` ¶