Model Comparison¶
Compare how different models perform the same task.
Example¶
import pytest
from pytest_codingagents import CopilotAgent
MODELS = ["claude-sonnet-4", "gpt-4.1"]
@pytest.mark.parametrize("model", MODELS)
async def test_fibonacci(copilot_run, tmp_path, model):
agent = CopilotAgent(
name=f"model-{model}",
model=model,
instructions="Write clean Python code.",
working_directory=str(tmp_path),
)
result = await copilot_run(agent, "Create fibonacci.py with a fibonacci function")
assert result.success
assert (tmp_path / "fibonacci.py").exists()
# Compare token usage across models
print(f"{model}: {result.total_tokens} tokens, ${result.total_cost_usd:.4f}")
What To Look For¶
- Success rate — Which model completes the task reliably?
- Token usage — Which model is most efficient?
- Tool calls — Which model uses tools appropriately?
- Reasoning traces — How does each model think through the problem?