Multi-Turn Sessions¶

So far, each test is independent—the agent has no memory between tests. Sessions let multiple tests share conversation history, simulating real multi-turn interactions.

CopilotEval and sessions

CopilotEval supports sessions using a context-in-prompt pattern — prior conversation turns are injected as context in each new prompt. This works for most workflows but differs from Eval, which maintains full turn-based message history via PydanticAI. The Copilot SDK accepts string prompts only (send_and_wait(prompt: str)), so true stateful multi-turn sessions are not available. See copilot/test_06_sessions.py for the Copilot pattern.

Why Sessions?¶

Real coding agents don't answer single questions. Users have conversations:

"What's my checking account balance?"
"Transfer $200 to savings" ← Requires remembering the accounts
"What are my new balances?" ← Requires remembering the transfer

Without sessions, test 2 would fail—the agent doesn't know which accounts were discussed.

Defining a Session¶

Use the @pytest.mark.session marker:

import pytest
from pytest_skill_engineering.copilot import CopilotEval

banking_agent = CopilotEval(
    name="banking",
    instructions="You are a banking assistant.",
)

@pytest.mark.session("banking-chat")
class TestBankingConversation:
    """Tests run in order, sharing conversation history."""

    async def test_initial_query(self, copilot_eval):
        """First message - establishes context."""
        result = await copilot_eval(banking_agent, "What's my checking account balance?")
        assert result.success

    async def test_followup(self, copilot_eval):
        """Second message - uses context from first."""
        result = await copilot_eval(banking_agent, "Transfer $200 to savings")
        assert result.success
        # Agent remembers we were talking about checking

    async def test_verification(self, copilot_eval):
        """Third message - builds on full conversation."""
        result = await copilot_eval(banking_agent, "What are my new balances?")
        assert result.success

Key points:

Tests in a session run in order (top to bottom)
Each test sees the full conversation history from previous tests

Not compatible with pytest-xdist

Sessions require sequential test execution to maintain conversation order. Don't use -n auto or other parallel execution with session tests.

The session name ("banking-chat") groups related tests

Session Context Flow¶

test_initial_query
    User: "What's my checking account balance?"
    Eval: "Your checking balance is $1,500.00..."
    ↓ context passed to next test

test_followup  
    [Previous messages included]
    User: "Transfer $200 to savings"
    Eval: "Done! Transferred $200 from checking to savings..."
    ↓ context passed to next test

test_verification
    [All previous messages included]
    User: "What are my new balances?"
    Eval: "Checking: $1,300, Savings: $3,200..."

When to Use Sessions¶

Scenario	Use Session?
Single Q&A tests	No
Multi-turn conversation	Yes
Workflow with multiple steps	Yes
Independent feature tests	No
Testing context retention	Yes

Sessions with Parametrize¶

You can combine sessions with model comparison:

@pytest.mark.session("shopping-flow")
@pytest.mark.parametrize("model", ["gpt-5-mini", "gpt-4.1"])
class TestShoppingWorkflow:
    """Test the same conversation flow with different models."""

    async def test_browse(self, copilot_eval, model):
        agent = CopilotEval(
            name=f"shop-{model}",
            model=model,
            instructions="You are a shopping assistant.",
        )
        result = await copilot_eval(agent, "Show me running shoes")
        assert result.success

    async def test_select(self, copilot_eval, model):
        agent = CopilotEval(
            name=f"shop-{model}",
            model=model,
            instructions="You are a shopping assistant.",
        )
        result = await copilot_eval(agent, "I'll take the Nike ones")
        assert result.success

This creates two separate session flows:

shopping-flow[gpt-5-mini]: browse → select (with gpt-5-mini)
shopping-flow[gpt-4.1]: browse → select (with gpt-4.1)

The report shows each session as a complete flow with all turns visualized.

Next Steps¶

Comparing Configurations — Pattern for parametrized tests
Generate Reports — Understand report output

📁 Real Example: pydantic/test_06_sessions.py — Banking workflow with session continuity