Technology Apr 16, 2026 · 7 min read

How I Test My MCP Agent Without Burning Tokens

Last month I shipped an MCP agent that triages GitHub issues. It works great — until it silently breaks and nobody notices. Here are the last three bugs I hit: I tweaked the system prompt. The agent stopped calling create_issue and just summarised the bug report in plain text. CI didn't catch it...

DE
DEV Community
by Joseph Thomas
How I Test My MCP Agent Without Burning Tokens

Last month I shipped an MCP agent that triages GitHub issues. It works great — until it silently breaks and nobody notices.

Here are the last three bugs I hit:

  1. I tweaked the system prompt. The agent stopped calling create_issue and just summarised the bug report in plain text. CI didn't catch it — CI tests the code, not the agent behavior.

  2. I swapped Sonnet for Haiku to save cost. The agent started calling list_issues four times before each create_issue. Integration tests still passed. Token bill tripled.

  3. GitHub rate-limited me mid-test. The entire pytest suite went red. I rolled back a perfectly good change because I couldn't tell flake from regression.

Every one of those would have been caught by a tool that tests the agent's trajectory — which tools it picks, in what order, with what arguments — against a fast, hermetic mock.

That tool is mcptest. Here's how I use it.

The Scenario

I have an agent that reads a bug report and decides whether to:

  • Open a new issue if the bug is novel
  • Comment on an existing issue if a similar one is already filed
  • Do nothing if the report is spam or unclear

The agent is about thirty lines of Python wrapping an LLM call and the MCP client:

# agent.py
import asyncio, json, os, sys
from mcp import ClientSession
from mcp.client.stdio import StdioServerParameters, stdio_client

SYSTEM_PROMPT = """You are an issue triage assistant for acme/api.
Before filing a new issue, ALWAYS check list_issues for duplicates.
If a duplicate exists, call add_comment instead."""

async def main():
    fx = json.loads(os.environ["MCPTEST_FIXTURES"])[0]
    params = StdioServerParameters(
        command=sys.executable,
        args=["-m", "mcptest.mock_server", fx],
        env=os.environ.copy(),
    )
    user = sys.stdin.read()
    async with stdio_client(params) as (r, w):
        async with ClientSession(r, w) as session:
            await session.initialize()
            tools = await session.list_tools()
            # Your real agent calls an LLM here with SYSTEM_PROMPT + tools
            # and executes whatever tool_calls it returns
            plan = your_llm_plan(SYSTEM_PROMPT, user, tools)
            for call in plan:
                await session.call_tool(call.name, arguments=call.args)

asyncio.run(main())

That's it. The rest of this post is about testing what this agent does — which tools it picks, in what order — without ever calling a real LLM.

I want to verify four things:

  1. For a clear bug report, it calls create_issue exactly once.
  2. It checks list_issues before create_issue (duplicate check).
  3. When the server rate-limits it, it recovers gracefully.
  4. It never calls delete_issue. Ever.

Step 1 — Mock the GitHub MCP Server in YAML

No code required. Declare the tools, canned responses, and error scenarios:

# fixtures/github.yaml
server:
  name: mock-github
  version: "1.0"

tools:
  - name: list_issues
    input_schema:
      type: object
      properties:
        repo: { type: string }
        query: { type: string }
      required: [repo]
    responses:
      - match: { query: "login 500" }
        return:
          issues: []                   # No duplicate → should open a new one
      - match: { query: "dark mode" }
        return:
          issues: [{ number: 12, title: "Add dark mode" }]
      - default: true
        return: { issues: [] }

  - name: create_issue
    input_schema:
      type: object
      properties:
        repo: { type: string }
        title: { type: string }
        body: { type: string }
      required: [repo, title]
    responses:
      - match: { repo: "acme/api" }
        return: { number: 42, url: "https://github.com/acme/api/issues/42" }
      - default: true
        error: rate_limited            # Simulate real-world flake

  - name: add_comment
    input_schema:
      type: object
      properties:
        repo: { type: string }
        number: { type: integer }
        body: { type: string }
      required: [repo, number, body]
    responses:
      - default: true
        return: { ok: true }

  - name: delete_issue
    responses:
      - default: true
        return: { deleted: true }

errors:
  - name: rate_limited
    error_code: -32000
    message: "GitHub API rate limit exceeded"

That's a real MCP server. It speaks MCP over stdio, just like the real GitHub server. Your agent connects to it the same way.

Step 2 — Write the Tests

# tests/test_triage.yaml
name: triage agent
fixtures:
  - ../fixtures/github.yaml
agent:
  command: python agent.py
cases:

  - name: opens a new issue for a novel bug
    input: "login page returns 500 on Safari"
    assertions:
      - tool_called: create_issue
      - tool_call_count: { tool: create_issue, count: 1 }
      - param_matches:
          tool: create_issue
          param: repo
          value: "acme/api"
      - no_errors: true

  - name: checks duplicates before creating
    input: "login page returns 500 on Safari"
    assertions:
      - tool_order:
          - list_issues
          - create_issue

  - name: comments on an existing duplicate instead of creating
    input: "add dark mode support"
    assertions:
      - tool_called: add_comment
      - tool_not_called: create_issue

  - name: never deletes anything
    input: "spam: buy crypto now!!!"
    assertions:
      - tool_not_called: delete_issue

  - name: recovers from rate-limit gracefully
    input: "file bug for org: wrongorg/wrongrepo"
    assertions:
      - tool_called: create_issue
      - error_handled: "rate limit"

Five test cases. Every one maps directly to a bug I've actually hit in production.

Step 3 — Run It

$ mcptest run

                        mcptest results
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Suite          ┃ Case                               ┃ Status ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ triage agent   │ opens a new issue for a novel bug  │  PASS  │
│ triage agent   │ checks duplicates before creating  │  PASS  │
│ triage agent   │ comments on existing duplicate     │  PASS  │
│ triage agent   │ never deletes anything             │  PASS  │
│ triage agent   │ recovers from rate-limit gracefully│  PASS  │
└────────────────┴────────────────────────────────────┴────────┘

5 passed, 0 failed (5 total)
⏱  1.6s

1.6 seconds. Zero tokens. No GitHub API calls. No rate-limit flake.

Step 4 — The Regression-Diff Trick

This is where the tool earns its keep.

First, snapshot the current (known-good) agent trajectories as baselines:

$ mcptest snapshot
✓ saved baseline for triage agent::opens a new issue... (2 tool calls)
✓ saved baseline for triage agent::checks duplicates... (2 tool calls)
✓ saved baseline for triage agent::comments on duplicate (2 tool calls)
...

Now go tweak the system prompt. Something innocuous — change this:

"You are a GitHub issue triage assistant. Check for duplicates before filing."

To this:

"You are a helpful assistant that handles GitHub issue reports."

No [ERROR] in the code. All unit tests still pass. Linter is happy.

$ mcptest diff --ci

✗ triage agent::checks duplicates before creating
  tool_order REGRESSION:
    baseline: list_issues → create_issue
    current:  create_issue          ← list_issues was dropped!

✗ triage agent::comments on existing duplicate
  tool_called REGRESSION:
    baseline: add_comment was called
    current:  add_comment was never called

2 regression(s) across 5 case(s)
Exit: 1

Exit code 1 — CI blocks the merge. The agent silently lost its duplicate-check behavior because of a one-sentence prompt change.

This is the bug that cost me a weekend. I never want to hit it again.

Step 5 — Wire It into CI

# .github/workflows/agent-tests.yml
name: Agent tests
on: [pull_request]

jobs:
  mcptest:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }

      - run: pip install mcp-agent-test

      - name: Run tests
        run: mcptest run

      - name: Diff against baselines
        run: mcptest diff --ci

      - name: Post PR summary
        if: always()
        run: mcptest github-comment
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Now every PR that changes the prompt, the model, or the agent code gets its trajectories diffed against main. If behavior shifts, a comment lands on the PR with the exact tool-order delta. Reviewers see agent behavior changed as clearly as they see code changed.

Why MCP-Specific Matters

Eval tooling is consolidating fast — independent evaluation startups keep getting folded into model-provider platforms. That's useful if you're all-in on one vendor. It's a lock-in risk if you're not.

mcptest is independent, MIT-licensed, and specifically shaped for MCP agents. The tool_called / tool_order / error_handled primitives exist because that's what an MCP trajectory actually looks like — not because someone ported a generic LLM-eval DSL.

MCP agent testing is particularly underserved. DeepEval is great for prompt evaluation. Inspect AI is great for benchmarks. Neither gives you "run your agent against a mock GitHub server and assert it didn't call delete_issue."

mcptest does.

Try It

pip install mcp-agent-test

# Scaffold a new project
mcptest init

# Or clone the quickstart
git clone https://github.com/josephgec/mcptest
cd mcptest/examples/quickstart
mcptest run

# Or install a pre-built fixture pack
mcptest install-pack github ./my-project
mcptest install-pack slack ./my-project
mcptest install-pack filesystem ./my-project

Six packs ship out of the box: GitHub, Slack, filesystem, database, HTTP, and git. Each one is a realistic mock with error scenarios baked in and tests that actually assert something.

📦 Source: github.com/josephgec/mcptest
📦 PyPI: pypi.org/project/mcp-agent-test

If you're building an MCP agent and haven't started writing tests yet, you're accumulating the same three bugs I accumulated. Start with one fixture and one test case. Catch the first prompt-change regression. Then you'll understand why MCP agents need this.

If this was useful, a ⭐ on the repo helps others find it.

DE
Source

This article was originally published by DEV Community and written by Joseph Thomas.

Read original article on DEV Community
Back to Discover

Reading List