Claude Code for the Outer Loop: An AI SRE Playbook to Reduce On-Call Toil

It is 2:13am. PagerDuty fires for checkout-service, p95 past threshold for four minutes. You open Datadog, find the wrong dashboard, then the right one, then the CI tool for recent deploys, then Jira for open incidents, then #incidents in Slack to check whether a co-worker is already in the war room. Eight minutes in, you have a working hypothesis.

That is not incident response. That is a context-loading tax the on-call pays before the work begins.

Coding agents, such as Claude Code, are eating the inner loop. The outer loop is a different story. Operational work (incident response, runbook execution, SLO investigation, on-call handoffs) still looks almost identical to how it looked five years ago. The gap is not the model. It is the infrastructure to run agentic tools across a team, against production, with the auth, scope, and audit guarantees an SRE program needs.

This article is about the execution layer. The data substrate underneath is the other half of the problem, and I've written about it on the ClickHouse blog.

TL;DR

Claude Code already works in the outer loop. The interface, the reasoning, the tool-call contract all transfer. What changes is the data sources.
Five workflows prove it. Incident triage, runbook execution, postmortem drafting, SLO investigation, on-call handoffs. Every one of them is Claude-shaped.
The auth, scope, and audit gap is the bottleneck. The MCP servers for most SaaS tools already exist. The problem is that when every engineer wires their own connection, you inherit inconsistent authorization, over-scoped credentials, and no audit trail. Useful to one person at best. A data exposure incident at worst.
The gap is an MCP runtime, not a model. Managed auth, hosted compute, tool-level governance, persistent audit logs. Until something provides all four, outer-loop AI stays a party trick.
An MCP runtime is more than an MCP gateway. A gateway routes MCP tools under one URL. An MCP runtime adds the compute that runs them, the auth that scopes them, and the audit trail that makes them safe in production. Arcade.dev is an MCP runtime with a gateway inside it.

Five AI SRE workflows and the MCP servers that power them

If you only read one thing in this article, read this table.

#	Workflow	MCP servers	What Claude Code does	What on-call does
1	Incident triage	PagerDuty, Datadog, Slack, GitHub	Pulls the PagerDuty payload, correlates Datadog signals in the window, checks recent deploys, scans Jira and #incidents, drafts a war room post	Decides the next move
2	Runbook execution	Confluence, Kubernetes, GitHub	Parses the Confluence doc into steps, lays out the diagnostic sequence with commands and expected output, proposes any write command	Runs the steps, approves every write
3	Postmortem drafting	Slack, PagerDuty, Datadog, Confluence	Reconstructs the timeline from Slack, PagerDuty, Datadog, and the deploy log, fills the team template with source-linked evidence	Writes the root cause and action items
4	SLO investigation	Datadog, PagerDuty, Snowflake, Confluence	Finds the burn inflection, correlates deploys, config changes, traffic shifts, and upstream incidents, ranks hypotheses with linked evidence	Evaluates hypotheses, decides action items
5	On-call handoff	PagerDuty, Datadog, Slack, Zendesk	Assembles the shift briefing from pages, active incidents, baking deploys, SLO burn, and open action items, delivers it as a Slack DM	Reviews, adds color, signs off

Workflow 1: Incident triage is mostly archaeology

Scenario

The manual triage above is a parallelism problem, not a skill problem. One engineer, five workflows, sequential context loads. Every on-call engineer I know tells the same story: "I spent the first ten minutes figuring out what was happening."

What Claude Code does

Hand the alert to Claude Code: "Triage this particular alert, correlated with the Datadog metrics, service logs, and the deployment history. Scan Slack history for other correlated failures."

Claude Code returns the alert context in two sentences, the top three correlated signals with direct Datadog links, and the deploys most likely to matter by service-graph proximity with commit SHAs and authors. Two to three minutes end to end, running while you are opening the laptop. Grafana's team reported a 3.5x reduction in time to root cause using a similar pattern.

What on-call does

By the time the on-call moves from the alert on their phone to opening their laptop, Claude Code's initial analysis is waiting. They read the summary, validate it against the dashboards, cross-reference the ranked deploys against what they know shipped recently, and decide the next move. They also catch the failure modes: the correlation that is spurious, the deploy the service graph does not know about, the #incidents thread that was noise. Claude Code compresses the archaeology. The on-call judges it.

The auth, scope, and audit gap

PagerDuty, Datadog, Slack, Jira, and GitHub all ship MCP servers. The problem is running them across a team, not building them.

If the setup is not configured consistently for every engineer on the rotation, the workflow breaks on the shift that needs it most. Misconfigured permissions lead to inconclusive analysis, and inconclusive analysis at 3am is worse than no analysis at all. Engineers who wire up their own connections often grant themselves broader scopes than the workflow needs, and the next access review turns into cleanup nobody planned for. The failure mode that matters most: if tool access is not scoped properly, a diagnostic step can inadvertently trigger a write action, mutate state in production, and turn the triage itself into the incident. Consistent setup, scoped credentials, and read-only enforcement are properties of the MCP runtime, not the individual engineer's configuration.

Workflow 2: Runbook execution at 3am

Scenario

Mature teams maintain their runbooks. The ones in constant use stay fresh because people fix them after every incident. The rot lives in two quieter places. Runbooks that fire once a quarter drift between uses, and nobody notices until the next 3am page reveals that half the commands point at deprecated tools and renamed clusters. And new engineers on the rotation often do not know which runbook applies to the alert in front of them. Finding the right doc at 3am is its own skill, and it takes months on the rotation to build.

"Runbooks are a lie we tell ourselves."

During my time leading reliability at Confluent and Dropbox, I saw this pattern play out across very different stacks. It is not an organization-specific problem. It is the law of prioritization playing out: the runbooks that fire often get the attention, and the ones that fire rarely do not.

What Claude Code does

Finding the right runbook. Once triage narrows the problem, the on-call needs to know which runbook applies and what to run. Point Claude Code at the alert. It matches the metadata (service, symptom, tag) against the runbook index, surfaces the top candidate, and lays out the diagnostic sequence with exact commands, the systems they target, and expected output for each step.

Keeping runbooks fresh. Most mature teams run quality weeks or reliability sprints to refresh runbooks. At Confluent, we did this quarterly. Claude Code makes the sprint cheaper since this is a safe environment: replay every runbook against staging in a batch, flag the commands pointing at deprecated tools and renamed clusters, regenerate steps against current infra. The rot that accumulated since the last review gets caught in hours instead of weeks.

What on-call does

The on-call runs the steps. Claude Code lays out the plan, the engineer executes it. Opening unbounded production access to a coding agent does not pass the sniff test for any reliability org I have worked with, and should not. The engineer confirms Claude Code picked the right runbook, runs each diagnostic in their own terminal with their own scoped credentials, and tracks pass/fail as they go. When Claude Code picks the wrong runbook, the on-call re-points it, and that correction feeds the index for the next page.

The auth, scope, and audit gap

If Claude Code does not execute against production directly, enforcement becomes the whole game. The runbook has to be scoped to the user running it, the environment it targets, and the actions the current step actually needs. A step that is safe in staging is dangerous in prod. A step that is safe for a senior SRE is catastrophic for a new joiner still learning the cluster. Without tool-level governance that understands user, environment, and action together, you are back to trusting every engineer to read carefully at 3am, which is exactly the failure mode the runbook was supposed to prevent. Finding the right runbook and enforcing the right scopes are two different problems. Claude Code solves the first. The MCP runtime solves the second, with governance scoped per user, per environment, and per action. Both have to work, and neither replaces the other.

Workflow 3: Postmortem drafting rots at the archaeology step

Scenario

The incident resolved at 4pm. The retro is Thursday. Someone has to write the draft. The hard part is not the thinking. It is the archaeology: Slack scrollback, PagerDuty timeline, Datadog graphs, deploy history, team template. The incident.io team puts manual reconstruction at 60 to 90 minutes per incident. That matches every team I have run.

Most postmortems get drafted badly at the last minute. The retro starts from a weak foundation, and the same incident class comes back six months later.

What Claude Code does

Type into Claude Code: "Draft the postmortem for INC-4729 using the team template." Claude Code assembles the archaeology. It pulls the Slack transcript, the PagerDuty timeline, the Datadog panels from the incident dashboard, and the deploy log for every service touched. It drops each of those into the team template with source links, so every timeline entry traces back to the panel, commit, or message it came from.

The draft stops at archaeology. Timeline, impact, affected services, evidence. The root cause, contributing factors, and action items fields are left structurally empty. Teams that let AI draft those turn every retro into a cleanup exercise. Zalando's team reported hallucination rates as high as 40 percent in early AI-drafted postmortem analysis, and the lesson is not better prompting. It is to keep anything causal out of the draft.

What on-call does

The on-call and the retro group review the draft. They are not rewriting it. They correct timeline entries that are wrong, add the signal the archaeology missed (a customer report that came through email, a related incident three days earlier, the deploy two sprints ago that introduced the latent bug), and spend their time on the part that matters: running the 5 whys, pressure-testing the root cause, deciding action items.

The leverage is strongest on the long tail. In my experience, eighty to ninety percent of incidents a mature team handles are high-volume, low-priority events where the archaeology is mechanical and the writeup feels mundane. That is where teams cut corners, and where repeat incidents quietly accumulate. Claude Code absorbs the mundane work so the high-judgment work gets attention on every incident, not just the big ones.

The auth, scope, and audit gap

The tools the draft pulls from carry the most sensitive data in the company. #incidents has customer PII and vendor secrets. The deploy log has commit messages that sometimes leak security context. Datadog dashboards expose traffic patterns across the fleet. The engineer who set up the Slack connector usually has broader workspace read than the postmortem role needs, and the draft ends up citing messages it had no business reading.

Scoping has to happen at the tool layer, not the prompt layer. Which channels the draft can read, which dashboards it can fetch, which tables it can query, all bounded by policy and tied to the user triggering the workflow. Then a provenance trail in a persistent log, showing what the AI accessed, when, and under whose identity. That is the half compliance will ask about, and the half that decides whether the workflow survives its first security review.

Workflow 4: SLO investigation and error budget reviews

Scenario

At Confluent, my team reviewed our availability SLO every Monday. We pulled the week's incidents, measured their impact on the SLO and the customer SLA, and mapped the root causes from each postmortem back to services and themes. The goal was to see whether the week's error budget had been spent on one repeat problem or scattered across five unrelated ones.

Most of the prep was manual correlation: error budget delta, matched to PagerDuty incident, matched to Datadog regression, matched to deploy history, matched to the postmortem, matched to the theme bucket. One SRE typically spent four to six hours on that pipeline before the meeting started. The thinking happened in the review. The prep was legwork.

What Claude Code does

Ask Claude Code to prep the Monday review. It pulls the SLO and SLA deltas, fetches every PagerDuty incident in the window, joins each to the Datadog regression that matches in time and service, pulls the postmortem from Confluence, and extracts the root cause section. It groups root causes into themes using the team's existing taxonomy and hands back a structured brief: error budget delta, the incidents that account for it, the themes, and the open questions the postmortems did not resolve.

What Claude Code does not do is quantify how much of the burn each incident "caused" in percentage terms. That is causal analysis current models do poorly, and a made-up percentage in a metrics review is worse than no number.

The AI hunts. The human decides.

What on-call does

The SRE running the review reads the brief, validates the incident-to-regression matches (Claude Code will get some wrong), writes the causal story the AI refused to guess at, decides which themes warrant action items, and raises the open questions in the meeting. Four hours of prep becomes thirty minutes of review and correction.

The auth, scope, and audit gap

Warehouse-backed workflows are the ones SRE teams have held off on the longest, and the reason is scope. You cannot hand Claude Code unrestricted warehouse access and hope prompt engineering keeps it away from PII. You cannot give it unbounded query budgets and wait to see a five-thousand-dollar scan on next month's bill. Scope enforcement at the MCP runtime layer is what changes the math: this task queries these tables and not others, costs less than fifty dollars, never touches prod write paths. Without that, the workflow stays a prototype and never makes the rotation.

Workflow 5: On-call handoffs lose the context nobody wrote down

Scenario

Handoffs are the most undervalued ritual in SRE work because the incidents they prevent never get counted. Handoff quality tracks how tired the outgoing engineer is, which means handoffs are worst on the shifts that had the most incidents, which is when they matter most. The non-obvious cost: the morning incident where the new on-call did not know a deploy was still baking, and ends up paging the previous on-call at 8am to ask what happened overnight.

What Claude Code does

Claude Code generates the briefing at the rotation boundary, without anyone triggering it. It pulls the last 24 hours of pages with resolution notes, active incidents, baking deploys, SLOs that crossed a burn threshold, unresolved #incidents threads, Zendesk escalations, and customer reports that came in through the on-call email alias. It lists open action items assigned to the rotation. It delivers the briefing as a Slack DM with a copy in the team's handoff Confluence doc.

What on-call does

The outgoing engineer adds the color only they can add: what they think is a false alarm, which customer report to watch, which deploy they are nervous about, which alert they silenced and why. That is the handoff knowledge that lives in the outgoing engineer's head and nowhere else. Claude Code assembles the facts. The on-call provides the judgment.

The auth, scope, and audit gap

The briefing fires at 5pm whether anyone is logged in or not, which means it needs a credential that lives outside any single engineer's session. Dotfiles on a closed laptop do not qualify. A scheduled workflow without a persistent service identity is not a workflow. It is a cron job that silently stops running the next time someone rotates off the team. Persistent service identity is a property of the MCP runtime, not the engineer's laptop.

Claude Code is a companion, not an autonomous AI SRE

Five workflows, one pattern. Claude Code reads, correlates, drafts, and waits. The human decides.

Most of the AI SRE market is betting the other way. Traversal, Resolve, Anyshift, and others are building toward autonomous agents that page, remediate, and close incidents on their own. I am skeptical. A model's output is a function of its capability and the context it is given. Current models can do the archaeology reliably. They cannot reliably be given enough scoped context and the right tools to remediate production unsupervised. That is a context and tooling gap, not a model gap, and I would rather ship the shape that already works.

Claude Code runs when you ask. It stops when the next step needs judgment. It never pages, rolls back, or closes an incident on its own.

A companion also dodges the procurement fight that stalls autonomous rollouts. You are not replacing a role or adding an on-call tier. You are pointing the tool your team already uses at data sources they already trust, with an MCP runtime that scopes what it can do. The security review goes from "new vendor, new risk" to "scoped tools inside an existing agent."

Every workflow in this article starts as a prompt and grows into a skill. The triage prompt, the runbook dispatcher, the postmortem drafter, the SLO prep pipeline, the handoff briefing: each one begins as something one engineer types once, and becomes a packaged skill every engineer on the rotation invokes the same way. The skill keeps getting sharper because the team keeps editing it: a new data source here, a tighter prompt there, a correction after an incident surfaces a blind spot. One person's trick becomes team infrastructure, and the infrastructure compounds.

Reliability comes from running a proper reliability program, and a proper program is mostly operational work around rituals: triage, runbooks, postmortems, SLO reviews, handoffs. Claude Code earns its keep by making the rituals cheap enough to happen on every shift, not just the ones where someone has the energy for them.

What an AI SRE needs from its MCP tool integration layer

Every workflow above needs the same four things.

Managed authentication and authorization across tools. OAuth flows for every connected tool, credentials refreshed automatically, scoped per user, reachable from any device including a phone at 3am.
Managed compute, always on, team-wide. Tools run on shared infrastructure, cloud-hosted or on-prem, with the same behavior whether the trigger came from a laptop, a phone, a webhook, or a cron job.
Tool- and agent-level governance. Per-tool permission policies, per-task cost budgets, and per-query data access limits enforced where the call happens, not where the model proposes it. This is the difference between a workflow security will approve and one they kill on sight.
Persistent audit logs. Every tool call logged with triggering user, arguments, response, and timestamp, in a log the agent cannot modify. Without this you cannot retro the AI, and you cannot trust it.

Arcade: an MCP runtime for AI SRE workflows

Arcade is an MCP runtime built to close exactly this gap. Managed OAuth handles every connected tool, with credentials that refresh automatically and never touch the language model. Every tool call runs on behalf of the user who triggered it, so native permissions in PagerDuty, Datadog, and Snowflake apply exactly as they would outside the agent. You connect PagerDuty once, and every Claude Code session on your team picks it up at the right scope.

The runtime runs tools on hosted workers, deployable in your cloud or on-prem, and enforces per-tool policies where the call happens, not where the model proposes it. The same workflow triggered from a phone, a laptop, or a cron job executes on shared infrastructure. Policies fire at the MCP runtime layer: "this workflow queries these Snowflake tables and not others," "this workflow can propose PagerDuty actions but cannot execute without approval," "this workflow has a $25 query budget."

Every tool call lands in an OpenTelemetry-compatible run log with triggering user, arguments, response, and timestamp. It drops straight into the observability pipeline your platform team already runs. When your postmortem asks what Claude Code did during the incident, you have the answer. When compliance asks for every query this AI ran against the warehouse last quarter, you have the answer.

Prebuilt tools ship for PagerDuty, Datadog, Slack, Jira, Confluence, GitHub, Snowflake, and more. You can also bring your own MCP servers into the runtime: the PagerDuty, Datadog, Snowflake, and Kubernetes servers linked in the table above drop in as-is and inherit the same managed auth, policy enforcement, and audit logs as the prebuilt ones. You extend your existing MCP investment instead of replacing it.

You can build this without Arcade, and the reason not to is the same reason you did not write your own CI system: the work is real, the edge cases are ugly, and it is not where your reliability differentiation lives. A mature team can hand-roll managed OAuth, stand up hosted workers, wire per-tool policy enforcement, and ship a tamper-evident audit log. A few platform teams I know started down that path and concluded it was too costly to own, or simply not where they wanted to spend their reliability budget.

Reducing on-call toil is where SRE leverage lives

The outer loop has not caught up to the inner loop because the infrastructure to run agentic tools safely against production systems has been missing. A coding assistant only needs your repo and your editor. An operational assistant needs managed identity, hosted compute, enforced governance, and an audit trail, because it reaches into systems where mistakes page the CTO.

The SRE teams that figure this out over the next year will pull away from the ones that do not, the same way the teams that adopted Claude Code for inner-loop work in 2024 pulled away from the teams that waited. The inner loop is solved. The outer loop is where the leverage lives now, sitting on a data substrate that is its own design problem.

Claude Code does not replace the on-call. It just lets them start on page 5 instead of page 1.

Frequently asked questions

What is an AI SRE?

An AI SRE is an AI assistant that helps site reliability engineers with operational work: incident triage, runbook execution, postmortem drafting, SLO investigation, and on-call handoffs. Most practical AI SRE deployments today run as companions that read, correlate, and draft while a human engineer decides the next move, rather than as autonomous agents that page, remediate, and close incidents on their own.

What is the difference between an MCP gateway and an MCP runtime?

An MCP gateway routes MCP tools under a single URL so any MCP client can call them. An MCP runtime goes further: it adds the compute that runs the tools, managed authentication, per-tool permission enforcement, and persistent audit logs. A gateway is routing infrastructure. A runtime is production infrastructure. Arcade is an MCP runtime with a gateway inside it.

Can Claude Code replace an on-call engineer?

No. Claude Code works best as a companion to the on-call engineer, not a replacement. It compresses the archaeology (pulling alerts, correlating signals, drafting summaries) so the engineer starts with context already loaded. Every decision that requires judgment (rolling back a deploy, paging a co-worker, closing an incident) stays with the human.

How do I use Claude Code for incident triage?

Point Claude Code at the alert with a prompt like "Triage this alert, correlated with Datadog metrics, service logs, and deployment history. Scan Slack for correlated failures." With MCP servers for PagerDuty, Datadog, Slack, and GitHub wired into an MCP runtime, Claude Code returns a summary, the top correlated signals, candidate deploys, and a draft war room post in two to three minutes.

Is it safe to let Claude Code execute runbooks in production?

Claude Code should not execute against production directly. The safer pattern is for Claude Code to parse the runbook, lay out the diagnostic sequence, and propose commands, while the on-call engineer runs each step in their own terminal with their own scoped credentials. Unbounded production access for any coding agent should not pass a reliability review.

What MCP servers do I need for AI SRE workflows?

The core set covers the tools already in an SRE rotation: PagerDuty, Datadog, Slack, and GitHub for incident triage; Confluence and Kubernetes for runbook execution; Snowflake for SLO investigation; Zendesk for on-call handoffs. Each has a production-ready MCP server that can run inside an MCP runtime like Arcade, which handles managed auth, policies, and audit logs across all of them.

How does Arcade work with Claude Code?

Arcade is an MCP runtime that manages OAuth, per-tool permission policies, and audit logs for every tool Claude Code calls. You connect PagerDuty, Datadog, or Snowflake once, and every Claude Code session on your team picks up the tools at the right scope. Arcade also runs bring-your-own MCP servers, so existing integrations work as-is.

What is the difference between AI SRE tools like Traversal and using Claude Code with an MCP runtime?

Traversal, Resolve, and Anyshift are building autonomous agents that page, remediate, and close incidents on their own. Claude Code with an MCP runtime takes the companion approach: read, correlate, draft, and wait for the engineer to decide. The companion pattern ships today. The autonomous bet does not.

Does the observability store underneath matter as much as the MCP runtime above?

Yes. An AI agent runs 10 to 30 queries per investigation, and most observability stores weren't built to serve that pattern at the retention and cardinality an SRE needs. The MCP runtime handles the execution layer; the observability store handles the cognitive substrate. Both matter. I've written about the substrate side here.

DE

Source

This article was originally published by DEV Community and written by Manveer Chawla.

Read original article on DEV Community

Back to Discover

TL;DR

Five AI SRE workflows and the MCP servers that power them

Workflow 1: Incident triage is mostly archaeology

Scenario

What Claude Code does

What on-call does

The auth, scope, and audit gap

Workflow 2: Runbook execution at 3am

Scenario

What Claude Code does

What on-call does

The auth, scope, and audit gap

Workflow 3: Postmortem drafting rots at the archaeology step

Scenario

What Claude Code does

What on-call does

The auth, scope, and audit gap

Workflow 4: SLO investigation and error budget reviews

Scenario

What Claude Code does

What on-call does

The auth, scope, and audit gap

Workflow 5: On-call handoffs lose the context nobody wrote down

Scenario

What Claude Code does

What on-call does

The auth, scope, and audit gap

Claude Code is a companion, not an autonomous AI SRE

What an AI SRE needs from its MCP tool integration layer

Arcade: an MCP runtime for AI SRE workflows

Reducing on-call toil is where SRE leverage lives

Frequently asked questions

What is an AI SRE?

What is the difference between an MCP gateway and an MCP runtime?

Can Claude Code replace an on-call engineer?

How do I use Claude Code for incident triage?

Is it safe to let Claude Code execute runbooks in production?

What MCP servers do I need for AI SRE workflows?

How does Arcade work with Claude Code?

What is the difference between AI SRE tools like Traversal and using Claude Code with an MCP runtime?

Does the observability store underneath matter as much as the MCP runtime above?

Reading List