Overview

I finished the DR Toolkit thinking I had covered the important parts of disaster recovery: runbooks, RTO/RPO targets, post-mortems. Then I mapped out the actual incident lifecycle and realized everything I built sits at the edges. The middle part (detecting the incident, correlating signals across regions, finding the root cause while the primary region is actively failing) was not covered. That gap is what this series is about.

In the BuildWithAI: DR Toolkit on AWS series, I ran through how you can build six AI-powered tools that automate the tedious parts of DR planning, all running on serverless AWS in ap-southeast-1. Those tools handle what you do before an incident and what you do after. But the part in between, the actual incident response, none of them touch.

This series covers that middle phase using AWS DevOps Agent. The demo app is PayLedger, a multi-region serverless payment ledger built specifically for this blog. It is not a real product and contains no real user data. Part 1 maps out the gap, introduces DevOps Agent, and walks through the architecture. Part 2 covers the full setup and the actual demo, including what the agent's investigation looked like when I ran three real faults against it.

The DR Lifecycle, Mapped Out

Phase	What happens	Covered by
Prepare	Runbooks, RTO/RPO targets, DR strategy, checklists	DR Toolkit
Detect	Alarm fires, SNS notifies DevOps Agent, health check fails, DNS fails over	CloudWatch + Route 53 + SNS
Investigate	Root cause analysis, cross-region signal correlation	AWS DevOps Agent
Recover	Apply fix, bring the unhealthy region back up, validate failback	Human + runbook
Learn	Prevention recommendations, operational improvements	DevOps Agent

The DR Toolkit is solid for Prepare. CloudWatch and Route 53 handle Detect. Alarms fire and Route 53 failover routes traffic to the healthy region automatically. But Investigate is the phase with no real tooling unless someone built it themselves. Figuring out why a service running in the primary region is down, correlating signals across services, giving the team the information needed to bring that region back up.

That is what AWS DevOps Agent targets.

What is AWS DevOps Agent?

AWS DevOps Agent is a frontier agent for cloud operations. "Frontier agent" is AWS's term for autonomous systems that work independently, scale across concurrent tasks, and run persistently without constant human oversight. It starts working the moment an alarm fires, no manual trigger needed.

Three capabilities:

Autonomous incident response. When an alert comes in, the agent starts investigating immediately. It correlates signals across services and regions. If multiple alarms fire from the same root cause, it identifies them as related rather than treating each one separately. Root cause categories it investigates: system changes, input anomalies, resource limits, component failures, and dependency issues.

Proactive incident prevention. After an investigation, the agent recommends improvements in four areas: observability, infrastructure optimization, deployment pipeline, and application resilience.

On-demand SRE tasks. Conversational chat against your actual infrastructure. You can ask about resource state, alarm status, or deployment history without switching consoles.

The service uses a dual-console architecture. The AWS Console is for admin setup (Agent Space creation, integrations). A separate Agent Space web app is for day-to-day work (investigations, topology, prevention, chat).

More on features: AWS DevOps Agent features and About AWS DevOps Agent

A Note on Region Availability

As of this writing, AWS DevOps Agent is not available in ap-southeast-1 (Singapore) at GA. Supported regions are: us-east-1, us-west-2, eu-central-1, eu-west-1, ap-southeast-2, ap-northeast-1. AWS may add support for more regions in the future, so it is worth checking the supported regions page before you start.

The two closest for SEA builders are ap-southeast-2 (Sydney) and ap-northeast-1 (Tokyo). For this demo I used ap-southeast-2, but you can use any supported region you prefer. The Agent Space and its investigation data live there. Your workload stays wherever it is. Cross-region monitoring means the agent discovers and monitors resources across any linked AWS account regardless of region.

The Agent Space region is where your investigation data is stored, not where your app runs. For this demo, a single Agent Space in ap-southeast-2 monitors resources in both ap-southeast-1 and ap-northeast-1.

Reference: AWS DevOps Agent Supported Regions

The Demo App: PayLedger

Note: PayLedger is a demo project built solely for this blog series. It is not affiliated with any real business, does not process real transactions, and contains no personally identifiable information. All data is synthetic and generated by a seed script.

A payment ledger is a practical choice for a DR demo because the requirements are clear. Any outage means transactions fail and balances go stale. The multi-region setup is the right response to that, not over-engineering.

PayLedger has four endpoints: record a transaction, list recent transactions, get the current balance, and a health check. Deployed to two regions with Route 53 active-passive failover and DynamoDB Global Tables for data replication.

                    payledger.yourdomain.com (CloudFront + S3)
                              |
                         Next.js UI
                         (balance, transactions, region indicator)
                              | calls
                              v
                    api-payledger.yourdomain.com
                              |
                         Route 53 (failover routing)
                         |-- PRIMARY  -> ap-southeast-1 (Singapore)
                         +-- SECONDARY -> ap-northeast-1 (Tokyo)

    ap-southeast-1                         ap-northeast-1
    +-- API Gateway                        +-- API Gateway
    +-- Lambda: createTransaction          +-- Lambda: createTransaction
    +-- Lambda: listTransactions           +-- Lambda: listTransactions
    +-- Lambda: getBalance                 +-- Lambda: getBalance
    +-- Lambda: health                     +-- Lambda: health
    +-- Lambda: devopsAgentTrigger         +-- Lambda: devopsAgentTrigger
    +-- DynamoDB <-- Global Table -->      +-- DynamoDB (replica)
    +-- SNS Topic (alarm notifications)    +-- SNS Topic (alarm notifications)
    +-- CloudWatch alarms                  +-- CloudWatch alarms

                    ap-southeast-2 (Sydney)
                    +-- AWS DevOps Agent
                        +-- Agent Space
                        +-- Slack (optional)
                        +-- GitHub (optional)

Layer	Service	Notes
Frontend	Next.js (static) + S3 + CloudFront	payledger.yourdomain.com
DNS	Route 53	Failover routing + health checks
Compute	Lambda (Python 3.12)	5 functions per region
API	API Gateway (HTTP API, regional)	Custom domain per region
Database	DynamoDB Global Tables	Multi-region replication
Observability	CloudWatch	Alarms in both regions

Route 53 checks /health every 10 seconds. If the health check fails twice (around 20 seconds), DNS fails over to Tokyo automatically. Traffic routes to the healthy region while the team investigates and works to restore the primary. The frontend polls /health every 5 seconds and shows which region is serving: green for Singapore (PRIMARY), amber for Tokyo (FAILOVER).

DynamoDB Global Tables replicate data between both regions. After failover, the balance and transaction history are intact in Tokyo. Same data, just a different region serving it. That is the whole point of the architecture.

How the Demo Works

When faults are injected into ap-southeast-1, the health check starts failing. Route 53 detects the failure and routes traffic to ap-northeast-1 within around 20 seconds. Users continue to be served from Tokyo while DevOps Agent investigates in the background. Once the agent identifies the root causes and the team applies the fixes, the primary region recovers and Route 53 fails back.

This is the core of the DR story: failover keeps the service running; the investigation tells you what broke so you can fix it.

Three Fault Scenarios

In Part 2, I inject three faults against the primary region using fault.py, a Python script for fault injection and restoration. Each represents a common real-world serverless incident.

#	Fault	How it breaks	Root cause category
1	IAM permission denied	Role swapped to fault role with no DynamoDB access	System change
2	Lambda throttling	Reserved concurrency = 0, 429 before function runs	Resource limits
3	Missing environment variable	TABLE_NAME removed, KeyError at module load	Code/config change

What makes this interesting: all three run simultaneously using python scripts/fault.py inject (the default mode assigns one distinct fault per service). One alarm fires in ap-southeast-1, three different root causes show up in the investigation, and DevOps Agent has to untangle all of them in a single run. That is a harder test than running each fault separately.

Where This Fits in the DR Lifecycle

The DR Toolkit covered the Prepare phase. This series covers Investigate and Recover. The part that happens after the alarm fires.

DevOps Agent does not need the DR Toolkit to investigate. It reads your topology, correlates signals across services, identifies root causes, and posts findings to Slack on its own. AWS DevOps Agent is capable enough to detect, investigate, root cause, and even generate post-mortem inputs without any external tool.

The connection here is context: if you want to give the agent extra architecture knowledge upfront, you can optionally load a runbook generated by the DR Toolkit as a Custom Skill.

What's Next?

In Part 2, we'll get our hands dirty with the full setup and the demo: deploying PayLedger to both regions, configuring Route 53 failover, setting up the Agent Space, and then running the faults. I'll walk through the actual investigation the agent ran: the timeline, the findings, the root cause, and what it concluded about mitigation.

Try it / Fork it:

PayLedger Repo: github.com/romarcablao/payledger-aws-devops-agent

romarcablao / payledger-aws-devops-agent

DevOpsAgent: Beyond the Runbook

PayLedger — Multi-Region Serverless Payment Ledger

Multi-region serverless payment ledger for recording transactions and viewing balances with active-passive failover. Deployed across ap-southeast-1 (Singapore, primary) and ap-northeast-1 (Tokyo, secondary) using AWS Lambda, DynamoDB Global Tables, and Route 53 failover routing.

Built as a demonstration platform for disaster recovery testing with AWS DevOps Agent.

Note: PayLedger is a demo project. It is not affiliated with any real business, does not process real transactions, and contains no personally identifiable information.

Architecture

                    payledger.yourdomain.com (CloudFront + S3)
                              │
                         Next.js static UI (balance, transactions, region indicator)
                              │
                              ▼
                    api-payledger.yourdomain.com
                              │
                    Route 53 failover routing
                    ├── PRIMARY   ──▶ apse1-api-payledger.yourdomain.com  ← health check
                    └── SECONDARY ──▶ apne1-api-payledger.yourdomain.com  ← health check
                    TTL: 60s | health check: 10s interval, 2 failures to trip
                              │
               ┌──────────────┴──────────────┐
               │                             │
    ap-southeast-1 (Singapore)     ap-northeast-1 (Tokyo)
    ├── API Gateway (regional)     ├── API Gateway (regional)
    ├── Lambda: createTransaction  ├── Lambda: createTransaction
    ├── Lambda: listTransactions   ├── Lambda:

…

View on GitHub

References:

DE

Source

This article was originally published by DEV Community and written by Romar Cablao.

Read original article on DEV Community

Back to Discover

Runbooks Don't Investigate. AWS DevOps Agent Does.