Overview

Let's get our hands dirty. This part covers the full setup and the actual demo: deploy PayLedger to both regions, wire up Route 53 failover, configure the Agent Space, inject three simultaneous faults, and walk through exactly what the agent found.

Quick recap from Part 1: PayLedger is a demo payment ledger deployed to ap-southeast-1 (primary) and ap-northeast-1 (secondary) with Route 53 failover, DynamoDB Global Tables, and a Next.js frontend showing which region is serving. DevOps Agent sits in ap-southeast-2 monitoring both. If you haven't read the first part, you can check it out here:

Romar Cablao for AWS Community Builders

May 3

Runbooks Don't Investigate. AWS DevOps Agent Does.

#aws #devops #aiops #disasterrecovery

Comments

7 min read

Before You Start

Requirement	Notes
AWS account	IAM admin permissions
Domain in Route 53	Hosted zone for custom domain
Serverless Framework v4	`npm install -g serverless`
Python 3.12	Lambda runtime
ACM certificates	In both apse1 and apne1 for the API subdomain

New customers get a 2-month free trial for AWS DevOps Agent. After that, billing is per second when the agent is active. Support credits vary by tier.

Reference: AWS DevOps Agent Pricing

Step 1: Create the Agent Space

Before deploying anything in your workload regions, set up the Agent Space first. The webhook credentials produced here are needed later when you wire up alarm forwarding.

Switch to ap-southeast-2 in the AWS Console. Navigate to AWS DevOps Agent and create a new Agent Space. AWS creates the required IAM roles automatically:

DevOpsAgentRole-AgentSpace uses AIDevOpsAgentAccessPolicy
DevOpsAgentRole-WebappAdmin uses AIDevOpsOperatorAppAccessPolicy

Link your AWS account. Both workload regions (apse1 and apne1) are in the same account, so a single association gives the agent visibility into both.

Once the Agent Space is up, grab the webhook URL and HMAC key from the integrations page. You'll use them in Step 5.

Reference: What are DevOps Agent Spaces?

Step 2: Deploy to Both Regions

Copy .env.example to .env and fill in your values, then run:

bash scripts/setup.sh --step deploy-backend

This deploys to ap-southeast-1 first (which creates the DynamoDB table), then ap-northeast-1 (which skips table creation via a CloudFormation Condition). API Gateway IDs are auto-discovered from CloudFormation and written back to .env. No manual copy-pasting.

If you prefer to run the deploys individually:

# Primary (creates the DynamoDB table)
npx serverless deploy --stage dev --region ap-southeast-1

# Secondary (skips DynamoDB creation via CloudFormation Condition)
npx serverless deploy --stage dev --region ap-northeast-1

Verify both health endpoints are up:

curl https://<APSE1_ID>.execute-api.ap-southeast-1.amazonaws.com/health
# {"status": "healthy", "region": "ap-southeast-1", "service": "payledger", "timestamp": "..."}

curl https://<APNE1_ID>.execute-api.ap-northeast-1.amazonaws.com/health
# {"status": "healthy", "region": "ap-northeast-1", "service": "payledger", "timestamp": "..."}

Step 3: Enable DynamoDB Global Table

bash scripts/setup.sh --step setup-global-table

This adds the ap-northeast-1 replica and polls until it reaches ACTIVE status (typically 2-5 minutes). Under the hood it runs update-table with replica-updates Create={RegionName=ap-northeast-1} and waits.

Seed some transactions so the UI has data to show:

python scripts/seed_transactions.py

Reference: Amazon DynamoDB Global Tables

Step 4: Configure Custom Domains and Route 53 Failover

Two sub-steps here. Before running them, make sure ACM certificates exist in both regions covering the API subdomain and the failover domain.

# Create API GW custom domains + Alias A records in Route 53
bash scripts/setup.sh --step setup-custom-domains

# Create Route 53 health checks + PRIMARY/SECONDARY failover CNAME records
bash scripts/setup.sh --step setup-route53

setup-custom-domains creates the regional custom domains (apse1-api-payledger.yourdomain.com, apne1-api-payledger.yourdomain.com) and registers both with the failover domain (api-payledger.yourdomain.com) so API Gateway accepts the Host header from either path.

setup-route53 creates health checks (10s interval, FailureThreshold 2) and the PRIMARY/SECONDARY CNAME failover pair. It polls until both health checks pass before returning.

After setup, all traffic to api-payledger.yourdomain.com goes to Singapore. If the health check fails twice (around 20 seconds), Route 53 fails over to Tokyo automatically.

# Verify, should hit primary
curl https://api-payledger.yourdomain.com/health
# {"status": "healthy", "region": "ap-southeast-1", "service": "payledger", "timestamp": "..."}

Reference: Amazon Route 53 Failover Routing

Step 5: Store the DevOps Agent Webhook Credentials

The alarm notification flow uses a webhook: CloudWatch Alarm → SNS Topic → devopsAgentTrigger Lambda → DevOps Agent webhook. The setup.sh script handles this via the setup-webhook step, which stores the webhook URL and HMAC key from the DevOps Agent console in Secrets Manager.

bash scripts/setup.sh --step setup-webhook

You'll need the webhook URL and HMAC key from your Agent Space in the DevOps Agent console. Set them in your .env file first:

DEVOPS_AGENT_WEBHOOK_URL=https://event-ai.ap-southeast-2.api.aws/webhook/generic/your-webhook-id
DEVOPS_AGENT_HMAC_KEY=your-hmac-key-here

Step 6: Deploy the Frontend

bash scripts/setup.sh --step deploy-frontend

This provisions the S3 bucket and CloudFront distribution if they don't exist, registers FRONTEND_DOMAIN in Route 53, builds the Next.js app, syncs the output to S3, and invalidates the CloudFront cache. If you just want to run it locally without the cloud provisioning:

bash scripts/setup.sh --step deploy-frontend --local
# Writes frontend/.env.local only. Run with: npm run dev --prefix frontend

The UI polls /health every 5 seconds. Green banner = Singapore (PRIMARY). Amber banner = Tokyo (FAILOVER). When the region changes, a "Failover detected" banner appears automatically.

Step 7: Verify Topology

After linking the account, DevOps Agent builds the topology automatically from CloudFormation stacks. Serverless Framework deploys via CloudFormation, so all resources in both regions are discovered without manual setup.

Three views in the web app: System view (account/region boundaries), Container view (CloudFormation stacks), Resource view (full resource graph with cross-region DynamoDB relationship).

The topology is powered by the Agent Space Understanding learned skill. It auto-generates when integrations are configured and powers the Topology page.

Reference: What is a DevOps Agent Topology?

Step 8: Verify the Full Stack

Run the verify step to confirm all endpoints are reachable through the failover URL before injecting any faults:

bash scripts/setup.sh --step verify

This runs health checks against both regional endpoints directly, then tests all four endpoints through the Route 53 failover URL including a POST to /transactions. All checks should pass and return 2xx before you continue.

Optional Integrations

The Agent Space works without these, but they make findings easier to consume.

Slack

AWS DevOps Agent console -> Settings -> Communications -> Slack -> Register (OAuth)
Agent Space -> Capabilities -> Communications -> Slack -> select channel -> Create

The Agent Space web app shows all investigation findings regardless. Slack is useful if you want findings posted to a channel without keeping the web app open.

Reference: Connecting Slack

GitHub

Agent Space -> Capabilities -> Pipeline -> Connect -> GitHub
Install the AWS DevOps Agent GitHub App on your account
Grant access to the payledger-aws-devops-agent repository

The agent investigates all three faults without GitHub. The value it adds is deployment correlation. For config-related faults, the agent can correlate errors with recent config changes and deployment history.

Reference: Connecting GitHub

The Demo: Three Faults at Once

With everything set up, I ran python scripts/fault.py inject. The default mode assigns one distinct fault per service simultaneously:

python scripts/fault.py inject
# health       -> throttle   (reserved concurrency = 0)
# transactions -> envvar     (TABLE_NAME removed)
# balance      -> iam        (role swapped to fault-iam, no DynamoDB access)

The CloudWatch 5xx alarm for ap-southeast-1 fired at 21:30:02. Route 53 detected the failing health checks and routed traffic to ap-northeast-1. PayLedger continued serving from Tokyo. DevOps Agent started investigating automatically.

Here is the full failover in action. You can see the region indicator shift from Singapore to Tokyo in real time:

The Investigation

The alarm triggered at 21:30:02. The investigation completed at 21:37:05. Total time: 7 minutes and 3 seconds.

Investigation Timeline

The agent opened by reading two things before making a single AWS API call: the Agent Space Understanding skill and the PayLedger component reference file, both auto-generated learned skills from the connected account. Before any CloudWatch or CloudTrail queries had returned, the agent already had context about the service architecture.

From there it split into three parallel tracks:

Lambda logs: 11 tool calls over 1 minute, comparing a baseline window (13:00-13:05 UTC) against the incident window
CloudTrail changes: 19 tool calls over 2 minutes 4 seconds, pulling config change events for the account and region
Lambda metrics: 7 tool calls over 1 minute 43 seconds, error counts, throttle counts, duration, and invocation counts per function

By +2m16s, findings were coming back from all three tracks simultaneously.

Findings

Finding 1: listTransactions Lambda missing TABLE_NAME causing init crash

Every invocation of payledger-dev-listTransactions failed during module initialization. The agent pulled the actual log entry from CloudWatch:

[2026-05-02T13:28:06.250Z] [ERROR] KeyError: 'TABLE_NAME'
Traceback (most recent call last):
  File "/var/task/functions/list_transactions.py", line 29, in <module>
    TABLE_NAME = os.environ["TABLE_NAME"]
INIT_REPORT Phase: init  Status: error  Error Type: Runtime.Unknown

26 error records in the incident window, zero in baseline. It confirmed the missing variable by inspecting the live function configuration directly: ALLOWED_ORIGINS, POWERTOOLS_SERVICE_NAME, LOG_LEVEL, REGION were all present. No TABLE_NAME. The function was never initializing. Every cold start failed before the handler could run.

Finding 2: getBalance Lambda using fault-iam role with no DynamoDB permissions

The function was assigned payledger-dev-fault-iam, which only has AWSLambdaBasicExecutionRole. Every DynamoDB query returned AccessDeniedException. The function handled the exception gracefully, so the Lambda Errors metric showed 0. API Gateway still recorded the 500s. The agent caught this by looking at both metrics separately rather than relying on either one alone.

Finding 3: health function throttled to zero

Reserved concurrency had been set to 0, blocking all invocations before execution. 11 throttles at 13:27, 79 throttles at 13:28. Invocation count at 13:28 dropped to only 20 from the normal 90-100 per minute. The function had zero errors when it did execute, confirming it was a concurrency limit, not a code problem.

The accounting

The agent reconciled the numbers before writing the final report:

Source	Errors	Share
`health` (reserved concurrency = 0)	90 (11 + 79)	90%
`listTransactions` (missing `TABLE_NAME`)	5	5%
`getBalance` (wrong IAM role)	5	5%
Total	100	100%

100 5xx errors, all accounted for.

Root Cause

CloudTrail confirmed the trigger. All three configuration changes happened within a 2-second window:

PutFunctionConcurrency on payledger-dev-health. Reserved concurrency set to 0 (13:27:54Z)
UpdateFunctionConfiguration on payledger-dev-listTransactions. All environment variables cleared (13:27:55Z)
UpdateFunctionConfiguration on payledger-dev-getBalance. Execution role changed to payledger-dev-fault-iam, env vars cleared (13:27:56Z)

The root cause statement from the agent:

"The role name 'payledger-dev-fault-iam', the use of Boto3 scripting, and the rapid self-recovery at 13:29:00Z strongly indicate this was a deliberate chaos engineering / fault injection exercise rather than an accidental misconfiguration."

That last line: the agent identified the devopsAgentTrigger Lambda in the stack and flagged the fault as intentional. It was right.

Mitigation Plan

The agent returned: no mitigation action required.

Two things happened in parallel during this incident. Route 53 detected the failing health checks and automatically failed over to ap-northeast-1 within 20 seconds, so the service kept running throughout. That part required no intervention. On the primary region side, the faults were reversed at 13:29:00 UTC when fault.py restore ran, 2 minutes after injection. The agent saw the 5xx errors drop to 0, matched it against the CloudTrail restore events, and concluded there was nothing left to fix.

"This was a controlled chaos engineering exercise to test system resilience. The incident self-recovered at 13:29:00 UTC, indicating the configurations were reverted as part of the planned test. Since this was intentional testing and the system has already recovered, no immediate operational mitigation is required."

A system that generates restore commands for changes that have already been reverted would be wrong. The agent recognized self-recovery and didn't produce output that didn't apply.

Here is the full AWS DevOps Agent investigation in action:

Observations

The agent built its own context before touching a single API. It started by reading the Agent Space Understanding skill, which auto-generates from your connected account and maps resources, request paths, and service relationships. Before any CloudWatch or CloudTrail queries had returned, it already had the architecture context to make sense of what it was about to find.

Three root causes from one alarm. A single 5xx alarm triggered. The agent identified three distinct failure mechanisms, attributed the exact error count to each (90 throttles, 5 init crashes, 5 IAM errors), and traced all three to the same 2-second injection window in CloudTrail. That correlation is not obvious when a throttle, a KeyError, and an AccessDeniedException don't look like they came from the same event.

The empty mitigation plan was the correct answer. My expectation was restore commands. Instead the agent returned "no mitigation action required." Route 53 had already kept the service running via automatic failover. The primary region faults were reversed by fault.py restore. The agent recognized both facts in the metrics and CloudTrail, and declined to produce output that didn't apply. Knowing when not to act is more useful than generating work that doesn't exist.

It identified the test as intentional. Not just "three things broke." The agent concluded this was fault injection, named the evidence (role name, Boto3 scripting, 2-minute self-recovery), and assessed it correctly. That was not something I scripted or hinted at.

Restoring the Stack

After the demo, restore all faults:

# Restore all faults at once
python scripts/fault.py restore

# Or restore individually
python scripts/restore_fault_iam.py --stage dev
python scripts/restore_fault_throttle.py --stage dev
python scripts/restore_fault_envvar.py --stage dev

# Wait around 60s for health checks to pass
curl https://api-payledger.yourdomain.com/health
# {"status": "healthy", "region": "ap-southeast-1"}

Once the health checks recover, Route 53 routes traffic back to ap-southeast-1. The primary region is restored.

Wrapping Up

The DR Toolkit series covered Prepare. This series covered the middle: a multi-region demo app with real failover, three simultaneous faults, and AWS DevOps Agent investigating all of them from a single alarm trigger. The agent identified the root cause, recognized the service had already recovered, and correctly concluded no action was needed, because the evidence from logs, metrics, and CloudTrail told it this was an injected fault, not a real incident.

Route 53 kept the service running by routing to the healthy region. DevOps Agent used that time to find exactly what broke in the primary region. That is the relationship between the two: one buys you time, the other uses it.

The Agent Space Understanding skill was the most visible differentiator in this investigation. It auto-generated from the connected account and gave the agent architecture context before the first API call. No manual input required.

AWS DevOps Agent handles the full investigation loop on its own: topology discovery, root cause analysis, and Slack notification. If you have a previous DR Toolkit runbook, you can optionally load it as a Custom Skill to give the agent extra context. If you haven't seen the DR Toolkit series: BuildWithAI: DR Toolkit on AWS.

Try it / Fork it:

PayLedger Repo: github.com/romarcablao/payledger-aws-devops-agent

romarcablao / payledger-aws-devops-agent

DevOpsAgent: Beyond the Runbook

PayLedger — Multi-Region Serverless Payment Ledger

Multi-region serverless payment ledger for recording transactions and viewing balances with active-passive failover. Deployed across ap-southeast-1 (Singapore, primary) and ap-northeast-1 (Tokyo, secondary) using AWS Lambda, DynamoDB Global Tables, and Route 53 failover routing.

Built as a demonstration platform for disaster recovery testing with AWS DevOps Agent.

Note: PayLedger is a demo project. It is not affiliated with any real business, does not process real transactions, and contains no personally identifiable information.

Architecture

                    payledger.yourdomain.com (CloudFront + S3)
                              │
                         Next.js static UI (balance, transactions, region indicator)
                              │
                              ▼
                    api-payledger.yourdomain.com
                              │
                    Route 53 failover routing
                    ├── PRIMARY   ──▶ apse1-api-payledger.yourdomain.com  ← health check
                    └── SECONDARY ──▶ apne1-api-payledger.yourdomain.com  ← health check
                    TTL: 60s | health check: 10s interval, 2 failures to trip
                              │
               ┌──────────────┴──────────────┐
               │                             │
    ap-southeast-1 (Singapore)     ap-northeast-1 (Tokyo)
    ├── API Gateway (regional)     ├── API Gateway (regional)
    ├── Lambda: createTransaction  ├── Lambda: createTransaction
    ├── Lambda: listTransactions   ├── Lambda:

…

View on GitHub

References:

I Injected Three Faults. The Agent Found All of Them.