- Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
- Also by me: Thinking in Go — Book 1: Go Programming + Book 2: Hexagonal Architecture
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You're interviewing for a senior AI engineer role in 2026. The interviewer is polite, on time, has your resume open. You've walked through the RAG pipeline you shipped last year. The retrieval flow. The eval harness you wrote in a weekend. They nod.
Then they lean forward.
"Walk me through how you'd know your LLM feature is broken before your users tell you."
90% of candidates flunk this question. Not because they don't know the answer. Because the answer they give is the one they'd have given for a normal web service, and this is not a normal web service.
The PwC 2026 AI Agent Survey found that 79% of organizations running agents in production cannot trace agent failures systematically. The interviewer has read that number. They are sitting across from you to find out which side of it you're on.
What the weak answer sounds like
The candidate pauses. They've thought about this. They answer confidently.
"We monitor latency p95 and p99. Error rates on 4xx and 5xx. Token usage per request. We have alerts if the 5xx rate crosses 1% or if p99 latency goes over two seconds. We also have a weekly review of a sample of user-reported complaints."
The interviewer writes a note and moves on. You won't get a callback.
That answer isn't wrong. It's insufficient. Every single failure mode specific to LLM systems returns HTTP 200. Every one of them keeps p99 latency flat. Every one of them stays inside the per-request token cap. Traditional APM is blind to hallucinations, retrieval drift, tool-call misfires, prompt-injection compromise, model-version swaps, and fallback-tier degradation.
The candidate answered the question they know how to answer. Not the one that was asked.
What the strong answer sounds like
Here's what the interviewer wants to hear. Not the exact words. The exact shape.
"Latency and error rates don't see LLM failures because every failure I care about returns 200 and stays inside the SLO on every span. What I run is five things, in layers."
Then the candidate walks through them. One at a time. Concrete. Named.
1. Canary evals against every production model ID
A fixed set of 10 to 20 prompts, run on an hourly schedule, against every model the service calls. Scored by a judge from a different vendor to avoid self-preference bias. A rolling-mean drop of more than 0.15 sustained across two windows pages the on-call.
This is what would have caught the Anthropic three-bug cascade from August to September 2025. The provider silently reshuffled routing, the TPU config drifted, the XLA compiler miscompiled for a subset of traffic. No server-side error. User reports started within hours. Aggregated quality graphs would have showed the break on hour one.
2. An online judge sampled on real traffic, sliced
The candidate says sampled and sliced in the same sentence. That's the signal.
Not every trace gets judged. That would double the cost of every request. A controlled sample (1 to 5%, weighted toward the slices you care about) runs through a faithfulness judge, a relevance judge, a safety judge. The scores stream back as span attributes. Dashboards and alerts look at the scores per slice: per customer tier, per query recency, per entity type, per product area.
The reason a 1% aggregate hallucination rate hides a 40% rate on the last-seven-days slice is that the aggregate is the wrong number to look at. The strong answer names which slices matter for the feature the candidate shipped.
3. Drift detection on retrieval relevance
For any RAG path, a context-relevance eval runs on a rolling production sample. Each retrieved chunk gets graded against the user query. Mean top-1 and mean top-5 relevance tracked over a 7-day rolling window. A provider embedding-model swap, a chunking change, or a user-query distribution shift all show up here before they show up in user complaints.
The candidate mentions the Ten Failure Modes of RAG Nobody Talks About piece, or the equivalent. Bonus.
4. Cost-per-tenant alerting, not cost-per-request
Per-request budget caps catch single runaway calls. They miss the failure shape where one tenant finds a cheap way to ask expensive questions, or a new tool-call loop multiplies every turn by 400. The $47,000 LangChain agent loop passed every per-request check for 11 days.
The instrument is cost-per-tenant-per-hour with a rolling baseline. Alert when a tenant exceeds 3x their 7-day mean for 10 minutes. Cache-hit ratio on the side, alert when it drops below 50% of baseline — that's the cheapest cost-regression detector you will ever wire up.
5. Fallback-tier eval in steady state
A senior answer includes this one. A mid-level answer doesn't.
Your primary provider has a fallback. Maybe two. When the primary brownouts, you route to the secondary, and you assume the secondary works because you wired it up six months ago and the integration test passed. That's a bet, not a signal.
The strong answer: "The online judge runs on fallback traffic even in steady state." A synthetic slice of real prompts gets routed through the fallback chain every hour. Scores compared to primary. A tertiary tier that scores 0.55 on a Tuesday afternoon is a tertiary tier that will fail the company during the outage that forces it into use.
The span attribute that ties it together
At some point the candidate says the phrase app.feature.owner or its equivalent. Every span carries an attribute identifying which team owns the feature the call belongs to. When the alert fires at 2 AM, the runbook routes by that attribute. No triage by the on-call trying to figure out which team owns a trace.
That one attribute is the difference between a system that can be operated and one that cannot.
What the interviewer is listening for
Put yourself on the other side of the table. The interviewer is not grading your answer for completeness. They're grading it for whether you have operated a system, not built one.
- Do you say sampled, not all traffic? Shows you've thought about cost.
- Do you slice your evals? Shows you've seen a global dashboard hide a slice-level regression.
- Do you alert on baselines, not absolute thresholds? Shows you know absolute thresholds break the week after launch when traffic shape changes.
- Do you mention steady-state eval of fallbacks? Shows you've been on-call during an outage that failed over into a degraded path and made things worse.
- Do you name ownership attributes on spans? Shows you've been the poor on-call who got paged for a trace you couldn't route.
Each of those is a scar. The interviewer has the same scars. They're checking for the match.
Why this question works for the interviewer
A lot of senior AI hires land, ship an MVP, and leave a mess the next engineer has to instrument. The mess is expensive. The PwC 2026 survey number (79% cannot systematically trace agent failures) is the shape of that mess at industry scale.
The question separates:
- People who have built a cool demo, once.
- People who have run one for six months with real users, real cost, real regressions.
The first group talks about the model, the prompt, the framework. The second group talks about the spans, the judges, the baselines, the slice alerts, and the 2 AM page they learned from.
You want to be hiring the second group. You want to be the second group.
If you're the one preparing
Take the five layers above. For each, be able to answer three follow-ups without hesitating:
- "Give me a concrete signal name and the threshold you'd set."
- "What's the first time you saw this signal save you?"
- "What's the failure mode your signal doesn't catch?"
The third one is the one that lands the senior title. Nothing you build catches everything. Knowing what your instrumentation cannot see, and being honest about it in the room, is the move that tells the interviewer you've been in the chair.
The five layers (canary evals, online judge on sliced traffic, retrieval-relevance drift, cost-per-tenant, fallback-tier steady-state eval) and the ownership-attribute pattern are the map of the book I finished last month. Chapter 1 frames what APM cannot see. Chapters 8 through 11 cover evals in detail. Chapter 17 covers alerting and drift. Chapter 18 is the incident-response and production-readiness checklist, including the owner-attribute pattern.
Go into the interview with the map in your head. You'll answer the question in the shape the interviewer is waiting for, and the rest of the loop gets easier.
If this was useful
This post is the short version of an argument the book makes across 18 chapters. The interview question is a shortcut for detecting whether a candidate has the map. Every layer named above is a chapter.
- Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
- Also by me: Thinking in Go — Book 1: Go Programming + Book 2: Hexagonal Architecture
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
This article was originally published by DEV Community and written by Gabriel Anhaia.
Read original article on DEV Community
