We're at the end of the series. Nine chapters of mechanism. One chapter of opinion.
Building the Auth Gateway took roughly two years from "what if NGINX did the auth?" to "this thing handles every authenticated request in production." A lot of what's in the previous chapters wasn't obvious to us at the start. This is the post-mortem on our own architecture: what worked, what hurt, what we'd build earlier, and what we'd warn the next team about.
What worked
A few decisions held up cleanly. We'd make all of them again.
auth_request as the primitive
NGINX's auth_request directive is, with no exaggeration, the single most leveraged design choice in the platform. One directive, well-understood, supported across NGINX versions. We don't need a service mesh. We don't need a custom Envoy filter. We don't need a Lua module compiled into NGINX.
If you can do your auth in HTTP-status terms (200/401/403), auth_request is the right tool. If you can't, you probably want a sidecar or mesh-level enforcement and this whole architecture doesn't apply.
Endpoint metadata as data
Storing endpoint type and required permissions in Postgres, refreshed via Pub/Sub, was the right call. It means:
- We can change auth without redeploying the gateway.
- We can audit auth ("what protects this URL?") with a SQL query.
- Admin tooling and the gateway share a single contract.
The cost — a small DB lookup at boot, an in-memory trie, a refresh mechanism — was tiny compared to the operational flexibility we got back.
One structured log line per decision
AUTH_DECISION is the contract between the Auth Service and oncall. Every field, every time, every request. A year of operations later, this is the artifact we reference most often. Every alert we've built points at it. Every incident postmortem references it.
Resist the temptation to add INFO/DEBUG lines around it. Resist the temptation to omit fields when they're "not relevant." One line. Same shape. Forever.
Fail-closed by default at the edge
error_page 502 503 504 = @auth_unavailable; was a one-line change that defines our security posture. When the Auth Service is unhealthy, NGINX returns 503 to the client instead of letting the request through. We've had a few incidents where this caused brief platform-wide outages. We have never regretted the choice.
The principle: the cost of a 5-minute outage on rare occasions is much, much less than the cost of one cross-tenant data leak ever.
Caches in the auth process, not in NGINX
auth_request is intentionally not cacheable, and we leaned into that. Every cache lives inside the Auth Service: JWT verify, RSA keys, route lookup, policy bitmap, revocation map, SA versions. Each is invalidated through its own channel. The gateway's hot path makes zero Redis calls in steady state.
This kept the architecture honest. The auth pod is the unit of correctness. Scale it, monitor it, debug it as one thing.
Pub/Sub-driven trie reload
Push-based invalidation for the endpoint trie was the right shape. Periodic-only would have given us a ~30 minute window where new admin routes were unprotected. Pub/Sub-only would have been brittle (events get lost). Both, with periodic as the safety net, gives us seconds of staleness in the common case and bounded staleness even when the message is lost.
Most caches we'd default to TTL. The trie was worth the special case.
The bitmap fast path
Encoding permissions as bit indexes paid off. Smaller tokens, faster checks, cleaner metrics. The legacy path we kept around for safety has earned its keep — version skew is real, and fall-through is graceful.
Shadow mode for two months before flipping the switch was the right rollout pattern. Catching three real bugs in shadow with zero impact on production is the gold standard for a sensitive change like authorization logic.
What hurt
Now the harder list. Things that cost us time, sleep, or trust.
Tenant resolution living in two places
NGINX resolves the tenant. The Auth Service also checks tenant binding (token tenant matches request tenant). The two places do different checks for a reason — but the reason isn't obvious, and we've watched several engineers add tenant logic to a third place because they didn't realize it was already covered.
What we'd do differently: write a single tenant-resolution doc that explicitly enumerates which layer owns what and what each layer assumes about the others. A "tenancy contract" page. We have it now (chapter 5 is a recovered version of it); we should have had it on day one.
The "first segment is the slug" rule
For a long time, the Auth Service split the URI on / and treated the first segment as the service slug. This worked until services started nesting each other or grouping under shared prefixes. We had to retrofit X-Service-Slug and X-Request-Path headers — backward-compatibly, with fallback to the old rule. The retrofit is fine; it took longer than it should have because the old rule was buried in three places.
What we'd do differently: explicit slug headers from day one. Don't infer slugs from the URI structure. The path inside a service is the service's business; the slug is a separate concern.
Migration to the bitmap took longer than expected
The bitmap fast path was a six-week project that took five months. The math was straightforward. What ate the time:
- Coordinating bit-index assignments with the token issuer team (different repo, different rollout cadence).
- Fixture data in our test suites was hardcoded with old permission strings; updating it for the bitmap registry was a long tail of small PRs.
- The shadow comparison logic exposed three subtle bugs (Chapter 6) that each required investigation.
What we'd do differently: assume cross-team auth changes are 3x what you estimate. Build the shadow harness first, then the new path. The shadow harness paid for itself five times over.
Cache invalidation was an afterthought
The first version of the JWT cache was a map with time.AfterFunc evictors. We covered it in Chapter 8. It seemed fine. It fell over in production within a week.
The lesson generalizes: a cache without a written-down invalidation channel is a memory leak. Every cache should have:
- A bounded size (entries or bytes).
- A clear invalidation event ("token expired", "trie reloaded", "revocation event consumed").
- A staleness window we can articulate ("up to 30 seconds late").
If you can't write those three down, don't add the cache.
The default tenant we shipped on day one
For the first quarter we had a default tenant. "If no X-Tenant-ID and no host match, fall through to default-tenant." It was added because it made local dev easier.
It cost us in two ways. First, removing it took longer than building it — every misconfigured client started 400'ing once we removed the fallback. Second, while it was live, it produced exactly one near-miss data leak (a service-account request without a tenant header writing into the wrong tenant). We caught it before it left staging.
Shipping that default tenant was the worst single decision in the whole project. We'd remove it from every future system before it ever boots.
Per-pod alert spam, twice
Twice we shipped alert code that fired per-request rather than per-state. The first time was during a Redis outage in our second month (lit up Slack with ~10k messages in 90 seconds). The second was during an RSA misconfig rollout (a few hundred messages per minute per pod, fleet-wide).
Both were the same bug: alerting from a request handler instead of from a state-transition observer. Both were "fixed" with atomic.Bool swaps. Now we apply that pattern aggressively.
What we'd do differently: write the alert dedup helper first, before the first alert. Have it baked into the codebase before there's anything to alert on.
What we'd build earlier
In the order we'd add them:
1. The structured AUTH_DECISION log
On day one. Even before fancy auth logic. The log structure outlives every other choice — every dashboard, every alert, every postmortem reads from it. Build the contract first.
2. Slack alert dedup helper
Before the first alert. Five lines of code to wrap an atomic.Bool around a Slack send. Ship it before you have anything to alert on.
3. Fail-closed posture
Before the gateway sees a single request in production. Don't even try permissive defaults. The "we'll tighten it later" path becomes "we shipped a permissive-by-default thing for 18 months." Just ship it tight.
4. Endpoint metadata in DB
Skip the YAML-of-routes phase. Skip the in-code decorator phase. Go straight to the database table with refresh mechanism. The transitional architectures cost more to migrate off than they cost to skip.
5. Gap probe on revocation streams
The probe (Chapter 7) catches data loss between the stream and consumers. It costs almost nothing to run. Without it you don't know if you're losing events; you just hope.
6. Shadow harness for sensitive changes
Comparing old-vs-new in production with the new path muted is a powerful pattern. Build the harness as a reusable thing. We re-implemented variants of it for three different rollouts before realizing it should be a library.
7. Tenancy contract document
One page. Owns: which layer resolves the tenant, which layer validates token-tenant binding, which layer scopes queries, what the failure modes are. Required reading before anyone touches request handling. Should have existed before the gateway shipped.
The maturity progression
Looking back, the gateway evolved through identifiable stages. They're worth naming because if you're starting fresh, knowing the destination shape lets you skip steps.
v1 — per-service auth libs. Where most teams are. Each service has its own JWT decode, its own permission check. Inconsistent, drift-prone, slow to fix CVEs. Don't stay here.
v2 — auth_request + minimal /auth. A simple gateway that decodes a token and returns 200/401. Static list of "open" routes. Enough to centralize the decision; not enough to scale.
v3 — trie + classification + Pub/Sub. Endpoint metadata in a DB. Trie in memory. Refresh kicks. Now adding a route doesn't require a redeploy.
v4 — revocation + caching. Logout works. Admin disable works. Each cache layer in place. Hot path is sub-millisecond.
v5 — bitmap + structured logs + degraded mode. The mature gateway. Fast, observable, alertable, recoverable.
Most of the value lives between v2 and v3. If you're at v1, that's the migration to plan for. v3 to v5 is iteration; v1 to v3 is the project.
Five pieces of advice for teams building this
If you're starting from scratch with a similar problem, here's what I'd hand off in five bullets:
1. Start with auth_request. Don't shop architectures.
Service mesh, custom Envoy filter, Lua plugin, sidecar — they all promise more flexibility. They all cost more in operations. auth_request is enough for the 90% case, and the 10% is rarely worth the complexity.
2. Make the gateway HA before anything else.
Two replicas minimum, HPA, graceful shutdown, retries to upstream auth pod, circuit-breaker semantics in NGINX, fail-closed posture. If any one of these is missing the gateway will take down your platform during a normal degraded event. This isn't optional.
3. The log is the API.
The AUTH_DECISION log is a public contract with everyone who ever debugs your gateway. Treat it like a schema. Don't change field names without a migration. Don't add free-form strings to enum fields. Have one version-controlled doc that defines every field and every value of every enum.
4. Cache invalidation has to be explicit.
Every cache: bounded size, explicit invalidation channel, articulable staleness window. If a cache doesn't have all three, it's a bug-in-waiting. We learned this twice.
5. Build observability before you build features.
Dashboards, alerts, trace context, the structured log — all of these come before you ship the cool feature you're excited about. A clever new permission model that you can't observe is worse than a boring permission model you can.
What we'd build next
A few things on our list that didn't fit this series:
- NGINX otel module compiled in. Right now NGINX traces are limited; the Auth Service has full spans, but the NGINX hop is a black box from the trace's point of view. Worth fixing.
- Per-tenant rate limiting. Currently we rely on upstream services. The gateway is the natural place.
-
WAF integration. We have an external WAF. Closer integration so WAF events show up in
AUTH_DECISIONwould help triage. - Token introspection cache. Some integrations issue opaque tokens that we have to introspect with the issuer. Caching that lookup is its own caching problem; we haven't tackled it.
- A formal "tenancy contract" page. Yes, the same one I told you to build on day one. We're catching up.
Each of these is a future series, probably.
Final architecture
For posterity, the picture of where we ended up:
Everything in this picture has been earned by an outage, a postmortem, or a near-miss. None of it is decoration. If you're building something similar and one of the boxes seems extra to you, it's because you haven't had the incident that justifies it yet.
Closing
Centralizing auth at the edge is one of those decisions that looks obviously correct in hindsight and is genuinely hard to convince a team to invest in beforehand. The wins are diffuse — slightly less drift, slightly fewer CVEs, slightly faster security responses. The pain is concentrated and visible — one new service to operate, one extra hop, one more place that has to be HA.
But every six months we look back and the gateway has paid for itself again. A library upgrade we did once instead of thirty times. A revocation feature that shipped in a week instead of being negotiated across teams. A multi-tenant isolation guarantee we can actually defend in audits.
If you take one thing from this series, take this: auth is not a problem you solve once and ignore. It's a problem you solve somewhere, well, and operate with care. Pick that somewhere to be the edge, build it small and observable, and the rest of your platform gets to focus on actual product work.
Thanks for reading. If you build one of these — or are stuck somewhere mid-build — drop a comment. The hardest part of operating an Auth Gateway is realizing that other people have built the same thing and hit the same rocks. There's no reason for each team to find them independently.
This article was originally published by DEV Community and written by Akarshan Gandotra.
Read original article on DEV Community
