Part 10 — Lessons learned building a Kubernetes Auth Gateway – Discover

We're at the end of the series. Nine chapters of mechanism. One chapter of opinion.

Building the Auth Gateway took roughly two years from "what if NGINX did the auth?" to "this thing handles every authenticated request in production." A lot of what's in the previous chapters wasn't obvious to us at the start. This is the post-mortem on our own architecture: what worked, what hurt, what we'd build earlier, and what we'd warn the next team about.

What worked

A few decisions held up cleanly. We'd make all of them again.

`auth_request` as the primitive

NGINX's auth_request directive is, with no exaggeration, the single most leveraged design choice in the platform. One directive, well-understood, supported across NGINX versions. We don't need a service mesh. We don't need a custom Envoy filter. We don't need a Lua module compiled into NGINX.

If you can do your auth in HTTP-status terms (200/401/403), auth_request is the right tool. If you can't, you probably want a sidecar or mesh-level enforcement and this whole architecture doesn't apply.

Endpoint metadata as data

Storing endpoint type and required permissions in Postgres, refreshed via Pub/Sub, was the right call. It means:

We can change auth without redeploying the gateway.
We can audit auth ("what protects this URL?") with a SQL query.
Admin tooling and the gateway share a single contract.

The cost — a small DB lookup at boot, an in-memory trie, a refresh mechanism — was tiny compared to the operational flexibility we got back.

One structured log line per decision

AUTH_DECISION is the contract between the Auth Service and oncall. Every field, every time, every request. A year of operations later, this is the artifact we reference most often. Every alert we've built points at it. Every incident postmortem references it.

Resist the temptation to add INFO/DEBUG lines around it. Resist the temptation to omit fields when they're "not relevant." One line. Same shape. Forever.

Fail-closed by default at the edge

error_page 502 503 504 = @auth_unavailable; was a one-line change that defines our security posture. When the Auth Service is unhealthy, NGINX returns 503 to the client instead of letting the request through. We've had a few incidents where this caused brief platform-wide outages. We have never regretted the choice.

The principle: the cost of a 5-minute outage on rare occasions is much, much less than the cost of one cross-tenant data leak ever.

Caches in the auth process, not in NGINX

auth_request is intentionally not cacheable, and we leaned into that. Every cache lives inside the Auth Service: JWT verify, RSA keys, route lookup, policy bitmap, revocation map, SA versions. Each is invalidated through its own channel. The gateway's hot path makes zero Redis calls in steady state.

This kept the architecture honest. The auth pod is the unit of correctness. Scale it, monitor it, debug it as one thing.

Pub/Sub-driven trie reload

Push-based invalidation for the endpoint trie was the right shape. Periodic-only would have given us a ~30 minute window where new admin routes were unprotected. Pub/Sub-only would have been brittle (events get lost). Both, with periodic as the safety net, gives us seconds of staleness in the common case and bounded staleness even when the message is lost.

Most caches we'd default to TTL. The trie was worth the special case.

The bitmap fast path

Encoding permissions as bit indexes paid off. Smaller tokens, faster checks, cleaner metrics. The legacy path we kept around for safety has earned its keep — version skew is real, and fall-through is graceful.

Shadow mode for two months before flipping the switch was the right rollout pattern. Catching three real bugs in shadow with zero impact on production is the gold standard for a sensitive change like authorization logic.

What hurt

Now the harder list. Things that cost us time, sleep, or trust.

Tenant resolution living in two places

NGINX resolves the tenant. The Auth Service also checks tenant binding (token tenant matches request tenant). The two places do different checks for a reason — but the reason isn't obvious, and we've watched several engineers add tenant logic to a third place because they didn't realize it was already covered.

What we'd do differently: write a single tenant-resolution doc that explicitly enumerates which layer owns what and what each layer assumes about the others. A "tenancy contract" page. We have it now (chapter 5 is a recovered version of it); we should have had it on day one.

The "first segment is the slug" rule

For a long time, the Auth Service split the URI on / and treated the first segment as the service slug. This worked until services started nesting each other or grouping under shared prefixes. We had to retrofit X-Service-Slug and X-Request-Path headers — backward-compatibly, with fallback to the old rule. The retrofit is fine; it took longer than it should have because the old rule was buried in three places.

What we'd do differently: explicit slug headers from day one. Don't infer slugs from the URI structure. The path inside a service is the service's business; the slug is a separate concern.

Migration to the bitmap took longer than expected

The bitmap fast path was a six-week project that took five months. The math was straightforward. What ate the time:

Coordinating bit-index assignments with the token issuer team (different repo, different rollout cadence).
Fixture data in our test suites was hardcoded with old permission strings; updating it for the bitmap registry was a long tail of small PRs.
The shadow comparison logic exposed three subtle bugs (Chapter 6) that each required investigation.

What we'd do differently: assume cross-team auth changes are 3x what you estimate. Build the shadow harness first, then the new path. The shadow harness paid for itself five times over.

Cache invalidation was an afterthought

The first version of the JWT cache was a map with time.AfterFunc evictors. We covered it in Chapter 8. It seemed fine. It fell over in production within a week.

The lesson generalizes: a cache without a written-down invalidation channel is a memory leak. Every cache should have:

A bounded size (entries or bytes).
A clear invalidation event ("token expired", "trie reloaded", "revocation event consumed").
A staleness window we can articulate ("up to 30 seconds late").

If you can't write those three down, don't add the cache.

The default tenant we shipped on day one

For the first quarter we had a default tenant. "If no X-Tenant-ID and no host match, fall through to default-tenant." It was added because it made local dev easier.

It cost us in two ways. First, removing it took longer than building it — every misconfigured client started 400'ing once we removed the fallback. Second, while it was live, it produced exactly one near-miss data leak (a service-account request without a tenant header writing into the wrong tenant). We caught it before it left staging.

Shipping that default tenant was the worst single decision in the whole project. We'd remove it from every future system before it ever boots.

Per-pod alert spam, twice

Twice we shipped alert code that fired per-request rather than per-state. The first time was during a Redis outage in our second month (lit up Slack with ~10k messages in 90 seconds). The second was during an RSA misconfig rollout (a few hundred messages per minute per pod, fleet-wide).

Both were the same bug: alerting from a request handler instead of from a state-transition observer. Both were "fixed" with atomic.Bool swaps. Now we apply that pattern aggressively.

What we'd do differently: write the alert dedup helper first, before the first alert. Have it baked into the codebase before there's anything to alert on.

What we'd build earlier

In the order we'd add them:

1. The structured `AUTH_DECISION` log

On day one. Even before fancy auth logic. The log structure outlives every other choice — every dashboard, every alert, every postmortem reads from it. Build the contract first.

2. Slack alert dedup helper

Before the first alert. Five lines of code to wrap an atomic.Bool around a Slack send. Ship it before you have anything to alert on.

3. Fail-closed posture

Before the gateway sees a single request in production. Don't even try permissive defaults. The "we'll tighten it later" path becomes "we shipped a permissive-by-default thing for 18 months." Just ship it tight.

4. Endpoint metadata in DB

Skip the YAML-of-routes phase. Skip the in-code decorator phase. Go straight to the database table with refresh mechanism. The transitional architectures cost more to migrate off than they cost to skip.

5. Gap probe on revocation streams

The probe (Chapter 7) catches data loss between the stream and consumers. It costs almost nothing to run. Without it you don't know if you're losing events; you just hope.

6. Shadow harness for sensitive changes

Comparing old-vs-new in production with the new path muted is a powerful pattern. Build the harness as a reusable thing. We re-implemented variants of it for three different rollouts before realizing it should be a library.

7. Tenancy contract document

One page. Owns: which layer resolves the tenant, which layer validates token-tenant binding, which layer scopes queries, what the failure modes are. Required reading before anyone touches request handling. Should have existed before the gateway shipped.

The maturity progression

Looking back, the gateway evolved through identifiable stages. They're worth naming because if you're starting fresh, knowing the destination shape lets you skip steps.

v1 — per-service auth libs. Where most teams are. Each service has its own JWT decode, its own permission check. Inconsistent, drift-prone, slow to fix CVEs. Don't stay here.

v2 — auth_request + minimal /auth. A simple gateway that decodes a token and returns 200/401. Static list of "open" routes. Enough to centralize the decision; not enough to scale.

v3 — trie + classification + Pub/Sub. Endpoint metadata in a DB. Trie in memory. Refresh kicks. Now adding a route doesn't require a redeploy.

v4 — revocation + caching. Logout works. Admin disable works. Each cache layer in place. Hot path is sub-millisecond.

v5 — bitmap + structured logs + degraded mode. The mature gateway. Fast, observable, alertable, recoverable.

Most of the value lives between v2 and v3. If you're at v1, that's the migration to plan for. v3 to v5 is iteration; v1 to v3 is the project.

Five pieces of advice for teams building this

If you're starting from scratch with a similar problem, here's what I'd hand off in five bullets:

1. Start with `auth_request`. Don't shop architectures.

Service mesh, custom Envoy filter, Lua plugin, sidecar — they all promise more flexibility. They all cost more in operations. auth_request is enough for the 90% case, and the 10% is rarely worth the complexity.

2. Make the gateway HA before anything else.

Two replicas minimum, HPA, graceful shutdown, retries to upstream auth pod, circuit-breaker semantics in NGINX, fail-closed posture. If any one of these is missing the gateway will take down your platform during a normal degraded event. This isn't optional.

3. The log is the API.

The AUTH_DECISION log is a public contract with everyone who ever debugs your gateway. Treat it like a schema. Don't change field names without a migration. Don't add free-form strings to enum fields. Have one version-controlled doc that defines every field and every value of every enum.

4. Cache invalidation has to be explicit.

Every cache: bounded size, explicit invalidation channel, articulable staleness window. If a cache doesn't have all three, it's a bug-in-waiting. We learned this twice.

5. Build observability before you build features.

Dashboards, alerts, trace context, the structured log — all of these come before you ship the cool feature you're excited about. A clever new permission model that you can't observe is worse than a boring permission model you can.

What we'd build next

A few things on our list that didn't fit this series:

NGINX otel module compiled in. Right now NGINX traces are limited; the Auth Service has full spans, but the NGINX hop is a black box from the trace's point of view. Worth fixing.
Per-tenant rate limiting. Currently we rely on upstream services. The gateway is the natural place.
WAF integration. We have an external WAF. Closer integration so WAF events show up in AUTH_DECISION would help triage.
Token introspection cache. Some integrations issue opaque tokens that we have to introspect with the issuer. Caching that lookup is its own caching problem; we haven't tackled it.
A formal "tenancy contract" page. Yes, the same one I told you to build on day one. We're catching up.

Each of these is a future series, probably.

Final architecture

For posterity, the picture of where we ended up:

Everything in this picture has been earned by an outage, a postmortem, or a near-miss. None of it is decoration. If you're building something similar and one of the boxes seems extra to you, it's because you haven't had the incident that justifies it yet.

Closing

Centralizing auth at the edge is one of those decisions that looks obviously correct in hindsight and is genuinely hard to convince a team to invest in beforehand. The wins are diffuse — slightly less drift, slightly fewer CVEs, slightly faster security responses. The pain is concentrated and visible — one new service to operate, one extra hop, one more place that has to be HA.

But every six months we look back and the gateway has paid for itself again. A library upgrade we did once instead of thirty times. A revocation feature that shipped in a week instead of being negotiated across teams. A multi-tenant isolation guarantee we can actually defend in audits.

If you take one thing from this series, take this: auth is not a problem you solve once and ignore. It's a problem you solve somewhere, well, and operate with care. Pick that somewhere to be the edge, build it small and observable, and the rest of your platform gets to focus on actual product work.

Thanks for reading. If you build one of these — or are stuck somewhere mid-build — drop a comment. The hardest part of operating an Auth Gateway is realizing that other people have built the same thing and hit the same rocks. There's no reason for each team to find them independently.

DE

Source

This article was originally published by DEV Community and written by Akarshan Gandotra.

Read original article on DEV Community

Back to Discover