Rust vs Go for AI Infrastructure in 2026: Here's What the Benchmarks Actually Say

Book: Observability for LLM Applications · Ebook from Apr 22
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You have already picked Python for your model serving. The question in 2026 is what sits in front of it: the gateway, the router, the span collector, the streaming layer. This is not a research question. You benchmark it.

Most of the Rust-vs-Go argument on HackerNews treats the comparison like a team sport. Rust-shaped people point at the Techempower numbers and declare victory. Go-shaped people point at Kubernetes and declare victory. Neither side is answering the question you actually have, which is whether the language delta matters on the specific workloads that sit between a user and a model in a production AI product.

The honest answer is narrower than either camp admits. Rust wins raw throughput and tail latency. Go wins developer velocity, deployment story, and the mental overhead of shipping something that has to be read by a rotating team next year. And on the workloads AI infrastructure actually runs, the gap is smaller than the synthetic benchmarks suggest.

This post walks through the five workloads that matter: LLM gateway throughput, token streaming, OpenTelemetry span processing, vector similarity hot paths, and prompt templating under concurrency. Where a public benchmark exists, it is cited. Where one does not, the text says "rough estimate" and explains what it is estimating from. Nothing is made up.

The workloads, and why these five

The serving layer of an AI product is not one workload. It is at least five.

Gateway / reverse proxy — accepts a request, checks a key, picks a vendor, forwards, streams back.
Token streaming — long-lived SSE or WebSocket connections, many concurrent, each trickling bytes.
OpenTelemetry span processing — ingest, batch, sanitize, forward spans at high cardinality.
Vector similarity hot path — the per-query CPU work around an ANN index, plus hybrid-search reranking.
Prompt templating under concurrency — string assembly, tokenization, context packing on every request.

Training, fine-tuning, and inference kernels are not on this list. Those are Python's home and no sane person is porting them. The argument here is about the tier above the model.

Gateway throughput

The Techempower benchmarks are the most cited public numbers for web-framework throughput. On the plaintext and JSON serialization rounds (latest 2024 results at techempower.com/benchmarks), Rust frameworks like actix-web and axum consistently sit near the top of the table. Go frameworks like fiber and the stdlib net/http sit further down, often at 50–70% of the top Rust result, depending on the round and the hardware.

That is real. It is also not your workload.

A Techempower "plaintext" round measures how many times per second a process can respond to a noop HTTP request. An LLM gateway spends the overwhelming majority of its wall time waiting on an upstream model call that takes 800–2000ms. The local CPU cost of the request terminator is a rounding error against that. If your gateway can do 50k noop RPS in Rust vs 25k in Go, but every real request spends 1200ms in Anthropic, the throughput you observe in production is bottlenecked by connection pool size and upstream concurrency, not by which language terminated the TLS.

The place the language gap actually bites is the tail. Go's GC is excellent, but it is a GC. Rust has no stop-the-world pauses because it has no GC. For a p99.9 SLO measured in milliseconds, Rust's tail is flatter. Rough estimate: in a gateway doing 5k RPS with streaming fanout, you will see Go tail latency wobble 2–5ms under GC pressure; Rust will not. For most AI products, where model latency dwarfs everything, this does not matter. For frontier products with hard tail-latency commitments, it does.

Verdict on gateways. Go is fine. Rust is better if tail latency is a product requirement. The choice is dominated by the team, not the language.

Token streaming

This is the one most people get wrong about both languages.

The workload is: one process holding N concurrent long-lived connections, each streaming a slow byte trickle from an upstream model to a downstream client. Memory per connection matters. Context-switch cost matters. The C10K problem, basically, applied to SSE.

Go was designed for this. Goroutines cost ~2KB at start, the scheduler multiplexes them onto OS threads for you, and the stdlib net/http handles SSE without ceremony. You can hold 10k concurrent streaming connections on a single 2-core 4GB container and not break a sweat. This is well-documented in Cloudflare's engineering posts about their Go-based proxies (blog.cloudflare.com) and in Discord's writeup of their Go-to-Rust migration for a specific service (discord.com/blog/why-discord-is-switching-from-go-to-rust).

Rust does this too, via tokio. The async story is more explicit, the compile times are worse, and the abstraction cost of pinning, borrow checking across await points, and lifetime management in a connection struct is real. You pay for it once, in developer time. Rough estimate: a mid-level engineer building a streaming SSE proxy hits a working first cut in Go in a day, and in Rust in two to three days. The Rust version, once it compiles, will use less memory per connection. Not an order of magnitude less. Call it 30–50% less, heavily dependent on what you pack into each connection's state.

The Discord post is the honest reference point here. They switched one service from Go to Rust because a specific GC pause in a specific code path was hurting a specific user-facing metric. They did not switch their whole stack. They would not have switched at all if the metric had been less sensitive. Read the post; it is a good lesson in what the language delta actually buys you.

Verdict on streaming. Go is the default. Rust is the answer when you have measured a GC pause in the streaming path and it is hurting something you care about. If you have not measured it, you do not have the problem.

OpenTelemetry span processing

The reference OpenTelemetry Collector is Go (open-telemetry/opentelemetry-collector). There is also a Rust collector effort (open-telemetry/opentelemetry-rust) but it is not a drop-in replacement for the Go collector; it is the Rust SDK, not the full collector pipeline.

The Collector is the span-processing workload in practice. It ingests OTLP over gRPC, runs spans through a chain of processors (batch, filter, attribute sanitization, sampler), and forwards them to exporters. Throughput depends on the processor chain more than the language.

Published numbers from the collector project's own benchmarks (github.com/open-telemetry/opentelemetry-collector/tree/main/testbed) show the Go collector sustaining spans in the tens-of-thousands-per-second range on modest hardware with a realistic processor chain. Rough estimate: a hand-written Rust equivalent with a similar processor chain would gain 30–60% on raw throughput, mostly from zero-copy deserialization and less GC pressure. That is meaningful if you are ingesting a million spans per second. It is a rounding error if you are ingesting ten thousand.

The other side of this workload is memory. A span processor that batches in-memory before flushing has to decide how much to hold. The Go collector's GC behavior under bursty ingest has been the subject of multiple issues and tuning guides. Rust's memory story is more predictable. Again, rough estimate: at steady state the Rust version will sit at maybe 40–60% of the Go version's RSS, under the same ingest rate, with a flatter variance.

Verdict on OTel. Ship the Go collector unless your span ingest is genuinely at hyperscale. Most teams who think they are there, are not. If you are, Rust is the right answer and there are people writing the plumbing for it.

Vector similarity hot path

The top vector databases are written in the languages you would expect. Qdrant is Rust (github.com/qdrant/qdrant). Weaviate is Go (github.com/weaviate/weaviate). Milvus is Go with C++ kernels (github.com/milvus-io/milvus). LanceDB is Rust.

The benchmark question here is split. The per-query ANN search kernel is SIMD-heavy numeric code. Rust and C++ both have excellent SIMD stories. Go's SIMD story is getting better via the assembler and the recently proposed simd package, but the ecosystem is behind. If the workload is dominated by the ANN kernel — a high-QPS search against a large index, no reranking, no filtering — Rust will beat Go on per-query latency, probably by a factor of 2–3x on a cold kernel. That is a rough estimate based on the consistent pattern in numeric-code microbenchmarks.

But most production RAG workloads are not dominated by the ANN kernel. They are dominated by the glue: embedding the query, filtering by metadata, calling the vector DB over a network, reranking with a cross-encoder, assembling the final context. The ANN search is frequently less than 20% of the total latency. Go is fine for the glue. Rust is not required.

The ANN-benchmarks project (github.com/erikbern/ann-benchmarks) publishes standardized numbers across vector indices. The Rust-implemented indices (Qdrant's HNSW, for example) consistently post competitive recall/QPS curves. So do the Go-based ones. The choice of vector DB should turn on the feature set and the operational story, not the implementation language.

Verdict on vector search. The engine you pick is usually not the one you write. Rust is the right answer if you are building the engine. Go is the right answer if you are building the surrounding product.

Prompt templating under concurrency

This is the least-discussed workload and the one most teams under-engineer.

On every request, the gateway has to: load a template, interpolate user variables, sanitize, tokenize for budget checks, pack context windows, serialize to the vendor's JSON schema. That is a pile of string work, often running in the critical path. Under concurrent load, the allocator matters.

Go's garbage collector has to deal with the string churn. Rust has no GC; String allocations go through the normal allocator and drop when they go out of scope. In microbenchmarks of pure string assembly, Rust typically runs 1.5–3x faster than Go, with a flatter memory profile. Rough estimate from public Criterion vs testing.B numbers in various language-shootout repos; these are not AI-specific but the pattern holds.

What this translates to in a real gateway is a few milliseconds of CPU time per request, and a few hundred bytes of transient allocation. At 1k RPS, both languages handle it without breathing hard. At 50k RPS with complex template logic and heavy context assembly, you might measure the difference in p99. You are unlikely to measure it in p50.

For tokenization specifically, both ecosystems have production-grade wrappers around the same tiktoken and tokenizers native libraries. github.com/tiktoken-go/tokenizer on the Go side and tiktoken-rs on the Rust side both call into the same optimized Rust kernels underneath. The language at the call site is not the bottleneck.

Verdict on templating. Not a language-choice workload. Write it carefully in either language. If you are allocating on every request instead of reusing buffers, your problem is not Go vs Rust.

What this actually adds up to

Five workloads, five verdicts, and a pattern.

Rust wins when:

Tail latency under 5ms is a product requirement.
Span ingest or streaming is at hyperscale (hundreds of thousands per second sustained).
You are writing the CPU-bound kernel yourself — a vector index, a rerank scorer, a SIMD-heavy filter.
The operational story can absorb longer compile cycles and a smaller hiring pool.

Go wins when:

The team needs to onboard new engineers quickly.
The deployment story is Kubernetes-heavy, because the Go-in-Kubernetes ecosystem is decisive.
Most of the real latency is in the upstream model call, which describes every AI product that uses a hosted LLM.
You are building the surrounding product, not the engine.

The HackerNews framing of this debate, that Rust is objectively superior and Go is what happens when you pick the inferior option, does not survive contact with the workloads. On an LLM gateway spending 95% of its wall time in an Anthropic call, the 2x raw-throughput advantage of Rust is a rounding error. On a streaming proxy that has to be rewritten by a new hire in six months, the compile-time advantage of Go is the product feature.

There is also a middle answer that a lot of teams end up at: Go for the serving layer, Rust for the single hottest path inside it, bridged by gRPC or an FFI call. Discord's migration is one shape of this. The OpenTelemetry project's two-collector approach is another. The right question is not "which language" but "which workload, and is the language the bottleneck."

Pick the one that matches your team and your failure modes. Measure before you migrate. If your production evidence does not show GC pauses in the path you care about, you do not have a Go-to-Rust problem. If it does, you have a migration to plan.

The benchmarks tell you what the language can do. Your traces tell you whether the language is the thing to fix.

If this was useful

The observability side of this story, the traces, the tail-latency histograms, the span attributes that tell you whether you have a language problem or an upstream problem, is the subject of the book I just finished. It is language-agnostic on the surface and has examples in Go. If you are picking a stack for the AI serving layer and want a field guide for instrumenting whichever one you pick, that book is the most direct thing I can point you at.

Observability for LLM Applications — Amazon — paperback live, ebook from Apr 22.
Thinking in Go (2-book series) — Complete Guide to Go + Hexagonal Architecture in Go.
Hermes IDE — hermes-ide.com — an IDE built for developers who ship with Claude Code and other AI coding tools.
More writing — xgabriel.com and GitHub.

DE

Source

This article was originally published by DEV Community and written by Gabriel Anhaia.

Read original article on DEV Community

Back to Discover