- Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Your Datadog bill crossed $50K in March. Your finance team flagged it. Your engineering team has been eyeing the LLM traces line on the invoice for a year. The number on that line is not the problem on its own. The problem is the slope. Spans are growing faster than users, because every agent call fans out into four tool spans and a retrieval span and an embedding span, and the invoice tracks the fan-out.
Here is the 2026 migration path teams are actually taking, and the one invisible cost that keeps some of them paying.
Why the bill grew that fast
An LLM feature looks cheap at the model layer and expensive at the telemetry layer. A single user message to an agent produces, on a normal day, one chat span, three to five execute_tool spans, a retrieval span with a half dozen child embeddings, and a parent span to hold them together. Ten spans per user turn is a floor, not a ceiling.
Datadog's APM list price, as documented on their pricing page, is $31 per host per month for Pro and $40 for Enterprise, plus $1.27 per million ingested spans and $2.55 per million indexed spans at their published rates. Datadog LLM Observability is a separate SKU that meters per LLM span. When a product that was billed at one span per request starts emitting ten, the invoice grows tenfold without a single new user arriving.
The teams I have talked to over the last six months all describe the same curve. A fintech that shipped an agent in Q3 2025 watched their Datadog bill go from $18K to $62K per year across two quarters. A European SaaS vendor hit an $80K renewal quote and asked their platform team for an alternative. None of this is Datadog doing anything wrong. It is a pricing model designed for HTTP services meeting a workload that emits an order of magnitude more spans per unit of user value.
The stack teams are migrating to
The 2026 self-hosted stack is narrow and boring in a good way. Three containers, one afternoon, zero SaaS contracts.
- OpenTelemetry Collector ingests OTLP over gRPC or HTTP, batches, drops what you do not need, and forwards.
- ClickHouse stores the spans. Columnar, compressed, queryable in SQL.
- Grafana reads ClickHouse via the official datasource plugin and draws the dashboards.
Optional fourth container: Langfuse, which was acquired by ClickHouse in January 2026. Langfuse already ran on ClickHouse as its trace store, so the acquisition is a clarification of the architecture rather than a change to it. OSS Langfuse is MIT, runs in its own container, and points at the same ClickHouse instance the Collector writes to. You get a polished trace explorer on top of your own data.
The reason this shape dominates 2026 is OpenTelemetry's GenAI semantic conventions v1.37, which stabilized late 2025 and is now emitted natively by OpenAI's Python SDK, Anthropic's SDK, the Traceloop SDK, and by Datadog LLM Observability itself. Every tool in the category reads the same wire format. Migration is a DNS change, not a rewrite.
The Collector config
This is the whole pipeline. OTLP in, ClickHouse out.
# otel-collector.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
batch:
timeout: 5s
send_batch_size: 10000
# Keep every error, keep every slow trace,
# sample 5% of the rest. Saves 80% of volume.
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow
type: latency
latency: {threshold_ms: 2000}
- name: sample
type: probabilistic
probabilistic: {sampling_percentage: 5}
exporters:
clickhouse:
endpoint: tcp://clickhouse:9000?dial_timeout=10s
database: otel
username: otel
password: ${env:CH_PASSWORD}
traces_table_name: otel_traces
create_schema: true
ttl: 720h
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [clickhouse]
Two things worth calling out. The Collector you want is otel/opentelemetry-collector-contrib, not the base distribution. The ClickHouse exporter lives in contrib only. Second, tail_sampling is what makes this stack financially viable at scale. You keep every error, every slow trace, and a representative sample of the rest. The rows you drop are the ones you would not have looked at anyway.
The Grafana datasource
File-based provisioning. Drop one YAML under grafana/provisioning/datasources/ and the instance boots with the datasource wired.
# grafana/provisioning/datasources/clickhouse.yaml
apiVersion: 1
datasources:
- name: ClickHouse
type: grafana-clickhouse-datasource
access: proxy
uid: clickhouse-otel
isDefault: true
user: otel
jsonData:
host: clickhouse
port: 9000
protocol: native
defaultDatabase: otel
secureJsonData:
password: ${CH_PASSWORD}
Install the grafana-clickhouse-datasource plugin via GF_INSTALL_PLUGINS in the Grafana container's env, and the datasource is ready on first boot. One query later, you are drawing token spend by model.
The cost math, with real numbers
Two workloads. Both assume 30 days of retention, prompts and completions captured, tail sampling at 5% for the self-hosted path.
Workload A: 10M spans per month. This is a small-to-medium LLM product. One agentic feature, a few thousand active users, moderate tool use.
| Line item | Datadog (list) | Self-hosted |
|---|---|---|
| Ingest / host | ~$12,700 ingest + ~$4,800 host (10 hosts Pro) | $0 |
| LLM Obs per-span | ~$9,000 (metered) | $0 |
| Storage | included | ~$60/mo (EBS) |
| Compute | included | ~$140/mo (one m7i.2xlarge) |
| Backups to S3 | included | ~$15/mo |
| Annual | ~$312,000 | ~$2,600 |
Datadog list prices are public. Real Datadog bills at this volume are usually negotiated 20–40% below list, which gets the annual closer to $180K–$250K. Even at the discounted number, the gap is two orders of magnitude.
Workload B: 100M spans per month. Larger product, multiple agent surfaces, heavy tool use.
| Line item | Datadog (list) | Self-hosted |
|---|---|---|
| Ingest / host | ~$127K ingest + ~$20K host (40 hosts) | $0 |
| LLM Obs per-span | ~$90K (metered) | $0 |
| Storage | included | ~$400/mo (EBS) |
| Compute | included | ~$900/mo (three-node CH cluster) |
| Backups + egress | included | ~$150/mo |
| Annual | ~$2.85M | ~$17.5K |
The self-hosted numbers come from teams I have talked to running roughly this shape at AWS list prices, with tail sampling doing most of the heavy lifting. Without tail sampling, you push 20x the data into ClickHouse and the storage line grows, but the compute line stays remarkably flat because ClickHouse compresses spans at 8–12x on the wire.
You will notice the self-hosted column has no "Datadog salesperson discount." It also has no renewal cycle. The invoice is from AWS, and the AWS invoice does not care whether your span volume doubled this quarter.
What Datadog still does better
The dollar column is one slide. A complete migration plan has three more.
Alert routing and PagerDuty integration. Datadog has fifteen years of accumulated alerting UX. Monitor composition, flaky alert suppression, the entire library of pre-built integrations with PagerDuty, Opsgenie, Slack with context-aware message formatting. Grafana Alerting is competent in 2026 and covers the 80% case, but the last 20% — multi-condition composite monitors with suppression windows and PagerDuty context payloads — is where Datadog still wins. Budget a week of your platform team's time to port alerts, and a quarter of maintenance to get parity on the sharp-edged cases.
SaaS UX for non-engineers. Your product manager and your CFO both have Datadog logins. They will not have ClickHouse logins. Grafana is the replacement surface for most of what they were looking at, but Grafana dashboards optimized for engineers are not the same artifact as Datadog dashboards optimized for execs. You will build two sets of dashboards in Grafana, one per audience, and it will be more work than you estimated.
The ecosystem of integrations. Datadog has agents for every piece of infrastructure you have ever deployed. Every new AWS service ships with Datadog support in its first quarter. Self-hosting means you own the instrumentation for anything outside the OpenTelemetry contrib repo, and the contrib repo is good but not exhaustive.
Incident retrospective tooling. Datadog's Watchdog and anomaly detection have real signal. Grafana's built-in anomaly detection is weaker. Teams that migrate and still want this end up adding a Grafana ML plugin or a separate Metrics Advisor pipeline, which is another stateful service to run.
Those are the things that keep some teams paying Datadog after they priced the alternative. The teams that migrate anyway do it because the dollar gap is bigger than the UX gap, and because the UX gap shrinks every release.
What you give up: the ops burden
This is the line on the migration plan that teams underestimate. Running three stateful services is real work.
ClickHouse upgrades. ClickHouse ships fast. Pin the version, upgrade on a schedule, read the release notes when you cross LTS branches. Minor bumps inside 25.3 are safe. Crossing 24.8 to 25.3 is not automatic. Budget an engineer day per quarter for upgrades.
Backups. BACKUP TABLE to S3 works. Nobody sets it up on day one. The first outage is also the first time most teams learn their backup cron never ran. Wire it up before you need it.
Disk planning. At 100M spans per month with prompts captured, you are looking at 200–400 GB per month of compressed data. EBS throughput matters; a small gp3 volume will bottleneck ClickHouse on ingest spikes. Provision intentionally.
On-call for the observability stack itself. This is the cost that keeps some of the teams I have talked to paying for SaaS. When your observability is hosted, the observability company is on-call for their own stack. When you self-host, you are on-call for it. An outage at 03:00 on the ClickHouse instance means the engineer who would have been debugging production is now debugging the thing they use to debug production. That is a bad place to stand.
The teams that handle this well have a platform team of three or more engineers with one of them owning the observability stack as a named responsibility. The teams that handle this badly are the ones that rolled the stack because it looked cheap on a Thursday, and still have an unupgraded ClickHouse 25.3 running a year later because nobody owns it.
Who should migrate
Three conditions, all three required.
You have a platform team. Not a fractional engineer, not "a backend engineer who handles infra on Fridays." A named person or team whose job includes keeping stateful services alive. If your company is under 20 engineers and has no platform function, keep paying Datadog and spend that engineering capacity on the product.
Your volume is predictable and growing. The break-even against Datadog LLM Observability lands somewhere around 2–5M spans per month on list pricing, earlier against Langfuse Cloud. Below that, you are paying your platform engineer more than a vendor would charge. Above it, the math flips and keeps flipping.
Data sovereignty is a constraint, not a preference. A regulated industry, a customer contract that forbids prompts leaving your VPC, an internal policy about user content. If your reason for self-hosting is "I do not like SaaS," you will hate this stack in six months.
Who should not migrate
If you are a team of 12 engineers, with no platform function, whose Datadog bill is $40K a year and whose product is growing, the arithmetic of this migration does not work. You will spend the annual savings on platform engineering time in the first quarter, and you will learn that the Datadog bill you were trying to avoid is the smallest line item in the real cost of running observability.
The honest version of the recommendation: migrate when the Datadog bill is bigger than the cost of a half-time engineer, and you have the half-time engineer. Otherwise, negotiate with Datadog on the renewal.
The migration, step by step
If the three conditions hold, here is the shape the migration takes. Budget two sprints.
Week one. Stand up the compose stack in a staging environment. OTel Collector, ClickHouse, Grafana, Langfuse on top if you want the polished trace UI. Point a single service at it. Verify spans land. Write the first four panels: token usage per model, cost per hour, latency percentiles, error rate per operation. Chapter 15 of the observability book walks through the exact queries.
Week two. Point 10% of production traffic at the new Collector in parallel with Datadog. Run both for a week. Compare dashboards side by side. Find the metrics that look different and figure out why (usually a missing attribute or a sampling config drift). Port your top ten Datadog monitors to Grafana Alerting.
Week three. Cut over 100% of traffic. Keep Datadog running for the LLM traces for one more billing cycle as a safety net. Write the runbook for the three failure modes that matter: ClickHouse disk fills up, Collector OOMs, Grafana cannot reach the datasource.
Week four. Cancel the LLM Observability SKU on Datadog. Watch the next invoice.
The full decision tree
Is your Datadog LLM spend > $30K/year?
No → negotiate your renewal, stop reading.
Yes → continue.
Do you have a platform team with one named owner for this?
No → keep paying Datadog.
Yes → continue.
Is your span volume growing faster than your headcount?
No → keep paying Datadog; the math won't flip.
Yes → continue.
Is data sovereignty a real constraint?
Yes → migrate now.
No → migrate when the bill exceeds a half-time engineer's loaded cost.
The migration is not a question of whether self-hosted is possible. It has been possible since 2023. The question in 2026 is whether the operational maturity of your team matches the operational demands of a stateful columnar store.
If it does, the stack is three containers, the config is above, and the book has the rest.
If this was useful
Chapter 15 of Observability for LLM Applications is the full version of this post — the compose file, the ClickHouse schema walkthrough, the four panels with their SQL, the operator notes that come from having run the stack past the easy months. Chapter 16 picks up where this one stops, on cost tracking and token accounting once the telemetry is in your own warehouse. The rest of Part IV compares Langfuse, LangSmith, Phoenix, Braintrust, DeepEval, and Helicone against the same decision matrix this post uses for Datadog.
- Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
- Thinking in Go: 2-book series on Go programming and hexagonal architecture.
- Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
- Me: xgabriel.com · github.com/gabrielanhaia.
This article was originally published by DEV Community and written by Gabriel Anhaia.
Read original article on DEV Community
