How I added LLM fallback to my OpenAI app in 10 minutes
You're running a production app on OpenAI. One Tuesday morning it goes down. Your app returns 500s. You spend an hour refreshing status.openai.com.
There's a better setup. Here's how to add provider fallback to any OpenAI-SDK app without rewriting anything.
The problem with single-provider setups
When you call OpenAI directly, you have one point of failure:
from openai import OpenAI
client = OpenAI(api_key="sk-...")
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Summarise this text..."}],
)
If OpenAI returns a 500 or a 429, your user sees an error. You have no fallback, no visibility into what failed, and no easy way to route to a cheaper provider when you don't need GPT-4 quality.
The fix: two lines and a gateway
InferBridge is an OpenAI-compatible API gateway. You point the OpenAI SDK at it instead of OpenAI directly. It handles routing, fallback, and per-request observability — without touching your application logic.
Step 1: Get an InferBridge key (run once)
# Create an account — returns your InferBridge key exactly once, save it.
curl -X POST https://api.inferbridge.dev/v1/users \
-H 'Content-Type: application/json' \
-d '{"email":"you@example.com"}'
# {"api_key": "ib_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", ...}
Step 2: Register your existing OpenAI key
curl -X POST https://api.inferbridge.dev/v1/keys \
-H 'Authorization: Bearer ib_xxx...' \
-H 'Content-Type: application/json' \
-d '{"provider":"openai","api_key":"sk-..."}'
Your key is Fernet-encrypted at rest. InferBridge never logs request content and never marks up inference — your key goes directly to the provider.
Step 3: Change two lines in your app
from openai import OpenAI
client = OpenAI(
api_key="ib_xxx...", # ← was sk-...
base_url="https://api.inferbridge.dev/v1", # ← new
)
resp = client.chat.completions.create(
model="ib/balanced", # ← was "gpt-4o-mini"
messages=[{"role": "user", "content": "Summarise this text..."}],
)
That's it. Your app now has fallback.
What the routing tiers actually do
InferBridge uses explicit routing tiers instead of magic auto-classification:
| Tier | Chain | Use when |
|---|---|---|
ib/cheap |
Groq → DeepSeek → Together → Sarvam → OpenAI | High volume, cost-sensitive, quality flexible |
ib/balanced |
OpenAI → Sarvam → Anthropic | Default for most production apps |
ib/premium |
Anthropic → OpenAI | Complex reasoning, quality-critical |
The router intersects the tier with the provider keys you've registered. So if you only have an OpenAI key, ib/cheap routes to OpenAI. Register a Groq key (free tier available) and the same request code now hits Groq first — no code change.
What fallback looks like in practice
A 500 from OpenAI on ib/balanced is invisible to your app. You get a clean 200 with a normal OpenAI-shaped response. The only signal is in the inferbridge block appended to the response body:
{
"id": "chatcmpl-...",
"choices": [...],
"usage": {...},
"inferbridge": {
"provider": "anthropic",
"model": "claude-3-5-haiku-20241022",
"mode": "ib/balanced",
"cache_hit": false,
"latency_ms": 834,
"cost_usd": "0.000041",
"residency_actual": "global",
"request_id": "abc123"
}
}
provider: "anthropic" tells you OpenAI failed and Anthropic served it. Your application code didn't change. Your user saw nothing.
If every candidate in the chain fails, you get a clean error:
- All 429s →
429 rate_limit_errorwith aRetry-Afterheader - Mixed 5xx/timeouts →
502 provider_erroror504 gateway_timeout
Observability you get for free
Every request is logged. Two endpoints give you visibility without a dashboard:
# Aggregated stats
GET /v1/stats
# → totals, cache_hit_rate, breakdown by provider/mode/status
# Paginated request log
GET /v1/logs
# → per-request: provider, model, cost_usd, latency_ms, status, request_id
status can be success, fallback_success, cache_hit, or error. Filter for fallback_success to see exactly when and how often your primary provider is failing.
Optional: add caching for repeated prompts
For deterministic prompts (classification, extraction, templated queries) you can opt into exact-match caching with one header:
resp = client.chat.completions.create(
model="ib/balanced",
messages=[...],
extra_headers={
"X-InferBridge-Cache": "true",
"X-InferBridge-Cache-TTL": "3600", # seconds
}
)
Cache key is a SHA-256 over provider + model + messages + determinism params. A cache hit returns cache_hit: true in the inferbridge block and costs zero tokens.
What's not built yet (be honest with yourself)
InferBridge is early. Before you adopt it, know the gaps:
- No dashboard UI — observability is JSON endpoints only
- Streaming requests bypass the cache
- No embeddings endpoint
- No vision inputs
- No streaming tool use / function calling
If those are blockers for your use case, it's not the right fit yet.
Try it
Free tier is unlimited BYOK. No credit card.
- Docs: inferbridge.dev/docs
- Migration guide: inferbridge.dev/docs/migration-from-openai
If you run into anything broken or confusing, hello@inferbridge.dev goes to a real inbox.
This article was originally published by DEV Community and written by Jay.
Read original article on DEV Community