TL;DR
My coding-interview prep app has a "Generate Visualization" button. Click it on any algorithm problem and Claude Sonnet 4.5 produces a self-contained interactive widget that teaches it — a sliding-window expanding, a two-pointer racing, a DP table filling cell by cell. (That's what the GIF above is showing.)
At naive implementation, each click cost me about $0.08. Workable for Pro subscribers. Ruinous if I let free-tier users click freely.
Through five cost decisions — tiering the call path, prompt caching, output capping, a Haiku gatekeeper, and a Groq fallback on regenerations — I got the per-click cost to $0.029 and pushed free-tier users to a $0-marginal-cost path entirely.
If you're building anything with a "Generate with AI" button in a freemium product, these are the moves that matter before you ship.
The product problem
I'm building Crackly — a DSA interview prep tool. 474 problems, each with an "AI Visual" panel. Press "Generate Visualization" and Claude generates a custom HTML+JS widget that animates the algorithm end-to-end: you watch the two-pointer sweep, the sliding window expand, the recursion tree unfold.
It's the best feature in the product. It's also the most expensive.
A naive build looks like this:
- User opens a problem page.
- Claude Sonnet 4.5 is called with the problem + a spec for how to generate the visualization.
- ~15 seconds later, the visualization renders in an iframe.
- Cost: ~$0.08 per click at full API pricing (mostly from output tokens — visualizations run ~4,000 output tokens naive).
For Pro users on a $49/quarter plan, that's fine. A Pro user generates maybe 20 of these per month, costing me $1.60 in Claude on $16.33/mo of revenue. ~90% gross margin.
For free users, this is an existential problem. At just 1,000 free DAUs generating one visualization each per day, that's $80/day, $2,400/mo burned by a cohort paying me $0.
The first question any founder shipping an AI feature should ask: what's the marginal cost of a free user's most expensive action? For me it was $0.08, and it was eating my runway.
Here's how I got it down.
Decision 1: Tier the call path. Free users never trigger Claude.
The biggest wins in AI product engineering come from not making the call, not making it cheaper.
I split the "Generate Visualization" button into two code paths:
Free user clicks → Check DB for cached visualization for this exercise. If hit, serve it (~10ms). If miss, show a tasteful "Generating visualizations is a Pro feature — here's a preview of a similar one" state. No Claude call. $0 marginal.
Pro user clicks → Check cache first. If hit, serve it. If miss, generate fresh with Claude, store in cache for the next user, serve.
This is obvious in retrospect but I didn't build it this way at first. My v1 ran Claude on every click regardless of tier. The metrics dashboard told me within a day that this would end me.
The important property: free users benefit from the cache Pro users warm. Every Pro user who generates a fresh visualization populates the DB; every free user who lands on that same problem afterward gets it free. Pro users subsidize free coverage without knowing it, and the DB gets organically richer over time.
No batch job required. No pre-generated library. The cache grows naturally from user behavior.
Decision 2: Prompt caching — 90% off input tokens
Every "Generate Visualization" call uses the same ~2,800-token system prompt. It defines the HTML output contract, the styling rules, the safety constraints, eight example outputs. The only thing that varies between calls is the problem description (~200 tokens).
Anthropic's prompt caching charges ~10% of full input-token price on cache hits. 5-minute TTL, extended to 1 hour on paid plans.
Without caching:
2,800 tokens × $3.00/MTok = $0.0084 per call (input side)
With caching:
First call (cache write): 2,800 × $3.75/MTok = $0.0105
Calls 2..N (cache read): 2,800 × $0.30/MTok = $0.00084 — 10x cheaper
The practical wrinkle: the 5-minute TTL means the cache only stays warm if calls arrive frequently. For organic traffic on a not-yet-launched product, I barely got cache hits on quiet days — the cache would expire between clicks.
Fix: a shared worker that pools requests within TTL windows, so back-to-back calls within 5 minutes share the warm cache. On high-traffic Pro days, cache hit rate climbs to ~85%, and per-call input cost trends toward the $0.00084 floor.
If your system prompt is >500 tokens and identical across calls, this is your biggest free lunch. Cache it.
Decision 3: Output capping — the sneakiest cost sink
Output tokens cost 5x more than input tokens on Sonnet 4.5: $15/MTok out vs $3/MTok in. And models have a bad habit of filling whatever budget you give them.
My first prompt said "generate a complete visualization." No cap, no structural constraint. Typical responses came back at ~4,000 output tokens because Claude kept adding elaborate comments, explanatory headers, <details> sections, inline docstrings — none of which the iframe needed.
I switched to this constraint in the system prompt:
Generate exactly the HTML/JS body. No HTML page wrapper (no <!DOCTYPE>,
<html>, <head>, <body>). No comments. No explanation. Output must be
under 3,500 tokens. If approaching the limit, truncate visual polish
before truncating logic.
Plus max_tokens: 3500 on the API call — a hard cap. If the model tries to exceed it, the response gets truncated mid-output and I detect that server-side and retry with a tighter instruction.
Typical response length dropped from ~4,000 tokens to ~1,850 tokens. A 54% reduction on the expensive side.
Before: 4,000 × $15/MTok = $0.060 per call (output)
After: 1,850 × $15/MTok = $0.0278 per call
Savings: $0.032 per call
At 1,000 Pro-user generations per month, that's $32/mo saved forever. The change was one paragraph in the prompt. Cap your outputs — it doesn't just save money, it forces you to define what you actually need, which produces better outputs anyway.
Decision 4: Groq fallback on regenerations
Pro users have a "Regenerate" button if they don't like the first output. Each regenerate is another call. If I let Pro users regenerate unlimited times at full Sonnet price, my cost per Pro user goes from $1.60/mo to unbounded.
What I built:
- First generation: Claude Sonnet 4.5. High quality. Persisted in the cache.
- Regenerations 1–5: Groq (Llama 3.3 70B). Free tier: 14,400 req/day. Ephemeral — not persisted in DB.
- Regeneration 6+: Rate-limited. "You've hit today's regeneration limit. Back tomorrow."
The insight: tier your inference quality to the user's need at that moment. First-generation quality is what the user judges the product by. Regeneration quality is incrementally useful — they already have a decent visualization, they're asking for a second take. A "good enough" viz from Groq is fine.
First gen to expensive-but-reliable. Regenerations to free-but-decent. Same design pattern applies to many AI features: your first inference run is your hero surface. Your retries, refinements, and exploratory calls can route to cheaper infrastructure without the user ever noticing.
Decision 5: Haiku gatekeeper (offline batch only)
This one runs exactly once per problem, offline, during seed-script runs.
Not every algorithm benefits from visualization. "Return a constant" doesn't. "Sum two integers" doesn't. I didn't want to manually tag 474 problems for visualization-worthiness.
A two-pass filter does the work:
- Haiku pass ($0.25/MTok in, $1.25/MTok out): "Given this problem description, is an interactive visualization likely to help a learner understand the algorithm? Reply YES or NO with a one-sentence reason."
- Sonnet pass: runs only if Haiku said YES.
Haiku filters out ~13% of problems as "not visualization-worthy." At ~$0.04 full-cost per Sonnet call on the output side, the filter saves about $5 on a full-catalog batch and — more importantly — stops the system from generating awkward animations of trivial operations.
This pattern is underused. Flash / Haiku / GPT-4o-mini are nearly free at classification tasks. If your expensive model is making judgment calls that could be offloaded, offload them.
The per-click economics today
Putting it all together, here's what a Pro user's "Generate Visualization" click costs me after all the engineering:
| Component | Amount |
|---|---|
| Input tokens (2,800 cached at ~90% hit rate) | $0.00084 |
| Output tokens (1,850 × $15/MTok) | $0.02775 |
| Network/infra overhead (Cloud Run + DB write) | $0.001 |
| Per-click total | ~$0.029 |
A typical Pro user generates ~20 visualizations per month. That's $0.58/mo in Claude cost per Pro user, on a $16.33/mo subscription. Gross margin per Pro user: ~96%.
Free users cost me $0 in Claude — they see the cache or see the upgrade nudge.
| Cost curve | Before engineering | After engineering |
|---|---|---|
| Per-click cost | $0.08 | $0.029 |
| Per-Pro-user monthly cost | $1.60 | $0.58 |
| Per-Pro-user gross margin | ~90% | ~96% |
| Free-user marginal cost | $0.08 | $0 |
A 6-point margin improvement on an AI feature's hero surface is the difference between a company that can pour into growth and one that can't. At 1,000 Pro users that's $720/mo of gross margin I wasn't paying attention to before.
The framework I'd use on any freemium AI product
If you're shipping a "Generate with AI" button, run this checklist before launch:
What's the marginal cost of one click at naive settings? Know this number. If it's >$0.02 and you have free-tier users clicking it, you have a unit-economics problem to solve before you scale.
Tier the call path by user segment. Free users should usually see cached outputs or a graceful paywall state, not trigger fresh inference. Pro users get the live call. This isn't user-hostile — it's the only way freemium AI math works.
What tokens are identical across calls? Those go in a cached system prompt. Prompt caching is the single biggest lever on input cost at scale.
What's your output actually using? If your responses exceed ~2,000 tokens often, your prompt isn't constraining output enough. Cap it in both the instruction and
max_tokens.Is there a cheap model that can route or retry the expensive one? Haiku / Flash / Mini are nearly free at classification. Offload judgment calls and filters.
Can some of your calls run on a free tier? Groq, Gemini free tier, Cerebras. Not for your hero feature — but regenerations, retries, warm-up passes, exploratory runs? Yes. Tier your inference quality to the user's moment-of-need.
What % of your LTV per user does per-user click cost consume? If a single user's typical use of the feature can eat >10% of their subscription price, you've got a hidden margin killer.
None of this is secret. These are rarely applied rigorously because "make it work first" correctly precedes "make it cheap." But the gap between naive and cost-engineered is 3-5x on AI features — which is the difference between a feature you can give away and one you have to meter.
What I'm building this for
All of this cost engineering is in service of Crackly — a DSA interview prep tool I'm shipping at crack-ly.com. The "Generate Visualization" button is one feature among many; the whole product is designed around the principle that teaching should be expensive thinking, not expensive generation. The AI works hardest at the moment you most need help. Everywhere else, it stays out of your way.
It's in private beta. Free tier forever. If you're prepping coding interviews in the next six months, try the free tier — and play with the live "AI Visual" demo embedded on the landing page. If you're building with LLMs and want to compare notes on cost engineering, my inbox is open: admin@crack-ly.com.
Follow me on X at @jobcrackly for more building-in-public from this project.
This article was originally published by DEV Community and written by Crackly.
Read original article on DEV Community