LLM Cost Benchmark: API vs. Self-Hosted

Short answer: when to use API vs self-hosted

APIs win for low to medium volume and fast time-to-market. Self-hosting wins when you serve a lot of tokens or need strict data control. Rough thresholds to test: under $50k per year use an API, between $50k and $500k mix API and self-hosting, above $500k per year plan on self-hosting if you can operate GPUs well.

Why this matters

LLM inference cost affects your product roadmap and burn rate. A wrong bet can double your cloud bill or force a last-minute re-architecture. This guide gives a clear way to compare per-token API costs to the full total cost of ownership for private inference.

What we compare

API-based inference: pay per token to a provider like OpenAI or other vendors.
Self-hosted inference: rent or buy GPUs, run model servers, and cover ops.

What you need to estimate

Annual token volume (input+output tokens).
Target latency and model family (smaller models are cheaper).
Hardware costs: GPUs, servers, colocation or cloud instances.
Operating costs: power, cooling, staff, software licenses.
API prices for input and output tokens.

How to calculate TCO for self-hosting

Follow these steps. This mirrors methods used by NVIDIA in their benchmark guide and other TCO posts.

Pick a target model and throughput requirement.
Estimate the number of GPUs or servers required to hit that throughput.
Add capital expenses: GPU cost or hourly cloud price. Depreciate hardware over 3–4 years.
Add operating expenses: hosting, power, networking, software licenses, and one or more operator salaries.
Divide total yearly cost by total tokens served per year to get cost per million tokens.

For a worked example and formulas see the NVIDIA LLM inference benchmarking post.

API pricing: simple but not the full picture

APIs charge per input and per output token. For example, older GPT models charged different rates for input vs output tokens. That matters if your app uses long prompts or Retrieval-Augmented Generation (RAG).

APIs remove ops work, scale instantly, and include reliability and sometimes model improvements. But they add recurring variable spend and possible vendor limits. See the discussion about public vs private inference costs on this Medium analysis.

Typical cost buckets to include

GPUs: the largest single line item for self-hosting. Use market rates or cloud hourly prices.
Servers and chassis: CPU, memory, and networking costs.
Hosting and power: colocation, electricity, cooling.
Software and licenses: frameworks, enterprise SDKs, model licensing.
Personnel: SRE, MLOps, security.
Opportunity cost: model updates and maintenance.

Several practical breakdowns and numbers are available from community posts and vendor studies such as LLM Total Cost of Ownership 2025 and Dell's white paper on inferencing TCO Understanding the Total Cost of Inferencing Large....

Simple per-token math (example)

Pick numbers you control and run this simple formula.

Annual tokens = tokens per request * requests per year.
Self-host yearly cost = (capex/years) + yearly hosting + software + ops salary.
Self-host cost per 1M tokens = (self-host yearly cost / annual tokens) * 1,000,000.

For APIs: API cost per 1M tokens = provider rate for input and output tokens combined. Many providers list separate input and output prices; factor both. NVIDIA's guide shows how this splits in practice NVIDIA blog.

Real-world scenarios and break-even rules

Use these as quick rules of thumb. They come from industry research, community hardware builds, and cost studies.

Under $50k/year: API is usually cheaper and faster to ship. No hardware risk.
$50k's to $500k/year: consider hybrid. Run cheaper, smaller models locally for high-volume paths and use APIs for quality or complex tasks. See the model in Ptolemay's TCO.
Above $500k/year: self-hosting typically pays off if you can maintain operations. Many analyses and industry reports converge on this threshold, though your specific numbers matter.

Common gotchas

Ignoring staff and ops costs. Hardware is easy to estimate; people and reliability are not.

Overfitting to peak load. Design for average sustained throughput or use autoscaling in cloud.

Underestimating prompt length. Long prompts increase input token costs on APIs and increase compute on self-hosting.

Licensing and model updates. Model weights and enterprise SDKs can add surprises.

Hardware examples and ranges

Community builds and marketplace prices give ranges. For DIY on-premises, prices vary by GPU and build. See community hardware options at sanj.dev. Vendor reports like Dell's provide enterprise assumptions for 70B models Dell paper.

Tier	Typical GPU	Hourly or CapEx
Budget	RTX 3090 / 4090	Low upfront, limited scale
Enterprise	H100 / A100	High CapEx or $/hr cloud

Example break-even calculation (short)

Imagine 1 billion tokens/year. API charges $3 per 1M output tokens plus $1 per 1M input tokens. The API cost would be 1,000 * ($3+$1) = $4,000/year.

Self-hosting yearly cost including ops might be $100k. This leads to a self-host cost per 1M tokens of 100k / 1,000 = $100, which is much higher.

This simple example demonstrates that at low token volumes, APIs win. For very high volumes, however, the ratio flips in favor of self-hosting.

Where to get current numbers

API prices change fast and model improvements shift compute cost. Track these sources:

NVIDIA inference benchmarking for hardware math.
a16z's LLMflation note for long-term pricing trends.
Price observations and model-size trends.
Medium: public vs private inference.

Decision checklist

Estimate tokens per user and users per year.
Decide acceptable latency and model size.
Compute API annual cost and self-host TCO with ops and licenses.
Run a 12-month sensitivity: what if tokens double?
Start hybrid if unsure: cheap local models for bulk, API for quality.

Next steps and tools

Download a spreadsheet TCO model or use an interactive calculator to plug your numbers. A good model includes GPU hourly rates, depreciation, power, colocation, and staff. For a concise primer on pitfalls read community post Deploying LLMs in production: lessons.

Closing thought

APIs simplify life and are cost-efficient at low to medium scale. Self-hosting is a capital and ops play that pays off at scale. Use the thresholds above, plug in your product's numbers, and let the math decide. Ship fast, then optimize.