Unit Economics

The Breakeven Math on Self-Hosting vs API Inference

Self-hosting AI models promises dramatically lower inference costs but requires significant engineering investment and infrastructure overhead. This guide walks through the complete breakeven calculation — including hidden costs — so you can make the switch at the right time.

SaaS Science TeamJune 14, 20267 min read
self hosting ai modelsapi inference vs self hostingai inference breakevenself hosted llm saasai model hosting costinference cost comparisonself host ai economics

The financial case for self-hosting AI models is superficially compelling. A tier-3 API provider charges $0.002 per 1,000 tokens for a capable open-weight model. Running that same model on a dedicated GPU cluster might cost $0.0005 per 1,000 tokens — a 75% reduction. For a product spending $100,000/month on API inference, that is a $75,000/month saving.

This calculation is correct and also incomplete. It omits the engineering overhead required to run the infrastructure, the operational risk of owning an inference stack, the potential quality gap for the specific tasks the product requires, and the opportunity cost of the engineering time diverted to infrastructure. When these costs are included, the breakeven shifts substantially higher than the nominal calculation suggests — and for most AI-native SaaS companies below a certain scale, self-hosting is not economically rational.

See Your Growth Ceiling NowTry Free

The Complete Self-Hosting Cost Model

The full cost of self-hosted inference has four components. The nominal calculation includes only the first.

Cost 1: GPU Infrastructure (Usually Quoted)

GPU infrastructure is the most visible self-hosting cost because it appears directly on cloud bills. The per-hour cost varies by GPU type and instance class:

GPU TypeUse CaseOn-Demand $/hrReserved $/hr (1yr)
A10G7B–13B models$1.50–$2.50$0.90–$1.50
A100 80GB70B models (quantized)$3.00–$6.00$1.80–$3.60
H10070B+ models, high throughput$4.00–$8.00$2.50–$5.00

A minimal production inference deployment for a 13B parameter model requires at minimum 2 GPUs for redundancy. Monthly infrastructure cost: 2 GPUs × $0.90/hour × 720 hours = $1,296/month at reserved A10G pricing.

At higher throughput requirements, the GPU count scales. A production deployment handling 1,000 requests/minute for a 13B parameter model typically requires 4–8 GPUs.

Cost 2: Engineering Overhead (Most Often Omitted)

Running self-hosted inference requires ongoing engineering work that is invisible in the infrastructure comparison:

Deployment and updates: When new model versions are released, or when a production issue requires a rollback, an engineer must manage the deployment. This is not a one-time cost — model providers release updates regularly, and production deployments must be maintained.

Monitoring and incident response: Self-hosted inference creates a new category of production incidents — GPU health, model performance degradation, throughput bottlenecks, and capacity saturation. These incidents require engineers to respond, often outside business hours.

Performance optimization: Unlike API inference where the provider handles batching, quantization, and inference optimization, self-hosted deployments must be configured and tuned. Achieving competitive inference throughput requires expertise in serving frameworks (vLLM, TensorRT-LLM) that most product engineering teams do not have by default.

Capacity planning and scaling: Self-hosted inference requires anticipating demand and provisioning capacity proactively. Auto-scaling is possible but requires engineering to implement correctly.

The aggregate engineering overhead for a well-run self-hosted inference deployment is typically 0.5–1.5 FTE-equivalents of ongoing work. At a loaded engineer cost of $200,000/year, this represents $100,000–$300,000/year in overhead — an enormous cost that the nominal breakeven calculation ignores.

Cost 3: Operational Risk Premium

Self-hosted inference has a higher blast radius when things go wrong than API inference. Provider downtime affects all customers of that provider simultaneously — a systemic event that is typically resolved quickly. Self-hosted downtime affects only your customers and is your team's responsibility to resolve.

The operational risk premium should be calculated as:

  • Expected incidents per year × average MTTR (time to resolve) × hourly revenue impact

For a product with $1M ARR, an hourly revenue impact of approximately $110. If a self-hosted deployment has 6 more incidents per year than an API deployment, at 2 hours MTTR each, the risk premium is $1,320/year — modest relative to the infrastructure saving. But for a product at $10M ARR with complex incident response, the operational risk premium can be material.

Cost 4: Opportunity Cost

Engineering time spent on self-hosting infrastructure is not spent on product features. This opportunity cost is difficult to quantify precisely but should be considered when evaluating the self-hosting decision timeline.

For early-stage companies in active feature development, the opportunity cost of diverting engineering to infrastructure is typically higher than the cost saving from self-hosting. For growth-stage companies with a mature product and dedicated platform teams, the opportunity cost is lower.

The Real Breakeven Calculation

The full breakeven equation:

API monthly cost
vs.
GPU infrastructure cost
+ (Engineer overhead cost / 12)
+ Operational risk premium / 12

Self-hosting is economical when:
API monthly cost > GPU + engineer overhead + operational risk

At a typical loaded engineer cost of $200,000/year (≈$16,667/month overhead) and assuming the engineer cost is fully allocated to the self-hosting work:

API Monthly SpendGPU CostNet Saving (Nominal)Net Saving (Full)
$10,000$2,500$7,500-$9,167
$20,000$5,000$15,000-$1,667
$40,000$10,000$30,000$13,333
$80,000$20,000$60,000$43,333
$150,000$35,000$115,000$98,333

The economic threshold for self-hosting is in the $40,000–$80,000/month API spend range, not the $10,000–$20,000 range that the nominal calculation suggests.

The Hybrid Architecture: Capturing Most of the Saving at Lower Risk

The pure self-hosting decision is binary and high-commitment. The hybrid architecture captures most of the cost saving at lower risk and lower engineering investment:

Identify the high-volume, low-complexity workloads: For most AI-native SaaS products, 60–70% of inference volume consists of relatively simple tasks that open-weight models can handle well — classification, extraction, format conversion, summarization of provided text. These workloads are the self-hosting candidates.

Benchmark open-weight model quality on these workloads: Before committing to self-hosting, validate that the target open-weight model matches the API inference quality on the specific tasks involved. A structured quality evaluation with 200–500 representative samples reveals the quality gap before it appears in production.

Self-host only the validated workloads: Run high-volume, quality-validated workloads on self-hosted infrastructure. Continue to use API inference for complex, reasoning-intensive tasks where frontier model quality is required.

This hybrid approach captures approximately 50–70% of the theoretical self-hosting saving (because simple tasks often represent 60–70% of volume) with significantly lower engineering overhead than full self-hosting — because the infrastructure is sized for a narrower workload and management complexity is lower.

According to SaaS Capital's research on AI-native SaaS infrastructure economics, companies using hybrid architectures achieve 30–45% reduction in inference COGS versus pure API inference, compared to 60–70% for full self-hosting — but at 40–50% of the engineering overhead.

For context on the API cost management alternative to self-hosting, see Negotiating Committed-Spend Discounts With Model Providers. For how self-hosting economics appear in the broader cost picture, see AI-Native SaaS Open Source Model Self-Hosting and AI-Native SaaS Gross Margin Decomposition.

When to Initiate the Evaluation

The right time to begin a structured self-hosting evaluation — not commitment, evaluation — is when:

  1. Monthly API spend consistently exceeds $30,000–$40,000/month
  2. The product's core inference workloads are stable and well-understood
  3. Engineering capacity exists for a 4–8 week evaluation project
  4. At least one engineer on the team has ML infrastructure familiarity (or can be hired)

The evaluation should produce a go/no-go recommendation with full cost modeling (including engineering overhead) before any infrastructure investment is made.

Conclusion

The breakeven on self-hosting versus API inference is real but substantially higher than the nominal calculation suggests. Including engineering overhead, the economic threshold is $50,000–$80,000/month in API costs for a full self-hosting commitment. Below that threshold, committed-spend contracts with API providers are typically the higher-ROI cost reduction.

The hybrid architecture — self-hosting high-volume simple workloads while maintaining API inference for complex tasks — offers a practical middle path that captures meaningful cost savings at lower engineering investment and lower operational risk than full self-hosting.

Make the decision with the full cost model. The partial calculation creates the illusion of an obvious answer when the full answer is considerably more nuanced.

See Your Growth Ceiling Now

Calculate when your SaaS growth will plateau — free, no signup required.

Calculate Your Growth Ceiling

Frequently Asked Questions

What does it cost to self-host an AI model?
Self-hosting costs have four components: (1) GPU infrastructure — the primary cost. A single H100 GPU instance runs $2.50–$5.00/hour for on-demand cloud instances, or $8,000–$12,000/month for reserved instances. A production inference cluster for a 7B parameter model requires 2–4 GPUs minimum; for a 70B parameter model, 8–16 GPUs are required. (2) Engineering overhead — a dedicated ML infrastructure engineer (or a portion of an engineer's time) is required to manage model deployments, performance optimization, scaling, and incident response. This is often the largest cost component. (3) Operational infrastructure — load balancers, monitoring, logging, and storage for model weights and inference logs. Typically $500–$2,000/month on top of GPU costs. (4) Opportunity cost — engineering time spent on infrastructure is not spent on product features. This is the most frequently omitted cost in breakeven calculations.
How do you calculate the nominal breakeven between self-hosting and API inference?
The nominal breakeven calculation: Monthly API cost vs. Monthly self-hosting cost. If monthly API spend is $20,000 and monthly self-hosting cost (GPU + infrastructure) is $8,000, the nominal saving is $12,000/month, or $144,000/year. This calculation ignores engineering overhead. Adding a dedicated ML infrastructure engineer at $200,000/year in loaded cost changes the picture: net self-hosting saving is $144,000 - $200,000 = -$56,000/year. Self-hosting at $20,000/month in API costs is not economically rational when engineering overhead is included. The breakeven shifts to approximately $50,000–$80,000/month in API costs when engineering overhead is properly accounted for.
What is the capability gap between open-weight and proprietary frontier models?
The capability gap between open-weight models (that can be self-hosted) and proprietary frontier models varies significantly by task type: (1) Simple tasks (classification, extraction, format conversion, keyword extraction) — open-weight models at 7B–13B parameters perform at or near frontier model quality. The capability gap is minimal. (2) Moderate tasks (summarization, Q&A on provided context, translation, code generation for common patterns) — open-weight models at 13B–70B parameters approach frontier model quality. A small quality gap exists but may be acceptable. (3) Complex tasks (multi-step reasoning, nuanced analysis, complex code generation, long-context synthesis) — frontier models have a meaningful quality advantage over open-weight models at any size currently available for self-hosting. The quality gap is a real business risk for products where task complexity is high.
What is the hybrid self-hosting architecture?
The hybrid architecture self-hosts inference for tasks where open-weight models are adequate and uses API inference for tasks requiring frontier model capability. The routing logic is based on task type: (1) Classify each incoming request by complexity tier (simple, moderate, complex). (2) Route simple requests to the self-hosted model. (3) Route complex requests to the API provider. The classification step itself should be cheap (a rule-based system or a very small model). The hybrid architecture captures 50–70% of the theoretical self-hosting savings (since simple tasks often represent 60–70% of inference volume) while preserving frontier model quality for tasks that require it.
What are the operational risks of self-hosted inference?
Self-hosted inference creates operational risks that API inference avoids: (1) Availability risk — the company is responsible for infrastructure availability. GPU hardware failures, software bugs, and capacity exhaustion are incidents that the engineering team must respond to, rather than an incident at the provider's infrastructure. (2) Model update risk — when a new model version is released, the self-hosted deployment must be updated manually, with testing to ensure quality is maintained. API inference automatically benefits from provider improvements. (3) Scaling risk — unexpected traffic spikes require provisioning additional GPU capacity quickly. Self-hosted environments have longer provisioning times than API inference, which scales instantly. (4) Security risk — model weights stored on self-hosted infrastructure represent a security asset that must be protected. Provider-hosted models do not create this asset.
How does latency compare between self-hosted and API inference?
Latency comparison depends heavily on the specific self-hosted infrastructure and the specific API provider. General patterns: (1) Cold start latency — API inference is typically faster for first requests because providers maintain warm model instances. Self-hosted inference may have higher cold start latency if capacity management includes spinning down idle instances. (2) Steady-state latency — at consistent request volumes, self-hosted inference can be lower latency than API inference because there is no network round trip to the provider. For applications where sub-100ms response times are critical, self-hosting on geographically close infrastructure can provide a meaningful advantage. (3) Tail latency — self-hosted infrastructure typically has better tail latency control because there is no competition from other tenants for the same hardware, as exists in multi-tenant API infrastructure.
When is the decision to self-host clearly premature?
Self-hosting is premature when: (1) Monthly API spend is below $20,000 — the savings do not cover engineering overhead at this volume. (2) The team does not have ML infrastructure expertise — self-hosting without in-house expertise creates operational risk that exceeds the cost saving. (3) The product is in active feature development — the opportunity cost of ML infrastructure engineering during feature development is typically higher than the cost saving from self-hosting. (4) Product market fit is not yet established — infrastructure investment in a direction that may change with product direction is capital poorly deployed. The right time to begin evaluating self-hosting is when API spend consistently exceeds $30,000–$40,000/month and the product's core architecture is stable.
What are the steps to evaluating self-hosting in practice?
Evaluation steps: (1) Identify candidate workloads — which inference workloads are high volume, repeated, and could be served by an open-weight model without quality degradation? (2) Benchmark model quality — run candidate open-weight models on a representative sample of production queries. Calculate quality metrics relevant to your product (accuracy, user satisfaction, output format compliance). (3) Calculate the infrastructure cost — provision a test environment, run representative load, calculate the GPU cost to serve the candidate workload at production volume. (4) Include engineering overhead — estimate the ongoing engineering hours required to maintain the self-hosted deployment. Price this at the full loaded cost of the engineers involved. (5) Calculate the net saving — compare (API cost for candidate workloads) vs. (GPU cost + engineering overhead). If positive and above a 50% ROI threshold, the evaluation supports self-hosting.

Related Posts