The Breakeven Math on Self-Hosting vs API Inference
Self-hosting AI models promises dramatically lower inference costs but requires significant engineering investment and infrastructure overhead. This guide walks through the complete breakeven calculation — including hidden costs — so you can make the switch at the right time.
The financial case for self-hosting AI models is superficially compelling. A tier-3 API provider charges $0.002 per 1,000 tokens for a capable open-weight model. Running that same model on a dedicated GPU cluster might cost $0.0005 per 1,000 tokens — a 75% reduction. For a product spending $100,000/month on API inference, that is a $75,000/month saving.
This calculation is correct and also incomplete. It omits the engineering overhead required to run the infrastructure, the operational risk of owning an inference stack, the potential quality gap for the specific tasks the product requires, and the opportunity cost of the engineering time diverted to infrastructure. When these costs are included, the breakeven shifts substantially higher than the nominal calculation suggests — and for most AI-native SaaS companies below a certain scale, self-hosting is not economically rational.
The Complete Self-Hosting Cost Model
The full cost of self-hosted inference has four components. The nominal calculation includes only the first.
Cost 1: GPU Infrastructure (Usually Quoted)
GPU infrastructure is the most visible self-hosting cost because it appears directly on cloud bills. The per-hour cost varies by GPU type and instance class:
| GPU Type | Use Case | On-Demand $/hr | Reserved $/hr (1yr) |
|---|---|---|---|
| A10G | 7B–13B models | $1.50–$2.50 | $0.90–$1.50 |
| A100 80GB | 70B models (quantized) | $3.00–$6.00 | $1.80–$3.60 |
| H100 | 70B+ models, high throughput | $4.00–$8.00 | $2.50–$5.00 |
A minimal production inference deployment for a 13B parameter model requires at minimum 2 GPUs for redundancy. Monthly infrastructure cost: 2 GPUs × $0.90/hour × 720 hours = $1,296/month at reserved A10G pricing.
At higher throughput requirements, the GPU count scales. A production deployment handling 1,000 requests/minute for a 13B parameter model typically requires 4–8 GPUs.
Cost 2: Engineering Overhead (Most Often Omitted)
Running self-hosted inference requires ongoing engineering work that is invisible in the infrastructure comparison:
Deployment and updates: When new model versions are released, or when a production issue requires a rollback, an engineer must manage the deployment. This is not a one-time cost — model providers release updates regularly, and production deployments must be maintained.
Monitoring and incident response: Self-hosted inference creates a new category of production incidents — GPU health, model performance degradation, throughput bottlenecks, and capacity saturation. These incidents require engineers to respond, often outside business hours.
Performance optimization: Unlike API inference where the provider handles batching, quantization, and inference optimization, self-hosted deployments must be configured and tuned. Achieving competitive inference throughput requires expertise in serving frameworks (vLLM, TensorRT-LLM) that most product engineering teams do not have by default.
Capacity planning and scaling: Self-hosted inference requires anticipating demand and provisioning capacity proactively. Auto-scaling is possible but requires engineering to implement correctly.
The aggregate engineering overhead for a well-run self-hosted inference deployment is typically 0.5–1.5 FTE-equivalents of ongoing work. At a loaded engineer cost of $200,000/year, this represents $100,000–$300,000/year in overhead — an enormous cost that the nominal breakeven calculation ignores.
Cost 3: Operational Risk Premium
Self-hosted inference has a higher blast radius when things go wrong than API inference. Provider downtime affects all customers of that provider simultaneously — a systemic event that is typically resolved quickly. Self-hosted downtime affects only your customers and is your team's responsibility to resolve.
The operational risk premium should be calculated as:
- Expected incidents per year × average MTTR (time to resolve) × hourly revenue impact
For a product with $1M ARR, an hourly revenue impact of approximately $110. If a self-hosted deployment has 6 more incidents per year than an API deployment, at 2 hours MTTR each, the risk premium is $1,320/year — modest relative to the infrastructure saving. But for a product at $10M ARR with complex incident response, the operational risk premium can be material.
Cost 4: Opportunity Cost
Engineering time spent on self-hosting infrastructure is not spent on product features. This opportunity cost is difficult to quantify precisely but should be considered when evaluating the self-hosting decision timeline.
For early-stage companies in active feature development, the opportunity cost of diverting engineering to infrastructure is typically higher than the cost saving from self-hosting. For growth-stage companies with a mature product and dedicated platform teams, the opportunity cost is lower.
The Real Breakeven Calculation
The full breakeven equation:
API monthly cost
vs.
GPU infrastructure cost
+ (Engineer overhead cost / 12)
+ Operational risk premium / 12
Self-hosting is economical when:
API monthly cost > GPU + engineer overhead + operational risk
At a typical loaded engineer cost of $200,000/year (≈$16,667/month overhead) and assuming the engineer cost is fully allocated to the self-hosting work:
| API Monthly Spend | GPU Cost | Net Saving (Nominal) | Net Saving (Full) |
|---|---|---|---|
| $10,000 | $2,500 | $7,500 | -$9,167 |
| $20,000 | $5,000 | $15,000 | -$1,667 |
| $40,000 | $10,000 | $30,000 | $13,333 |
| $80,000 | $20,000 | $60,000 | $43,333 |
| $150,000 | $35,000 | $115,000 | $98,333 |
The economic threshold for self-hosting is in the $40,000–$80,000/month API spend range, not the $10,000–$20,000 range that the nominal calculation suggests.
The Hybrid Architecture: Capturing Most of the Saving at Lower Risk
The pure self-hosting decision is binary and high-commitment. The hybrid architecture captures most of the cost saving at lower risk and lower engineering investment:
Identify the high-volume, low-complexity workloads: For most AI-native SaaS products, 60–70% of inference volume consists of relatively simple tasks that open-weight models can handle well — classification, extraction, format conversion, summarization of provided text. These workloads are the self-hosting candidates.
Benchmark open-weight model quality on these workloads: Before committing to self-hosting, validate that the target open-weight model matches the API inference quality on the specific tasks involved. A structured quality evaluation with 200–500 representative samples reveals the quality gap before it appears in production.
Self-host only the validated workloads: Run high-volume, quality-validated workloads on self-hosted infrastructure. Continue to use API inference for complex, reasoning-intensive tasks where frontier model quality is required.
This hybrid approach captures approximately 50–70% of the theoretical self-hosting saving (because simple tasks often represent 60–70% of volume) with significantly lower engineering overhead than full self-hosting — because the infrastructure is sized for a narrower workload and management complexity is lower.
According to SaaS Capital's research on AI-native SaaS infrastructure economics, companies using hybrid architectures achieve 30–45% reduction in inference COGS versus pure API inference, compared to 60–70% for full self-hosting — but at 40–50% of the engineering overhead.
For context on the API cost management alternative to self-hosting, see Negotiating Committed-Spend Discounts With Model Providers. For how self-hosting economics appear in the broader cost picture, see AI-Native SaaS Open Source Model Self-Hosting and AI-Native SaaS Gross Margin Decomposition.
When to Initiate the Evaluation
The right time to begin a structured self-hosting evaluation — not commitment, evaluation — is when:
- Monthly API spend consistently exceeds $30,000–$40,000/month
- The product's core inference workloads are stable and well-understood
- Engineering capacity exists for a 4–8 week evaluation project
- At least one engineer on the team has ML infrastructure familiarity (or can be hired)
The evaluation should produce a go/no-go recommendation with full cost modeling (including engineering overhead) before any infrastructure investment is made.
Conclusion
The breakeven on self-hosting versus API inference is real but substantially higher than the nominal calculation suggests. Including engineering overhead, the economic threshold is $50,000–$80,000/month in API costs for a full self-hosting commitment. Below that threshold, committed-spend contracts with API providers are typically the higher-ROI cost reduction.
The hybrid architecture — self-hosting high-volume simple workloads while maintaining API inference for complex tasks — offers a practical middle path that captures meaningful cost savings at lower engineering investment and lower operational risk than full self-hosting.
Make the decision with the full cost model. The partial calculation creates the illusion of an obvious answer when the full answer is considerably more nuanced.
See Your Growth Ceiling Now
Calculate when your SaaS growth will plateau — free, no signup required.
Frequently Asked Questions
What does it cost to self-host an AI model?
How do you calculate the nominal breakeven between self-hosting and API inference?
What is the capability gap between open-weight and proprietary frontier models?
What is the hybrid self-hosting architecture?
What are the operational risks of self-hosted inference?
How does latency compare between self-hosted and API inference?
When is the decision to self-host clearly premature?
What are the steps to evaluating self-hosting in practice?
Related Posts
A Per-Feature Gross Margin Tracking System for AI Products
Aggregate gross margin hides which AI features are profitable and which are destroying it. This guide covers how to build a per-feature gross margin tracking system that reveals the cost structure of each AI capability in your product and informs pricing tier design.
7 min readAllocating AI Inference Costs Back to Individual Customers
AI inference costs pooled at the company level create invisible margin problems. Here is a complete system for allocating inference costs to individual customers, surfacing unprofitable accounts, and pricing corrective action before margins erode.
9 min readSetting Per-Account Token Budgets Before Margins Erode
Token budgets at the per-account level prevent high-usage customers from consuming margin generated by the rest of the customer base. This guide covers how to design, implement, and communicate per-account token budgets that protect gross margin without creating customer friction.
7 min read