Engineering

Webhook Delivery Infrastructure Cost in SaaS at Scale

Webhook delivery looks cheap at small scale and becomes a significant infrastructure investment at production scale. Here is the cost model, reliability requirements, and build-vs-buy decision for SaaS teams.

SaaS Science TeamMay 31, 202610 min read
webhooksinfrastructure costsapi eventsreliability engineeringsaas scaling

Webhooks are the invisible infrastructure of modern SaaS integrations. When everything works, developers never think about them — events flow silently from your system to theirs, triggering downstream processes that keep their products running. When webhooks fail, the impact surfaces hours or days later in a customer's data pipeline, payment reconciliation, or user notification system.

This visibility gap — webhooks are critical but invisible until they break — makes webhook infrastructure a systematic underinvestment in fast-growing SaaS companies. The cost model is poorly understood, the reliability requirements are higher than they appear, and the build-vs-buy decision is made without the full cost picture.

See Your Growth Ceiling NowTry Free

What Webhook Delivery Actually Requires

Delivering a webhook is not just an HTTP POST request. A production-grade webhook delivery system requires:

Event capture and queueing: Every event that should trigger a webhook must be durably captured and enqueued before any delivery attempt. If event capture fails, the webhook is never delivered — and there is no way to know what was missed. This requires a durable message queue (SQS, Kafka, RabbitMQ, or equivalent) with at-least-once delivery semantics.

Fan-out to destinations: A single event may need to be delivered to multiple customer endpoints — each customer who has registered a webhook for this event type gets their own delivery. Fan-out from one event to N deliveries requires per-destination queue management.

HTTP delivery with timeout handling: Each delivery attempt makes an outbound HTTP POST to the customer's endpoint with the event payload. The delivery system must handle: connection timeouts (the customer's server does not respond in time), HTTP errors (the server returns a 4xx or 5xx), SSL certificate errors, and successful delivery confirmation (2xx response).

Retry logic for failures: When delivery fails, the system must retry according to a defined schedule. Exponential backoff (retry after 5 minutes, 30 minutes, 2 hours, etc.) is standard. Retries require maintaining retry state per delivery attempt — another queue and database write per attempt.

Dead-letter management: Events that exceed the maximum retry attempts must be preserved in a dead-letter queue rather than silently dropped. Customers need the ability to inspect and replay dead-lettered events after fixing their endpoints.

Delivery logging and observability: Every delivery attempt — successful or failed — must be logged with timestamp, response code, response body, and delivery latency. These logs are the primary debugging tool for customers investigating webhook delivery issues.

Customer-facing delivery dashboard: Customers need visibility into their webhook delivery status. This is a product feature, not just an operational tool — it requires a UI for browsing delivery history, filtering by event type or date, inspecting event payloads, and replaying failed deliveries.

The gap between "basic webhook delivery" (fire-and-forget HTTP POST) and "production webhook infrastructure" (all of the above) is where most underinvestment occurs.

The Cost Curve by Scale

Webhook infrastructure cost is non-linear with event volume. Small-scale webhook delivery is nearly free; large-scale delivery requires infrastructure that competes with your core product for engineering investment.

Under 100,000 events/day: A simple implementation — a job queue (Sidekiq, Celery, BullMQ) triggering HTTP POSTs with basic retry logic — handles this volume on existing infrastructure. Additional cost is minimal: $50–$200/month in additional queue capacity, a small amount of database storage for delivery logs. Engineering investment is 2–4 weeks for initial implementation.

100,000–10 million events/day: At this scale, shared job queues are insufficient — a surge of webhook events can starve other background jobs. Dedicated webhook queue infrastructure is required. Per-destination rate limiting becomes necessary to prevent a single misbehaving customer's endpoint from consuming all queue capacity. Delivery logs require a dedicated logging store (not the application database) to avoid query performance degradation. Engineering investment: 4–12 weeks, plus ongoing maintenance.

10 million+ events/day: Full dedicated infrastructure: a message broker (Kafka is common at this scale), per-destination fan-out with independent queue management, multi-region delivery for geographic latency reduction, and a dedicated observability stack. Engineering investment is 6–18 months of platform team capacity, and the ongoing maintenance burden is significant. This is where the build-vs-buy decision becomes most acute.

Infrastructure Cost Components

For a production webhook system delivering 1 million events/day with a 5% initial failure rate:

ComponentMonthly Cost (estimate)
Message queue (SQS, Kafka, etc.)$200–$800
Compute (fan-out workers, delivery workers)$300–$1,200
Database/log storage (delivery history, 30-day retention)$100–$400
Monitoring and alerting$50–$200
Network egress (outbound HTTP)$50–$150
Total$700–$2,750/month

At 10 million events/day, multiply by roughly 10-15x (economies of scale in queue infrastructure offset some of the volume increase), producing $7,000–$35,000/month.

These costs are infrastructure-only. They do not include the engineering cost of building and maintaining the system, customer support load from delivery failures, or the product investment required for the customer-facing delivery dashboard.

The Reliability Math

The expected reliability of a webhook delivery system depends on:

  1. The initial delivery success rate
  2. The retry strategy
  3. The maximum retry window

For a system with 95% initial success rate and exponential backoff retries (3 attempts over 2.5 hours):

  • After attempt 1: 95% delivered
  • After attempt 2 (5 min retry): 97.75% delivered
  • After attempt 3 (30 min retry): 98.89% delivered

At this retry depth, 1.11% of events fail permanently — at 1 million events/day, that is 11,100 failed deliveries per day. For payment events, compliance notifications, or inventory updates, this failure rate is unacceptable.

Increasing to 6 retry attempts over 72 hours (exponential backoff: 5 min, 30 min, 2h, 6h, 24h, 72h):

  • Reduces permanent failure rate to under 0.1% for most endpoint availability patterns
  • But increases infrastructure cost (6 retry attempts × failure rate × retry queue cost)
  • And increases the time window within which events can be delayed (a customer's server being down for 48 hours means their events are delayed, not dropped)

The reliability design choice depends on what the customer is doing with webhooks. If customers are using webhooks for real-time notifications, delivery delay is more damaging than delivery loss. If customers are using webhooks for data reconciliation, delivery completeness (low loss rate) is more important than delivery speed.

Build vs. Buy Decision Framework

Specialized webhook delivery platforms (Hookdeck, Svix, Zeplo, Convoy) have emerged specifically because in-house webhook infrastructure at scale is expensive to build and maintain. The build-vs-buy decision should be made explicitly rather than defaulting to build because "webhooks are simple."

Build in-house when:

  • Event volume is under 1 million events/day and expected to stay there
  • The webhook feature is genuinely core to your product differentiation (you need deep control)
  • Your engineering team has queue infrastructure expertise and the bandwidth to maintain the system
  • The cost of a third-party platform exceeds the cost of in-house development (usually not true until very high volume)

Buy a specialized platform when:

  • Event volume is high or rapidly growing
  • Engineering team opportunity cost is high (time spent on webhook infrastructure is time not spent on core product)
  • You need customer-facing debugging tools (delivery dashboards, replay functionality) quickly — these take months to build in-house
  • You need features like per-destination rate limiting, custom retry policies per event type, or multi-region delivery

Third-party webhook platforms typically cost $200–$2,000/month depending on volume, plus a per-event fee. Compare against the fully-loaded cost of in-house development (engineering hours at your team's fully-loaded cost × weeks of development) plus ongoing maintenance to evaluate.

The Customer-Facing Experience Problem

The most underestimated cost in webhook infrastructure is the customer-facing debugging experience. Customers who build production systems on your webhooks inevitably encounter delivery failures. When they do, they need:

  1. Delivery status visibility: Which events were delivered successfully, which failed, and when?
  2. Failure diagnosis: Why did specific deliveries fail? Was it a timeout, a 500 error, a TLS error?
  3. Payload inspection: What was the exact payload that was (or was not) delivered?
  4. Replay capability: After fixing their endpoint, can they re-deliver the failed events?
  5. Retry configuration: Can they configure their retry preferences (timeout duration, maximum retry window)?

Building a customer-facing delivery dashboard with these capabilities is a 4–8 week product investment. It is also one of the highest-impact developer experience investments a webhook-enabled API company can make, because it converts a support ticket ("my webhook isn't working, help me debug it") into a self-service resolution.

For the connection between webhook infrastructure and your API cost model, see our SaaS rate limiting unit economics and SDK maintenance burden analysis.

Webhook Security Requirements

Webhook delivery at scale introduces security requirements that are often addressed late:

Signature verification: Every webhook payload should be signed with an HMAC signature (using a customer-specific secret) so customers can verify that the payload came from your system and was not tampered with. Stripe's webhook signature verification using Stripe-Signature headers is the reference implementation.

Secret rotation: Customers need the ability to rotate their webhook signing secrets without downtime — a new secret should be accepted alongside the old secret for a defined transition window.

TLS requirement: Webhook endpoints should be required to use HTTPS. Delivering webhooks to HTTP endpoints exposes event payloads to interception. Enforce TLS at the delivery layer.

Payload encryption: For high-sensitivity event types (PII, financial data), consider end-to-end payload encryption using the customer's public key, so even interception at the network layer does not expose event content.

FAQ

Q: How should webhook failures be surfaced to customers? A: Proactively. Do not wait for customers to notice that their downstream systems have missing data. Implement: email notification when a customer's webhook endpoint has 4+ consecutive failures, in-app notification in your product when failure thresholds are exceeded, and an automated health-check that pings customer endpoints daily (with notification if the endpoint has become unreachable). Customers who learn about webhook failures from you — not from their own systems — experience significantly higher satisfaction than customers who discover failures after downstream impact.

Q: What is the correct timeout for webhook delivery? A: Industry standard is 15–30 seconds for webhook delivery timeout. Timeouts below 5 seconds generate failures from legitimate endpoints under transient load. Timeouts above 60 seconds allow a single slow endpoint to hold queue workers indefinitely. Most production webhook systems use a 15-second timeout with documentation recommending customers respond immediately (200 OK) and process asynchronously.

Q: How do you handle webhook delivery for high-volume event types? A: High-volume event types (e.g., page view events, session events, click events) should either be batched (accumulate events and deliver in bulk per webhook call rather than one call per event) or not delivered via webhook at all (routed to a data stream like Kafka or Kinesis instead). Delivering millions of individual events via webhook is inefficient for both vendor and customer; batching or event streaming is more appropriate.

See Your Growth Ceiling Now

Calculate when your SaaS growth will plateau — free, no signup required.

Calculate Your Growth Ceiling

Taking Webhook Infrastructure Seriously

The moment a customer builds a production system that depends on your webhooks, webhook delivery becomes critical infrastructure. The cost of failure is not the infrastructure cost of a missed delivery — it is the downstream impact on the customer's business, the support ticket, the escalation, and the potential churn.

Webhook infrastructure designed seriously — with durable queuing, exponential backoff retry, dead-letter preservation, signature verification, and customer-facing debugging tools — is a meaningful engineering investment. But it is also a competitive differentiator. Developers choose APIs with reliable webhook delivery over APIs with superior features but unreliable events. Reliability is a feature.

For the broader context of API infrastructure economics, see our SaaS API rate card design guide and developer community cost model.

Frequently Asked Questions

What does it cost to deliver webhooks at scale?
Webhook delivery cost depends on event volume and retry rate. At 1 million events/day with a 5% failure rate requiring up to 3 retry attempts, you are delivering 1 million initial events plus up to 150,000 retries — each requiring an outbound HTTP request, queue processing, and logging. Fully-loaded infrastructure cost at this scale is typically $500–$2,000/month. At 100 million events/day, the infrastructure cost is $50,000–$200,000/month and requires dedicated engineering.
What is the reliability requirement for webhook delivery?
Production webhook consumers expect at-least-once delivery (every event is delivered at least once, possibly more). Most SaaS webhook systems target 99.9% delivery reliability with retry logic for failed attempts. Some high-stakes use cases (payment events, compliance notifications) require exactly-once semantics with idempotency guarantees, which are significantly more expensive to implement.
What is the difference between at-least-once and exactly-once webhook delivery?
At-least-once delivery guarantees that every event is attempted at least once and retried until it succeeds (or retry limit is reached). This is the standard for most webhook systems — recipients must be idempotent to handle duplicate delivery. Exactly-once delivery guarantees that each event is delivered exactly once, never duplicated. Exactly-once requires distributed coordination (idempotency keys, distributed transactions) and is significantly more expensive to implement correctly.
Should we build our own webhook infrastructure or use a third-party service?
Build your own if event volume is under 1 million events/day, your engineering team has queue infrastructure expertise, and the webhook feature is core to your product experience (meaning you need deep control over the implementation). Use a third-party service (Hookdeck, Svix, Zeplo, QStash) if event volume is high, your engineering team's time is better spent on core product, or you need features like customer-facing delivery dashboards, per-destination rate limiting, and replay tools that take months to build in-house.
What are the most common webhook delivery failure modes?
The five most common webhook delivery failure modes: endpoint timeout (recipient's server takes longer than the delivery timeout to respond), endpoint unavailability (DNS resolution failure, TCP connection refused, TLS error), HTTP error response (the recipient's server returns 4xx or 5xx), SSL certificate error (expired or invalid certificate on the recipient's endpoint), and payload rejection (the recipient validates the payload and returns a structured error indicating a validation failure).
How do you handle webhook delivery to customers who have downtime?
Implement exponential backoff retry logic with a maximum retry window. A common pattern is: retry after 5 minutes, 30 minutes, 2 hours, 6 hours, 24 hours, and 72 hours before marking the event as failed. Notify customers when their webhook endpoint has been failing for a defined threshold (e.g., 4+ consecutive failures) so they can investigate before events are permanently dropped. Dead-letter queues preserve events that exceeded the retry window for manual replay.

Related Posts