SaaS Pricing Test Statistical Power: How Many Visitors You Need
Calculate the sample size required to detect pricing test effects with statistical confidence. Covers power analysis fundamentals, the effect of baseline conversion rate on required sample, runtime estimation, and the cost of underpowered tests.
Most pricing tests in SaaS are underpowered. The team decides to test a hypothesis, splits traffic 50/50, and checks results after two weeks. If results "look significant" they call it — often with a sample half the size required to reliably detect the effect they hypothesized. If results look flat, they conclude the test "showed nothing."
Both conclusions are likely wrong. The first is a false positive from peeking; the second is a false negative from insufficient power. Underpowered tests do not just fail to detect effects — they actively mislead, producing false confidence in conclusions that have no statistical basis.
The Power Calculation, Explained
Statistical power analysis connects four quantities. Fix any three, and the fourth is determined:
- Power (1 - β): the probability of detecting a real effect. Typically set at 0.80 (80%) or 0.90 (90%).
- Significance level (α): the false positive rate. Typically set at 0.05 (5%) for one-sided tests or 0.025 for two-tailed.
- Effect size (MDE): the minimum difference in the primary metric that the test must be able to detect.
- Sample size (n): the number of observations per variant required.
The relationship between these quantities is approximately:
n ≈ (Z_α + Z_β)² × 2σ² / δ²
Where:
- Z_α is the Z-score for the significance level (1.65 for α=0.05 one-sided, 1.96 for two-sided)
- Z_β is the Z-score for power (0.84 for 80% power, 1.28 for 90% power)
- σ² is the variance of the metric
- δ is the minimum detectable effect (absolute, not relative)
For conversion rate tests, variance is a function of the baseline rate: σ² = p(1-p), where p is the baseline conversion probability.
The critical insight: variance is maximized at p=0.50 and decreases as conversion rates move toward 0 or 1. But at low conversion rates (2–5%), the variance is small but the absolute MDE is also small (a 20% relative lift on 2% is only 0.4 percentage points), which makes the signal-to-noise ratio unfavorable. This is why low-conversion-rate tests require more visitors.
Sample Size Tables for SaaS Pricing Tests
For two-proportion z-tests at 80% power, 95% confidence (two-tailed), detecting a 20% relative lift:
| Baseline Conversion Rate | Required Sample per Variant | Total Test Sample |
|---|---|---|
| 1% | 18,500 | 37,000 |
| 2% | 9,200 | 18,400 |
| 3% | 6,100 | 12,200 |
| 5% | 3,700 | 7,400 |
| 8% | 2,300 | 4,600 |
| 10% | 1,850 | 3,700 |
| 15% | 1,200 | 2,400 |
| 20% | 900 | 1,800 |
For a 15% relative lift (smaller MDE, requires more sample):
| Baseline Conversion Rate | Required Sample per Variant | Total Test Sample |
|---|---|---|
| 3% | 10,800 | 21,600 |
| 5% | 6,500 | 13,000 |
| 10% | 3,200 | 6,400 |
| 15% | 2,100 | 4,200 |
For pricing tests where the primary metric is revenue per visitor (continuous, not binary), the sample size calculation uses a t-test rather than a z-test for proportions. The required sample is typically 30–50% higher than for conversion rate tests because revenue per visitor has higher variance (driven by the distribution of plan prices and the presence of $0 outcomes from non-converters).
Setting the Right MDE
The minimum detectable effect is not the effect you expect to see — it is the smallest effect you would bother implementing. This distinction matters:
Expected effect: what your hypothesis predicts the change will produce. Use this in your power calculation to verify the test is feasible.
MDE (minimum actionable threshold): the smallest improvement that changes your decision. If a 5% revenue lift would be worth implementing but a 3% lift would be too small to justify the ongoing maintenance of a changed pricing page, your MDE is 5%.
Common MDE choices for SaaS pricing tests:
- Conversion rate tests: 10–20% relative lift (e.g., 3% → 3.3% is 10% relative)
- Revenue per visitor tests: 8–15% relative lift
- Plan mix shift tests: 15–25% relative change in a plan's selection rate
- Annual vs. monthly adoption: 20–30% relative change in annual plan selection rate
Setting MDE too small (e.g., 2% relative lift) produces impractically large sample sizes and impossibly long test runtimes. Setting MDE too large (e.g., 40% relative lift) means you only detect very large effects and miss moderate-but-real improvements. The 10–20% range is appropriate for most conversion rate and revenue per visitor tests.
Runtime Planning and Traffic Constraints
Test runtime is determined by:
Runtime (weeks) = Required total sample ÷ Weekly pricing page visitors
For a test requiring 12,000 total visitors on a page receiving 800 visitors/week: runtime = 15 weeks.
Pricing test runtime planning must account for:
Weekly traffic to the specific test URL. Homepage traffic does not count if the test is on the pricing page. Use your web analytics to extract pricing page-specific unique visitor counts, not site-wide sessions.
Conversion event frequency. For tests where the primary metric is revenue per visitor, you need sufficient conversions to estimate the revenue distribution. A page with 800 visitors/week and 2% conversion produces 16 paid conversions per week. At 15 weeks, that is 240 total conversions — enough for most conversion rate tests but borderline for revenue per visitor tests where variance is higher.
Seasonality exclusion. Do not plan tests through Q4 end-of-year if B2B buyers freeze budgets, or through your product's peak season if conversion rates are anomalous. The test window should represent normal operating conditions.
CI coverage period. A test that completes in week 12 but requires two more weeks to analyze and implement the winning variant has a 14-week total cycle. Budget accordingly.
The Cost of Underpowered Tests
Running an underpowered test is not harmless. The costs:
False negatives (missed real effects): A test with 40% power has a 60% chance of concluding "no effect" when a real effect exists. If your pricing change genuinely improves revenue per visitor by 15%, a test with 40% power will fail to detect it 60% of the time. You will shelve the winning variant based on a test that was not powerful enough to find it.
False positives from early stopping: A test with insufficient sample that is stopped when results first cross p=0.05 has an actual false positive rate far above 5%. At 40% power with early stopping, the false positive rate can be 15–30%. You will implement variants that don't actually work.
Institutional erosion: After several underpowered tests produce contradictory results, teams lose confidence in A/B testing. The correct response is better test design, not abandoning the practice — but the institutional damage from years of underpowered tests is real.
The standard from ProfitWell and Bessemer Venture Partners' published benchmarking guides is consistent: before running any pricing test, calculate the required sample size using your baseline conversion rate and target MDE. If the required runtime exceeds 12–16 weeks given your traffic, accept a larger MDE (smaller precision requirement) rather than running the test underpowered.
Practical Workflow for Power-Correct Testing
- Establish baseline. Pull 4 weeks of pricing page conversion data. Calculate mean conversion rate, standard deviation, and weekly visitor count.
- Set MDE. Define the smallest revenue per visitor improvement worth implementing.
- Calculate sample size. Use Evan Miller's online sample size calculator (for proportions) or a t-test power calculator (for continuous metrics like revenue per visitor). Run at 80% power, 95% two-tailed significance.
- Estimate runtime. Divide required total sample by weekly visitors. If runtime exceeds 16 weeks, either increase traffic before testing or increase MDE.
- Pre-register stopping rule. Record the required sample size, expected end date, and the rule: stop only when sample target is reached, not when results look significant.
- Run test. Do not look at results until the sample target is reached.
- Analyze. Report primary metric with 95% confidence interval. Secondary metrics for context. Document in test log.
This connects directly to pricing A/B test design — statistical power is one component of a rigorous test design, alongside metric selection, assignment mechanism, and contamination controls. And the metrics you are powering your test to detect — revenue per visitor, conversion rate, plan mix — connect to the pricing page conversion data that establishes your baseline.
See Your Growth Ceiling Now
Calculate when your SaaS growth will plateau — free, no signup required.
Conclusion
Statistical power is the foundational requirement for any pricing test that is worth running. An underpowered test is not a weak test — it is a test that will actively mislead through false negatives and inflated false positives.
The investment in correct power analysis takes 20 minutes and requires a baseline conversion rate, an MDE, and a power calculator. That 20 minutes of discipline is the difference between a test that produces reliable conclusions and one that produces noise with a confident-looking p-value attached.
Calculate power before launch, commit to the sample size, do not peek, and report with confidence intervals. The pricing decisions that result will be built on evidence strong enough to stand up to scrutiny — from your board, your team, and your own re-examination in 12 months.
Frequently Asked Questions
What is statistical power in an A/B test?
What sample size do I need for a SaaS pricing test?
What is minimum detectable effect (MDE)?
What happens if you run an underpowered test?
How do you calculate the required runtime for a pricing test?
Can you run multiple pricing tests simultaneously?
What is the difference between 95% confidence and 80% power?
How does seasonality affect test runtime?
Related Posts
Enterprise SaaS Pricing: Discount Floors and Approval Tiers
A rigorous framework for enterprise SaaS pricing discount floors and approval tiers — covering discount governance, approval workflow design, the financial math of unmanaged discounting, and how best-in-class revenue operations teams protect gross margin.
9 min readAnnual vs Monthly Pricing Test: SaaS Cash Flow Trade-off
Measure the real impact of shifting customers to annual billing — the cash flow benefit, churn reduction, and revenue per customer trade-offs. Includes the annual discount break-even formula and experiment design for testing billing term incentives.
7 min readCohort-Based Pricing Experiments for SaaS
Use cohort analysis to run pricing experiments that isolate causal effects from confounders. Covers cohort design, measurement windows, holdout groups, and interpreting cohort-level pricing signal.
9 min read