Pricing

SaaS Pricing Test Statistical Power: How Many Visitors You Need

Calculate the sample size required to detect pricing test effects with statistical confidence. Covers power analysis fundamentals, the effect of baseline conversion rate on required sample, runtime estimation, and the cost of underpowered tests.

SaaS Science TeamMay 31, 20268 min read
statistical powersaas pricinga/b testingsample sizepricing experiments

Most pricing tests in SaaS are underpowered. The team decides to test a hypothesis, splits traffic 50/50, and checks results after two weeks. If results "look significant" they call it — often with a sample half the size required to reliably detect the effect they hypothesized. If results look flat, they conclude the test "showed nothing."

Both conclusions are likely wrong. The first is a false positive from peeking; the second is a false negative from insufficient power. Underpowered tests do not just fail to detect effects — they actively mislead, producing false confidence in conclusions that have no statistical basis.

See Your Growth Ceiling NowTry Free

The Power Calculation, Explained

Statistical power analysis connects four quantities. Fix any three, and the fourth is determined:

  1. Power (1 - β): the probability of detecting a real effect. Typically set at 0.80 (80%) or 0.90 (90%).
  2. Significance level (α): the false positive rate. Typically set at 0.05 (5%) for one-sided tests or 0.025 for two-tailed.
  3. Effect size (MDE): the minimum difference in the primary metric that the test must be able to detect.
  4. Sample size (n): the number of observations per variant required.

The relationship between these quantities is approximately:

n ≈ (Z_α + Z_β)² × 2σ² / δ²

Where:

  • Z_α is the Z-score for the significance level (1.65 for α=0.05 one-sided, 1.96 for two-sided)
  • Z_β is the Z-score for power (0.84 for 80% power, 1.28 for 90% power)
  • σ² is the variance of the metric
  • δ is the minimum detectable effect (absolute, not relative)

For conversion rate tests, variance is a function of the baseline rate: σ² = p(1-p), where p is the baseline conversion probability.

The critical insight: variance is maximized at p=0.50 and decreases as conversion rates move toward 0 or 1. But at low conversion rates (2–5%), the variance is small but the absolute MDE is also small (a 20% relative lift on 2% is only 0.4 percentage points), which makes the signal-to-noise ratio unfavorable. This is why low-conversion-rate tests require more visitors.

Sample Size Tables for SaaS Pricing Tests

For two-proportion z-tests at 80% power, 95% confidence (two-tailed), detecting a 20% relative lift:

Baseline Conversion RateRequired Sample per VariantTotal Test Sample
1%18,50037,000
2%9,20018,400
3%6,10012,200
5%3,7007,400
8%2,3004,600
10%1,8503,700
15%1,2002,400
20%9001,800

For a 15% relative lift (smaller MDE, requires more sample):

Baseline Conversion RateRequired Sample per VariantTotal Test Sample
3%10,80021,600
5%6,50013,000
10%3,2006,400
15%2,1004,200

For pricing tests where the primary metric is revenue per visitor (continuous, not binary), the sample size calculation uses a t-test rather than a z-test for proportions. The required sample is typically 30–50% higher than for conversion rate tests because revenue per visitor has higher variance (driven by the distribution of plan prices and the presence of $0 outcomes from non-converters).

Setting the Right MDE

The minimum detectable effect is not the effect you expect to see — it is the smallest effect you would bother implementing. This distinction matters:

Expected effect: what your hypothesis predicts the change will produce. Use this in your power calculation to verify the test is feasible.

MDE (minimum actionable threshold): the smallest improvement that changes your decision. If a 5% revenue lift would be worth implementing but a 3% lift would be too small to justify the ongoing maintenance of a changed pricing page, your MDE is 5%.

Common MDE choices for SaaS pricing tests:

  • Conversion rate tests: 10–20% relative lift (e.g., 3% → 3.3% is 10% relative)
  • Revenue per visitor tests: 8–15% relative lift
  • Plan mix shift tests: 15–25% relative change in a plan's selection rate
  • Annual vs. monthly adoption: 20–30% relative change in annual plan selection rate

Setting MDE too small (e.g., 2% relative lift) produces impractically large sample sizes and impossibly long test runtimes. Setting MDE too large (e.g., 40% relative lift) means you only detect very large effects and miss moderate-but-real improvements. The 10–20% range is appropriate for most conversion rate and revenue per visitor tests.

Runtime Planning and Traffic Constraints

Test runtime is determined by:

Runtime (weeks) = Required total sample ÷ Weekly pricing page visitors

For a test requiring 12,000 total visitors on a page receiving 800 visitors/week: runtime = 15 weeks.

Pricing test runtime planning must account for:

Weekly traffic to the specific test URL. Homepage traffic does not count if the test is on the pricing page. Use your web analytics to extract pricing page-specific unique visitor counts, not site-wide sessions.

Conversion event frequency. For tests where the primary metric is revenue per visitor, you need sufficient conversions to estimate the revenue distribution. A page with 800 visitors/week and 2% conversion produces 16 paid conversions per week. At 15 weeks, that is 240 total conversions — enough for most conversion rate tests but borderline for revenue per visitor tests where variance is higher.

Seasonality exclusion. Do not plan tests through Q4 end-of-year if B2B buyers freeze budgets, or through your product's peak season if conversion rates are anomalous. The test window should represent normal operating conditions.

CI coverage period. A test that completes in week 12 but requires two more weeks to analyze and implement the winning variant has a 14-week total cycle. Budget accordingly.

The Cost of Underpowered Tests

Running an underpowered test is not harmless. The costs:

False negatives (missed real effects): A test with 40% power has a 60% chance of concluding "no effect" when a real effect exists. If your pricing change genuinely improves revenue per visitor by 15%, a test with 40% power will fail to detect it 60% of the time. You will shelve the winning variant based on a test that was not powerful enough to find it.

False positives from early stopping: A test with insufficient sample that is stopped when results first cross p=0.05 has an actual false positive rate far above 5%. At 40% power with early stopping, the false positive rate can be 15–30%. You will implement variants that don't actually work.

Institutional erosion: After several underpowered tests produce contradictory results, teams lose confidence in A/B testing. The correct response is better test design, not abandoning the practice — but the institutional damage from years of underpowered tests is real.

The standard from ProfitWell and Bessemer Venture Partners' published benchmarking guides is consistent: before running any pricing test, calculate the required sample size using your baseline conversion rate and target MDE. If the required runtime exceeds 12–16 weeks given your traffic, accept a larger MDE (smaller precision requirement) rather than running the test underpowered.

Practical Workflow for Power-Correct Testing

  1. Establish baseline. Pull 4 weeks of pricing page conversion data. Calculate mean conversion rate, standard deviation, and weekly visitor count.
  2. Set MDE. Define the smallest revenue per visitor improvement worth implementing.
  3. Calculate sample size. Use Evan Miller's online sample size calculator (for proportions) or a t-test power calculator (for continuous metrics like revenue per visitor). Run at 80% power, 95% two-tailed significance.
  4. Estimate runtime. Divide required total sample by weekly visitors. If runtime exceeds 16 weeks, either increase traffic before testing or increase MDE.
  5. Pre-register stopping rule. Record the required sample size, expected end date, and the rule: stop only when sample target is reached, not when results look significant.
  6. Run test. Do not look at results until the sample target is reached.
  7. Analyze. Report primary metric with 95% confidence interval. Secondary metrics for context. Document in test log.

This connects directly to pricing A/B test design — statistical power is one component of a rigorous test design, alongside metric selection, assignment mechanism, and contamination controls. And the metrics you are powering your test to detect — revenue per visitor, conversion rate, plan mix — connect to the pricing page conversion data that establishes your baseline.

See Your Growth Ceiling Now

Calculate when your SaaS growth will plateau — free, no signup required.

Calculate Your Growth Ceiling

Conclusion

Statistical power is the foundational requirement for any pricing test that is worth running. An underpowered test is not a weak test — it is a test that will actively mislead through false negatives and inflated false positives.

The investment in correct power analysis takes 20 minutes and requires a baseline conversion rate, an MDE, and a power calculator. That 20 minutes of discipline is the difference between a test that produces reliable conclusions and one that produces noise with a confident-looking p-value attached.

Calculate power before launch, commit to the sample size, do not peek, and report with confidence intervals. The pricing decisions that result will be built on evidence strong enough to stand up to scrutiny — from your board, your team, and your own re-examination in 12 months.

Frequently Asked Questions

What is statistical power in an A/B test?
Statistical power is the probability that your test will detect a real effect when one actually exists. A test with 80% power has a 20% chance of concluding 'no difference' even when there is a real difference. Standard practice is to design tests for 80% power, though 90% power is preferable for high-stakes pricing decisions. Higher power requires larger sample sizes.
What sample size do I need for a SaaS pricing test?
At 3% baseline conversion, detecting a 20% relative lift (3% → 3.6%) with 80% power at α=0.05 requires approximately 7,500 visitors per variant. At 10% baseline conversion, the same relative lift (10% → 12%) requires approximately 2,200 visitors per variant. Use a validated online calculator (Evan Miller's tool is the standard) — do not estimate by feel.
What is minimum detectable effect (MDE)?
MDE is the smallest difference in the primary metric that would be considered practically significant — worth implementing even if it's the actual effect. For pricing tests, a 5% relative improvement in revenue per visitor is typically considered the minimum worth detecting. MDE is set before the test starts and directly determines the required sample size: smaller MDE = larger sample required.
What happens if you run an underpowered test?
An underpowered test has two bad outcomes: (1) False negative — a real effect exists but the test is inconclusive (you implemented nothing when you should have). (2) Premature false positive — if you peek at results early and stop when they look significant, the test will appear to confirm an effect that is actually noise at higher rates. Both outcomes are harmful. Underpowered tests produce wasted experiment time and misleading conclusions.
How do you calculate the required runtime for a pricing test?
Runtime (weeks) = Sample size per variant × 2 ÷ Weekly visitors to the pricing page. If you need 6,000 visitors per variant and receive 1,500 pricing page visitors per week, the test runs for 8 weeks. If 1,500 visitors/week is too slow for your experimental cadence, either increase traffic (through ads or promotion) or accept a larger MDE (smaller effect, shorter runtime).
Can you run multiple pricing tests simultaneously?
No, unless using a fully factorial design that explicitly tests all variant combinations. Running two tests simultaneously (different variables, different test cells) means users can end up in any combination of variants. Without factorial design, you cannot attribute effects to individual variables, and interaction effects between tests inflate variance and distort results. Sequential testing is the standard practice.
What is the difference between 95% confidence and 80% power?
They address different error types. Confidence level (95%) is the probability of NOT declaring a false positive — that when the test concludes 'there is an effect,' the effect is real with 95% probability. Power (80%) is the probability of detecting a true effect — that when a real effect exists, the test will find it with 80% probability. Both must be set to control the quality of decisions. A test that hits 95% confidence but was only designed for 40% power is frequently missing real effects.
How does seasonality affect test runtime?
If your pricing page traffic is significantly higher during certain periods (end of fiscal year, product hunt launches, holiday campaigns), you should either run your test during a representative period or explicitly control for the seasonal effect by running both variants simultaneously through the high and low periods. Tests run entirely during seasonal peaks will show artificially inflated effect estimates.

Related Posts