Product Management

Sizing Experiments at Low Traffic: Choosing a Minimum Detectable Effect You Can Actually Reach

How SaaS teams with limited traffic set realistic minimum detectable effects, avoid underpowered experiments, and make confident product decisions without enterprise-scale sample sizes.

SaaS Science TeamJune 14, 20268 min read
minimum detectable effectexperimentationsample sizelow trafficstatistical powersaas

Sizing Experiments at Low Traffic: Choosing a Minimum Detectable Effect You Can Actually Reach

The sample size calculator is the most uncomfortable tool in the SaaS experimenter's kit. You enter your baseline conversion rate, your desired significance level, and the improvement you want to detect. The calculator returns a number — say, 12,000 users per variant — and your monthly unique visitors are 8,000. The math says your experiment will take 18 months. You have a two-week sprint cycle. Something has to give.

Most teams handle this tension by either ignoring it (running the experiment anyway and calling the result significant when it is not) or abandoning quantitative experimentation entirely ("we just don't have the traffic to run A/B tests"). Both are wrong. The correct response is to understand what the minimum detectable effect actually means, set it at a level your traffic can reach, and choose the experiment design that maximizes what you can learn with the users you have.

Amplitude's product benchmarks show that the median B2B SaaS company has between 5,000 and 25,000 monthly active users. At the low end of that range, classical fixed-horizon A/B testing is genuinely difficult for anything smaller than a 10-15% relative improvement. But smaller effects are often still detectible — with different methods.

See Your Growth Ceiling NowTry Free

What the MDE Actually Means — and Why Most Teams Set It Wrong

The minimum detectable effect is not the improvement you expect to see. It is the smallest improvement you care about detecting. Setting the MDE requires answering a business question: "If this variant improved [metric] by X%, would that change our product direction?"

Teams that set the MDE too small are not being rigorous — they are being impractical. An experiment designed to detect a 1% improvement in 14-day retention might require 200,000 users per variant. Unless your product has enterprise-scale traffic on the specific page being tested, this experiment cannot be run. But a 1% improvement in 14-day retention, while statistically real, may not justify the engineering cost of the change — particularly if the uncertainty interval around a "significant" result still spans from 0.1% to 1.9%.

The correct MDE is the threshold at which the improvement would be economically meaningful. For most SaaS metrics:

MetricEconomically Meaningful ImprovementTypical MDE Range
Signup conversion rate>5% relative (e.g., 10% → 10.5%)5-15% relative
Trial-to-paid conversion>3% relative (e.g., 25% → 25.75%)5-12% relative
30-day retention>2 pp absolute2-5 pp absolute
Feature adoption rate>5% relative5-20% relative
Time-to-first-value>10% reduction10-25% reduction

The second column reflects what actually moves ARR for a typical SaaS business. The third column reflects what traffic levels between 5,000 and 50,000 monthly uniques can actually detect in a 14-28 day experiment window. These ranges overlap — which means the MDE problem is solvable if teams are willing to be honest about the threshold.

Calculating Required Sample Size From First Principles

Rather than using a black-box calculator, understanding the formula gives teams the insight to optimize their experiment designs.

For a two-sample proportions test at 80% power and 5% significance, the per-variant sample size is approximately:

n ≈ 16 × σ² / δ²

Where σ² is the variance of the metric and δ is the MDE (in absolute terms). The key insight from this formula is that the required sample size grows with the square of reducing the MDE. Cutting the MDE in half quadruples the required sample. Doubling the MDE reduces the required sample to one quarter.

This is why the single most effective intervention for low-traffic teams is reconsidering the MDE — not chasing more traffic or extending experiment duration.

The formula also shows that reducing variance (σ²) has a linear effect on sample size. This is where CUPED and other variance reduction techniques provide value.

Example: A 14-day activation rate experiment

  • Baseline activation rate: 35%
  • Available unique visitors: 8,000 per 14-day period (4,000 per variant)
  • Variance of daily activation rate: σ² = 0.35 × (1 - 0.35) = 0.2275
  • With 4,000 users per variant, detectable δ ≈ √(16 × 0.2275 / 4,000) ≈ 3.0% absolute ≈ 8.6% relative

This team can detect an 8-9% relative improvement in activation rate with adequate power in a 14-day experiment. If the experiment is testing an onboarding flow change, that is a meaningful threshold — an 8-9% improvement in activation is worth building.

Variance Reduction: Getting More Power Without More Users

CUPED (Controlled-experiment Using Pre-Experiment Data) is the most practical variance reduction tool available. The concept is straightforward: use each user's pre-experiment behavior as a covariate to reduce the residual variance of the treatment effect estimate.

If you are running an experiment on day-14 retention, and you have data on each user's day-7 engagement from before the experiment started, including day-7 engagement as a CUPED covariate will typically reduce the variance of the day-14 retention estimate by 25-40%. This is equivalent to having 33-67% more effective sample size — without a single additional user.

CUPED is implemented natively in Statsig, Optimizely, and can be calculated manually in any analytics tool that supports linear regression. The manual calculation for a simple two-group comparison:

  1. Fit a regression: Y_i = α + β × X_i + ε_i (where Y is the post-experiment metric, X is the pre-experiment covariate)
  2. Calculate residuals: Y_i_adjusted = Y_i - β × X_i
  3. Run the experiment analysis on Y_adjusted instead of Y

The variance of Y_adjusted is always less than or equal to the variance of Y, and the treatment effect estimate on Y_adjusted is unbiased.

For teams running product analytics instrumentation, having clean pre-experiment event data is the prerequisite for CUPED. If your instrumentation is incomplete, variance reduction is unavailable.

Sequential Testing for Small-Traffic Environments

Fixed-horizon testing is ideal when you can commit to running an experiment for its full planned duration without peeking at results. For low-traffic SaaS teams with short planning cycles, this commitment is often impractical.

The mSPRT (mixture Sequential Probability Ratio Test) allows teams to:

  • Monitor results continuously during the experiment
  • Stop early for business reasons (a launch deadline, a guardrail breach) without inflating false positive rates
  • Reach a valid conclusion whenever the accumulated evidence crosses a predetermined threshold

The tradeoff is a modest efficiency cost — mSPRT requires approximately 5-10% more samples than a fixed-horizon test run to its planned duration. For low-traffic teams, this is usually acceptable.

When to use sequential testing:

  • Experiment duration would exceed 28 days with fixed-horizon design
  • Launch deadlines may require early decisions
  • Guardrail metric monitoring requires early stopping capability (see guardrail metrics)

When to use fixed-horizon testing:

  • Duration is under 21 days and the team can commit to not acting on early results
  • Primary metric has high day-of-week variance (sequential tests can struggle with periodic patterns)

The Holdout Strategy for Feature Clusters

When individual features are too small to experiment on (because each would need a prohibitively large sample), the holdout design tests the cumulative effect of an entire product direction.

The mechanics: randomly assign 5-10% of users to a permanent holdout group that does not receive any new features in a defined area (say, the onboarding flow) for a quarter. The remaining 90-95% receives all new features as they ship. At the end of the quarter, compare the holdout group to the non-holdout group on key outcomes (retention, conversion, revenue).

The holdout strategy has two advantages for low-traffic teams:

  1. The cumulative effect of multiple features is often large enough to be detectable even when individual effects are not
  2. It separates product direction validation from individual feature validation — a useful distinction when the team is uncertain whether a whole roadmap direction is working

The limitation is that holdouts cannot attribute effects to specific features. When a holdout result shows that the new onboarding direction improved 30-day retention by 4 pp, the team does not know which specific changes drove the improvement. This is acceptable when the goal is directional validation rather than feature-level attribution.

For teams connecting experimentation to their growth roadmap, the growth experiments velocity playbook covers how to sequence holdouts and individual experiments across a quarter.

See Your Growth Ceiling Now

Calculate when your SaaS growth will plateau — free, no signup required.

Calculate Your Growth Ceiling

Conclusion

The traffic constraint is real, but it is not a barrier to evidence-based product development — it is a parameter that shapes which methods are appropriate. Setting a realistic MDE, applying variance reduction, and choosing experiment designs calibrated to available sample sizes produce teams that make better decisions than teams waiting for more traffic to run the experiments they wish they could run.

The SaasDash experiment sizing tool calculates achievable MDEs based on your actual traffic levels and metric variances, and recommends the most efficient design for each experiment type. If your team has been running underpowered experiments without realizing it, the audit report will show you which past results were below 80% power — an important baseline for understanding which of your historical "findings" are reliable.

Frequently Asked Questions

What is the minimum detectable effect (MDE)?
The minimum detectable effect is the smallest true difference between control and treatment that an experiment is designed to detect with adequate statistical power. It is set before the experiment launches and determines the required sample size. An experiment designed to detect a 2% improvement requires a much larger sample than one designed to detect a 10% improvement. Setting the MDE too small for available traffic produces experiments that are underpowered — they cannot reliably distinguish real effects from noise.
How does low traffic affect experiment design?
Low traffic means each user is a scarce resource. If your experiment page receives 2,000 unique visitors per month, even a 50/50 split gives you 1,000 users per variant per month. To detect a 5% relative improvement in a metric with 30% baseline conversion at 80% power and 5% significance, you need roughly 8,000 users per variant. That means a 16-month experiment — practically impossible. The solution is either to increase the MDE to a level reachable with available traffic, or to use a more efficient experiment design.
What is statistical power and why does it matter?
Statistical power is the probability that an experiment correctly detects a true effect when one exists. A test with 80% power has a 20% chance of missing a real improvement — a false negative. Underpowered experiments are particularly dangerous in SaaS because teams run them to 'completion,' get a null result, and conclude the variant did not work, when in fact the experiment was never large enough to detect the improvement even if it was real.
What is CUPED and how does it reduce required sample size?
CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique developed at Microsoft. It uses pre-experiment observations of the primary metric to reduce the variance of the treatment effect estimate, effectively increasing statistical power without adding more users. For most SaaS experiments with reliable pre-experiment data, CUPED reduces the required sample size by 20-40%. It is available in several modern experimentation platforms.
When should a SaaS team use Bayesian experimentation instead of frequentist A/B testing?
Bayesian inference is most useful when the team cannot wait for a fixed-horizon experiment to complete, when prior knowledge about the likely effect size is strong, or when the cost of different types of errors is asymmetric. A Bayesian approach produces a probability that the variant is better than control — a more intuitive metric for business decisions — and allows valid inference at any sample size, though with wider uncertainty intervals at smaller samples.
What is a holdout experiment and when is it appropriate?
A holdout experiment keeps a small fraction of users (typically 5-10%) permanently in the control condition for an extended period, while the rest receive new features. This tests the cumulative effect of a product direction over weeks or months rather than evaluating individual features in isolation. Holdouts are most useful for measuring the aggregate impact of a feature cluster when individual experiments are too underpowered to detect their individual contributions.

Related Posts