Sizing Experiments at Low Traffic: Choosing a Minimum Detectable Effect You Can Actually Reach
How SaaS teams with limited traffic set realistic minimum detectable effects, avoid underpowered experiments, and make confident product decisions without enterprise-scale sample sizes.
Sizing Experiments at Low Traffic: Choosing a Minimum Detectable Effect You Can Actually Reach
The sample size calculator is the most uncomfortable tool in the SaaS experimenter's kit. You enter your baseline conversion rate, your desired significance level, and the improvement you want to detect. The calculator returns a number — say, 12,000 users per variant — and your monthly unique visitors are 8,000. The math says your experiment will take 18 months. You have a two-week sprint cycle. Something has to give.
Most teams handle this tension by either ignoring it (running the experiment anyway and calling the result significant when it is not) or abandoning quantitative experimentation entirely ("we just don't have the traffic to run A/B tests"). Both are wrong. The correct response is to understand what the minimum detectable effect actually means, set it at a level your traffic can reach, and choose the experiment design that maximizes what you can learn with the users you have.
Amplitude's product benchmarks show that the median B2B SaaS company has between 5,000 and 25,000 monthly active users. At the low end of that range, classical fixed-horizon A/B testing is genuinely difficult for anything smaller than a 10-15% relative improvement. But smaller effects are often still detectible — with different methods.
What the MDE Actually Means — and Why Most Teams Set It Wrong
The minimum detectable effect is not the improvement you expect to see. It is the smallest improvement you care about detecting. Setting the MDE requires answering a business question: "If this variant improved [metric] by X%, would that change our product direction?"
Teams that set the MDE too small are not being rigorous — they are being impractical. An experiment designed to detect a 1% improvement in 14-day retention might require 200,000 users per variant. Unless your product has enterprise-scale traffic on the specific page being tested, this experiment cannot be run. But a 1% improvement in 14-day retention, while statistically real, may not justify the engineering cost of the change — particularly if the uncertainty interval around a "significant" result still spans from 0.1% to 1.9%.
The correct MDE is the threshold at which the improvement would be economically meaningful. For most SaaS metrics:
| Metric | Economically Meaningful Improvement | Typical MDE Range |
|---|---|---|
| Signup conversion rate | >5% relative (e.g., 10% → 10.5%) | 5-15% relative |
| Trial-to-paid conversion | >3% relative (e.g., 25% → 25.75%) | 5-12% relative |
| 30-day retention | >2 pp absolute | 2-5 pp absolute |
| Feature adoption rate | >5% relative | 5-20% relative |
| Time-to-first-value | >10% reduction | 10-25% reduction |
The second column reflects what actually moves ARR for a typical SaaS business. The third column reflects what traffic levels between 5,000 and 50,000 monthly uniques can actually detect in a 14-28 day experiment window. These ranges overlap — which means the MDE problem is solvable if teams are willing to be honest about the threshold.
Calculating Required Sample Size From First Principles
Rather than using a black-box calculator, understanding the formula gives teams the insight to optimize their experiment designs.
For a two-sample proportions test at 80% power and 5% significance, the per-variant sample size is approximately:
n ≈ 16 × σ² / δ²
Where σ² is the variance of the metric and δ is the MDE (in absolute terms). The key insight from this formula is that the required sample size grows with the square of reducing the MDE. Cutting the MDE in half quadruples the required sample. Doubling the MDE reduces the required sample to one quarter.
This is why the single most effective intervention for low-traffic teams is reconsidering the MDE — not chasing more traffic or extending experiment duration.
The formula also shows that reducing variance (σ²) has a linear effect on sample size. This is where CUPED and other variance reduction techniques provide value.
Example: A 14-day activation rate experiment
- Baseline activation rate: 35%
- Available unique visitors: 8,000 per 14-day period (4,000 per variant)
- Variance of daily activation rate: σ² = 0.35 × (1 - 0.35) = 0.2275
- With 4,000 users per variant, detectable δ ≈ √(16 × 0.2275 / 4,000) ≈ 3.0% absolute ≈ 8.6% relative
This team can detect an 8-9% relative improvement in activation rate with adequate power in a 14-day experiment. If the experiment is testing an onboarding flow change, that is a meaningful threshold — an 8-9% improvement in activation is worth building.
Variance Reduction: Getting More Power Without More Users
CUPED (Controlled-experiment Using Pre-Experiment Data) is the most practical variance reduction tool available. The concept is straightforward: use each user's pre-experiment behavior as a covariate to reduce the residual variance of the treatment effect estimate.
If you are running an experiment on day-14 retention, and you have data on each user's day-7 engagement from before the experiment started, including day-7 engagement as a CUPED covariate will typically reduce the variance of the day-14 retention estimate by 25-40%. This is equivalent to having 33-67% more effective sample size — without a single additional user.
CUPED is implemented natively in Statsig, Optimizely, and can be calculated manually in any analytics tool that supports linear regression. The manual calculation for a simple two-group comparison:
- Fit a regression: Y_i = α + β × X_i + ε_i (where Y is the post-experiment metric, X is the pre-experiment covariate)
- Calculate residuals: Y_i_adjusted = Y_i - β × X_i
- Run the experiment analysis on Y_adjusted instead of Y
The variance of Y_adjusted is always less than or equal to the variance of Y, and the treatment effect estimate on Y_adjusted is unbiased.
For teams running product analytics instrumentation, having clean pre-experiment event data is the prerequisite for CUPED. If your instrumentation is incomplete, variance reduction is unavailable.
Sequential Testing for Small-Traffic Environments
Fixed-horizon testing is ideal when you can commit to running an experiment for its full planned duration without peeking at results. For low-traffic SaaS teams with short planning cycles, this commitment is often impractical.
The mSPRT (mixture Sequential Probability Ratio Test) allows teams to:
- Monitor results continuously during the experiment
- Stop early for business reasons (a launch deadline, a guardrail breach) without inflating false positive rates
- Reach a valid conclusion whenever the accumulated evidence crosses a predetermined threshold
The tradeoff is a modest efficiency cost — mSPRT requires approximately 5-10% more samples than a fixed-horizon test run to its planned duration. For low-traffic teams, this is usually acceptable.
When to use sequential testing:
- Experiment duration would exceed 28 days with fixed-horizon design
- Launch deadlines may require early decisions
- Guardrail metric monitoring requires early stopping capability (see guardrail metrics)
When to use fixed-horizon testing:
- Duration is under 21 days and the team can commit to not acting on early results
- Primary metric has high day-of-week variance (sequential tests can struggle with periodic patterns)
The Holdout Strategy for Feature Clusters
When individual features are too small to experiment on (because each would need a prohibitively large sample), the holdout design tests the cumulative effect of an entire product direction.
The mechanics: randomly assign 5-10% of users to a permanent holdout group that does not receive any new features in a defined area (say, the onboarding flow) for a quarter. The remaining 90-95% receives all new features as they ship. At the end of the quarter, compare the holdout group to the non-holdout group on key outcomes (retention, conversion, revenue).
The holdout strategy has two advantages for low-traffic teams:
- The cumulative effect of multiple features is often large enough to be detectable even when individual effects are not
- It separates product direction validation from individual feature validation — a useful distinction when the team is uncertain whether a whole roadmap direction is working
The limitation is that holdouts cannot attribute effects to specific features. When a holdout result shows that the new onboarding direction improved 30-day retention by 4 pp, the team does not know which specific changes drove the improvement. This is acceptable when the goal is directional validation rather than feature-level attribution.
For teams connecting experimentation to their growth roadmap, the growth experiments velocity playbook covers how to sequence holdouts and individual experiments across a quarter.
See Your Growth Ceiling Now
Calculate when your SaaS growth will plateau — free, no signup required.
Conclusion
The traffic constraint is real, but it is not a barrier to evidence-based product development — it is a parameter that shapes which methods are appropriate. Setting a realistic MDE, applying variance reduction, and choosing experiment designs calibrated to available sample sizes produce teams that make better decisions than teams waiting for more traffic to run the experiments they wish they could run.
The SaasDash experiment sizing tool calculates achievable MDEs based on your actual traffic levels and metric variances, and recommends the most efficient design for each experiment type. If your team has been running underpowered experiments without realizing it, the audit report will show you which past results were below 80% power — an important baseline for understanding which of your historical "findings" are reliable.
Frequently Asked Questions
What is the minimum detectable effect (MDE)?
How does low traffic affect experiment design?
What is statistical power and why does it matter?
What is CUPED and how does it reduce required sample size?
When should a SaaS team use Bayesian experimentation instead of frequentist A/B testing?
What is a holdout experiment and when is it appropriate?
Related Posts
Fake-Door and Concept Testing Without Eroding Customer Trust
How SaaS product teams use fake-door tests and concept validation to measure demand before building — while maintaining the customer trust that makes future research possible.
10 min readRunning Continuous Discovery on a Team Too Small to Have a Research Org
How small SaaS teams can run weekly customer discovery without a dedicated researcher — the cadence, interview format, and synthesis system that fits inside a sprint.
10 min readSynthesizing Customer Interviews Into a Reusable Pattern Library
How SaaS teams build a living pattern library from customer interviews — a synthesis system that accumulates insight across sessions instead of producing reports that nobody reads.
9 min read