Product Management

Controlling False Positives When You Run Dozens of SaaS Experiments a Quarter

A practical guide to multiple comparisons correction, experiment sequencing, and false discovery rate control for SaaS teams running high-velocity experimentation programs.

SaaS Science TeamJune 14, 20269 min read
false positivesexperimentationstatistical rigormultiple comparisonsa/b testingsaas growth

Controlling False Positives When You Run Dozens of SaaS Experiments a Quarter

High-velocity experimentation is widely considered a competitive advantage. Amplitude's 2024 data shows that the top quartile of SaaS product teams runs more than 40 experiments per quarter. At that pace, teams ship improvements faster, learn more, and make fewer large bets based on intuition. The problem is that high velocity amplifies a statistical problem that almost nobody discusses honestly: when you run many tests, some of your wins are not wins.

At a standard significance threshold of p<0.05, there is a 5% probability of declaring a false positive when a variant has no real effect. Run 50 experiments per quarter with that threshold, and probability says roughly 2-3 of your declared winners are noise. Over a year, that is 8-12 false positive results baked into your product — features that shipped because they "won" in an experiment but do not actually help customers. The technical term for this problem is the multiple comparisons problem, and it is not a theoretical concern. It is a practical one with measurable consequences.

This post covers the correction methods that work at SaaS scale: when to use Bonferroni versus Benjamini-Hochberg, how to pre-register experiments to prevent post-hoc rationalization, and when sequential testing designs are the right tool.

See Your Growth Ceiling NowTry Free

Understanding What False Positives Cost

Before discussing how to control false positive rates, it is worth quantifying what they cost in a SaaS context. The cost has two components: the direct cost of the false positive and the opportunity cost of not investigating why the "winning" feature failed to move downstream metrics.

Direct cost: When a false positive ships, the team expends engineering effort to implement a change that has no real effect. For a feature that requires two engineer-weeks to implement, at a fully-loaded cost of $15,000 per engineer-week, a single false positive costs $30,000 in development. A portfolio of 10 false positives per year — entirely plausible at 50 experiments per quarter — costs $300,000 in wasted engineering before accounting for maintenance burden.

Signal degradation cost: This is harder to quantify but potentially larger. Each false positive pollutes the team's causal model of the product. When the "winning" feature from Q2 is cited as evidence that users prefer shorter onboarding, and that claim influences three subsequent design decisions, the false positive has propagated through the team's entire knowledge base.

Ron Kohavi, Diane Tang, and Ya Xu, in Trustworthy Online Controlled Experiments, found that at large technology companies, the replication rate for internally-run A/B test winners was roughly 60-70% — meaning 30-40% of declared winners did not hold up when re-tested. For smaller SaaS companies with less statistical infrastructure, the replication rate is likely lower.

The Two Correction Approaches and When to Use Each

Bonferroni Correction: When One False Positive Is Unacceptable

The Bonferroni correction is the conservative option. It divides the significance threshold by the number of tests, controlling the probability that any test in the family produces a false positive to below 5%.

For 10 concurrent tests: alpha = 0.05 / 10 = 0.005

The practical implication is that individual tests need larger sample sizes to achieve significance at the corrected threshold. A test that would require 5,000 users per variant at p<0.05 now requires approximately 8,500 users per variant at p<0.005.

Use Bonferroni when:

  • The experiments share a common population and influence each other's results
  • A single false positive would lead to a high-cost decision (pricing change, infrastructure rebuild)
  • The experiments are testing variations on a single hypothesis (multivariate testing)

Bonferroni limitations: The correction assumes that all tests are independent, which is almost never true in a SaaS product where a change to onboarding affects the population available for retention experiments. For non-independent tests, Bonferroni is actually anti-conservative — it overcorrects, reducing power more than necessary.

Benjamini-Hochberg Procedure: When You Want to Maximize Discoveries

The Benjamini-Hochberg (BH) procedure controls the False Discovery Rate (FDR) — not the probability of any false positive, but the expected proportion of positive results that are false. At FDR = 10%, among all experiments the team declares significant, roughly 10% are expected to be noise.

The BH procedure works by ranking all test p-values and applying a stepwise threshold. It is substantially more powerful than Bonferroni for the same FDR target, allowing more true discoveries to be detected.

Use Benjamini-Hochberg when:

  • Experiments are largely independent (testing different features, different user segments, different funnel stages)
  • The team runs high volumes of experiments and wants to maximize the rate of true discoveries
  • A modest false positive rate (10-20%) is acceptable because downstream validation processes exist

For most SaaS teams running 20-50 experiments per quarter on independent product areas, BH at FDR=10% is the right default. It captures more real improvements than Bonferroni while bounding the false discovery proportion to an acceptable level.

The experiment design rigor that makes these corrections effective is covered in detail in SaaS Pricing A/B Test Design Rigor and the retention experiment design framework.

Pre-Registration: The Non-Statistical Intervention With the Highest ROI

Statistical corrections fix one source of false positives — the multiple comparisons problem. They do not fix a separate source that is often larger in practice: post-hoc analysis flexibility, or what researchers call "researcher degrees of freedom."

The pattern is familiar: the experiment runs, the primary metric does not reach significance, but the PM notices that a particular user segment showed a strong positive effect. The segment result is reported as the finding. The next experiment targets that segment. The statistical machinery looks rigorous, but the "discovery" of the segment was not pre-specified — it was found by searching through the data after results were available.

Pre-registration eliminates this by committing to the analysis plan before the data is collected. The pre-registration document records:

  1. Hypothesis: Exactly what you expect the variant to do and why
  2. Primary metric: The single metric that determines ship/no-ship
  3. Sample size: Calculated from the minimum detectable effect and power requirements
  4. Experiment duration: Start and end date, not subject to change based on results
  5. Segmentation rules: Any subgroup analyses that will be reported, specified in advance
  6. Guardrail metrics: Pre-specified thresholds that block shipping (see guardrail metrics framework)

The last item connects directly to the guardrail framework. Pre-registration and guardrails together create an experiment design that is robust to both statistical noise and motivated reasoning.

Sequential Testing: Monitoring Results Without Inflating Error Rates

Fixed-horizon experiments — run for a pre-specified duration, analyze once at the end — are the statistical gold standard. They are also often impractical. Teams want to know if experiments are working before the 28-day window closes. Guardrail breaches may require early stopping. Launches with hard deadlines may force early reads.

The traditional "peek early, decide early" approach inflates false positive rates dramatically. An experiment peeked at halfway through with a p<0.05 rule will produce false positives at 2-3x the nominal rate, because the test has effectively been run twice.

Sequential testing designs solve this by specifying a stopping rule in advance that accounts for multiple looks.

Sequential Probability Ratio Test (SPRT): Tests at each observation whether the data is more consistent with the null hypothesis or the alternative. Reaches a definitive conclusion faster than fixed-horizon tests on average, while maintaining the pre-specified error rates. Requires specifying the minimum detectable effect in advance.

mSPRT (mixture SPRT): A variant designed specifically for online experimentation, developed by researchers at Microsoft. It produces valid p-values at any point during the experiment, allowing teams to stop when business conditions require it without inflating false positive rates. Implemented natively in Optimizely, Statsig, and several other modern experimentation platforms.

The tradeoff is efficiency: sequential tests are slightly less efficient than fixed-horizon tests when run to completion, requiring 5-10% more samples on average. For most SaaS teams, this overhead is negligible compared to the value of being able to make early decisions safely.

For teams also managing traffic constraints, the minimum detectable effect framework for low-traffic SaaS covers how these sequential methods interact with sample size planning.

Building a Practical False-Positive Control Policy

The theory is useful, but what matters in practice is a written policy that governs how experiments are analyzed before they run. A policy that can be implemented in a day:

RuleDetails
Pre-registration requiredHypothesis, primary metric, sample size committed 48h before launch
Multiple experiments on same populationApply BH correction at FDR=10%
Pricing or high-cost experimentsApply Bonferroni correction
Early stopping for guardrailsPermitted; use mSPRT if available, otherwise require >50% of planned duration elapsed
Segment analysesOnly pre-specified segments reported as findings; all others exploratory
Minimum experiment duration14 days, regardless of significance, to capture weekly seasonality

The minimum duration rule addresses a subtle false positive source that corrections do not fix: day-of-week effects. Users who sign up on Monday behave differently from users who sign up on Friday. An experiment that happens to over-sample a Monday-heavy period in the treatment group will see inflated engagement metrics that have nothing to do with the variant. Two full weeks eliminates most day-of-week confounding.

See Your Growth Ceiling Now

Calculate when your SaaS growth will plateau — free, no signup required.

Calculate Your Growth Ceiling

Conclusion

False positive control is not a constraint on experimentation velocity — it is what makes experimentation results trustworthy at high velocity. A team that runs 50 experiments per quarter with rigorous pre-registration, appropriate multiple comparisons corrections, and sequential testing for early-stopping scenarios will ship fewer features than a team that ignores these practices. But the features it ships will actually work, and the knowledge base it accumulates will be reliable.

The SaasDash experimentation tools include built-in pre-registration templates and automated Benjamini-Hochberg correction for multi-experiment analysis batches. If your team is currently analyzing experiments without these controls, the experiment audit tool in the dashboard can scan your last 12 months of results and flag which declared winners may warrant re-testing.

Frequently Asked Questions

What is a false positive in A/B testing?
A false positive (also called a Type I error) occurs when a statistical test reports a significant result — suggesting that the experiment variant caused a real improvement — when in fact the observed difference was due to random sampling variation. At a significance threshold of p&lt;0.05, a false positive will occur 5% of the time even when the treatment has zero actual effect. This rate compounds rapidly when multiple experiments are run.
Why does running many experiments increase false positive rates?
Each test at p&lt;0.05 has a 5% chance of producing a false positive when the null hypothesis is true. Running 20 independent tests means approximately one will produce a false positive by chance. Running 50 tests produces roughly 2-3. If you treat each test result independently without correction, you will accumulate a body of 'winning' experiments where some fraction of the wins are purely statistical noise. This is the multiple comparisons problem.
What is the Bonferroni correction?
The Bonferroni correction adjusts the significance threshold by dividing it by the number of comparisons. If you run 10 experiments simultaneously, you test each at p&lt;0.005 instead of p&lt;0.05. This controls the family-wise error rate — the probability of even one false positive across the entire set of tests. It is conservative, which reduces statistical power, but is the right approach when a single false positive would lead to a costly product change.
What is the Benjamini-Hochberg procedure?
The Benjamini-Hochberg (BH) procedure controls the False Discovery Rate (FDR) — the expected proportion of significant results that are false positives — rather than the probability of any false positive. It is less conservative than Bonferroni, preserving more statistical power. For a team running many independent experiments, BH is appropriate when the cost of occasional false positives is acceptable and the goal is to maximize the number of true improvements discovered.
What is pre-registration and why does it reduce false positives?
Pre-registration means documenting the experiment hypothesis, primary metric, sample size, and analysis plan before launching the test. This prevents post-hoc changes to the analysis that exploit random variation in the data — a practice sometimes called p-hacking or HARKing (Hypothesizing After Results are Known). Teams that pre-register consistently report lower rates of experiment results that fail to replicate.
When should a SaaS team use sequential testing instead of fixed-horizon testing?
Sequential testing is appropriate when the team cannot commit to a fixed experiment duration in advance — for example, when a launch deadline might require an early decision, or when guardrail metric breaches might require early stopping. Sequential tests like SPRT and mSPRT maintain the nominal false positive rate even when results are examined multiple times. The tradeoff is that they require slightly larger sample sizes than fixed-horizon tests run to completion.

Related Posts