Controlling False Positives When You Run Dozens of SaaS Experiments a Quarter
A practical guide to multiple comparisons correction, experiment sequencing, and false discovery rate control for SaaS teams running high-velocity experimentation programs.
Controlling False Positives When You Run Dozens of SaaS Experiments a Quarter
High-velocity experimentation is widely considered a competitive advantage. Amplitude's 2024 data shows that the top quartile of SaaS product teams runs more than 40 experiments per quarter. At that pace, teams ship improvements faster, learn more, and make fewer large bets based on intuition. The problem is that high velocity amplifies a statistical problem that almost nobody discusses honestly: when you run many tests, some of your wins are not wins.
At a standard significance threshold of p<0.05, there is a 5% probability of declaring a false positive when a variant has no real effect. Run 50 experiments per quarter with that threshold, and probability says roughly 2-3 of your declared winners are noise. Over a year, that is 8-12 false positive results baked into your product — features that shipped because they "won" in an experiment but do not actually help customers. The technical term for this problem is the multiple comparisons problem, and it is not a theoretical concern. It is a practical one with measurable consequences.
This post covers the correction methods that work at SaaS scale: when to use Bonferroni versus Benjamini-Hochberg, how to pre-register experiments to prevent post-hoc rationalization, and when sequential testing designs are the right tool.
Understanding What False Positives Cost
Before discussing how to control false positive rates, it is worth quantifying what they cost in a SaaS context. The cost has two components: the direct cost of the false positive and the opportunity cost of not investigating why the "winning" feature failed to move downstream metrics.
Direct cost: When a false positive ships, the team expends engineering effort to implement a change that has no real effect. For a feature that requires two engineer-weeks to implement, at a fully-loaded cost of $15,000 per engineer-week, a single false positive costs $30,000 in development. A portfolio of 10 false positives per year — entirely plausible at 50 experiments per quarter — costs $300,000 in wasted engineering before accounting for maintenance burden.
Signal degradation cost: This is harder to quantify but potentially larger. Each false positive pollutes the team's causal model of the product. When the "winning" feature from Q2 is cited as evidence that users prefer shorter onboarding, and that claim influences three subsequent design decisions, the false positive has propagated through the team's entire knowledge base.
Ron Kohavi, Diane Tang, and Ya Xu, in Trustworthy Online Controlled Experiments, found that at large technology companies, the replication rate for internally-run A/B test winners was roughly 60-70% — meaning 30-40% of declared winners did not hold up when re-tested. For smaller SaaS companies with less statistical infrastructure, the replication rate is likely lower.
The Two Correction Approaches and When to Use Each
Bonferroni Correction: When One False Positive Is Unacceptable
The Bonferroni correction is the conservative option. It divides the significance threshold by the number of tests, controlling the probability that any test in the family produces a false positive to below 5%.
For 10 concurrent tests: alpha = 0.05 / 10 = 0.005
The practical implication is that individual tests need larger sample sizes to achieve significance at the corrected threshold. A test that would require 5,000 users per variant at p<0.05 now requires approximately 8,500 users per variant at p<0.005.
Use Bonferroni when:
- The experiments share a common population and influence each other's results
- A single false positive would lead to a high-cost decision (pricing change, infrastructure rebuild)
- The experiments are testing variations on a single hypothesis (multivariate testing)
Bonferroni limitations: The correction assumes that all tests are independent, which is almost never true in a SaaS product where a change to onboarding affects the population available for retention experiments. For non-independent tests, Bonferroni is actually anti-conservative — it overcorrects, reducing power more than necessary.
Benjamini-Hochberg Procedure: When You Want to Maximize Discoveries
The Benjamini-Hochberg (BH) procedure controls the False Discovery Rate (FDR) — not the probability of any false positive, but the expected proportion of positive results that are false. At FDR = 10%, among all experiments the team declares significant, roughly 10% are expected to be noise.
The BH procedure works by ranking all test p-values and applying a stepwise threshold. It is substantially more powerful than Bonferroni for the same FDR target, allowing more true discoveries to be detected.
Use Benjamini-Hochberg when:
- Experiments are largely independent (testing different features, different user segments, different funnel stages)
- The team runs high volumes of experiments and wants to maximize the rate of true discoveries
- A modest false positive rate (10-20%) is acceptable because downstream validation processes exist
For most SaaS teams running 20-50 experiments per quarter on independent product areas, BH at FDR=10% is the right default. It captures more real improvements than Bonferroni while bounding the false discovery proportion to an acceptable level.
The experiment design rigor that makes these corrections effective is covered in detail in SaaS Pricing A/B Test Design Rigor and the retention experiment design framework.
Pre-Registration: The Non-Statistical Intervention With the Highest ROI
Statistical corrections fix one source of false positives — the multiple comparisons problem. They do not fix a separate source that is often larger in practice: post-hoc analysis flexibility, or what researchers call "researcher degrees of freedom."
The pattern is familiar: the experiment runs, the primary metric does not reach significance, but the PM notices that a particular user segment showed a strong positive effect. The segment result is reported as the finding. The next experiment targets that segment. The statistical machinery looks rigorous, but the "discovery" of the segment was not pre-specified — it was found by searching through the data after results were available.
Pre-registration eliminates this by committing to the analysis plan before the data is collected. The pre-registration document records:
- Hypothesis: Exactly what you expect the variant to do and why
- Primary metric: The single metric that determines ship/no-ship
- Sample size: Calculated from the minimum detectable effect and power requirements
- Experiment duration: Start and end date, not subject to change based on results
- Segmentation rules: Any subgroup analyses that will be reported, specified in advance
- Guardrail metrics: Pre-specified thresholds that block shipping (see guardrail metrics framework)
The last item connects directly to the guardrail framework. Pre-registration and guardrails together create an experiment design that is robust to both statistical noise and motivated reasoning.
Sequential Testing: Monitoring Results Without Inflating Error Rates
Fixed-horizon experiments — run for a pre-specified duration, analyze once at the end — are the statistical gold standard. They are also often impractical. Teams want to know if experiments are working before the 28-day window closes. Guardrail breaches may require early stopping. Launches with hard deadlines may force early reads.
The traditional "peek early, decide early" approach inflates false positive rates dramatically. An experiment peeked at halfway through with a p<0.05 rule will produce false positives at 2-3x the nominal rate, because the test has effectively been run twice.
Sequential testing designs solve this by specifying a stopping rule in advance that accounts for multiple looks.
Sequential Probability Ratio Test (SPRT): Tests at each observation whether the data is more consistent with the null hypothesis or the alternative. Reaches a definitive conclusion faster than fixed-horizon tests on average, while maintaining the pre-specified error rates. Requires specifying the minimum detectable effect in advance.
mSPRT (mixture SPRT): A variant designed specifically for online experimentation, developed by researchers at Microsoft. It produces valid p-values at any point during the experiment, allowing teams to stop when business conditions require it without inflating false positive rates. Implemented natively in Optimizely, Statsig, and several other modern experimentation platforms.
The tradeoff is efficiency: sequential tests are slightly less efficient than fixed-horizon tests when run to completion, requiring 5-10% more samples on average. For most SaaS teams, this overhead is negligible compared to the value of being able to make early decisions safely.
For teams also managing traffic constraints, the minimum detectable effect framework for low-traffic SaaS covers how these sequential methods interact with sample size planning.
Building a Practical False-Positive Control Policy
The theory is useful, but what matters in practice is a written policy that governs how experiments are analyzed before they run. A policy that can be implemented in a day:
| Rule | Details |
|---|---|
| Pre-registration required | Hypothesis, primary metric, sample size committed 48h before launch |
| Multiple experiments on same population | Apply BH correction at FDR=10% |
| Pricing or high-cost experiments | Apply Bonferroni correction |
| Early stopping for guardrails | Permitted; use mSPRT if available, otherwise require >50% of planned duration elapsed |
| Segment analyses | Only pre-specified segments reported as findings; all others exploratory |
| Minimum experiment duration | 14 days, regardless of significance, to capture weekly seasonality |
The minimum duration rule addresses a subtle false positive source that corrections do not fix: day-of-week effects. Users who sign up on Monday behave differently from users who sign up on Friday. An experiment that happens to over-sample a Monday-heavy period in the treatment group will see inflated engagement metrics that have nothing to do with the variant. Two full weeks eliminates most day-of-week confounding.
See Your Growth Ceiling Now
Calculate when your SaaS growth will plateau — free, no signup required.
Conclusion
False positive control is not a constraint on experimentation velocity — it is what makes experimentation results trustworthy at high velocity. A team that runs 50 experiments per quarter with rigorous pre-registration, appropriate multiple comparisons corrections, and sequential testing for early-stopping scenarios will ship fewer features than a team that ignores these practices. But the features it ships will actually work, and the knowledge base it accumulates will be reliable.
The SaasDash experimentation tools include built-in pre-registration templates and automated Benjamini-Hochberg correction for multi-experiment analysis batches. If your team is currently analyzing experiments without these controls, the experiment audit tool in the dashboard can scan your last 12 months of results and flag which declared winners may warrant re-testing.
Frequently Asked Questions
What is a false positive in A/B testing?
Why does running many experiments increase false positive rates?
What is the Bonferroni correction?
What is the Benjamini-Hochberg procedure?
What is pre-registration and why does it reduce false positives?
When should a SaaS team use sequential testing instead of fixed-horizon testing?
Related Posts
Fake-Door and Concept Testing Without Eroding Customer Trust
How SaaS product teams use fake-door tests and concept validation to measure demand before building — while maintaining the customer trust that makes future research possible.
10 min readRunning Continuous Discovery on a Team Too Small to Have a Research Org
How small SaaS teams can run weekly customer discovery without a dedicated researcher — the cadence, interview format, and synthesis system that fits inside a sprint.
10 min readSynthesizing Customer Interviews Into a Reusable Pattern Library
How SaaS teams build a living pattern library from customer interviews — a synthesis system that accumulates insight across sessions instead of producing reports that nobody reads.
9 min read