This is the third post in our Product Experimentation Pitfalls blog series written by Optimizely’s Director of Product Management, Jon Noronha. See here for more information on this 5-part series.
Analyzing an A/B test is deceptively simple. You count up how many conversions happened in each variation, you divide by the number of users exposed it to it, and then you compare the conversion rates. Then you run it all through a stats formula online and see if it comes out >95% probability. Rinse and repeat for each variation and metric and presto, you’ve got a scorecard of winners and losers. What could possibly go wrong?
A whole lot, it turns out.
Having gone through this process on hundreds of experiments myself, I’ve come to learn two very costly lessons that I’d like to share with you:
- Statistics is the most confusing science on earth
- Misinterpreting an experiment is worse than not running it at all
Here’s an example from early in my experimentation career. We had launched a big redesign of our search results page with four different variants, and we were hoping to see an increase in engagement with no drop in core metrics like performance or ad revenue. So we worked with a data scientist to configure the test, and she advised us to run the experiment for three weeks with our standard battery of around 25 core metrics, like “homepage CTR” and “mobile ad revenue.” She also insisted, “Don’t peek at the results until it’s done, and loop me in when you’re ready to analyze it.” In retrospect, this was like telling a child, “Don’t eat the tasty marshmallow sitting in front of you.”
Needless to say, we peeked at the results. About four days into the test, we had our weekly experiment review meeting (a must for any team building a product experimentation muscle). In the meeting, we’d have all the PMs and engineering managers review the tests we were running and decide next steps. We decided to peek at this test even though it wasn’t done yet. And…uh-oh…the results were very bad.
The scorecard had lit up like a Christmas tree, with metrics showing up as extreme winners and losers. Somehow our redesign had tanked mobile revenue but boosted it on desktop…but only in China? We were ruining site performance on the homepage, but we’d made the search flow way faster. And two nearly identical variants were having almost exactly opposite effects on metrics that seemed hopelessly unrelated to the change we’d made.
What followed was the kind of FUD explosion that every product manager dreads.
My boss decided we had to stop the test before we took down the whole site. This left the entire project, six months of hard work by our whole team, in limbo. We spent the next two weeks desperately trying to figure out how our long-awaited redesign could have had such unexpected effects. Our engineers scrambled to profile the new design and figure out what could be making it slow. Our analyst sliced and diced the data 10 different ways to make sense of the changes. I had our designer frantically churning out whole new versions of the redesign, scrapping all of our big ideas so we could release something more practical. And in the middle of it all, I remember the desperate uncertainty. Why in the world were the numbers going haywire?
Finally, the data scientist found me wandering the halls, dejected and frantic. She asked me what was wrong and I told her what happened. “But you didn’t run it for the full three weeks,” she said, “And you didn’t invite me to the metrics review!” Uh…oops. So I sheepishly apologized and then pulled up the data to show her our measly four days of data. And that’s how I learned about the devastating power of false positives.
It turns out, the data we were seeing was completely a fluke. And furthermore, it was completely expected from the way we’d run the test. We’d made two critical errors that I, as a stats novice, was utterly unaware of.
The first mistake was making multiple comparisons without correcting for them. It turns out that the standard t-test, the basic algorithm behind the significance calculation in A/B testing, was only designed to test two variants and a single metric. 95% significance means there’s a 5% chance that you’ll get a significant result when there wasn’t really a difference, which is a pretty low risk. But we had four variants and 25, which meant we were making 100 comparisons instead of one. Each of those comparisons had a 5% chance of being a false positive — which means that you should expect five of them to show a change even when nothing had happened. We were chasing ghosts all along.
The multiple comparisons problem, famously illustrated by Randall Patrick Munroe via xkcd.
This can feel like voodoo. How could gathering more data possibly be worse? Isn’t more data always better? The key factor to understand is the risk of a false positive. Because we’re dealing with random behavior, we can never be completely certain – but there’s a temptation to treat “95% significant” as if it means “100% definitely true.” I like to think of false positives as a game of Russian roulette, where the “bullet” is a misleading conclusion that sends your team scrambling after bad data. An A/B test with a single metric is like a revolver with 20 chambers, and a 1/20 chance of hitting a false positive. But testing lots of metrics is like spinning this revolver over and over and shooting each time. If you repeat this 100 times, you’re almost sure to be misled on at least a few of the metrics. This multiple comparisons problem is one of the most pervasive, least-understood challenges in experimentation – and one that had directly torpedoed our project because we didn’t account for it.
To make matters worse, we’d also made the second mistake of peeking at a fixed-horizon test. This problem stems from taking a statistic invented in 1908 and applying it to the Internet era. Fun fact: the t-test was actually invented by a chemist working for the Guinness brewery in Dublin. It made sense in that context: he would brew beer with two recipes, let it ferment for a few weeks, and then test it out at the end. In the meantime, there wasn’t much to do other than to watch it bubble. Therefore the math was designed to give a valid result at the end of the test only, not along the way.
With traditional t-tests, the p-value will oscillate over time. If you take action when it happens to peak, you risk working off bad information.
That’s great for brewing, but it falls apart with online A/B testing because you can refresh the numbers whenever you want. In our case, that flexibility meant we were going crazy over numbers that weren’t fully “cooked.” With fixed-horizon testing, statistical significance can fluctuate wildly until you reach the predefined end date of a test, which means it’s dangerously easy to misinterpret the results. And it’s not just a problem of peeking too soon. Even after you’ve waited for some predefined sample size, you’re still not supposed to peek at the results as the experiment continues to run because they’ll keep on fluctuating up and down.
What’s so frustrating about these problems is that they don’t announce themselves obviously. Nothing pops up in your analytics saying “don’t worry, this is just a false positive.” Instead, you see a seemingly-credible, bright-red report popping up to tell you that revenue is plunging and IT’S ALL YOUR FAULT. These false alarms are incredibly costly. They send your designers and developers back to the drawing board, frantically retooling a feature to correct a problem that isn’t really there and abandoning features that might actually have a positive impact. False positives can quickly negate the entire value of experimentation by injecting uncertainty and fear into a process that’s meant to provide clarity and confidence.
The good news is that these traps are entirely avoidable. There are two ways to prevent false positives. The most common one is the step I skipped: having an experienced statistician or data scientist analyze every test. This is how companies like Airbnb, Netflix, and Booking.com approach experimentation. They deliberately hire larger-than-average data science teams so that every product manager can be paired with an impartial analyst to parse the results.
These professionals can spot problems like false positives and guide a team to the right conclusions without overreacting.
But unfortunately, this is kind of expertise is hard to scale, especially because naive experimenters, like my younger self, just don’t know what they’re missing. The idea that just peeking at the results could somehow invalidate them, or that measuring more data can make your data less valid, is so profoundly counterintuitive that most experimenters will just dismiss it entirely — at least until they’re shaken by an experience like mine.
Given this challenge, modern experimentation platforms are increasingly adopting the opposite approach. Instead of expecting the experimenter to understand the math, they change the math to match the experimenter’s intuitions. For example, you can apply techniques like Bonferroni correction for multiple hypotheses or the Benjamini-Hochberg procedure for reducing the false discovery rate. This was our approach at Optimizely. After years of trying to teach our users Statistics 101, we decided to throw out all the math and start from scratch. We partnered with a team of statisticians from Stanford University to build a new model called Stats Engine to modernize the standard t-test. Stats Engine uses modern techniques like sequential testing and false discovery rate control to generate results that don’t suffer from peeking or multiple comparisons problems.
For more on addressing these challenges in practices, I recommend this great blog post from the data science team at the BBC. After years of battling false positives, they decided to switch to an experimentation platform with built-in corrections for multiple comparisons and peeking. For a more technical treatment, check out this conference presentation from KDD.