Posted Sept. 18, 2023

Why you should choose sequential testing to accelerate your experimentation program

When reaching statistical significance in digital experiments, the statistical method chosen can influence the interpretation, speed, and robustness of results. The best approach will depend on the specifics of the experiment, available info, and the desired balance between speed and certainty.

Principal Statistician, Product, Optimizely

Full Bayesian methods are about as useful for large-scale online experiments as a chemical dumpster fire. Bayesian methods offer a probability-based perspective, integrating prior knowledge and current data. If you make the wrong choice, like poorly selecting in the critical first step which statistical distribution you should set as your prior, your online experiment is going to be as slow as molasses.

Sequential methods can accelerate decision-making by allowing early stopping. the statistical power of sequential testing thrives on discovering difficult-to-find minuscule effects and is lightning-fast for detecting blockbuster effects. Learn more about what test method you should pick and how that affects your test results.

Why sequential testing is superior to fully Bayesian statistics

Fully Bayesian statistics have a different aim than the frequentist statistics that underly Stats Engine, so they cannot be directly compared.

Bayesian experiments are all about combining two sources of information: what you thought about the situation before you observed the data at hand, and the other is what the data themselves have to say about the situation.

The first source is expressed as a prior probability distribution. Prior means what you understood before you observed the data. The second source is expressed through the likelihood. The same likelihood is used in fixed-horizon, frequentist statistics.

Therefore, an experimenter starting with a good guess can reach a decision much faster compared to an experimenter using a frequentist method. However, if that initial guess was poorly chosen then the test can either take an extremely long time or yield very high error rates.

The flawed Bayesian claim about "No mistakes"

A fully Bayesian testing procedure does not claim to control the frequency of false positives. They instead set a goal about the expected loss and quantify the risk of choosing one variant over the other.

Bayesian methods are prone to detecting more winners simply because they do not incorporate strong false discovery rate control, particularly for experiments with continuous monitoring.

Although the Frequentist methods underlying Stats Engine are less flexible for directly incorporating prior information, they offer error guarantees that hold no matter the situation or prior knowledge of the experimenter. For example, Stats Engine offers strong control of the false discovery rate for any experiment, whereas Bayesian methods may perform better or worse depending on the exact choice of the prior distribution.

It takes more evidence to produce a significant result with Stats Engine, which allows experimenters to peek as many times as they desire over the life of an experiment. Further, the Stats Engine is designed to compute statistical significance continuously so an experiment can be concluded as soon as enough evidence has accumulated.

How much power is gained or lost when using a sequential approach with Stats Engine versus an experiment based on traditional fixed horizon testing depends on the performance of the specific experiment.

Traditional test statistics require a pre-determined sample size. You are not allowed to deviate from the sample size collected. You cannot call the test early or let it run longer. With traditional statistics detecting smaller, more subtle true effects requires running a longer experiment.

The real role of sequential testing's blockbuster boost and sensitive signal superpower

There are several major time-saving advantages of sequential analysis.

First, an experiment requires fewer samples with group sequential tests when the difference between the treatment and control groups is large, meaning when the actual uplift is larger than the minimum detectable effect (MDE) set at the initiation of the experiment. In such cases, the experiment can be stopped early before reaching the pre-specified sample size.

For example, if the lift of your test is 5 percentage points larger relative to your chosen than your MDE, Stats Engine will run as fast as fixed horizon statistics. As soon as the improvement exceeds the MDE by as much as 7.5 percentage points, Stats Engine is almost 75% faster than a test run with traditional methods. For larger experiments (>50,000 visitors), the gains are even larger, where Stats Engine can determine a conclusive experiment up to 2.5 times as fast.

Another scenario that requires fewer samples in sequential experimentation is when the conversion rate of the control group is less than 10%. In this case, sequential testing can reduce the number of observations required for a successful experiment by 50% or more.

Want to see if sequential testing can fail? Don’t give it a lift.

Sequential testing is faster when there are tiny effects to find and when there are giant uplifts to find. If there's no effect, then it's going to take a sequential test longer to conclude (reach statistical significance).

So, there are times when you can choose a traditional statistical test over a sequential design. First, recall that Type 1 error means you conclude there is a surprising difference between the test variant and the control/baseline when there is no real difference between the two.

Doing statistics old school means you get Type-1 error guarantees at a single point in time. What that means is you get a specific sample size. That's literally it. That's the only prize you win.

Now, that's super useful for clinical trials. Ask yourself: Are we designing orphan drugs for a Phase 1 clinical trial that needs to meet hyper-stringent mandates by the FDA? Last I checked we are not in that business. We have different, but no less complex, scientific rigors to contend with.

Sequential tests offer you Type-1 protections for the entire length of the experiment for any amount of traffic. Doesn't that sound a lot more flexible?

Then, what in the world is so great about the "sequential" aspect of any sequential method?

Size of sample

The fixed sample size is difficult to do in practice. Resources and schedules rapidly change and shift the analysis timeline.
Sample size calculations are often completely unmanageable for complex models and depend on many unknown parameters (Noordzij et al, 2010)
A workaround is to wait a week or more. But then the analyst doesn't know the expected power of the test or what the confidence interval should be. The analyst could keep waiting longer, but then their chances of making a Type I error (i.e., crying wolf) inflate like a hot air balloon (Armitage et al, 1969, 1991, 1993).

Size of effect

If the effect size of the test variation is large, then we can detect it with less data than we initially thought necessary. (Bojinov & Gupta, 2022)
At the same time, companies want the ability to precisely estimate the effect even if it is relatively small.
It is impossible to satisfy both objectives with traditional, fixed sample-size statistical tests.
If the sample size is small, then the experiment identifies large negative effects early but is underpowered to identify small, interesting distinctions between the test variation and the baseline.
If the sample size is large so that the experiment can detect small effects, then there is a high risk of exposure to large negative effects for a dangerous length of time where the user experience deteriorates beyond repair.

The Burger King factor -- Have it your way

Conveniently, the experimenter doesn't have to commit to a fixed sample size ahead of time. Instead, the experimenter can collect data until they are satisfied
Sequential tests let you monitor your experiments AND stop the experiment when the uncertainty around the projected performance of the test variations stabilizes (for example, the confidence interval narrows, and interpretation is easier).

The Taco Bell factor -- Live más

Sequential tests allow continuous monitoring. With continuous monitoring, you can manage your experiments automatically and algorithmically. A/B tests are used as quality control gatekeepers for controlled rollouts of new features and changes. This allows for experimentation to be scaled as part of your culture of experimentation.

What makes Optimizely's sequential method interesting?

Stats Engine deploys a novel algorithm called the mixture sequential probability ratio test (mSPRT).

It compares after every visitor how much more indicative the data is of any improvement / non-zero improvement, compared to zero / no improvement at all. This is the relative plausibility of the variation(s), compared to the baseline.

The mSPRT is a special type of statistical test that improves upon the sequential probability ratio test (SPRT), first proposed by theoretical statistician David Siegmund at Stanford in 1985. That OG sequential probability ratio test from Siegmund was designed to test exact, specific values of the lift from a single variation in comparison to a single control by comparing the likelihood that there is a non-zero improvement in performance from the variation versus zero improvement in performance from the baseline.

Specifically, Optimizely's mSPRT algorithm averages the ordinary SPRT across a range of all possible improvements (for example, alternative lift values).

Optimizely’s stats engine also employs a flavor of the Empirical Bayesian technique. It blends the best of frequentist and Bayesian methods while maintaining the always valid guarantee for continuous monitoring of experiment results.

Stats Engine takes more evidence to produce a significant result, which allows experimenters to peek as many times as they desire over the life of an experiment. Stats Engine also controls your false-positive rates at all times regardless of when or how often you peek, and further adjusts for situations where your experiment has multiple comparisons (i.e., multiple metrics and variations).

Controlling the False Discovery Rate offers a way to increase power while maintaining a principled bound on error. Said another way, the False Discovery Rate is the chance of crying wolf over an innocuous finding. Therefore, the Stats Engine permits continuous monitoring of results with always valid outcomes by controlling the false positive rate at all times regardless of when or how often the experimenter peeks at the results.

Remember...

When choosing between a Bayesian test and a sequential test, it is important to consider the specific needs of your situation.

Bayesian a/b methods are well-suited for situations where you have prior information about the parameters of interest and want to update your beliefs about those parameters in light of new data. Stopping a Bayesian test early means you’ll likely accept a null or negative result.

Sequential testing can help you evaluate the consistency and dominance of a variation's performance over the other (or lack thereof).

Where else can you read more about the different testing methods? Well, here are some relevant posts to get you started:

Is the Bayesian method immune to peeking?

What you need to run perfect experiments

Pillars for building a culture of experimentation

About the author