Optipedia Logo

Optimization Glossary

Multi-Armed Bandit

What is the Multi-Armed Bandit Problem?

In marketing terms, a multi-armed bandit solution is a ‘smarter’ or more complex version of A/B testing that uses machine learning algorithms to dynamically allocate traffic to variations that are performing well, while allocating less traffic to variations that are underperforming. In theory, multi-armed bandits should produce faster results since there is no need to wait for a single winning variation.

The term "multi-armed bandit" comes from a hypothetical experiment where a person must choose between multiple actions (i.e. slot machines, the "one-armed bandits"), each with an unknown payout. The goal is to determine the best or most profitable outcome through a series of choices. At the beginning of the experiment, when odds and payouts are unknown, the gambler must determine which machine to pull, in which order and how many times. This is the “multi-armed bandit problem.”

Multi-Armed Bandit Examples

One real-world example of a multi-armed bandit problem is when a news website has to make a decision about which articles to display to a visitor. With no information about the visitor, all click outcomes are unknown. The first question is, which articles will get the most clicks? And in which order should they appear? The website’s goal is to maximize engagement, but they have many pieces of content from which to choose, and they lack data that would help them to pursue a specific strategy.

The news website has a similar problem in choosing which ads to display to its visitors. In this case, they want to maximize advertising revenue, but they may be lacking enough information about the visitor to pursue a specific advertising strategy. Similar to the issue with news articles, they typically have a large number of ads from which to choose. Which ads will drive maximum revenue for their news site?

The website needs to make a series of decisions, each with unknown outcome and ‘payout.’

Multi-Armed Bandit Solutions

There are many different solutions that computer scientists have developed to tackled the multi-armed bandit problem. Below is a list of some of the most commonly used multi-armed bandit solutions:


This is an algorithm for continuously balancing exploration with exploitation. (In ‘greedy’ experiments, the lever with highest known payout is always pulled except when a random action is taken). A randomly chosen arm is pulled a fraction ε of the time. The other 1-ε of the time, the arm with highest known payout is pulled.

Upper Confidence Bound

This strategy is based on the Optimism in the Face of Uncertainty principle, and assumes that the unknown mean payoffs of each arm will be as high as possible, based on observable data.

Thompson Sampling (Bayesian)

With this randomized probability matching strategy, the number of pulls for a given lever should match its actual probability of being the optimal lever.

What is a Contextual Bandit?

In a real-world scenario, we sometimes have data that can help inform decision-making when choosing between the various actions in a multi-armed bandit situation. This information is the ‘contextual bandit,’ the context and environment in which the experiment occurs.

In website optimization, contextual bandits rely on incoming user context data as it can be used to help make better algorithmic decisions in real time.

For example, you can use a contextual bandit to select a piece of content or ad to display on your website to optimize for click-through rate. The context is any historical or current information you have about the user, such as previously visited pages, past purchase information, device information, or geolocation.

If you are a news portal, and you know that a person coming to your site has read Entertainment articles in the past, you can select your top Entertainment article to display at the top of the page for them to see. If you can see that your user is in Miami, you can display the local weather or other relevant information.

Multi-Armed Bandits & A/B Testing

In deciding whether to use multi-armed bandits instead of A/B testing, you must weigh the tradeoff of exploitation vs. exploration (sometimes known as ‘earn or learn’.)

With A/B testing, you have a limited period of pure exploration where you allocate traffic in equal numbers to Version A and Version B. Once you declare a winner, you move into a long period of exploitation, where 100% of users go to the winning variation. One issue with this approach is that you waste resources on the losing variation while trying to gather data and learn which is the winner.

With multi-armed bandit testing, the tests are adaptive, and include periods of exploration and exploitation at the same time. They move traffic gradually towards winning variations, instead of forcing you to wait to declare a winner at the end of an experiment. This process is faster and more efficient because less time is spent on sending traffic to obviously inferior variations.

One of the main disadvantages to multi-armed bandit testing is its computational complexity. Simply put, it is just more difficult and resource-intensive to run multi-armed bandit tests.

There are some known situations where multi-armed bandit testing typically works best:

Headlines and Short-Term Campaigns

The opportunity cost for waiting for the results of A/B tests make bandit algorithms a better choice for short-lived pieces of content like headline testing for new articles or holiday promotions.

Long-term Dynamic Changes

When the item being tested changes significantly enough to invalidate the results of an A/B test over time, multi-armed bandits provide an alternative to repeatedly retesting by continuously exploring.


Targeting is another example of a long-term use of bandit algorithms. If certain types of users are more common than others, the multi-armed bandit can applying learned targeting rules sooner for more common users, while continuing to experiment on less common users.

Automation for Scale

If you have multiple components to continuously optimize, the multi-armed bandit approach gives you a framework to partially automate the optimization process for low-risk problems which can be too costly to analyze individually.

How Optimizely Uses Multi-Armed Bandits

Optimizely’s Stats Accelerator can be described as a multi-armed bandit.This is because it helps users algorithmically capture more value from their experiments, either by reducing the time to statistical significance or by increasing the number of conversions gathered. To do this, Stats Accelerator monitors ongoing experiments and automatically adjusts traffic distribution among variations – just like multi-armed bandits.

Stats Accelerator applies one of the following two algorithms or optimization strategies:

Option 1: Accelerate Learnings

The Accelerate Learnings algorithm seeks to reduce experiment duration by showing more visitors the variations that have a better chance of reaching statistical significance. Accelerate Learnings attempts to discover as many significant variations as possible.

Accelerate Learnings maximizes the number of learnings from experiments in a given timeframe, so you spend less time waiting for results.

Option 2: Accelerate Impact

The Accelerate Impact algorithm seeks to maximize the payoff of the experiment by showing more visitors the leading variation. Accelerate Impact helps you exploit as much value from the leading variation as possible during the experiment lifecycle, so you avoid the opportunity cost of showing sub-optimal experiences.


Download our FREE Testing Toolkit for A/B testing ideas, planning worksheets, presentation templates, and more!

Get It Now