Optimization glossary

Multi-armed bandit

What is the multi-armed bandit problem?

In marketing terms, a multi-armed bandit solution is a ‘smarter’ or more complex version of A/B testing that uses machine learning algorithms to dynamically allocate traffic to variations that are performing well, while allocating less traffic to variations that are underperforming.

The term "multi-armed bandit" comes from a hypothetical experiment where a person must choose between multiple actions (i.e., slot machines, the "one-armed bandits"), each with an unknown payout. The goal is to determine the best or most profitable outcome through a series of choices. At the beginning of the experiment, when odds and payouts are unknown, the gambler must determine which machine to pull, in which order and how many times. This is the “multi-armed bandit problem.”

Multi-armed bandit examples

One real-world example of a multi-armed bandit problem is when a news website has to make a decision about which articles to display to a visitor. With no information about the visitor, all click outcomes are unknown. The first question is, which articles will get the most clicks? And in which order should they appear? The website’s goal is to maximize engagement, but they have many pieces of content from which to choose, and they lack data that would help them to pursue a specific strategy.

The news website has a similar problem in choosing which ads to display to its visitors. In this case, they want to maximize advertising revenue, but they may be lacking enough information about the visitor to pursue a specific advertising strategy. Similar to the issue with news articles, they typically have a large number of ads from which to choose. Which ads will drive maximum revenue for their news site?

The website needs to make a series of decisions, each with unknown outcome and ‘payout.’

Multi-armed bandit solutions

There are many different solutions that computer scientists have developed to tackle the multi-armed bandit problem. Below is a list of some of the most commonly used multi-armed bandit solutions:


This is an algorithm for continuously balancing exploration with exploitation. (In ‘greedy’ experiments, the lever with highest known payout is always pulled except when a random action is taken). A randomly chosen arm is pulled a fraction ε of the time. The other 1-ε of the time, the arm with highest known payout is pulled.

Upper confidence bound

This strategy is based on the Optimism in the Face of Uncertainty principle, and assumes that the unknown mean payoffs of each arm will be as high as possible, based on observable data.

Thompson Sampling (Bayesian)

With this randomized probability matching strategy, the number of pulls for a given lever should match its actual probability of being the optimal lever.

What is a contextual bandit?

In a real-world scenario, we sometimes have data that can help inform decision-making when choosing between the various actions in a multi-armed bandit situation. This information is the ‘contextual bandit,’ the context and environment in which the experiment occurs.

In website optimization, contextual bandits rely on incoming user context data as it can be used to help make better algorithmic decisions in real time.

For example, you can use a contextual bandit to select a piece of content or ad to display on your website to optimize for click-through rate. The context is any historical or current information you have about the user, such as previously visited pages, past purchase information, device information or geolocation.

If you are a news portal, and you know that a person coming to your site has read entertainment articles in the past, you can select your top entertainment article to display at the top of the page for them to see. If you can see that your user is in Miami, you can display the local weather or other relevant information.

Multi-armed bandits and A/B testing

In deciding whether to use multi-armed bandits instead of A/B testing, you must weigh the tradeoff of exploitation vs. exploration (sometimes known as ‘earn or learn’.)

With A/B testing, you have a limited period of pure exploration where you allocate traffic in equal numbers to Version A and Version B. Once you declare a winner, you move into a long period of exploitation, where 100% of users go to the winning variation. One issue with this approach is that you waste resources on the losing variation while trying to gather data and learn which is the winner.

With multi-armed bandit testing, the tests are adaptive, and include periods of exploration and exploitation at the same time. They move traffic gradually towards winning variations, instead of forcing you to wait to declare a winner at the end of an experiment. This process is faster and more efficient because less time is spent on sending traffic to obviously inferior variations.

One of the main disadvantages to multi-armed bandit testing is its computational complexity. Simply put, it is just more difficult and resource-intensive to run multi-armed bandit tests.

There are some known situations where multi-armed bandit testing typically works best:

Headlines and short-term campaigns

The opportunity cost for waiting for the results of A/B tests make bandit algorithms a better choice for short-lived pieces of content like headline testing for new articles or holiday promotions.

Long-term dynamic changes

When the item being tested changes significantly enough to invalidate the results of an A/B test over time, multi-armed bandits provide an alternative to repeatedly retesting by continuously exploring.


Targeting is another example of a long-term use of bandit algorithms. If certain types of users are more common than others, the multi-armed bandit can apply learned targeting rules sooner for more common users, while continuing to experiment on less common users.

Automation for scale

If you have multiple components to continuously optimize, the multi-armed bandit approach gives you a framework to partially automate the optimization process for low-risk problems, which can be too costly to analyze individually.

How Optimizely uses multi-armed bandits

Optimizely uses a few multi-armed bandit algorithms to intelligently change the traffic allocation across variations to achieve a goal. Depending on your goal, you choose between the objectives:

  • Find statistically significant variation as soon as possible – Stats Accelerator
    • Reduces the experiment’s duration by showing more visitors the variation that has a better chance of reaching statistical significance.
    • Maximizes the number of learnings from experiments in a timeframe, so you spend less time waiting for results.
    • Attempts to discover as many significant variations as possible.
  • Maximize reward and minimize regret – Multi-armed Bandit (MAB)
    • Allows you to exploit as much value from the leading variation as possible during the experiment lifecycle, so you avoid the cost of showing sub-optimal experiences.
    • Does not generate statistical significance.
    • Uses Thompson Sampling algorithm for binary metrics.
    • Uses Epsilon-greedy algorithm for numeric metrics.