Multi-armed bandit

What is the multi-armed bandit problem?

In marketing terms, a multi-armed bandit solution is a ‘smarter’ or more complex version of A/B testing that uses machine learning algorithms to dynamically allocate traffic to variations that are performing well, while allocating less traffic to variations that are underperforming.

The term "multi-armed bandit" comes from a hypothetical experiment where a person must choose between multiple actions (i.e., slot machines, the "one-armed bandits"), each with an unknown payout. The goal is to determine the best or most profitable outcome through a series of choices. At the beginning of the experiment, when odds and payouts are unknown, the gambler must determine which machine to pull, in which order and how many times. This is the “multi-armed bandit problem.”

Multi-armed bandit examples

One real-world example of a multi-armed bandit problem is when a news website has to make a decision about which articles to display to a visitor. With no information about the visitor, all click outcomes are unknown. The first question is, which articles will get the most clicks? And in which order should they appear? The website’s goal is to maximize engagement, but they have many pieces of content from which to choose, and they lack data that would help them to pursue a specific strategy.

The news website has a similar problem in choosing which ads to display to its visitors. In this case, they want to maximize advertising revenue, but they may be lacking enough information about the visitor to pursue a specific advertising strategy. Similar to the issue with news articles, they typically have a large number of ads from which to choose. Which ads will drive maximum revenue for their news site?

The website needs to make a series of decisions, each with unknown outcome and ‘payout.’

Multi-armed bandit solutions

There are many different solutions that computer scientists have developed to tackle the multi-armed bandit problem. Below is a list of some of the most commonly used multi-armed bandit solutions:

Epsilon-greedy

This is an algorithm for continuously balancing exploration with exploitation. (In ‘greedy’ experiments, the lever with highest known payout is always pulled except when a random action is taken). A randomly chosen arm is pulled a fraction ε of the time. The other 1-ε of the time, the arm with highest known payout is pulled.
Upper confidence bound

This strategy is based on the Optimism in the Face of Uncertainty principle, and assumes that the unknown mean payoffs of each arm will be as high as possible, based on observable data.
Thompson Sampling (Bayesian)

With this randomized probability matching strategy, the number of pulls for a given lever should match its actual probability of being the optimal lever.

How contextual bandits differ from standard multi-armed bandits

A contextual bandit is an advanced personalization algorithm that enhances the multi-armed bandit approach by incorporating user-specific data. While traditional multi-armed bandits help identify winning variations, contextual bandits determine which variation works best for each unique visitor.

The "context" refers to visitor-specific information like device type, location, past behavior, or purchase history. This data allows the machine learning model to make smarter decisions about which content to display to maximize conversions.

How contextual bandits work:

Learning period: The model starts with 100% exploration, randomly assigning variations to visitors to gather diverse data for predictions.
Balancing exploration and exploitation: Once enough visitor behavior data is collected, the model exploits (serving personalized variations). It dynamically adjusts exploration/exploitation rates as it receives more events.
Continuous adaptation: The model maintains some exploration (maximum 95% exploitation) to ensure continuous learning and avoid missing opportunities

Instead of manually creating complex targeting rules for different user segments, contextual bandits automatically learn these relationships and serve the most relevant experience to each visitor in real time. This approach eliminates guesswork and delivers true 1:1 personalization at scale—critical in today's competitive environment where attention spans are limited and personalization expectations are high.

The key difference between contextual bandits and traditional multi-armed bandits is context. While standard multi-armed bandits seek a single winning variation across all users, contextual bandits identify the best variation for each user based on their specific attributes.

A/B testing: Fixed traffic splits with one-size-fits-all winner selection
Multi-armed bandits: Dynamic traffic allocation seeking one overall "best" variation
Contextual bandits: Personalized experiences based on user context (device, location, behavior)

Multi-armed bandits vs. A/B testing

In deciding whether to use multi-armed bandits instead of A/B testing, you must weigh the tradeoff of exploitation vs. exploration (sometimes known as ‘earn or learn’.)

The key to understanding the difference between traditional A/B testing and multi-armed bandits is the concept of exploitation versus exploration:

Exploration: Showing different variations to visitors to learn which performs best
Exploitation: Showing the best-performing variation to maximize conversions

With A/B testing, you have a limited period of pure exploration where you allocate traffic in equal numbers to Version A and Version B. Once you declare a winner, you move into a long period of exploitation, where 100% of users go to the winning variation. One issue with this approach is that you waste resources on the losing variation while trying to gather data and learn which is the winner.

With multi-armed bandit testing, the tests are adaptive, and include periods of exploration and exploitation at the same time. They move traffic gradually towards winning variations, instead of forcing you to wait to declare a winner at the end of an experiment. This process is faster and more efficient because less time is spent on sending traffic to obviously inferior variations.

One of the main disadvantages to multi-armed bandit testing is its computational complexity. Simply put, it is just more difficult and resource-intensive to run multi-armed bandit tests.

Metric	A/B testing	Mutli-armed bandit
Time to implement winner	2-4 weeks on average	Can begin within days
Traffic required	Higher (fixed allocation)	Lower (adaptive allocation)
Opportunity cost	High (well-established p-values)~5-15% conversion loss during test period	~1-5% conversion loss during test period
Statistical rigor	High (well-established p-values)	Moderate (focuses on regret minimization)
Best for test duration	Long-term (weeks/months)	Short-term (days/weeks)
Computational complexity	Low	Moderate to high

When to use multi-armed bandits

There are some known situations where multi-armed bandit testing typically works best:

Headlines and short-term campaigns: The opportunity cost for waiting for the results of A/B tests make bandit algorithms a better choice for short-lived pieces of content like headline testing for new articles or holiday promotions.
Long-term dynamic changes: When the item being tested changes significantly enough to invalidate the results of an A/B test over time, multi-armed bandits provide an alternative to repeatedly retesting by continuously exploring.
Targeting: Targeting is another example of a long-term use of bandit algorithms. If certain types of users are more common than others, the multi-armed bandit can apply learned targeting rules sooner for more common users, while continuing to experiment on less common users.
Automation for scale: If you have multiple components to continuously optimize, the multi-armed bandit approach gives you a framework to partially automate the optimization process for low-risk problems, which can be too costly to analyze individually.

When to use contextual bandits

Contextual bandits deliver substantial business value in these scenarios:

Personalization at scale: When you need to provide truly personalized experiences for every user instead of one-size-fits-all approaches, delivering the right content to the right person at the right time.
Dynamic adaptation: When you need a system that serves the best variation in every session, even as user preferences evolve.
Elimination of opportunity costs: Unlike A/B tests that require weeks or months to reach statistical significance, contextual bandits start optimizing immediately, reducing exposure to underperforming variations in real time.
Low maintenance optimization: For pages where content doesn't change too frequently. As time goes on, the ML model gets sharper with the data it collects, making this a "set it and forget it" optimization that can be left running continuously.

How Optimizely uses multi-armed bandits

Optimizely Web Experimentation and Feature Experimentation use a few multi-armed bandit algorithms to intelligently change the traffic allocation across variations to achieve a goal. Depending on your goal, you choose between the objectives:

1. Stats accelerator

Find statistically significant variation as soon as possible.

Reduces the experiment’s duration by showing more visitors the variation that has a better chance of reaching statistical significance.
Maximizes the number of learnings from experiments in a timeframe, so you spend less time waiting for results.
Attempts to discover as many significant variations as possible.

2. Multi-armed bandit (MAB)

Maximize reward and minimize regret.

Allows you to exploit as much value from the leading variation as possible during the experiment lifecycle, so you avoid the cost of showing sub-optimal experiences.
Does not generate statistical significance.
Uses Thompson Sampling algorithm for binary metrics.
Uses Epsilon-greedy algorithm for numeric metrics.

3. Contextual bandit

Deliver truly personalized experiences.

Advanced tree-based machine learning models that match experiences to individual user contexts
Provides 1:1 personalization at scale without manual segmentation
Best for Creating lasting personalization programs that adapt to changing user behaviors