As we work with our customers to help them understand the results and learnings from an experiment, one of the main questions we hear is “why am I not reaching significance?” or, more specifically, “why is this experiment inconclusive?” When I hear this question, my first check is to look at what metrics are set within Optimizely to measure the success of an experiment. And most importantly what is set as the primary metric. This primary metric is meant to be the metric weighted the heaviest amongst the other experiment metrics to declare an experiment a winner.
The work to find conclusive results starts way before pressing start on your experiment. It starts before doing results analysis. It starts before test design. It starts as you identify the metrics you can influence through experimentation and understanding how those metrics interact with each other.
For most businesses, revenue and conversions are the KPIs in focus to improve through experimentation. But there are other metrics that could (and should) be the focus for individual experiments. The metrics below your most important KPIs are also the ones closest to where you are experimenting. These metrics are the behaviors you can most confidently measure and improve for an experiment. Moving the needle on these behaviors has downstream impact on those top-line KPIs.
If you only focus on revenue as the primary metric for all experiments, you are sure to lose out on wins, learnings, and opportunities to iterate towards impacting revenue. You will undoubtedly make decisions and call experiments unsuccessful when in fact they positively impact user behaviors while influencing revenue in the long run. If you don’t set up your experiments properly to uncover impact on these behaviors you are also diverting from learning what impacts revenue.
But wait. We only care about driving revenue from our experiments. Why wouldn’t we measure success by revenue?
You should measure revenue! In fact you should measure it for every experiment as a secondary metric if that’s important to your business! And use that in balance with other metrics to determine success. However, what you can’t control for every experiment is the impact that it will directly have on revenue.
Your metric definition for an individual experiment should fit into these three buckets:
Let’s imagine you have seen dropping engagement with your homepage, which is an issue because it’s the main entry point for your ecommerce site. A hypothesis you have may be:
‘If we used a carousel instead of a static hero image, then we will increase purchases, because we are providing our users more offers and product messaging upon entry.’
You need to consider that there are a lot of steps / messages / behaviors that happen for your users between that change and the final conversion that you are not controlling for. As an example, there could be a promotional ad campaign that drove a user arrived to the homepage that pull users away from converting later on due to mismatched messaging later in the path towards purchasing.
Your primary metric (and the main metric in your hypothesis) should always be the behavior closest to the change you are making in the variation you are employing. The hypothesis should actually read:
‘If we use a carousel instead of a static hero image, then we will increase hero real estate clicks and product page views, because we are providing our users more offers and product messaging upon entry.’
Now let’s imagine a different scenario. You’ve optimized the top part of your ecommerce funnel well. But you’re struggling in the checkout step now. Your hypothesis is:
‘If we collapsed form field sections on the checkout page, then we will increase purchases, because we are presenting indication of all the info we will require from users above-the-fold from the section titles’
This maps well to the metric flow chart above! Since we are experimenting at the last step (the only purpose of that page is to convert users), using purchases or revenue as the primary metric makes sense. It’s the behavior you are most likely to impact from the changes you are making in the variation.
But for our up-the-funnel (those further away from purchase or your final conversion) experiments, should we not expect to see revenue impact?
“Should” may not be the right word. We always hope that revenue will be impacted by your prioritized experiments. There just needs to be an understanding that for every experiment you may not be able to confidently measure direct impact on revenue.
However if you can progress those behaviors that ladder up and lead your users closer to that final conversion that drives revenue, you are impacting revenue. If you measure those leading behaviors first and make statistically significant improvements on those, you can continue to shift your focus to experiments that are closer to revenue.
That makes sense. But again. REVENUE, REVENUE, REVENUE.
Alright fair. The above is conceptual. But we looked across all customer experiments and actually saw this to be true! We found that when revenue is set as the primary metric in Optimizely, it reached the project’s statistical significance level 10% of the time compared to when all other goal types (page views events, click events, custom events) are set as the primary metric. Even though we want to maximize revenue in our experimentation, it is not always within our complete control to in every experiment and experiment data backs this up.
We’ve also heard from a handful of our customers that they see the same type of statistical significance split on goals types when they analyze their own testing backlog from a quarter or year. Though the best programs do measure revenue and other downstream metrics for every experiment in order to understand incremental impact from experimentation on those keyKPIs.
How should you weigh the primary versus secondary metric(s) for determining individual experiment success?
A good practice in your test planning is to discuss as a group what trade-offs you are willing to make on performance between the primary and secondary metric(s) – in this case revenue. It’s been an interesting learning to find that this differs across the industry. Some programs take a statistically significant improvement on the primary metric alone as the success factor for an experiment. Some programs state that there must be statistically significant improvement on secondary metrics (e.g. revenue, purchases) for any experiment to be considered a win.
You can set up a decision framework upfront to an experiment (or for the program at large) to create a consensus approach on how to handle these scenarios. This framework may change over time, but it can increase speed to decision making and actions based on results.
There are two key parts to a best-in-class decision framework: it has a reasonable delta of impact on revenue AND either accepts or rejects that a statistically significant win on the primary metric is necessary. This could be a delta that has both positive and negative boundaries. This is where Optimizely’s confidence intervals come in handy. Using the confidence intervals on the Optimizely results page gives you a clear indication of where “true improvement” will lie on revenue if you implement the winner. Ensure that interval does not expand based your reasonable delta.
What are other ways we can measure the success of our program?
The most overlooked part of standing up an experimentation program is the measurement imposed on itself. We call these measures ‘operational metrics’ and these look at the overall program versus the specific tests. These metrics are behaviors that we know are strong indicators of a healthy program. If we believe that our methodology is sound and in turn generating learnings that improve our knowledge on our customers, these types of metrics are good indicators of success:
- Velocity – The # of experiments started per week, month, quarter, etc.
- Conclusive Rate – The % of experiments reaching a statistically significant state.
- Win Rate – The % of experiments reaching a positive statistically significant state.
- Learning Rate – The % of experiments that created an actionable learning.
- Reusability Rate – The % of experiments that inform other business initiatives.
- Iteration Rate – The % of experiments that are iterated on as a next step.
There are many other operational metrics you can dream up and these could be sliced by any meta data you are keeping on your program (e.g. type of variation strategy, source of idea, etc.) to illustrate the impact the program is having on your business.
Every experiment is different! You may not follow these principles to a T, but your program should have a strong and consistent point of view on how to define primary metrics to better understand your experiment results and learnings. Let us know how you might approach this differently, what you’ve seen to be a success in defining your experiment metrics and how you have approached analyzing revenue impact in the comments below!