16 juni, 2022

Planning your experiment metrics: why the success criteria is key

We have outlined different tips and an experiment example to contemplate your success criteria and the metrics you apply to measure an individual experiment’s success. Read more below.

Optimizely Team


Back in 2019, we wrote a blog advocating for not using revenue as your primary metric for every experiment. And generally, we still feel this approach is best to consistently gain more statistically significant results and clearer learnings.

But a lot changes over the years (stating the obvious!). As we continue to build experimentation programs with customers with different metrics of focus, we realized that the above blog fell short. The ethos was correct. But the guidance was lacking and not applicable for everyone.

One thing we clearly missed was thinking past the primary metric. It’s not as simple as choosing a single metric and anchoring the success of an experiment against it solely. Our products and experiences are complex. Often, we are balancing multiple behaviors to measure the success of a variation. Metrics also often influence each other. Different stakeholders may care about different metrics. We must think about success criteria, and not just a single metric.

But what do we mean by success criteria? Clearly defining the metrics you will validate the hypothesis by is the obvious part. What is often overlooked is upfront setting what acceptable performance is for each metric to deem it a win. Is it a statistically significant uplift on one metric? Two metrics? You may have multiple primary success metrics in that case! It could also be stating inconclusive as acceptable performance for secondary or monitoring metrics.

What we have outlined below are different tips and an experiment example to think about your success criteria and the metrics you may apply to measure an individual experiment’s success!

Revenue is important

You are managing the experimentation program at your company. The end of the year comes around and leadership asks you, “how did we do?” As we know (most) sound decisions at a company are made by the financial impact of the program. Experimentation should be no different.

Making sure you track revenue and/or your final conversion metric on every experiment is paramount. Even if the experiment is multiple steps away from that conversion. This always-on metric will allow you to more confidently aggregate performance at the end of a certain time period.

Tip: Revenue should be the primary or secondary most of the time, paired with a second primary metric that speaks more to the specifics of the behavior and why revenue would be up, down or flat. 

Traffic volume

Experimenting on a low traffic site or page can be a bear. You may look to take bigger swings in your variations, then realize you need more developers (an always-on challenge!). You may look to run more targeted experiments, then realize you are not set to scale personalization like that. Low traffic is an experimenter’s pandora’s box.

Lower traffic volume sites or pages are where we strongly advocate for our original position on revenue, where it should not be a primary success criterion. Microconversions (e.g., a CTA click) or vanity metrics (e.g., bounce rate) are ideal in these types of scenarios. They show progressive movement of your visitors closer to that revenue/final conversion when experiments will likely take longer (much longer) to impact down funnel.

Tip: If constrained on traffic, focus on smaller conversions and behaviors as your primary metric of success, leaving revenue/final conversion as a metric you just track.

Bigger bets

What has been awesome to see in the experimentation space the last decade, is how many teams are now using experiments to protect their big ideas. If you are going to invest the time, money and resources to build that big feature, why don’t you vet it first before releasing it to the masses?

Oftentimes when you are ready to roll out a bigger bet, you have iterating options readily available based on what you learn. You may have more user testing on which you can pull. Or design reviews. Using revenue/final conversion as your primary success metric can help indicate if you need to take that big idea back before releasing it in full. You may use this metric as a positive indicator: if there’s a statistically significant uplift, let’s roll it out in full. Even if it is inconclusive, it shows that the feature did no harm. And of course, anything that would be a statistically significant loss would indicate you need to go back to one of those iterating options.

Tip: When using experimentation to validate the bigger feature bets you are making, focus on a primary success metric of revenue or the conversion(s) that are closest to driving revenue.

----------------------

Let’s look at a recent example and how a product manager designed their success criteria. This product manager oversees the product detail page (PDP) for the website of a massive online retailer.

Hypothesis: If we moved the product ratings closer to the title of the product, then we will improve both add-to-cart rates and revenue.

Success Criteria:

Metric

Definition

Success Criteria

Add-to-Rate (primary)

Increase in rate

Stat sig uplift

Revenue (primary)

Increase in per-visitor

Stat sig uplift

Conversion Rate

Increase in rate

Uplift or inconclusive

Exit Rate

Decrease in rate

Uplift or inconclusive

Color & Size Clicks

Increase in rate

Uplift or inconclusive

 

There are a few great teaching points from this example:

  • Defining multiple metrics as the primary success criteria, and noting them clearly in the hypothesis statement
  • Validating that would be a big change across thousands of products, is validated by revenue success before starting the work
  • Including additional metrics besides the primary success criteria, to inform iteration and answer “why did we get this result on the primary metrics?” questions

 

----------------------

 

How do you and your team set primary metrics and success criteria on your experiments? Are there typical frameworks you keep based on certain metrics you are looking to improve?

Om författaren