This post originally appeared on the BBC Data Science Blog.
When speaking of optimization, most of us will think about increasing conversions and revenue in e-commerce, otherwise known as CRO (conversion rate optimization). More and more though, media and services brands are using experimentation as a means for increasing customer engagement and fostering loyalty; focusing on retention instead of just acquisition. Aside from targeted messaging to reduce churn, loyalty is gained by making the product or service a joy to use. With so much content available now on so many different platforms, products need to be first class in order to grow and retain their user base and the BBC is no exception to this.
“We need to continue to do brilliant things on our traditional channels and services. But also work to be truly outstanding in the digital space — where our audiences increasingly are, where we face huge competition for their attention, and where we have the chance to serve them in incredible new ways”
Tony Hall, BBC Director General
Experimentation is one of the many research methods at our disposal used to achieve the goal of putting the audience at the heart of our digital products. A/B testing enables us to increase engagement through data driven product development and ensuring that we are creating the best possible experiences for the audience.
When Success isn’t Conversions
In e-commerce, the choice of primary metric in A/B tests is fairly straightforward as generally conversions and/or value must increase to see a return on investment. When looking to improve audience engagement through experimentation, the choice of metric can be a little bit more complex. Each product has a set of KPIs that experiments aim to increase, but we also focus on driving metrics closer to the changes such as increasing views of a specific episode or article. The BBC seeks to increase the time an audience member spends with us so that they are getting as much value as possible from the licence fee. This is therefore where we focus our effort and what our experimentation programme is aiming to increase overall.
Reporting within Optimizely
While our primary metric for most experiments should be time spent, we are currently using proxies for that to conclude tests. We reliably track time spent in our analytics, but don’t currently have this metric set up within Optimizely. For example, on iPlayer we track plays and completes within Optimizely and use them to conclude the test and then measure time spent through our analytics integration. We do this because Optimizely’s Stats Engine provides a more advanced mechanism for checking statistical significance of tests for two reasons:
1) Sequential testing
Stats Engine enables us to continuously look at the results of our tests and make valid decisions about whether to conclude the test or leave it to run for longer. For the same analysis in our analytics tool we would have to calculate test duration before starting and only check significance at the end of the committed test duration.
The problem with this calculation is that you need to know what you are expecting the result of your test to be before you perform it. If you are focused on increasing revenue, you may know what uplift you require the variant to cause in order for you to invest in developing the changes. When the aim of your test is to increase engagement and learn about your audience, the minimum detectable change can sometimes be calculated but sometimes can just be a stab in the dark. If the change is underestimated, you could miss out on detecting a smaller but still statistically significant uplift, and if the change is overestimated your test is inefficient because you have run it for longer than you need to establish the results. The calculation for engagement metric tests is more complex than this as the variance of number of events fired per browser needs to be considered, which can add in even more uncertainty.
2) False Discovery Rate Control
At the BBC we run our experiments to 95% significance. This means that 1 in every 20 statistically significant result we see will be a false positive. However, if we measure several metrics across several variants in our analytics tool then the chances of seeing a false positive are much more than 5%. Instead of controlling error rate for a single comparison, Optimizely uses False Discovery Rate Control which is a method that keeps the overall error rate below 5% for the whole experiment.
Reporting on Experiments from our Analytics Tool
While we keep our main metrics for the test within Optimizely, we have set up an integration with our analytics tool so that we can take advantage of the detailed time spent tracking we have there. We also use it to dive deeper into test results and understand what the main drivers behind the results were, through looking at metrics like:
· Article views broken down by category
· Plays of episodes which have trailers
· Use of onward journeys
To integrate with our analytics, we send an event at the point of experiment entry with the experiment and variant names.
For conversion rates, we can simply plug the number of browsers and unique conversions into an Excel calculator to check for a statistically significant difference. Most of our metrics look at total counts or continuous variables however, where variance between browsers is a contributing factor to significance. As we often have over 1 million browsers in each of our tests our files are too large for excel and we use R to calculate this value.
Key Features of Calculating Significance Using R
Don’t use segments
A common pitfall in setting up a test report in analytics is to organise your data into segments based on what experience they are in. Depending on your options for setting up segments, you may be including actions that the browser has sent before activating the test and therefore including irrelevant data. To get around this, we apply the test variant label to all events after it is first sent and use it as a parameter in the report so that only actions after test activation are included.
Although we filter internal IPs and anything that looks like a bot from our results automatically, there are still often cases of extreme behaviour. In ordinary analytics reporting, these will make hardly any difference and can be ignored. When we are comparing the mean of one experience to another, these can have an effect on the overall results and significance. For this reason, we remove any data points that are further away than 3 standard deviations from the mean and as our data isn’t normally distributed we check what percentage of the data is removed. If more than 5% of the data is removed from the calculation, then we instead remove outliers manually.
As shown above, our data can often be heavily skewed on zero — say for example, looking at browsers who viewed an article about hockey. As we often have over a million visitors in each variant, we can still use a t-test rather than a Mann-Whitney-Wilcoxon test that would be recommended for smaller sample sizes.
Calculate test duration
Although wherever possible we use related metrics that are measurable within Optimizely for the conclusion of our experiments, there are some cases where we can only conclude based on time spent. An example of this sort of test is changing the live playback experience within iPlayer Radio, where time spent streaming live radio would be the only success metric. For these cases we ensure that we have calculated test duration before we begin the experiment. This is done by calculating the standard deviation of time spent per browser for a week’s worth of data — usually just the previous week, unless there has been a big event on that would mean browsers were behaving differently. We then calculate the length of time a test should run in weeks because our products have different uses at the weekend compared to during the week, which means that only running for part of the week would not give us the full picture. We only check the results for significance after the number of weeks specified by the test duration calculator.
Calculation for test duration for a t-test
Currently we are using a data download from our analytics to run through R, but we are making this more efficient by creating an Alteryx workflow so that the data can be accessed through an API call and automatically processed at the click of a button. This will enable business analysts and product managers to calculate significance themselves instead of having to rely on the availability of an analyst.
We are also sending information about the test variant someone is in to Redshift to enable us to analyse results broken down by segments created by our data science team. This may sometimes result in a follow up test where we target our variants based on segments of our audience.
We are also currently exploring ways of getting time spent data into Optimizely so that we can make full use of their stats engine for concluding tests and measure the overall impact of experimentation on our main KPI. However, as so much BBC consumption is long-form audio and video it would require something akin to AV heartbeats for us to accurately measure it. For now, we’ll rely on our analytics integration to help us enrich our reporting and understand engagement.
Our next reporting challenge will be measuring the success of editorial testing. We are currently building functionality in to Optimizely to enable editors to set up tests for images from within their editorial tool, and this will eventually be extended to titles and headlines. We aim to have a high run rate for editorial testing and want editors to be able to not only see the results of individual experiments but also the value added through testing overall.