What is A/A testing?
A/A testing uses A/B testing to test two identical versions of an experiment baseline against each other. The typical purpose of running an A/A calibration test is to validate your experiment setup. Specifically, an A/A test is a data reliability/quality assurance procedure to evaluate the implementation of all your experiment comparisons. It’s recommended to run A/A calibration tests on a semi-regular basis. The general rule of thumb is to run them quarterly. In most cases, the majority of your A/A calibration test results should show that the conversion improvement between the identical baseline pages is statistically inconclusive.
Why test identical pages?
In some cases, you may want to monitor on-page conversions where you are running the A/A test to track the number of conversions and determine the baseline conversion rate before beginning an A/B or multivariate test.
In most other cases, the A/A test is a method of double-checking the effectiveness and accuracy of the A/B testing software. You should look to see if the software reports that there is a statistically significant (>95% statistical significance) difference between the control and variation. If the software reports that there is a statistically significant difference, that’s a problem. You will want to check that the software is correctly implemented on your website or mobile app.
Calibration test data can also provide insight into your experimentation program. Using an A/A calibration test is a great way to measure your analytics setup. Running the same variant twice in the same experiment can give you a benchmark KPI to track against. The test data should show what your average conversion rate to beat is.
Things to keep in mind with A/A testing:
When running an A/A test, it’s important to note that finding a difference in conversion rate between identical baseline pages is always a possibility. The statistical significance of your results is a probability, not a certainty. This isn’t necessarily a poor reflection on the A/B testing platform, as there is always an element of randomness when it comes to testing.
When running any A/B test, keep in mind that the statistical significance of your results is a probability, not a certainty. Even a statistical significance level of 95% represents a 1 in 20 chance that the results you’re seeing are due to random chance. In most cases, your A/A test should report that the conversion improvement between the control and variation is statistically inconclusive—because the underlying truth is that there isn’t one to find.
How does A/A testing impact conversion rates?
Because no actual change is made to the different versions in the experiment, it should not impact conversion rates. If the majority of your A/A calibration test results show a (significant) difference in conversion rates, this could indicate an issue with your experiment implementation, such as checking all the targeting rules and documentation. Make sure to check all the targeting rules and documentation to prevent any false positives.
Should you add a second baseline to an A/B test, creating an A/A/B test?
And what about duplicate baselines and duplicate test variations, like an A/B/A/B test? These are common questions. One way to validate an A/B test could be to add a duplicate of the A variant to the experiment.
But no. You should never, ever do this. A/A calibration tests need to reside in their own separate space, their own experiment. One should assess a large distribution of A/A calibration test outputs, instead of judging performance on a single experiment that tests a single baseline versus another single baseline.
When you combine multiple baselines with test variations, you are needlessly penalizing the performance of your test variations. Said another way, multiple baselines combined with test variations will cannibalize experiment results.
For an A/B/A/B test spackling on more baselines in an experiment does not make for a more secure or certain experience. Further, two or more baselines combined with any amount of test variations expose the experimenter to a high risk of confirmation bias: they are giving the expected outcome too much importance. Optimizely discourages anyone from adding a second baseline alongside test variations as it is often a highly misguided attempt by experimenters to shield themselves against errors.
Preventing false positives in A/B testing tools, and why it’s important
Running experiment can be great for optimizing conversion rates or impacting other business critical metrics. But if you can’t rely on the software to accurately keep track of test results, this defeats the purpose of having testing software to begin with. The results need to be:
Trustworthy - can you trust that the test results are accurate and reflect reality.
Accurate - Making sure sample sizes are large enough and results are stable is key.
Significant results - Are the results for variant B meaningfully and consistently different from the A variant.
A/B testing and experimentation software, which allows you to run more than just A/B tests, are meant to give marketers trust in their test results. Running an A/A test tackles the first 2 of the aforementioned points so you know the third, significant results, are accurate and can be trusted.
How A/A test data can help your analytics tool and vise versa
Using an A/A test is a great way to measure your analytics setup. By running the same variant twice in the same experiment, it can give you a benchmark kpi to track against. The test data should show what your average conversion rate to beat is.
How does your analytics tool play into that? Your analytics tool, likely Google Analytics, should already be tracking your conversion rates. So if you’re running an A/A test to measure benchmark metics, shouldn’t those be (nearly) the same? Correct!
A/A testing is a common practice to validate tools against itself, but also against other vendors. If you already know your Google Analytics conversion rates are being accurately tracked, your A/A test should show (nearly) the same.
Help! My A/B test tools and analytics tools are showing different conversion rates after an A/A test
Make sure you run some common troubleshooting steps:
Check the sample size of your test. Although this test will never achieve statistical significance, because there is no real difference between the 2 variants to measure, it’s still important to run the test on a sizable number of visitors to validate it’s accuracy.
Check the targeting rules for both tools. Because most experimentation rules have to run at the top of the page head, or can be run server-side, and your analytics tool might run in something like Google Tag Manager, it could be that the rules on which pages to fire both tools could differ. Make sure to test and check setups and coverage across both.
Good minimum sample sizes for A/A tests
Large sample sizes are not always needed for A/A calibration tests, because you’re not actually changing anything in the variants. For instance, running an A/A calibration test on the homepage is an excellent idea, as this is among the top-visited pages for many websites and could quickly help identify any issues with your setup. Using a non-important landing page is also an option, but always take into account external factors. If traffic fluctuates on this page a lot, for instance because of paid budgets, it might not be the best page to run the test on. You’re looking for a page with stable conversion rates to benchmark against.
Optimizely Experiment stats engine and A/A testing:
When running an A/A test with Web or Feature Experimentation, in most cases, you can expect the results from the test to be inconclusive—meaning the conversion difference between identical variations will not reach statistical significance. In fact, the number of A/A tests showing inconclusive results will be at least as high as the significance threshold set in your Project Settings (90% by default).
In some cases, however, you might see that one variation is outperforming another or a winner is declared for one of your goals. The conclusive result of this experiment occurs purely by chance, and should happen in only 10% of cases, if you have set your significance threshold to 90%. If your significance threshold is higher (say 95%), your chances of encountering a conclusive A/A test is even less (5%).