Statistical significance is a measure of how unusual your experiment results would be if there were actually no difference in performance between your variation and baseline and the discrepancy in lift was due to random chance alone.
Online web owners, marketers, and advertisers have recently become interested in making sure their A/B test experiments (e.g., conversion rate A/B testing, ad copy changes, email subject line tweaks) get statistical significance before jumping to conclusions.
On the Optimizely results page, if we observe a significance level of 90% or above, it is a statement of statistical surprise: something very unusual has happened if, in fact, there is no difference between a variation compared to the baseline. However, if your experiment fails to meet or exceed your chosen significance threshold, note that the result does not suggest there is little to no evidence of a treatment effect to be found in the experiment.
Statistical significance is most practically used in hypothesis testing. For example, you want to know whether changing the color of a button on your website from red to green will result in more people clicking on it. If your button is currently red, that’s called your “null hypothesis,” which takes the form of your experiment baseline. Turning your button green is known as your “alternative hypothesis.”
To determine the observed difference in a statistical significance test, you will want to pay attention to two outputs: p-value and the confidence interval.
P-value can be defined as the likelihood of seeing evidence as strong or stronger in favor of a difference in performance between your variation and baseline, calculated assuming there actually is no difference between them and any lift observed is entirely owed to random fluke. P-values do not communicate how large or small your effect size is or how important the result might be.
Confidence interval refers to an estimated range of values that are likely, but not guaranteed, to include the unknown but exact value summarizing your target population if an experiment was replicated numerous times. An interval is comprised of a point estimate (a single value derived from your statistical model of choice) and a margin of error around that point estimate. Best practices are to report confidence intervals to supplement your statistical significance results, as they can offer information about the observed effect size of your experiment.
Your metrics and numbers can fluctuate wildly from day to day. Statistical analysis provides a sound mathematical foundation for making business decisions and eliminating false positives. A statistically significant result depends on two key variables: sample size and effect size.
Sample size refers to how large the sample for your experiment is. The larger your sample size, the more confident you can be in the result of the experiment (assuming that it is a randomized sample). If you are running tests on a website, the more traffic your site receives, the sooner you will have a large enough data set to determine if there are statistically significant results. You will run into sampling errors if your sample size is too low.
Effect size refers to the magnitude of the difference in outcomes between the two sample sets and communicates the practical significance of your results.
Beyond these two factors, a key thing to remember is the importance of randomized sampling. If traffic to a website is split evenly between two pages, but the sampling isn’t random, it can introduce errors due to differences in behavior of the sampled population.
For example, if 100 people visit a website and all the men are shown one version of a page and all the women are shown a different version, then a comparison between the two is not possible, even if the traffic is split 50-50, because the difference in demographics could introduce variations in the data. A truly random sample is needed to determine that the result of the experiment is statistically significant.
In the pharmaceutical industry, researchers use statistical test results from clinical trials to evaluate new drugs. Research findings from significance testing indicates drug effectiveness, which can drive investor funding and make or break a product.
A strict set of guidelines is required to get valid results from experiments run with classical statistics: set a minimum detectable effect and sample size in advance, don’t peek at results, and don’t test too many goals or variations at the same time. These guidelines can be cumbersome and, if not followed carefully, can produce severely distorted and dubious results.
Fortunately, you can easily determine the statistical significance of your experiments using Stats Engine, the advanced statistical model built-in to Optimizely. Stats Engine operates by combining sequential testing and false discovery rate control to give you trustworthy results faster, regardless of sample size. Updating in real-time, Stats Engine computes always-valid inference, boosting your confidence in making the right decision for your company and to avoid pitfalls along the way.
To address these common problems, Stats Engine was created to test more in less time. By helping you make statistically sound decisions in real-time, Stats Engine adjusts values as needed and delivers reliable results quickly and accurately.
Start running your tests with Optimizely today and be confident in your decisions.