AB Testing And Statistical Significance

When you or your client wants to test a completely new element, in cases in which the result may effect sales or conversions, an AB test is usually the best approach. Unfortunately, AB testing needs a lot of visitors to work properly and so often, we end up making decisions based on the results in the first few days.  This is, of course, a bad approach as even though the new element might initially perform better, you might eventually find that in the long run, the original delivers more conversions.

You may choose to test anything on a page ranging from single elements (like headings, text content, images, call to actions, offers, colour) to complete layout changes. The aim of these tests can be different:

  • Trying to increase ROI by purposely optimizing a page to get more sales
  • Trying to increase sign-ups/user interaction by adding new elements to the page
  • Making sure you don’t negatively effect a page by introducing some brand changes

As you can see from the last one, the aim of a test isn’t always to find a winner – Sometimes it is good enough to make sure that the B is not worse than A.

Statistical Significance

There is something called “Statistical Significance” which is basically the point at which your tool can accurately tell you whether one element is definitely better than the other. Also known as the confidence level, the higher its value, the lower the chances are that the test result happened by chance. In other words, if you have a confidence level of 90%, then there is only a 10%  chance that the result is random, and not actually because one element is really better than another. To explain this, imagine this scenario:

You have two tables at home. You grab a coin and flip it a 100 times on the first table – 53 heads. You then do the same thing on the second table and get 48 heads. Does this mean that on the first table you are more likely to get a head when flipping a coin? Of course not! It is completely random and will never be statistically significant, even if you had to flip the coins a million times per table.

Similarly, when doing an AB test, we need to keep in mind there is always some randomness to the test. This means that although one of the options might initially seem like it’s the best option, you might later find out that the advantage was completely random.

Several big sites like Google, Amazon and Firefox are constantly using AB testing to launch big changes. The bigger the site, the faster the test will reach statistical significance.

How is it calculated?

First off, you will need quite a large sample for tests to become statically valid. The most common confidence level used is 95%. When this value is reached, we can generally assume that the test is now statistically valid.

While there seem to be different ways to calculate it, the simplest way to calculate it (without considering the sample size) is as follows: The difference between the two results must be larger than the square root of the sum of the two results. In other words, if A got 25 conversions, and B got 29, then the total is 54.

The difference between the two values is 4. The square root of 54 is… 7.34846923.

Since this value is higher than the difference (4), then this result is not statistically valid. In reality, this is not enough since you must also consider the size of the sample which can make an enormous difference. There are many formulas that you can use and different companies have different levels at which they can be confident about a change they just tested.

I won’t go into the details of the formulas myself, you can have a look at the wikiHow page which does a good job at explaining. You will definitely need the total visitors amount, and the total conversions amount of each item you’re testing to measure statistical significance.

I suggest you play around with the following pages to learn more:


Don't forget to check out more posts from the Lesson of the Week section.

blog comments powered by Disqus