Appropriate normality tests for small samples

So far, I've been using the Shapiro-Wilk statistic in order to test normality assumptions in small samples.

Could you please recommend another technique?

The fBasics package in R (part of Rmetrics) includes several normality tests, covering many of the popular frequentist tests -- Kolmogorov-Smirnov, Shapiro-Wilk, Jarque–Bera, and D'Agostino -- along with a wrapper for the normality tests in the nortest package -- Anderson–Darling, Cramer–von Mises, Lilliefors (Kolmogorov-Smirnov), Pearson chi–square, and Shapiro–Francia. The package documentation also provides all the important references. Here is a demo that shows how to use the tests from nortest.

One approach, if you have the time, is to use more than one test and check for agreement. The tests vary in a number of ways, so it isn't entirely straightforward to choose "the best". What do other researchers in your field use? This can vary and it may be best to stick with the accepted methods so that others will accept your work. I frequently use the Jarque-Bera test, partly for that reason, and Anderson–Darling for comparison.

You can look at "Comparison of Tests for Univariate Normality" (Seier 2002) and "A comparison of various tests of normality" (Yazici; Yolacan 2007) for a comparison and discussion of the issues.

It's also trivial to test these methods for comparison in R, thanks to all the distribution functions. Here's a simple example with simulated data (I won't print out the results to save space), although a more full exposition would be required:

library(fBasics); library(ggplot2)

# normal distribution
x1 <- rnorm(1e+06)   
x1.samp <- sample(x1, 200)
qplot(x1.samp, geom="histogram")

# cauchy distribution
x2 <- rcauchy(1e+06)
x2.samp <- sample(x2, 200)
qplot(x2.samp, geom="histogram")

Once you have the results from the various tests over different distributions, you can compare which were the most effective. For instance, the p-value for the Jarque-Bera test above returned 0.276 for the normal distribution (accepting) and < 2.2e-16 for the cauchy (rejecting the null hypothesis).

For normality, actual Shapiro-Wilk has good power in fairly small samples.

The main competitor in studies that I have seen is the more general Anderson-Darling, which does fairly well, but I wouldn't say it was better. If you can clarify what alternatives interest you, possibly a better statistic would be more obvious. [edit: if you estimate parameters, the A-D test should be adjusted for that.]

[I strongly recommend against considering Jarque-Bera in small samples (which probably better known as Bowman-Shenton in statistical circles - they studied the small sample distribution). The asymptotic joint distribution of skewness and kurtosis is nothing like the small-sample distribution - in the same way a banana doesn't look much like an orange. It also has very low power against some interesting alternatives - for example it is powerless to pick up a symmetric bimodal distribution that has kurtosis close to that of a normal distribution.]

Frequently people test goodness of fit for what turn out to be not-particularly-good reasons, or they're answering a question other than the one that they actually want to answer.

For example, you almost certainly already know your data aren't really normal (not exactly), so there's no point in trying to answer a question you know the answer to - and the hypothesis test doesn't actually answer it anyway.

Given you know you don't have exact normality already, your hypothesis test of normality is really giving you an answer to a question closer to "is my sample size large enough to pick up the amount of non-normality that I have", while the real question you're interested in answering is usually closer to "what is the impact of this non-normality on these other things I'm interested in?". The hypothesis test is measuring sample size, while the question you're interested in answering is not very dependent on sample size.

There are times when testing of normality makes some sense, but those situations almost never occur with small samples.

Why are you testing normality?

There is a whole Wikipedia category on normality tests including:

  • the Anderson-Darling test, popular amongst statisticians; and
  • the Jarque-Bera test, popular amongst econometricians.

I think A-D is probably the best of them.

For completeness, econometricians also like the Kiefer and Salmon test from their 1983 paper in Economics Letters -- it sums 'normalized' expressions of skewness and kurtosis which is then chi-square distributed. I have an old C++ version I wrote during grad school I could translate into R.

Edit: And here is recent paper by Bierens (re-)deriving Jarque-Bera and Kiefer-Salmon.

Edit 2: I looked over the old code, and it seems that it really is the same test between Jarque-Bera and Kiefer-Salmon.

In fact the Kiefer Salmon test and the Jarque Bera test are critically different as shown in several places but most recently here -Moment Tests for Standardized Error Distributions: A Simple Robust Approach by Yi-Ting Chen. The Kiefer Salmon test by construction is robust in the face of ARCH type error structures unlike the standard Jarque Bera test. The paper by Yi-Ting Chen develops and discusses what I think are likely to be the best tests around at the moment.

For sample sizes <30 subjects, Shapiro-Wilk is considered to have a robust power - Be careful, when adjusting the significance level of the test, since it may induce a type II error! [1]