BCSS: Myths and Nonsense about the t-test

The subject of t-tests and regression has generated more myths and superstitious practices than any area of research I know of. I'd like to spread some reassurance by debunking some of the common ones. Starting with

Myth: "My data are non-parametric so I can't use a t-test

Ah - that poor misused word “parametric”! A parameter is a property of a distribution. If we really knew the distribution of the weight of everyone in the population, we would know the average weight. However, there is little need to bother people by weighing them all. We can just weigh a representative sample and use this to estimate the the average weight. A statistic is an estimate of a parameter calculated from a sample.

So parametric statistical procedures estimate some property of the distribution using data from a sample. From this, you can see that data cannot be parametric or nonparametric. Data do not estimate anything. Statistical procedures can be parametric or nonparametric, depending on whether they do, or do not, estimate some property of the population.

Indeed, one popular test, the Wilcoxon Mann-Whitney, estimates a very useful parameter: that probability that a person in one group will score higher than a person in the other group. So it's a parametric procedure.

Moral: stop using the words "parametric" and "non-parametric". They are just confusing everyone.

Small samples

One persistent misconception is that you cannot use the t-test on small samples (when pressed, people mutter something about “less than 30” but aren’t sure). Actually, you can. And the t-test performs well in samples as small as N=2!(J. de Winter, 2013) Indeed, with very, very small samples, the Wilcoxon-Mann Whitney test is unable to detect a significant difference, while the t-test is able to do so (Altman & Bland, 2009).

Can I use a t-test with ordinal scales?

A t-test compares two means. It follows that you can use it when the mean is a meaningful data summary. So you can certainly use it to compare body mass index or blood pressure between two groups.

You will read some authors denouncing the use of the t-test with data such as attitude, aptitude and mood scales. These scales, they argue, are ordinal, and so we have no business comparing mean values (though they don’t imply that we shouldn’t report mean values, which seems to be a logical consequence of their point of view).

In fact, it often does make sense to talk of mean scores on these measures, which are created by summing (or averaging) a number of items which are themselves ordinal. The resulting scale does tend to take on the properties of a continuous numeric scale. Here, for example, are the scores of 87 patients on an illness stigma scale that has 23 items:

You can see that the scores form a continuous distribution that falls quite closely along the 45° reference line, indicating that they follow a normal distribution (the four lowest scores at the bottom are the only exceptions).

In a case like this, the argument for summarising the data using mean scores and comparing groups using a t-test is pretty convincing.

But can you should use means and a t-test to compare scores on individual items? Well, looking at the items from the stigma scale above, they follow discrete distributions that indicate that the gaps between each scale point are different for each item. Here are four of the items, plotted against the normal distribution:

Notice how summing these rather unpromising-looking individual items gives us a scale with properties (continuous, normal-ish) that the items themselves lack. But can we test for differences between the items using a t-test?

The t-test will work with the individual items too

In fact, as I pointed out above, the t-test is robust to very significant departures from normality. Simulation studies comparing the t-test and the Wilcoxon Mann-Whitney test on items scored on 5-point scales have given heartening results. In most scenarios, the two tests had a similar power to detect differences between groups. The false-positive error rate for both tests was near to 5% for most situations, and never higher than 8% in even the most extreme situations. However, when the samples differed markedly in the shape of their score distribution, the Wilcoxon Mann-Whitney test outperformed the t-test(J. C. de Winter & Dodou, 2010).

So when can you use a t-test?

The answer is "more often than you thought". It's a very robust test, and it tests a useful hypothesis. With small samples and odd-shaped distributions, it's wise to cross-check by running a Wilcoxon Mann-Whitney test, but if they disagree, remember that they test different hypotheses: the t-test tests for differences in means, while the Wilcoxon Mann-Whitney tests the hypotheses that a person in one group will score higher than a person in the other group. There can be reasons why one is significant and the other isn't.

But that's for a separate post.

Bibliography and useful reading

Altman, D. G., & Bland, J. M. (2009). Parametric v non-parametric methods for data analysis. Bmj, 338(apr02 1), a3167–a3167. doi:10.1136/bmj.a3167
Conroy, R. M. (2012). What hypotheses do “nonparametric” two-group tests actually test? The Stata Journal, 12(2), 1–9.
de Winter, J. (2013). Using the Student’s t-test with extremely small sample sizes. Practical Assessment, Research & Evaluation, 18(10), 1–12.
de Winter, J. C., & Dodou, D. (2010). Five-point Likert items: t test versus Mann-Whitney-Wilcoxon. Practical Assessment, Research & Evaluation, 15(11), 1–12.
Fagerland, M. W. (2012). t-tests, non-parametric tests, and large studies--a paradox of statistical practice? BMC Medical Research Methodology, 12, 78. doi:10.1186/1471-2288-12-78
Fagerland, M. W., Sandvik, L., & Mowinckel, P. (2011). Parametric methods outperformed non-parametric methods in comparisons of discrete numerical variables. BMC Medical Research Methodology, 11(1), 44. doi:10.1186/1471-2288-11-44
Rasch, D., & TEUSCHER, F. (2007). How robust are tests for two independent samples? Journal of Statistical Planning and Inference, 137(8), 2706–2720.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83.
Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. The British Journal of Mathematical and Statistical Psychology, 57(Pt 1), 173–181. doi:10.1348/000711004849222

1 comment:

B. Weaver17 February 2016 at 13:56
Geoff Norman's article 'Likert scales, levels of measurement and the "laws" of statistics' is another one you could add to your reference list.

http://www.ncbi.nlm.nih.gov/pubmed/20146096
http://link.springer.com/article/10.1007%2Fs10459-010-9222-y#/page-1

Monday 2 February 2015

Myths and Nonsense about the t-test