Thursday, 19 November 2015

10 Tips for successful research - 2. Be able to state your research question

16 words or less

This may sound completely obvious, but before you undertake any research you should be able to state a single, clear research question that will add to our current knowledge. I challenge our postgrads to state their research question in 16 words or less. A question should begin with a W (who, what, why, when and how – and yes, I know that 'how' doesn't begin with a w…). And it should end with a question mark. 

This is surprisingly hard to do. I ask people to imagine that someone without a background in research has asked them what they are studying. They want to impress this person with how cool and relevant their research is (I try to suggest that there is a potential romantic interest here). So how do you describe your research question in clear simple terms?

If you can state the research question clearly, in simple language, in one sentence, you will be able to work out what data are needed to answer the question, and to identify a suitable study methodology to gather these data. 

But without that initial step in place there is no way of deciding on an appropriate study methodology. 

It's worth spending time trying to phrase the question exactly right. It is the single most important step in your research. When you come up with the exact question you want to ask, make that the title of your research project.

From xkcd

Things that are not research questions

Remember that a research question is a question. I'm interested in patient litigation is not a question, nor is We have data on 120 patients on our deliberate self harm register or I'm planning to do an analysis of patient outcomes using the TILDA dataset.


Write the introduction section to your paper before you finalise the methodology. This should have three sections: what we already know, what we don't know, and what you decided to do. If these are clear in your head – and properly referenced – all is well.

Thursday, 5 November 2015

RStudio – an interface for R for people who hate the R interface

No-one would call the R interface pretty. In fact, it's the sort of interface that strikes terror into the heart of new users. The trouble is that many people find themselves trying to use R because you can do something in R that you cannot do in any other package. These users find that R is just so different to anything else they have used that they can spend days – weeks, even – just trying to figure out how to get their data into the blasted thing.

There have been a few attempts to improve the user experience, though the feeling in the R community seems to be that R does statistics, and that a nice user interface is a low priority. 

I've been experimenting with RStudio recently. As an interface, I've found it much easier to work with than anything I tried previously. Here's what it looks like (click the image to see it full-size):

The bottom left shows you your output. On the top left, you can see that I'm browsing a small table, and on the top right you can see the contents of my R workspace. I like this, because R's ability to have multiple datasets available at once is a strength. Being able to browse them and inspect them is pretty useful. The bottom right shows a very useful pane that you can use to manage files, plots and packages. Clicking a package name opens the help file in the help tab.

Command tips appear as you type a command – no, it doesn't give you dialogues for commands, but the tips are very useful.  

Will this make R as easy as Stata? No, clearly. But it makes it a lot easier. And for that, you may well be grateful.

RStudio is also under pretty active development, and has improved noticeably over the couple of months I've been using it. Worth a try, then.

Thursday, 29 October 2015

Which formula for the confidence interval for a proportion?

Stata presents a bewildering array of options for the confidence interval for a proportion. Which one should you use? 

By default, Stata uses the "exact" confidence interval. This name is a bit misleading (this interval is also called the Clopper Pearson confidence interval, which makes fewer implied claims!). The exact confidence interval is exact only in the sense that it is never too narrow. In other words, the probability of the true proportion lying within the "exact" confidence interval is at least 95%. However, this means that in most cases the interval is wider than it needs to be. 

For an apparently simple problem, finding a formula that will give 95% confidence intervals for a proportion has turned out to be surprisingly difficult to crack! The problem is that events are whole numbers, while proportions are continuous. Imagine you have a 25% real prevalence of smoking in your population, and you have a sample size of 107. Your sample cannot have a 25% prevalence of smoking, because, well, that would be 26·75 people. So some sample sizes are "lucky" because they can actually show lots of sample sizes proportion, and some proportions are "lucky" because they can turn up in lots of sample sizes. You begin to see the problem?

Solutions from research

There have been quite a few studies that have used computer simulation to examine the performance of different confidence interval formulas. The recommended alternatives are Wilson or Jeffeys confidence intervals for samples of less than 100 and the Agresti-Coull interval for samples of 100 or more. This gives the best  trade off between confidence intervals that are less than 95% and confidence intervals that are too wide. 

What about the textbook formula that SPSS uses?

One option that Stata does not offer you is the formula you find in textbooks, which simply uses the standard error of the proportion to create a confidence interval. This is known as the normal approximation interval, and it is used by SPSS. If you calculate the confidence interval for 2 events out of a sample of 23 using the normal approximation, the confidence interval is -4% to 21%. That's right: SPSS is suggesting that the true event rate could be minus four percent. Quite clearly this is wrong, as there is no such thing as minus four percent. However, the confidence interval also includes a figure which is obviously wrong. If we have observed two cases, then the true value cannot be zero percent either. Less obviously, the upper end of the confidence interval is also very wrong. Using Wilson's formula gives a confidence interval of 

. cii 23 2, wil

                                                         ------ Wilson ------
    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
             |         23    .0869565    .0587534          .02418    .2679598

2.4% to 26.8%. The "exact" method gives an interval that is slightly wider:

. cii 23 2, exact

                                                         -- Binomial Exact --
    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
             |         23    .0869565    .0587534          .01071    .2803793

at 1.1% to 28.0%. 

So never calculate a binomial confidence interval by hand or using SPSS!

Skip to this bit for the answer

For such an apparently simple problem, the issue of the confidence interval for a proportion is mathematically pretty complex. Mercifully, a Stata user just has to remember three things: 
  1. the "exact" interval is conservative, but has at least a 95% chance of including the true value; 
  2. for N < 100, Wilson or Jeffreys is less conservative and closest to an average chance of 95% coverage, 
  3. and for N > 100, Agresti Coull is the best bet. 

Tuesday, 20 October 2015

Missing data: never, never use 99!

The myth of the 99

My heart sinks every time I come across data with 9, 99, 999, 9999 and other real numbers used to indicate missing data. There are still supervisors (though they are getting pretty old by now) that advise students to do this. It's a myth particularly prevalent among SPSS users. 


The use of actual numbers as missing values takes us way back to the seventies, when computers were run by punched card. They looked like this:

And yes, thats SPSS on those cards!
When you were building a dataset, you had to tell the computer what kind of variable each variable was, and how much storage space it needed. Numeric variables could only contain numbers, and so researchers had a problem: what happened when the information was missing. 
The SPSS solution was to use declare one of the numbers to be a missing value. This overcame the problem of storing missing values in a column of numeric data, but it also opened the floodgates to a lot of really risky calculations. Because if you forgot to tell SPSS about your missing value, then the information would be treated as real.

Myth: you must have a numeric missing value

Every modern statistics package since Bob Dylan was alive in a meaningful sense has been able to handle missing values. By this I mean that if you leave a blank, it is correctly interpreted. It will automatically be assigned a missing value by the package. 
Let me repeat: you do not have to have special numeric missing values.

Missing – just leave it blank

So if you have missing data, just leave it blank. Your stats package knows what to do. If you need to know why the data were missing, then create a separate variable that codes the reasons. If the reasons are worth analysing, then they are worth coding properly. 

Friday, 16 October 2015

Stata tips : get a display of Stata's graph colours

Stata graphs allow you to specify the colo(u)rs used for various graph items. Stata has about 50 named colours, so it's hard to remember exactly what each colour looks like. In addition, Stata uses the names gs0 to gs16 for sixteen shades of grey, starting with gs0 (black) and ending with gs16 (white). These are useful for producing more-or-less evenly-spaced gradations of grey. 
To see a plot of all the available Stata colour, you need to install the vgsg package. This package contains useful resources that accompany Mitchell's excellent book  A visual guide to Stata graphics. Amongst these resources there a simple command that makes a colour chart. 

You install the package like this:

. net from

Click the link to install the package. Once you have installed it, you can issue the command 

. vgcolormap

to print a palette of the available colours. I printed one in colour and pasted it inside my copy of Mitchell's book. Here is the graph:

Notice that Stata has 16 shades of grey (biostats people can't cope with fifty). These are named gs0 to gs16. Of course, gs0 looks black, but if you look carefully, Dougal, you'll see it's actually a very, very, very, very dark grey*. And gs16 simply white. 

And there is a second user-written command, by Seth T. Lirette, that I rather like for its elegant output:

. ssc install hue
. hue

*For non-Irish people, this is a Father Ted joke. Don't worry about it.

10 Steps to successful research : 1 – Know the current state of knowledge

The research process begins by identifying a gap in our knowledge or our understanding. It doesn't matter whether we're talking about scientific research or real life. In real life, for example, you might be going to Lisbon and you need to find a good, cheap hotel near the city centre. Or you might need to figure out how to make carrot soup. But hold onto this idea: research fills a gap in our knowledge. If you don't have a knowledge gap, you're not doing research, you're just noodling around on the internet. 

Scientific research involves adding to knowledge. In order to do this, you must know the current state of knowledge, the current theoretical approaches and current best practice in terms of measurement. 

All biostats people have the experience of the person who comes in with a great research idea that looks like this:

The person: I have sixteen patients with rapid cycling mood disorder
Me : So what are you going to research?
The person: The patients with rapid cycling mood disorder
Me : No, I meant what question are you going to research. What do we not know about rapid cycling mood disorder?
The person : Oh…

Of course, those sixteen patients are a research opportunity. But they aren't a research project until we can find a question that will add to our knowledge, and that can be answered with sixteen patients. Often our job supporting student research is to help the student identify the research opportunities in their environment and then to see if any of these opportunities can be used to study a question that we need answered. 

The introduction to your research paper should do three things
1. It should outline the current state of knowledge.
2. It should identify a gap in that knowledge and
3. It should state the research question in clear, simple language.

Being able to write the first two sections is critical. There will be no step 3 – no research question – without the first two steps. 

But what about a great research question that just sort of pops into your head?  I hear you ask? 

Two things: first, this question may have a well-known answer. You need to know the literature to avoid duplicating work already done.
The second is to do with connectedness. Research is like a jigsaw. The best contributions are made by people who find the edge of the work in progress and join up with it. Sciences advances because each piece of research links into the existing body of knowledge like a jigsaw piece. 

So find out where the edge of our knowledge is. That's where you need to go to work.

Monday, 2 February 2015

Myths and Nonsense about the t-test

The subject of t-tests and regression has generated more myths and superstitious practices than any area of research I know of. I'd like to spread some reassurance by debunking some of the common ones. Starting with

Myth: "My data are non-parametric so I can't use a t-test

Ah - that poor misused word “parametric”! A parameter is a property of a distribution. If we really knew the distribution of the weight of everyone in the population, we would know the average weight. However, there is little need to bother people by weighing them all. We can just weigh a representative sample and use this to estimate the the average weight. A statistic is an estimate of a parameter calculated from a sample.

So parametric statistical procedures estimate some property of the distribution using data from a sample. From this, you can see that data cannot be parametric or nonparametric. Data do not estimate anything. Statistical procedures can be parametric or nonparametric, depending on whether they do, or do not, estimate some property of the population.
Indeed, one popular test, the Wilcoxon Mann-Whitney, estimates a very useful parameter: that probability that a person in one group will score higher than a person in the other group. So it's a parametric procedure. 
Moral: stop using the words "parametric" and "non-parametric". They are just confusing everyone. 

Small samples

One persistent misconception is that you cannot use the t-test on small samples (when pressed, people mutter something about “less than 30” but aren’t sure). Actually, you can. And the t-test performs well in samples as small as N=2!(J. de Winter, 2013) Indeed, with very, very small samples, the Wilcoxon-Mann Whitney test is unable to detect a significant difference, while the t-test is able to do so (Altman & Bland, 2009). 

Can I use a t-test with ordinal scales?

A t-test compares two means. It follows that you can use it when the mean is a meaningful data summary. So you can certainly use it to compare body mass index or blood pressure between two groups. 

You will read some authors denouncing the use of the t-test with data such as attitude, aptitude and mood scales. These scales, they argue, are ordinal, and so we have no business comparing mean values (though they don’t imply that we shouldn’t report mean values, which seems to be a logical consequence of their point of view).

In fact, it often does make sense to talk of mean scores on these measures, which are created by summing (or averaging) a number of items which are themselves ordinal. The resulting scale does tend to take on the properties of a continuous numeric scale. Here, for example, are the scores of 87 patients on an illness stigma scale that has 23 items:

You can see that the scores form a continuous distribution that falls quite closely along the 45° reference line, indicating that they follow a normal distribution (the four lowest scores at the bottom are the only exceptions).

In a case like this, the argument for summarising the data using mean scores and comparing groups using a t-test is pretty convincing. 

But can you should use means and a t-test to compare scores on individual items? Well, looking at the items from the stigma scale above, they follow discrete distributions that indicate that the gaps between each scale point are different for each item. Here are four of the items, plotted against the normal distribution:

Notice how summing these rather unpromising-looking individual items gives us a scale with properties (continuous, normal-ish) that the items themselves lack. But can we test for differences between the items using a t-test?

The t-test will work with the individual items too

In fact, as I pointed out above, the t-test is robust to very significant departures from normality. Simulation studies comparing the t-test and the Wilcoxon Mann-Whitney test on items scored on 5-point scales have given heartening results. In most scenarios, the two tests had a similar power to detect differences between groups. The false-positive error rate for both tests was near to 5% for most situations, and never higher than 8% in even the most extreme situations. However, when the samples differed markedly in the shape of their score distribution, the Wilcoxon Mann-Whitney test outperformed the t-test(J. C. de Winter & Dodou, 2010). 

So when can you use a t-test?

The answer is "more often than you thought". It's a very robust test, and it tests a useful hypothesis. With small samples and odd-shaped distributions, it's wise to cross-check by running a Wilcoxon Mann-Whitney test, but if they disagree, remember that they test different hypotheses: the t-test tests for differences in means, while the Wilcoxon Mann-Whitney tests the hypotheses that a person in one group will score higher than a person in the other group. There can be reasons why one is significant and the other isn't.

But that's for a separate post.

Bibliography and useful reading
  1. Altman, D. G., & Bland, J. M. (2009). Parametric v non-parametric methods for data analysis. Bmj, 338(apr02 1), a3167–a3167. doi:10.1136/bmj.a3167
  2. Conroy, R. M. (2012). What hypotheses do “nonparametric” two-group tests actually test? The Stata Journal, 12(2), 1–9.
  3. de Winter, J. (2013). Using the Student’s t-test with extremely small sample sizes. Practical Assessment, Research & Evaluation, 18(10), 1–12.
  4. de Winter, J. C., & Dodou, D. (2010). Five-point Likert items: t test versus Mann-Whitney-Wilcoxon. Practical Assessment, Research & Evaluation, 15(11), 1–12.
  5. Fagerland, M. W. (2012). t-tests, non-parametric tests, and large studies--a paradox of statistical practice? BMC Medical Research Methodology, 12, 78. doi:10.1186/1471-2288-12-78
  6. Fagerland, M. W., Sandvik, L., & Mowinckel, P. (2011). Parametric methods outperformed non-parametric methods in comparisons of discrete numerical variables. BMC Medical Research Methodology, 11(1), 44. doi:10.1186/1471-2288-11-44
  7. Rasch, D., & TEUSCHER, F. (2007). How robust are tests for two independent samples? Journal of Statistical Planning and Inference, 137(8), 2706–2720.
  8. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83.
  9. Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. The British Journal of Mathematical and Statistical Psychology, 57(Pt 1), 173–181. doi:10.1348/000711004849222

Thursday, 22 January 2015

Making graphs of tables

Seeing tables as graphs

We often put tables into papers by reflex. Making them is a dull activity because, I suspect, there is the sense that no-one reads them. And there’s a very good reason for this: while tables are a very good resource, they are lousy communicators. 

Tables : lousy communicators

Here is a table of hair and eye colour

. use "Hair and eye colour.dta"
(Hair and Eye Colour, Caithness, from Tocher (1908))

. tabulate eye_colour hair_colour [fweight = freq]

           |                      Hair colour
Eye colour |      Fair        Red     Medium       Dark      Black |     Total
      Blue |       326         38        241        110          3 |       718 
     Light |       688        116        584        188          4 |     1,580 
    Medium |       343         84        909        412         26 |     1,774 
      Dark |        98         48        403        681         85 |     1,315 
     Total |     1,455        286      2,137      1,391        118 |     5,387 

You have to be pretty determined to make any sense of the table. Indeed, to do so requires somehow digesting the information from 20 numbers, most of which are three-digit numbers. This is pretty much guaranteed to be beyond the working memory capacity of the average human.

And no, percentages don’t help much:

. tabulate eye_colour hair_colour [fweight = freq], column nofreq 

           |                      Hair colour
Eye colour |      Fair        Red     Medium       Dark      Black |     Total
      Blue |     22.41      13.29      11.28       7.91       2.54 |     13.33 
     Light |     47.29      40.56      27.33      13.52       3.39 |     29.33 
    Medium |     23.57      29.37      42.54      29.62      22.03 |     32.93 
      Dark |      6.74      16.78      18.86      48.96      72.03 |     24.41 
     Total |    100.00     100.00     100.00     100.00     100.00 |    100.00 

Stacked bar charts

Here, instead, is what happens when we graph the data

catplot  eye_colour hair_colour [fw=freq], name(catplot,replace) ///
asyvars stack percent(hair) legend(rows(1) stack)

The stacked bar chart shows the trend of dark-to-light running from top left to bottom right. This shows the breakdown of eye colour within each hair colour, but tells us nothing about the distribution of hair colour. 

This is done with Nick Cox’s command catplot. Download it from the ssc archive

. ssc install catplot

Spineplots (mosaic plots)

Spine plots (also called mosaic plots) are a very effective way of visualising tables. Unlike stacked bar charts, you may not have heard of spine plots. 
A spineplot will show both the distribution of hair colour, and the distribution of eye colour within hair colour:

spineplot  eye_colour hair_colour [fw=freq], percent

The hair colours are shown as columns, and we can see that red hair and black hair are much rarer in this population (Scotland, early 20th century) than fair, medium and dark. And the relationship with eye colour is now very evident – the colour changes from bottom left (fair hair, light or blue eyes) to the top right (dark or black hair, dark eyes). 

Do you need a graph rather than a table

The tables above contain the relationship but they don’t show it. And even if you are determined to find it, there are simply too many numbers in the table for any normal person to hold them all in working memory and make sense of the pattern. 

The spine plot, on the other hand, shows the relationship with little work needed on the part of the reader. It doesn’t record the exact percentages. If you needed simply to record the exact percentages for reference, then a table is better, but if you wanted to communicate a pattern, then there’s no question: the graph wins hands-down.

This is done with Nick Cox’s command spineplot. Download it from the ssc archive

. ssc install spineplot