Saturday, 22 August 2015

How robust is empirical science?

I'd like to begin this post by admitting that I am writing far outside my own research experience. Nevertheless the paper that prompted it ("Likelihood of null effects of large NHLBI clinical trials has increased over time" by Robert Kaplan and Veronica Irvin) sounds such a potentially alarming message that I think it is worth publicising.

Kaplan and Irvin looked at all "large" research trials funded by the NHLIB (the US National Heart, Lung and Blood Institute) between 1970 and 2012 ("large" here was precisely defined in their paper). These trials examined a variety of drugs and dietary supplements for preventing cardio-vascular disease. The authors recorded whether the trial had a positive outcome (statistical evidence that the treatment was successful, a negative outcome (statistical evidence that the treatment was ineffective) or a null outcome (no evidence either way).

 There were 30 studies prior to 2000 of which 17 produced a positive outcome, and 25 afterwards of which only 2 produced a positive outcome. This is a large decline in the number of positive outcomes and the authors discussed various possible causes.

 Two possibilities that might, a priori, explain the decline are

  1. researchers prior to 2000 were more influenced to produce positive outcomes because these are preferred by drug companies, and 
  2. the use of a control group to which a placebo was administered occurred less prior to 2000.
However possibility 1 was refuted in that the proportion of trials sponsored by drug companies was essentially the same for both periods, and possibility 2 was refuted for similar reasons.

Kaplan and Irvin did however advance quite a compelling reason to explain the discrepancy. Before I tell you what it was, it is worth saying something about the culture in which such experiments take place.

Imagine you are a scientist who is testing the efficacy of a drug to combat high blood pressure (say). You set up your clinical trials complete with a control group to whom you administer only a placebo, gather your data, and analyse it. This may take a long time and you are heavily invested in your results. Naturally you are hoping that the results will demonstrate that your new drug will prove effective in reducing blood pressure.

But maybe that doesn't happen. Oh dear what could you do? Doesn't it seem rather lame to simply report that you found no effect either way (or, worse, a negative effect)? Since you have gathered all this data why not look at it again - after all, it might have some significance. Maybe your drug has had some unexpected positive effect and, when you find it, you can report a positive outcome.

What's wrong with that? Isn't it perfectly fair since your data did demonstrate some form of efficacy?

No, it's not fair for several reasons. One reason is that your experimental procedures were perhaps not so well-tailored to assessing the result you did, in fact, find. Perhaps a more important reason though is that any such conclusion only comes with a statistical likelihood and, in the regime that you actually implemented, you gave yourself many opportunities to "get lucky" with a statistical conclusion.

It has become increasingly recognised that such researcher bias should be minimised. In the year 2000 the NHLIB did introduce a mechanism to prevent researchers from gaming the system in the way I described above. They began to require researchers to register in advance that they were conducting a clinical trial and what hypothesis they were testing. So once the data is gathered the researchers cannot change their research question.

Pre-registration of trials is what Kaplan and Irving believe explains the drop in positive outcomes since 2000. They could easily be right and, at the very least, their work should be scrutinised and critiqued.

Now, just for the moment let's assume they have hit upon a significant finding. What would this mean? Surely it means that, for the 30 trials conducted prior to 2000, we might assume that, rather than a fraction 57% of them having positive conclusions (17 out of 30), the fraction should be closer to 8% (the fraction of trials post-2000 that were positive). Since we have absolutely no idea which of the positive-outcome trials lie in this much smaller set we should dismiss 30 years of NHLIB funded research. Worse than that thousands of patients have been administered treatments whose efficacy is unproven. I'm not suggesting a witch-hunt against researchers who have acted in good faith but surely there are lessons to be learnt.

One lesson to learn is that we should value trials with a negative or null outcome just as much as we value those with a positive outcome. In particular the bias towards publishing only positive outcomes must disappear. This is already becoming increasingly accepted but, as Ben Goldacre has demonstrated in his book Bad Pharma, there is still a very long way to go. In fact Goldacre shows that pharmaceutical companies have actively concealed studies with null outcomes and cherry-picked only those studies that shine a favourable light on the drugs they promote.

But there is another lesson, one with potentially much wider implications.  Many disciplines conduct their research studies by the "formulate hypothesis, gather data, look for statistical conclusion" methodology. In fact hardly a discipline is untouched by this methodology and, in most cases, they are years behind the medical disciplines in recognising what can contribute to researcher bias. It is therefore not an overstatement when I claim that it is a strong possibility that such disciplines have a track record of generating dodgy research results.

This shocking conclusion should be taken to heart by every research institution and, in particular, it should be taken to heart by our universities which claim to have a mission to seek out and disseminate truth. In my opinion it is now incumbent on the relevant disciplines (most of them) to at least conduct an analogue of the work carried out by Kaplan and Irvin. We need to extent of the problem (if indeed there is a problem) and we need to repeat, with all the new rigour we now know about, many previous investigations even their conclusions have been accepted for years.

It is not enough to aggregate the results of several investigations to raise confidence in their conclusions (meta-studies). We will often need to begin afresh. Well, at least that will give us all something to do.

1 comment:

  1. My friend and colleague Professor Jeff Miller sent me some interesting thoughts that advanced another possible cause for the proportional change in positive findings. He writes:

    Yes, that is an interesting blog post about the Kaplan and Irvin article, which I had not seen. Playing devil's advocate, though, I would argue your post does not give enough attention to the possibility that the "research landscape" can change in ways that make it tougher to find new positive results (examples below). Kaplan and Irvin give some credence to this possibility in their discussion, but I think it might deserve even more credence than they give it. I must admit, however, that my impressions may not be valid since I have no first-hand experience in this research area.

    One way in which the research landscape might change is that the pre-2000 and post-2000 "patient populations" might be different. For example, treatments discovered prior to 2000 might effectively control blood pressure for many patients, and these "treatable" patients may be handled rather routinely. In that case, they may no longer be candidates for research studies post-2000. This could mean that the post-2000 patients had inherently more difficult medical problems. If so, it would not be surprising that there would be a lower rate of success in finding effective treatments for them. Kaplan and Irvin make the related point, " the quality of background cardiovascular care continues to improve, making it increasingly difficult to demonstrate the incremental value of new treatments".

    Another way in which the research landscape might change is that the most promising treatments could have been studied first. After all, researchers would surely tend to start by studying the treatments that seemed most likely to succeed based on the available medical knowledge. To the extent that researchers actually could identify the best treatment bets, the early studies would have had the advantage of getting to look at those, whereas the later studies would have tended to look at the less promising treatments that no one had already examined. This would obviously lower the success rate for the later studies.

    Kaplan and Irvin considered this latter potential problem in their discussion, and they essentially dismissed it "because nearly all of the trials evaluated treatments that had been previously studied". That argument sounds somewhat plausible, but the earlier point about a changing patient population weakens the argument. Specifically, suppose that the post-2000 patients who could be helped by the pre-2000 treatments were already receiving those treatments and were therefore not considered as candidates for post-2000 studies. In that case the post-2000 patient group would include a disproportionate share of patients for whom the pre-2000 treatments were already known not to work. When these people were included in a new study of the same pre-2000 treatments, it is hardly surprising that the pre-2000 treatments would have had little effect.