Coffee, Mortality, and Data Quality

This past week there were several news articles discussing the link between coffee drinking and mortality based on a New England Journal of Medicine paper titled: “Association of Coffee Drinking with Total and Cause-Specific Mortality.” Science Daily for example said “Coffee Drinkers Have Lower Risk of Death, Study Suggests.” I was able to obtain a copy of the paper and frankly I find several red flags as I look at the data that was used. Given the data the analysis may be well done, but computer geeks have a saying: “garbage in, garbage out.” I would not label the data used for this research as garbage. But I do have several concerns. While the authors recognized some of these issues in the paper, I am of the opinion that they were not well addressed and there implications that deserves further consideration. The mere mention of the problem does not answer the very important question as to how the problem impacts the conclusions reached in the paper.

The data in the study must first be regarded as being far from representing the population of the United States. The actual sampling was done in only six states. But even there the data cannot be claimed to represent the people living in those six states. The data source was the the NIH-AARP Diet and Health Study. The first stage of sample selection was members of the AARP in the selected states. Current membership numbers for the AARP seem to be around 40 to 45 million persons age 50 and over. The size of this group in the United States is about 99 million according to the 2010 Decennial Census. So the first problem is the sampling frame represents less than 50% of the population who have elected to join the AARP. Those are current numbers. The actual sample selection was done in 1996 and 1997. I do not have membership number for the AARP back then. The second stage of sample selection was much more problematic. 3.5 million questionnaires were mailed back. Approximately 600 thousand were returned. This means that the response rate for the survey was on the order of 17%. Response rates at this level are usually regarded as unacceptably low for most survey organizations. Combining the two problems the resulting sample represents about 8% of the population of the six states. Keep in mind that this 8% were in effect self selected into the sample. The first stage of self selection was the choice To join the AARP and the second stage was the choice to return the questionnaires.

But it gets worst. In the study the authors decided to exclude respondents from the analysis for several reasons. By the time they implemented these exclusions they had reduced the sample to 400,000 respondents. This then represents about 6% of the target population.

I find it very disconcerting to make claims about the population as a whole based on a sample selected in this manner. As a matter of comparisons the federal government would never consider providing estimates of monthly unemployment based on data with these types of sampling problems. So I am unsure why such selection procedures are considered acceptable when dealing with medical data. Even if all of the analysis done in the paper is sound it is very unclear that the results can be generalized beyond those people who ended up in the analysis. Any claim that those results can or should be generalized to the population of the United States would require substantial justification. So while the sample used is very large the biased selection procedures raise serious issues. The size of a study becomes very much unimportant when biased selection procedures are used. Statisticians will frequently say that the errors due to such procedures dwarf the measurement errors and cannot be offset by increasing the size of the data set.

As an example of the biased nature of the resulting sample the paper lists college graduation rates for the group (age 50 to 71) as ranging from 37% to 53% for men and ranging from 24% to 31% form woman with the level depending on the amount of coffee they drank. While the latest Census Bureau numbers put the current percent with college degrees at 28.6% for those age 45 to 64. Given the increasing trends in graduation rates this should be considered an overstatement of the level of education that should have been seen in the study participants if the sample was representative of the population. The large difference indicates that the authors ended up with a group that is much better educated than the general population. Adjustments can be made for difference between the general population and the sample population and these differences can be incorporated into the modeling methodology. However the larger these differences are in the more difficult such adjustments become and the less clear it becomes that such adjustments adequately account for the differences.

Those are not the only problem with the analysis. As the authors dutifully point out the entire study is based on a self reporting of the amount of coffee consumption at a single point in time. There was no measure of long term coffee consumption available in the data set. In fact all of the characteristic variables used for the individual other than the death information were based on the data reported in the questionnaires at the time of original survey. Reporting links between coffee consumption and mortality based on such data and analysis I find troubling. It seems little different from looking at those who received a ticket for speeding 15 years ago and studying traffic fatalities in auto accidents years later. There may well be a link, but this just does not seem to be the way to examine the situation.

The final issue I have with studies of this magnitude is the well documented problem of publication bias. This is where only data where results are statistically significant are reported in the journals and the studies where links were not found are not reported. It is never clear if the purported link is the result of random chance or reflects reality. Using the statistical standards employed in this reports five percent of all studies on this data set would be expected to show a significant statistical result even it the results were the result of pure random chance. The website for the survey show hundreds of studies based on this data. Studies of this size draw a large number or research interests and are therefore in a position where the publication bias issue needs to be seriously addressed before actions are taken based on the results of the studies that are done using data such as was used here. Ideally this would be done across all of the analysis done using the data set and is not specific just to this one study on coffee and mortality.

(Comments are closed)
  • Subscribe to Blog via Email

    Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  • January 2021
    S M T W T F S
    « Jan    
  • Recent Posts