Subscribe to Blog via Email
Good Stats Bad Stats
Search Text
May 2025 S M T W T F S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 -
Recent Posts
goodstatsbadstats.com
Last Sunday in an opinion piece by Robert Samuelson appeared in the Washington Post questioned the emphasis on getting a college education. With the increasing cost of a college education the wisdom of sending our young people to college has come into question. With the recent recession the recent graduates have been having a hard time getting their first job has added to the controversy. Funny thing is back in 1970 new college graduates were also having a hard time getting a job and no one seemed to be questioning the value of a college degree back then. The sad truth is that in the middle of a recession many people have problems getting and keeping a job. That is the nature of a recession.
So with the release by the Bureau of Labor Statistics job report today I looked at one measure of the value of a college degree – the unemployment rate. The numbers are enlightening. I only looked at those over age 24 as these are the values readily available in the BLS report. The downside of using the 25 and over age group is that if fails to fully reflect the status of the recent college graduates. Samuelson cites statistics on graduation rates within six years of starting college. Those would be mostly people turning 24 or 25 that year. That in itself makes that cutoff a fair values for evaluating the impact of a college degree on employment.
So which group is it better to be in? I’d rather have that college degree.
As with any situation relying on just one number can be misleading. A full evaluation would look at income levels and debt levels for college graduates as compared to other groups.
Samuelson failed to look at the bottom line. Instead he quotes claims that we have “dumbed down college.” He cites statistics that show that “fewer than 60 percent of freshmen graduate within six years.” He seems to think 60 percent is a bad number but offers no evidence of his reasons for coming to that conclusion. Who asked: is it better to try and fail than to never try?
At the annual meetings of the American Thoracic Society a paper was presented that claimed a link between Sleep Apnea and the risk of getting Cancer. It gained national headlines with a claimed five fold risk dying from cancer for those suffering from the most sever forms of sleep apnea. Those with a moderate form of the sleep apnea had doubled the risk of dying from cancer. View the NY times blog on the paper here and the Fox News report here.
I contacted one of the authors for a copy of the paper that had been presented and was given a link where I can for a fee obtain a copy of the paper as is soon to be published in the American Journal of Respiratory and Critical Care Medicine. That is unacceptable. I am aware that journal need to be funded in some manner. But my position is that once the authors release the results of their findings to the public through the media full disclosure of the paper is required so that the users of the information, who are no longer just the doctors treating their patients, can properly consider the claims being made.
All is not lost because I was not provided with a copy of the paper. The analysis was based on data from the Wisconsin Sleep Study. The study design is well documented. In addition some of the same authors to the current paper published an earlier paper in 2008 using some of the same data. There they only linked sleep apnea to mortality. The Wisconsin Sleep Study has been following patients since 1988. Both the earlier paper and the current paper use much of the same data. The earlier paper involved followup for 18 years, while the current paper involved followup for 22 years.
The first red flag came when I looked that the abstract for the paper and found that the five fold increased risk of dying from cancer had a confidence interval from 1.7 to 13.2. This is a very large confidence interval around the actual estimated relative risk of 4.8. Such large confidence intervals are symptomatic of the use of very little data. Looking further, the 2008 paper showed that of the 145 people in the original study with moderate to severe sleep apnea nine had died from cancer by that time. The number likely has increased by a few cases in the succeeding four years since the first paper. My first take away it that this is too little data to draw any conclusions. The authors included things like age, sex, obesity, smoking, and a few other variables in their models. But I saw no mention that other obvious and important factors like family history for cancer and heart disease were considered.
The next comment I have is on the nature of the statical testing that was done and just how the variances were calculated. The approached used in the paper uses what is usually called classical statistical testing. This scenario, and the associated variance calculations, assumes that the researcher decided ahead of time what data they were going to collect. That also means they decided how many years into the study they would do the statistical testing for the association between sleep apnea and cancer. That was apparently not done in this case. Rather the 2008 paper was written when the association between mortality and sleep apnea first become statistically significant. I fully expect a third paper to be written in a few years when the authors can claim statistical significance between heart disease and sleep apnea. This type of analysis goes under the name sequential decision theory and requires very different calculation of the associated variances than was done for this paper.
Beyond this the major weakness in the paper is the data availability. As the study was constructed it will only be able to measure mortality risk as it relates to cancer and heart disease. If, as I expect, they produce a third paper linking heart disease to sleep apnea then the results will only say that those with sleep apnea have a higher mortality risk. To demonstrate that there is a specific risk for cancer they need to go beyond the available data in the Wisconsin Sleep Study.
Last week the second annual Data Science Summit was held in Las Vegas. I was not there and having read some of the review I wish I had been able to attend. But here I want to bring to the foreground the brief blog post on the summit by David Smith at inside-r.org. He not the maturity evident in the material at the summit and in the process identifies some often neglected issues when doing any kind of data analysis.
David mentions the talk by Nate Silver on dealing with uncertainty. All to often uncertainty is forgotten in data analysis. It is ‘give me the number” and then the user regards that number at accurate no matter how it is obtained. When measures of uncertainty are available anything outside of the standard statistical confidence interval is looked at at being truly different when in fact it may well be nothing more than a statistical outlier.
Then David points to the talk by Michael Brown where he raised issues of recall bias in data collection. This is part of a much bigger problem that the statistical community refers to as non-sampling error. Not only must the data scientist ask about the accuracy of the data in terms of variance, but they also much ask is the data itself any good – what other sources of error are also present and is the data fit for the use in the planned analysis.
Michael Chui dealt with issues of statistical literacy. I am not sure where I stand on the issue of teaching calculus vs teaching statistics. But it is very clear from what is seen in the media and in much of the professional literature that there is a dearth of statistical literacy in this country.
David’s choice of which talks and issue to highlight in his blog is very insightful. The discussion of such issues the summit show that it has indeed reach maturity.