Subscribe to Blog via Email
Last week the second annual Data Science Summit was held in Las Vegas. I was not there and having read some of the review I wish I had been able to attend. But here I want to bring to the foreground the brief blog post on the summit by David Smith at inside-r.org. He not the maturity evident in the material at the summit and in the process identifies some often neglected issues when doing any kind of data analysis.
David mentions the talk by Nate Silver on dealing with uncertainty. All to often uncertainty is forgotten in data analysis. It is ‘give me the number” and then the user regards that number at accurate no matter how it is obtained. When measures of uncertainty are available anything outside of the standard statistical confidence interval is looked at at being truly different when in fact it may well be nothing more than a statistical outlier.
Then David points to the talk by Michael Brown where he raised issues of recall bias in data collection. This is part of a much bigger problem that the statistical community refers to as non-sampling error. Not only must the data scientist ask about the accuracy of the data in terms of variance, but they also much ask is the data itself any good – what other sources of error are also present and is the data fit for the use in the planned analysis.
Michael Chui dealt with issues of statistical literacy. I am not sure where I stand on the issue of teaching calculus vs teaching statistics. But it is very clear from what is seen in the media and in much of the professional literature that there is a dearth of statistical literacy in this country.
David’s choice of which talks and issue to highlight in his blog is very insightful. The discussion of such issues the summit show that it has indeed reach maturity.