From Kaiser Fung’s blog “Numbers Rule Your World”:
The following articles discuss the behind-the-scenes process of preparing data for analysis. It points to the “garbage in garbage out” problem. One should always be aware of the potential hazards.
“The murky world of student-loan statistics”, Felix Salmon (link)
At the end of this post, Felix found it remarkable that the government would not have better access to the data
The Reuters Blog post by Felix describes the typical problem with data and the challenges facing analysts who consume the data. The problem is difficult enough when ‘you own all the data’ (i.e. can examine how the data is created, aggregated, managed, etc. because you are the source). However, most analysis needs more than one pocket of data and relies on external sources of data to supplement what you might already have. The more removed an analyst is from the source, the less insight and understanding you have on its data quality.
One of the more disturbing aspects of Felix’s post is the fact that despite knowing there are significant errors in the previously published data, the NY Fed is only going to modify interpretation of current and future data. Thus, the longitudinal view (the view across time) will have this strange (and likely soon forgotten) jump in the amount of student loan debt. Good luck trying to do a longitudinal study using that data series.
A colleague responded to this by citing a CNN interview of a former GM executive discussing why GM declined. GM’s management culture (dominated by MBAs who are numbers people) made decisions based on what the data told them. When the former GM executive would bring perspectives from past experience, gut feeling and subjective judgement, he was advised that he came across as immature. My colleague commented:
“This provides a footnote to why over-reliance on data is dangerous in itself”
To rephrase/expand his point, I would say:
“Data-centric decision making is the most scientific basis for substantive decision making. HOWEVER, if you don’t understand the underlying data and its inherent flaws (known and/or unknown), you are living in a dream world.”
I think this is what he meant by ‘over reliance’ — total trust on the data in front of you to the exclusion of everything else.
In my view, you are almost always faced with these two conditions:
- Your data stinks, or at least has some rotten parts.
- You don’t have all the data which you really need/want
Once you acknowledge those conditions, you can start examining the ‘gut feel’ and the ‘subjective judgment’ in view of the data gaps.