A colleague sent me a link to a Forbes post “5 Cool Ways Big Data Is Changing Lives”, and I have to object to one of the entries in the post: ” When Big Data Goes Bad”
The example referred to is from when Forbes writer Kashmir Hill reported on how Target figured out a teen girl was pregnant before her father did.
Here is my issue with Raj Sabhlok‘s inclusion of this as one of “5 Cool Ways Big Data Is Changing Lives”:
1) It clearly is not a ‘cool way big data is changing lives’.. PERHAPS I could concede that it is changing lives, but, as implemented, I would not agree it is ‘cool’. (I guess I might concede that the predictive power is cool to data analysis folks)
2) And labeling the Target scenario as a case where Big Data [Went] Bad misleads the reader. It has nothing to do with the data itself but has to do with the business, policies, and implementation details of how the results might be used. And those types foibles have been around since the dawn of marketing, not as some new phenomenon that Big Data has caused.
Don’t blame the data, or even the techniques for analyzing the data, for things you bring upon yourself based on improper or poor usage / policies.
In John D Cook’s blog post (http://www.johndcook.com/blog/2010/12/15/big-data-is-not-enough/) he quotes Bradly Efron in an article from Significance. It is somewhat counter-culture (or at least thought-provoking) to the mainstream ‘Big Data’ mantra – Given enough data, you can figure it out. Here is the quote, with John D. Cook’s emphasis added:
“In some ways I think that scientists have misled themselves into thinking that if you collect enormous amounts of data you are bound to get the right answer. You are not bound to get the right answer unless you are enormously smart. You can narrow down your questions; but enormous data sets often consist of enormous numbers of small sets of data, none of which by themselves are enough to solve the thing you are interested in, and they fit together in some complicated way.”
What struck a chord with me (a data guy) was the statement ‘and they fit together in some complicated way’. Every time we examine a data set, there are all kinds of hidden nuances that are embedded in the content, or (more often) in the metadata. Things like:
- ‘Is this everything, or just a sample?’ – If it is a sample, then how was the sample created? Does it represent a random sample, or a time-series sample?
- ‘Are there any cases where there are missing cases from this data set?’ – Oh, the website only logs successful transactions, if it wasn’t successful, it was discarded.
- ‘Are there any procedural biases?’ – When the customer didn’t give us their loyalty card, all of the clerks just swiped their own to give them the discount.
- ‘Is there some data that was not provided due to privacy issues?’ – Oh, that extract has their birthday blanked out.
- ‘How do you know that the data you received is what was sent to you?’ – We figured out the issue – when Jimmy saved the file, he opened it up and browsed through the data before loading. It turns out his cat walked on the keyboard and changed some of the data.
- ‘How do you know that you are interpreting the content properly?’ – Hmm.. this column has a bunch of ‘M and F’s.. That must mean Male and Female. (Or, have you just changed the gender of all the data because you mistakenly translated ‘M-Mother and F-Father’?)
All of this is even more complicated once you start integrating data sets, and this is what Bradly Efron was getting at. All of these nuances are exacerbated when you start trying to marry data sets from different places. How do you reconcile two different sets of product codes which have their own procedural biases, but essentially report on the same things?
Full article here:
A pretty good summary of use cases for ‘big data’. This always ends up being the first set of questions when exposed to the idea of ‘big data’. “What the heck do _we_ do which is considered Big Data?” A lot of times this is because organizations don’t currently deal with these use cases BUT SHOULD to remain competitive. Things are a-changing.