Category Archives: Data

Personal Annual Reports

I had never seen this site before, but I came across it while watching something about Information Visualization:

http://feltron.com/

I guess I should probably know Nicholas Felton, from his high-profile “about me”:

He is the co-founder of Daytum.com, and currently a member of the product design team at Facebook. His work has been profiled in publications including the Wall Street Journal, Wired and Good Magazine and has been recognized as one of the 50 most influential designers in America by Fast Company.

Anyway, he has been producing ‘Personal Annual Reports’ which reflect each year’s activities.  I haven’t dug too deeply, but the struck me as quite interesting and worth a deeper dive.

Leave a comment

Filed under Data Visualization, Uncategorized

“Big Data” to solve everything?

In John D Cook’s blog post (http://www.johndcook.com/blog/2010/12/15/big-data-is-not-enough/) he quotes Bradly Efron in an article from Significance.  It is somewhat counter-culture (or at least thought-provoking) to the mainstream ‘Big Data’ mantra – Given enough data, you can figure it out.  Here is the quote, with John D. Cook’s emphasis added:

“In some ways I think that scientists have misled themselves into thinking that if you collect enormous amounts of data you are bound to get the right answer. You are not bound to get the right answer unless you are enormously smart. You can narrow down your questions; but enormous data sets often consist of enormous numbers of small sets of data, none of which by themselves are enough to solve the thing you are interested in, and they fit together in some complicated way.”

What struck a chord with me (a data guy) was the statement ‘and they fit together in some complicated way’.  Every time we examine a data set, there are all kinds of hidden nuances that are embedded in the content, or (more often) in the metadata.  Things like:

  • ‘Is this everything, or just a sample?’  –  If it is a sample, then how was the sample created?  Does it represent a random sample, or a time-series sample?
  • ‘Are there any cases where there are missing cases from this data set?’  –  Oh, the website only logs successful transactions, if it wasn’t successful, it was discarded.
  • ‘Are there any procedural biases?’ – When the customer didn’t give us their loyalty card, all of the clerks just swiped their own to give them the discount.
  • ‘Is there some data that was not provided due to privacy issues?’ – Oh, that extract has their birthday blanked out.
  •  ‘How do you know that the data you received is what was sent to you?’  – We figured out the issue – when Jimmy saved the file, he opened it up and browsed through the data before loading.  It turns out his cat walked on the keyboard and changed some of the data.
  • ‘How do you know that you are interpreting the content properly?’  –  Hmm.. this column has a bunch of ‘M and F’s.. That must mean Male and Female.  (Or, have you just changed the gender of all the data because you mistakenly translated ‘M-Mother and F-Father’?)

All of this is even more complicated once you start integrating data sets, and this is what Bradly Efron was getting at.  All of these nuances are exacerbated when you start trying to marry data sets from different places.  How do you reconcile two different sets of product codes which have their own procedural biases, but essentially report on the same things?

Full article here:

http://www-stat.stanford.edu/~ckirby/brad/other/2010Significance.pdf

Leave a comment

Filed under Big Data, Data, Data Management, Data Quality

Big Data use cases

A pretty good summary of use cases for ‘big data’.  This always ends up being the first set of questions when exposed to the idea of ‘big data’.  “What the heck do _we_ do which is considered Big Data?”  A lot of times this is because organizations don’t currently deal with these use cases BUT SHOULD to remain competitive.  Things are a-changing.

http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/

Leave a comment

Filed under Big Data, Data, Data Management, Systems Engineering

Incomplete analysis – finding patterns in noise

Kaiser Fung, author of Numbers Rule You World posted a blog entry about ‘Muzzling Data’ http://junkcharts.typepad.com/numbersruleyourworld/2012/03/we-sometimes-need-a-muzzle-.html

The entry talks about some automatic data analysis done by zip code, and how it was projecting the deviation of average lifespan for individuals in a zip code broken out by First Name. He goes on to show how and why this type of analysis is incomplete. Without a complete view of the data (i.e. what the population’s lifespan variability is overall), it is easy to find patterns in the noise of the data. He theorizes that this type of incomplete analysis might yield headlines such as:

“Your first name reduces your life expectancy!!”, or “Margaret, it’s time to become Elizabeth!”. And why not “James, if you want to live longer, become Elizabeth now!”

 The analyst needs to ensure that they are not identifying patterns in noise, due to an artifact of their methodology or incomplete analysis.

Leave a comment

Filed under Data, Systems Engineering

Data Quality – Garbage In, Garbage Out

From Kaiser Fung’s blog “Numbers Rule Your World”:

The following articles discuss the behind-the-scenes process of preparing data for analysis. It points to the “garbage in garbage out” problem. One should always be aware of the potential hazards.

“The murky world of student-loan statistics”, Felix Salmon (link)

At the end of this post, Felix found it remarkable that the government would not have better access to the data

 The Reuters Blog post by Felix describes the typical problem with data and the challenges facing analysts who consume the data.  The problem is difficult enough when ‘you own all the data’ (i.e. can examine how the data is created, aggregated, managed, etc. because you are the source).  However, most analysis needs more than one pocket of data and relies on external sources of data to supplement what you might already have.  The more removed an analyst is from the source, the less insight and understanding you have on its data quality. 

One of the more disturbing aspects of Felix’s post is the fact that despite knowing there are significant errors in the previously published data, the NY Fed is only going to modify interpretation of current and future data.  Thus, the longitudinal view (the view across time) will have this strange (and likely soon forgotten) jump in the amount of student loan debt.  Good luck trying to do a longitudinal study using that data series.

 

A colleague responded to this by citing a CNN interview of a former GM executive discussing why GM declined.  GM’s management culture (dominated by MBAs who are numbers people) made decisions based on what the data told them.  When the former GM executive would bring perspectives from past experience, gut feeling and subjective judgement, he was advised that he came across as immature.  My colleague commented:

“This provides a footnote to why over-reliance on data is dangerous in itself”

To rephrase/expand his point, I would say:

“Data-centric decision making is the most scientific basis for substantive decision making.  HOWEVER, if you don’t understand the underlying data and its inherent flaws (known and/or unknown), you are living in a dream world.”

I think this is what he meant by ‘over reliance’ — total trust on the data in front of you to the exclusion of everything else.

 In my view, you are almost always faced with these two conditions:

  1.  Your data stinks, or at least has some rotten parts.
  2.  You don’t have all the data which you really need/want

 Once you acknowledge those conditions, you can start examining the ‘gut feel’ and the ‘subjective judgment’ in view of the data gaps.

Leave a comment

Filed under Data, Data Management, Data Quality

Information Graphics: Effort to Create vs. Effort to Interpret

Kaiser Fung (from Junkcharts) posted this comparison of Creation vs Interpretation on the Statistics Forum.

This resonated with me since a colleague at work and I were having a debate about whether the plethora of toolsets for visualization are really ‘good’ for the data professionals.  My contention is that certain tools just make it easier to make horrible charts.  I think my quote was:

“This is like (possibly) giving young children access to chainsaws”

Just because you can use a given type of chart doesn’t mean you should use that type of chart.  Tools like this can make it even easier to generate horrible visualizations like: http://junkcharts.typepad.com/junk_charts/2011/04/worst-statistical-graphic-nominated.html

Sometimes making things easier to do is not always the best answer.  (“Now you too can run your own nuclear reactor, with three simple controls!”)

Return on Effort Matrix

Kaiser Fung's "Return on Effort Matrix

But I digressed from Kaiser Fung’s point..  His point is that different information/statistical charts have much different ‘usability’ factors for the reader, along with different levels of effort for the creator.  He turned this into a simple quad chart which I think is pretty reasonable (even though I really am not a fan of the Napoleon’s March chart).   I think one central theme to most of the complaints about ‘bad graphics’ is either a total lack of a point or where the point is so obscured by the details of the graphic.  The one exception to that is really the high effort, high reward space — graphics which try to tell several stories or illuminate several key points often are complex in nature.  The exploratory-type analysis graphics need significant skill and background to interpret and find the information in the ‘haystack’ of the graphic.

Leave a comment

Filed under Data

Are you a Data Wrangler?

I came across Data Wrangler http://vis.stanford.edu/wrangler/ today, a research project from Stanford.

From the video, it looks like a data-munger’s dream tool, allowing all of the ‘usual’ transformations we do, and enabling the user to output a script to recreate the transformation in a variety of languages. I have yet to play with it, but I think it has promise. If anyone plays with it extensively, I would love to hear how well it performs for you.

[Also from that team, Protovis (http://vis.stanford.edu/protovis/), a visualization toolset.]

It looks like Google has a somewhat similar toolset, with Google Refine . I am not sure the google’s tool keeps track of the transformations enabling export of the logic in other languages, but it has its own cool features.  I like the ‘Clustering Feature’ for transforming similar words /data values
Such as:

FFP
Firm Fixed Price
FFP:Firm fixed price
Fixed Price

Thankfully, Google Refine is a desktop app instead of forcing you to upload your data into some unknown location with unknown protection.

Thanks to Chris Pudney for the links.

Someone at work also pointed me to http://openii.sourceforge.net/index.php

 

Leave a comment

Filed under Data

The Data Scientist

The Fourth Paradigm

My colleague at work today pointed me to the following article, which describes the intersection of technology, data, and the scientific method.

http://www.dbta.com/Articles/Columns/Applications-Insight/The-Rise-of-the-Data-Scientist-75428.aspx

The article mentions the book “The Fourth Paradigm“, which describes this new paradigm of data-driven discovery.  I will need to put it on my long list of books to read (which I am making very slow progress through.. sigh). There is a review of the book here

It talks about using tools such as Hadoop, MapReduce, SAS/SPSS/R to crunch through lots of scientific data to obtain meaningful information.

It pretty well sums up where I see myself heading/positioning myself, from a ‘data-centric’ viewpoint.

2 Comments

Filed under Data, Data Management

Ranking and Clustering

In a recent JunkCharts post (http://junkcharts.typepad.com/junk_charts/2011/05/rank-confusion.html) the author shows a good example of transforming data (in this case ranking/clustering data) into a more visual representation.  I definitely like the revised representation proposed by the author.

Additionally, it points out that ranking and clustering has challenges for showing ‘superiority’ as the ordinal scales may not provide sufficient information to make such conclusion.  See also this post for more discussions and an external link (http://junkcharts.typepad.com/numbersruleyourworld/2011/02/why-statisticians-ignore-rankings.html)

Leave a comment

Filed under Data

Fun with R

a Riemannian circle

A Great Circle (a Riemannian circle), the basis of his maps

I came across this today on Flowing Data (a blog which I sometimes nod my head at, and (at times) frown and shake my head slowly).

Here is some interesting things you can do with R, a tool I learned to use during my Masters program at UVA. We never got into this kind of use, so it is a bit of an eye opener.

http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/

The flight data in the Flowingdata example follows the same method as

http://paulbutler.org/archives/visualizing-facebook-friends/ used for facebook connections.

Leave a comment

Filed under Data, UVA