Monthly Archives: May 2011

Are you a Data Wrangler?

I came across Data Wrangler today, a research project from Stanford.

From the video, it looks like a data-munger’s dream tool, allowing all of the ‘usual’ transformations we do, and enabling the user to output a script to recreate the transformation in a variety of languages. I have yet to play with it, but I think it has promise. If anyone plays with it extensively, I would love to hear how well it performs for you.

[Also from that team, Protovis (, a visualization toolset.]

It looks like Google has a somewhat similar toolset, with Google Refine . I am not sure the google’s tool keeps track of the transformations enabling export of the logic in other languages, but it has its own cool features.  I like the ‘Clustering Feature’ for transforming similar words /data values
Such as:

Firm Fixed Price
FFP:Firm fixed price
Fixed Price

Thankfully, Google Refine is a desktop app instead of forcing you to upload your data into some unknown location with unknown protection.

Thanks to Chris Pudney for the links.

Someone at work also pointed me to



Leave a comment

Filed under Data

The Data Scientist

The Fourth Paradigm

My colleague at work today pointed me to the following article, which describes the intersection of technology, data, and the scientific method.

The article mentions the book “The Fourth Paradigm“, which describes this new paradigm of data-driven discovery.  I will need to put it on my long list of books to read (which I am making very slow progress through.. sigh). There is a review of the book here

It talks about using tools such as Hadoop, MapReduce, SAS/SPSS/R to crunch through lots of scientific data to obtain meaningful information.

It pretty well sums up where I see myself heading/positioning myself, from a ‘data-centric’ viewpoint.


Filed under Data, Data Management

Commencement Speech

Steve Jobs at StanfordHaving just participated in our graduation proceedings this past weekend, I found it amusing that someone at work posted a link to Steve Jobs’ commencement speech  to Stanford University a few years back.

While I still find Steve Jobs a bit too pompous for my taste (on top of the fact that I had a horrible experience on a project that BDM forced us to use NeXT), I think his speech hit the mark.  As we discussed in our Entrepreneurship class, a failed business venture (or getting fired from your own company) does not mean the end of your career.  Your skills, knowledge, and inborn mental ability endures beyond any job you might have.

I have been laid off twice, both from companies to which I made massive contributions.  Until you have been there, I don’t think you can fathom what a mental blow it inflicts.  Experiencing that event really rocks your foundation and your belief in yourself. 

“Am I not as good as I thought?”

“Am I really that much of a troublemaker?”

“Should I be doing a different job?”

I think that Steve Jobs really got it right. To paraphrase: “Don’t be constrained by other people’s preferences/dreams/goals”.  And, if someone can bounce back from being fired from their own company..  it has to give you hope that those folks out of work will persevere and find their next great job.

Leave a comment

Filed under UVA

Expensive Paper

Expensive Paper

An Expensive Piece of Paper

After a fun, hot, and sunburn-inducing weekend down at UVA, I was finally received my expensive piece of paper and put it in it’s new home.

Look’s snazzy.. although the typography of my Virginia Tech diploma is nicer.  Figures.

Leave a comment

Filed under UVA

TinEye – Reverse Image Lookup



I came across this site.. didn’t explore it very far, but I am curious to dig into it and see how it works..

1 Comment

Filed under General

Brain Training

Lumosity - Brain Training Logo

Lumosity - Brain Training website

One of the things I did prior to AMP (and have started to do again) is to try to stretch my brain in ways which are different than what I encounter at my job. In the past I have periodically read brain-teasers or thought exercises, but this time I went a different route.

I am not sure how I was turned on to this site, but somehow I ended up at Lumosity. They have a free trial to let you explore the site and their games, which gives you a good sense of how their ‘brain training’ actually works. I am not sure I completely buy their propaganda, but I can tell you that it is both:

   a) a diversion (i.e fun) and

   b) a boost in confidence about your mental acuity (through improvements seen in the games)

Clearly a big jump in one’s improvement is due to practice and a deeper understanding of how the games are structured. However, I found a few games very interesting. One was the fast-food name/order matching game, which attacks one of my weaknesses – name recall. I found that after focusing on that game, I was more apt to actively attempt to remember someone’s name when I meet them, and therefore have a higher probability of remembering the name.

Again, I am not completely sold on ‘brain training’, but I think it does have some positive benefits. Give it a try and see what you think.

Leave a comment

Filed under General, UVA

Photopic Sky Survey

Amazingly cool.  The interactive 360 degree view is quite well done, and certainly a “time waster”.

The Photopic Sky Survey is a 5,000 megapixel photograph of the entire night sky stitched together from 37,440 exposures. Large in size and scope, it portrays a world far beyond the one beneath our feet and reveals our familiar Milky Way with unfamiliar clarity. When we look upon this image, we are in fact peering back in time, as much of the light—having traveled such vast distances—predates civilization itself

Thanks to Flowing Data for the post.

Leave a comment

Filed under General

Ranking and Clustering

In a recent JunkCharts post ( the author shows a good example of transforming data (in this case ranking/clustering data) into a more visual representation.  I definitely like the revised representation proposed by the author.

Additionally, it points out that ranking and clustering has challenges for showing ‘superiority’ as the ordinal scales may not provide sufficient information to make such conclusion.  See also this post for more discussions and an external link (

Leave a comment

Filed under Data

Fun with R

a Riemannian circle

A Great Circle (a Riemannian circle), the basis of his maps

I came across this today on Flowing Data (a blog which I sometimes nod my head at, and (at times) frown and shake my head slowly).

Here is some interesting things you can do with R, a tool I learned to use during my Masters program at UVA. We never got into this kind of use, so it is a bit of an eye opener.

The flight data in the Flowingdata example follows the same method as used for facebook connections.

Leave a comment

Filed under Data, UVA

Data Governance and Data Retention

Interesting article.. becoming relevant to my new task.
Records Management Truisms

1. You can’t keep everything forever.
2. You can’t get rid of everything tomorrow.
3. Your business is impacted when you kill content without regard to what it is. (For example, a post-9/11 investigation into translating terrorism-related recordings found the FBI had, as quoted in The New York Times, “limited storage capacities in the system [which] meant that older tapes had sometimes been deleted automatically to make room for newer materials, even if the recordings had not yet been translated.”)
4. Records retention isn’t sexy. But you have to do it … or else.
5. Information technology needs to “own” the content in its systems or discovery requests will continue to confound IT departments.
6. “Innocent” storage professionals have already been nailed for destruction of evidence after recycling media to make room for new data.
7. There are plenty of laws to guide you when considering what to save and for how long.
8. Employees aren’t very good at records retention or responding to litigation with relevant electronically stored information (ESI).
9. Data is growing rapidly, and can be discovered in all sorts of scattered locations, making your job all the more challenging.
10. Storage is a central part of IT operations and shouldn’t be viewed as an electronic dumping ground for the business.

Leave a comment

Filed under Data