Incomplete analysis – finding patterns in noise

Kaiser Fung, author of Numbers Rule You World posted a blog entry about ‘Muzzling Data’ http://junkcharts.typepad.com/numbersruleyourworld/2012/03/we-sometimes-need-a-muzzle-.html

The entry talks about some automatic data analysis done by zip code, and how it was projecting the deviation of average lifespan for individuals in a zip code broken out by First Name. He goes on to show how and why this type of analysis is incomplete. Without a complete view of the data (i.e. what the population’s lifespan variability is overall), it is easy to find patterns in the noise of the data. He theorizes that this type of incomplete analysis might yield headlines such as:

“Your first name reduces your life expectancy!!”, or “Margaret, it’s time to become Elizabeth!”. And why not “James, if you want to live longer, become Elizabeth now!”

 The analyst needs to ensure that they are not identifying patterns in noise, due to an artifact of their methodology or incomplete analysis.

Advertisements

Leave a comment

Filed under Data, Systems Engineering

‘Already read that’

Analog, Web, Kindle simul-read

Analog, Web, Kindle simul-read

Last month I was in a back-and-forth ‘did you see the article..’ scenario with a colleague.

(You know you have been there.. if you don’t get what I mean, check out this Portlandia comedy sketch).  Everything I sent him, he claimed he already knew about and decided to try to one-up me.  Not having infinite monkeys to do all my reading for me (they were busy typing the works of Shakespeare on an infinite number of typewriters), I was almost always lagging behind him.

When I came across this LifeHacker post today, the video from Portlandia made me smile and I had to share it with him and the rest of the team.  Of course, I realized that it was pointless to share it with him, since he obviously had already seen it..

True to form, I am reposting the article without having actually read it.  ( I think it mentions something about that in the article, but not having read it fully, I am just guessing..)

1 Comment

Filed under General

“Technical Debt”

This blog post showed up in my G+ feed from a colleague from UVA.

http://www.jsnover.com/blog/2011/12/18/iranian-drone-hack-and-technical-debt/

The author describes “technical debt” as the short-sighted decisions made in the past on a project which ultimately “come due” at some point in the future.  It examines the paradigms of ‘whiz kids’ and ‘greybeards’ and how there is a continual tug-of-war between the two camps.  While I am not sure I agree with the mass categorization, it is true that there are people who care about/plan for the future and there are people who are only focused on the tactical problem at hand.

Essentially, the author is saying “The ghosts of past decisions will bring ruin on the project’s future”.  That has to be inspired by some Shakespearian quote..

Leave a comment

Filed under Systems Engineering

Multi-modal analysis and Sentiment Analysis

A colleague pointed me to this Huffington Post article (which has embedded in it the TED talk “Deb Roy: The Birth Of A Word“) in a discussion about Sentiment Analysis.  I guess I have been dismissing all of the sentiment analysis discussions in the past (without really examining the ideas behind it).  I just couldn’t fathom how effective it could be — is it really better than just what a few people with some time on their hands could generate?

The TED talk started with a strange ‘experiment’ which involved wiring Deb Roy’ house with overhead video and audio in each room, and 200 terabytes of video recordings (he guess it’s probably the largest catalog of home movies).  Three years of recording, starting after the birth of his son, he brought the video to MIT and they started the analysis.  The types of analysis they are performing is extremely interesting — using multi-modal analysis to show correlation.  Proximity (spatially), socially (interaction), audio, and video all play into the analysis they performed.  Watch the video and I think you will be interested in what they are producing.

At ~12:30 min in the video, he describes how one of the MIT researchers on his team made the leap from a closed-space, controlled environment to the public-space.  Using public media (e.g. TV) as the video, and social media (e.g. Twitter) as the ‘audio’, they can start showing how the two relate much like they did with the home movies.  Social interconnectedness also was factored into the analysis.

Maybe I need to think a bit more about whether I should be dismissing these analysis ideas..

Leave a comment

Filed under Machine Learning

ORACLE and Big Data

ORACLEA colleague pointed me at these two resources from ORACLE Openworld, recently held, as we are discussing Big Data with our clients.

Oracle Openworld

The link to the Big Data Keynote OpenWorld presentation which talked about Oracle’s jump into Big Data.  Check out the link below (it runs for 60 minutes but covers a lot of material with a real use-case example).

 http://www.oracle.com/us/bc1176404856001-513710.html#big-data?iframe=true&width=680&height=400

 Below is additional information related to Oracle’s Big Data Solutions, including an Oracle Loader for Hadoop, Oracle NoSQL databases, and Oracle ‘R’ Enterprise (an open source Statistical and graphics language).

 http://www.oracle.com/us/technologies/big-data/index.html

Leave a comment

Filed under Uncategorized

Toolset Career Advice (from John D Cook and J.D. Long)

This came from a blog I follow casually.

http://www.johndcook.com/blog/2011/11/21/career-advice-regarding-tools/

 The post gives a summary of a discussion by JD Long about ‘Advice I wish someone gave me early in my career’, with the “20/30” hindsight that one might expect.  The points made by Long, and the summary viewpoints by the poster are pretty well written.  Here are a few that struck me as poignant to recent discussions we have been having in my department.  I am rephrasing them based on my opinion and in light of our discussions. 

  • Don’t discount Open Source – it is often the toolset which is ultimately the most transportable job to job (and project to project).
  • “Dependence on tools that are closed license and un-scriptable will limit the scope of problems you can solve. (i.e. Excel) Use them, but build your core skills on more portable & scalable technologies.”

 The follow up point made about closed tools, I had to ponder a little bit to see if I agreed: 

“Closed source software is often not scriptable, not because it’s closed source, but because it is often written for consumers who value usability over composability.”

 The author makes a point about portability, not just OS to OS, but from scale to scale (scale up to clusters and scale down to mobile).  This is in conjunction to career portability (the longevity / demand of a given toolset), which is always an important decision for a career.  I think it often comes down to whether you consider yourself:

  • A “tool jockey”  – intense and deep understanding in a given toolset or technology
  • A problem solver with (less-intense) understanding of a variety of tools

I think ultimately, irrespective of which camp you might self-identify with, effectiveness really boils down to good Systems Engineering process:

  • “Get really good at asking questions so you understand problems before you start solving them.”

Leave a comment

Filed under Systems Engineering

Awesome Tool which (I am sure) uses Machine Learning

Someone recently threw out a challenge to identify the artist of a particular painting /photograph which was hanging on the wall in the movie Iron Man 2.  What she provided was this image:

Iron Man 2 Office Art
‘unknown’ art from Iron Man 2

This appears to be two frames from the movie?  I can’t recall seeing it in the movie, but ultimately it didn’t matter, since I wasn’t going to go down that path.  (I assume she already tried that avenue, googling ‘iron man 2 office art’ or something similar).

 
With machine learning on the brain, I realized this was clearly a machine learning problem.  Thankfully, I remembered I had already found an existing tool to do just this — take an input image and find other images which are the same.  Here is where I first talked about TinEye.
 
All I did was pop the image into photoshop and chop off the right side, since I expected to match to a head-on image of the art and not a combined graphic.  That being done, I uploaded the graphic to TinEye and viola it returned:
From this website, it appears it is a 1997 photograph of Tonnay, France’s Esser River by German photographer Ursula Schulz-Dornburg.  It looks like the photo was in Lichfield Studios in Notting Hill.
 
I spent a second trying to validate the information, but the artist’s site doesn’t have all her works posted.
 
Still, I think there is fairly high likelihood that I have found the artist.
WAY TO GO MACHINE LEARNING!
 
 (BTW, it took me way longer to write the blog post than to do the search.. )
 

Leave a comment

Filed under General, Machine Learning

A SQLLOADER example – preserving parent-child relationships

 

A colleague sent me a problem he was having with loading data into ORACLE while preserving parent-child relationships..  Here was my response:

From what I understand,  the incoming data has no sequencing or relational values embedded into it?  E.g. a field which could designate the parent-child relationship?

Embedded relational fields

If there are relational fields, I would just load the data into temp tables, and then post process and assign the sequence numbers (which _you_ want) based on the previous data set’s relational values.

No embedded relational fields

If there are _not_ relational fields, then I am assuming that  the ‘proximity’ or ‘sequencing’ in the file is actually designating the parent-child relationship.  In that case, your data might  look something like:

01MFGDEPT

22JOE   SMITH

22SAM   SPADE

01SHPDEPT

22JILL  BROWN

22MARTY SPRAT

22CRAIG JONES

01FNCDEPT

22RICK  PRICE

Where 01 designates a dept record and 22 designates an employee who belongs in that department.

Because Joe Smith is immediately following the MFG DEPT record, the parent child relationship is assumed.  Is that correct?

(you specified type 2 and type 3, which I am assuming that you have one further level of embedded hierarchy)

My assumption is that you never know how many ‘child’ records you will see?  In this case, it could be 0-N DEPT_EMP records for one DEPT record.

If you always knew that there were a fixed number of child records, you COULD do something miserable such as concatenating multiple physical records into one logical record. This would keep ‘continuity’ from the parent to the child by slapping all those into one long record.  You would then proceed to post-process that combined table, splitting each record into their destination tables using PL/SQL (and using PL/SQL’s mechanism to retain the last sequence number, to ensure the parent’s unique sequence number is applied to the child records).

ASSEMBLING LOGICAL RECORDS FROM PHYSICAL RECORDS

http://download.oracle.com/docs/cd/E11882_01/server.112/e16536/ldr_control_file.htm#i1005509

If you can’t guarantee that, you could use sqlloader’s sequence facility to give each record a unique number which provides the order in which the records were loaded.  You could then post process this by running through the temp load tables and using the sequence number to find which child records were just after the parent record.

http://download.oracle.com/docs/cd/E11882_01/server.112/e16536/ldr_field_list.htm#i1008320

For example, the data above would be loaded as:

DEPT

100 MFGDEPT

103 SHPDEPT

107 FNCDEPT

DEPT_EMP

101 JOE SMITH

102 SAM SPADE

104 JILL BROWN

105 MARTY SPRAT

106 CRAIG JONES

108 RICK PRICE

The one point to note here is that if a row in the INFILE is rejected, that sequence number is skipped.  E.g. if SHPDEPT’s DEPT record was rejected, you would not see 103 anywhere in either table.  This may throw off your logic, if you are not careful.  E.g. you could think ‘I will just process the DEPT table, find each record, and find all child records between it and the next record in DEPT’.  If a dept record got rejected, then you would erroneously allocate the DEPT_EMP records (either to the wrong dept, or not allocate them at all). 

Basically, that code would look like a giant loop which would run through all the sequence numbers, and thereby create the parent-child relationships.  You would retain the ‘last parent I encountered’ so you could assign child records to belong to the parent record.  Again, you would need to look through the SQLLOADER logs to make sure no records were rejected, otherwise that loop would incorrectly assign parent-child relationships without more intricate ‘am I missing a sequence number’ logic.

Leave a comment

Filed under Data Management, ORACLE

Rebranding as Data Scientists

In a previous post The Data Scientist, I talked about the term and where it fits into the current paradigm.  The topic arose again this week in a post from Kaiser Fung, with an amusing twist — rebranding.

“You have to give it to the computer scientists. They are like the branding agencies of the engineering world. Everything they touch turns to PR gold. Steve Jobs, of course, was their standard bearer, with his infamous “reality distortion field”. Now, they are invading the statistician’s turf, and have already re-branded us as “data scientists”. MIT Technology Review noted this event recently”

I am amused, as Kaiser is, on how rebranding can hype ideas/terms/jobs/technology…  Here is my take:

I agree with the MIT article that it is not so much that ‘data scientists’ do anything differently than statisticians, in terms of their techniques.

 However, there is a clear gap from the ‘stats folks’ to the ‘business folks’.  One group speaks math, the other speaks English.  This is the void which I think needs to be filled.  My own personal (and COMPLETELY biased) mental model of ‘data scientist’ is the cross between statistics, data management, and system engineering.  The system engineering (a systemic viewpoint of SE, not a systematic viewpoint of SE) is the key to bridge the void.

This relates to the Susan Holmes statement (that >80% of a statisticians time is spent preparing the data ).  I would contend that it _should_ be more like 40% prepping and applying stats, and 60% describing/conveying what it means.

 

Leave a comment

Filed under Uncategorized

Data Quality – Garbage In, Garbage Out

From Kaiser Fung’s blog “Numbers Rule Your World”:

The following articles discuss the behind-the-scenes process of preparing data for analysis. It points to the “garbage in garbage out” problem. One should always be aware of the potential hazards.

“The murky world of student-loan statistics”, Felix Salmon (link)

At the end of this post, Felix found it remarkable that the government would not have better access to the data

 The Reuters Blog post by Felix describes the typical problem with data and the challenges facing analysts who consume the data.  The problem is difficult enough when ‘you own all the data’ (i.e. can examine how the data is created, aggregated, managed, etc. because you are the source).  However, most analysis needs more than one pocket of data and relies on external sources of data to supplement what you might already have.  The more removed an analyst is from the source, the less insight and understanding you have on its data quality. 

One of the more disturbing aspects of Felix’s post is the fact that despite knowing there are significant errors in the previously published data, the NY Fed is only going to modify interpretation of current and future data.  Thus, the longitudinal view (the view across time) will have this strange (and likely soon forgotten) jump in the amount of student loan debt.  Good luck trying to do a longitudinal study using that data series.

 

A colleague responded to this by citing a CNN interview of a former GM executive discussing why GM declined.  GM’s management culture (dominated by MBAs who are numbers people) made decisions based on what the data told them.  When the former GM executive would bring perspectives from past experience, gut feeling and subjective judgement, he was advised that he came across as immature.  My colleague commented:

“This provides a footnote to why over-reliance on data is dangerous in itself”

To rephrase/expand his point, I would say:

“Data-centric decision making is the most scientific basis for substantive decision making.  HOWEVER, if you don’t understand the underlying data and its inherent flaws (known and/or unknown), you are living in a dream world.”

I think this is what he meant by ‘over reliance’ — total trust on the data in front of you to the exclusion of everything else.

 In my view, you are almost always faced with these two conditions:

  1.  Your data stinks, or at least has some rotten parts.
  2.  You don’t have all the data which you really need/want

 Once you acknowledge those conditions, you can start examining the ‘gut feel’ and the ‘subjective judgment’ in view of the data gaps.

Leave a comment

Filed under Data, Data Management, Data Quality