Monthly Archives: October 2011

Awesome Tool which (I am sure) uses Machine Learning

Someone recently threw out a challenge to identify the artist of a particular painting /photograph which was hanging on the wall in the movie Iron Man 2.  What she provided was this image:

Iron Man 2 Office Art
‘unknown’ art from Iron Man 2

This appears to be two frames from the movie?  I can’t recall seeing it in the movie, but ultimately it didn’t matter, since I wasn’t going to go down that path.  (I assume she already tried that avenue, googling ‘iron man 2 office art’ or something similar).

 
With machine learning on the brain, I realized this was clearly a machine learning problem.  Thankfully, I remembered I had already found an existing tool to do just this — take an input image and find other images which are the same.  Here is where I first talked about TinEye.
 
All I did was pop the image into photoshop and chop off the right side, since I expected to match to a head-on image of the art and not a combined graphic.  That being done, I uploaded the graphic to TinEye and viola it returned:
From this website, it appears it is a 1997 photograph of Tonnay, France’s Esser River by German photographer Ursula Schulz-Dornburg.  It looks like the photo was in Lichfield Studios in Notting Hill.
 
I spent a second trying to validate the information, but the artist’s site doesn’t have all her works posted.
 
Still, I think there is fairly high likelihood that I have found the artist.
WAY TO GO MACHINE LEARNING!
 
 (BTW, it took me way longer to write the blog post than to do the search.. )
 
Advertisements

Leave a comment

Filed under General, Machine Learning

A SQLLOADER example – preserving parent-child relationships

 

A colleague sent me a problem he was having with loading data into ORACLE while preserving parent-child relationships..  Here was my response:

From what I understand,  the incoming data has no sequencing or relational values embedded into it?  E.g. a field which could designate the parent-child relationship?

Embedded relational fields

If there are relational fields, I would just load the data into temp tables, and then post process and assign the sequence numbers (which _you_ want) based on the previous data set’s relational values.

No embedded relational fields

If there are _not_ relational fields, then I am assuming that  the ‘proximity’ or ‘sequencing’ in the file is actually designating the parent-child relationship.  In that case, your data might  look something like:

01MFGDEPT

22JOE   SMITH

22SAM   SPADE

01SHPDEPT

22JILL  BROWN

22MARTY SPRAT

22CRAIG JONES

01FNCDEPT

22RICK  PRICE

Where 01 designates a dept record and 22 designates an employee who belongs in that department.

Because Joe Smith is immediately following the MFG DEPT record, the parent child relationship is assumed.  Is that correct?

(you specified type 2 and type 3, which I am assuming that you have one further level of embedded hierarchy)

My assumption is that you never know how many ‘child’ records you will see?  In this case, it could be 0-N DEPT_EMP records for one DEPT record.

If you always knew that there were a fixed number of child records, you COULD do something miserable such as concatenating multiple physical records into one logical record. This would keep ‘continuity’ from the parent to the child by slapping all those into one long record.  You would then proceed to post-process that combined table, splitting each record into their destination tables using PL/SQL (and using PL/SQL’s mechanism to retain the last sequence number, to ensure the parent’s unique sequence number is applied to the child records).

ASSEMBLING LOGICAL RECORDS FROM PHYSICAL RECORDS

http://download.oracle.com/docs/cd/E11882_01/server.112/e16536/ldr_control_file.htm#i1005509

If you can’t guarantee that, you could use sqlloader’s sequence facility to give each record a unique number which provides the order in which the records were loaded.  You could then post process this by running through the temp load tables and using the sequence number to find which child records were just after the parent record.

http://download.oracle.com/docs/cd/E11882_01/server.112/e16536/ldr_field_list.htm#i1008320

For example, the data above would be loaded as:

DEPT

100 MFGDEPT

103 SHPDEPT

107 FNCDEPT

DEPT_EMP

101 JOE SMITH

102 SAM SPADE

104 JILL BROWN

105 MARTY SPRAT

106 CRAIG JONES

108 RICK PRICE

The one point to note here is that if a row in the INFILE is rejected, that sequence number is skipped.  E.g. if SHPDEPT’s DEPT record was rejected, you would not see 103 anywhere in either table.  This may throw off your logic, if you are not careful.  E.g. you could think ‘I will just process the DEPT table, find each record, and find all child records between it and the next record in DEPT’.  If a dept record got rejected, then you would erroneously allocate the DEPT_EMP records (either to the wrong dept, or not allocate them at all). 

Basically, that code would look like a giant loop which would run through all the sequence numbers, and thereby create the parent-child relationships.  You would retain the ‘last parent I encountered’ so you could assign child records to belong to the parent record.  Again, you would need to look through the SQLLOADER logs to make sure no records were rejected, otherwise that loop would incorrectly assign parent-child relationships without more intricate ‘am I missing a sequence number’ logic.

Leave a comment

Filed under Data Management, ORACLE

Rebranding as Data Scientists

In a previous post The Data Scientist, I talked about the term and where it fits into the current paradigm.  The topic arose again this week in a post from Kaiser Fung, with an amusing twist — rebranding.

“You have to give it to the computer scientists. They are like the branding agencies of the engineering world. Everything they touch turns to PR gold. Steve Jobs, of course, was their standard bearer, with his infamous “reality distortion field”. Now, they are invading the statistician’s turf, and have already re-branded us as “data scientists”. MIT Technology Review noted this event recently”

I am amused, as Kaiser is, on how rebranding can hype ideas/terms/jobs/technology…  Here is my take:

I agree with the MIT article that it is not so much that ‘data scientists’ do anything differently than statisticians, in terms of their techniques.

 However, there is a clear gap from the ‘stats folks’ to the ‘business folks’.  One group speaks math, the other speaks English.  This is the void which I think needs to be filled.  My own personal (and COMPLETELY biased) mental model of ‘data scientist’ is the cross between statistics, data management, and system engineering.  The system engineering (a systemic viewpoint of SE, not a systematic viewpoint of SE) is the key to bridge the void.

This relates to the Susan Holmes statement (that >80% of a statisticians time is spent preparing the data ).  I would contend that it _should_ be more like 40% prepping and applying stats, and 60% describing/conveying what it means.

 

Leave a comment

Filed under Uncategorized

Data Quality – Garbage In, Garbage Out

From Kaiser Fung’s blog “Numbers Rule Your World”:

The following articles discuss the behind-the-scenes process of preparing data for analysis. It points to the “garbage in garbage out” problem. One should always be aware of the potential hazards.

“The murky world of student-loan statistics”, Felix Salmon (link)

At the end of this post, Felix found it remarkable that the government would not have better access to the data

 The Reuters Blog post by Felix describes the typical problem with data and the challenges facing analysts who consume the data.  The problem is difficult enough when ‘you own all the data’ (i.e. can examine how the data is created, aggregated, managed, etc. because you are the source).  However, most analysis needs more than one pocket of data and relies on external sources of data to supplement what you might already have.  The more removed an analyst is from the source, the less insight and understanding you have on its data quality. 

One of the more disturbing aspects of Felix’s post is the fact that despite knowing there are significant errors in the previously published data, the NY Fed is only going to modify interpretation of current and future data.  Thus, the longitudinal view (the view across time) will have this strange (and likely soon forgotten) jump in the amount of student loan debt.  Good luck trying to do a longitudinal study using that data series.

 

A colleague responded to this by citing a CNN interview of a former GM executive discussing why GM declined.  GM’s management culture (dominated by MBAs who are numbers people) made decisions based on what the data told them.  When the former GM executive would bring perspectives from past experience, gut feeling and subjective judgement, he was advised that he came across as immature.  My colleague commented:

“This provides a footnote to why over-reliance on data is dangerous in itself”

To rephrase/expand his point, I would say:

“Data-centric decision making is the most scientific basis for substantive decision making.  HOWEVER, if you don’t understand the underlying data and its inherent flaws (known and/or unknown), you are living in a dream world.”

I think this is what he meant by ‘over reliance’ — total trust on the data in front of you to the exclusion of everything else.

 In my view, you are almost always faced with these two conditions:

  1.  Your data stinks, or at least has some rotten parts.
  2.  You don’t have all the data which you really need/want

 Once you acknowledge those conditions, you can start examining the ‘gut feel’ and the ‘subjective judgment’ in view of the data gaps.

Leave a comment

Filed under Data, Data Management, Data Quality