Category Archives: Data Management

Forbes post “5 Cool Ways Big Data Is Changing Lives”

A colleague sent me a link to a Forbes post “5 Cool Ways Big Data Is Changing Lives”, and I have to object to one of the entries in the post:  ” When Big Data Goes Bad”

The example referred to is from when Forbes writer Kashmir Hill reported on how Target figured out a teen girl was pregnant before her father did.

Here is my issue with Raj Sabhlok‘s inclusion of this as one of “5 Cool Ways Big Data Is Changing Lives”:

1)      It clearly is not a ‘cool way big data is changing lives’..  PERHAPS I could concede that it is changing lives, but, as implemented, I would not agree it is ‘cool’.  (I guess I might concede that the predictive power is cool to data analysis folks)

2)      And labeling the Target scenario as a case where Big Data [Went] Bad misleads the reader.  It has nothing to do with the data itself but has to do with the business, policies, and implementation details of how the results might be used.  And those types foibles have been around since the dawn of marketing, not as some new phenomenon that Big Data has caused.

Don’t blame the data, or even the techniques for analyzing the data, for things you bring upon yourself based on improper or poor usage / policies.

Leave a comment

Filed under Analytics, Big Data

“Actionable Analytics”

A colleague sent me a request for information about ‘actionable analytics’.  The request was from their government customer to find whitepapers, research, etc. on ‘actionable analytics’.

Hmm.

First, I asked “How did the question/request come about?”  Sometimes things go askew from the question to the request to the response. (and even further askew when trying to gather inputs from others).

Second, (and the reason I asked the first question) I think (my opinion and yet-to-be-refuted based on my cursory research) that ‘actionable analytics’ is a combination of marketing hype (e.g. Gartner) and poor-phrasing for an existing concept.

  • RE: hype.  Looking at Google trends, you see that this phrase is a recent phenomenon.  Here: http://www.google.com/trends/explore#q=actionable%20analytics&date=today%2012-m&cmpt=q is the last 12 months, and you notice there is a peak in Jan 2013.  Lo and behold, that coincided with Gartner’s publication (http://www.gartner.com/resId=2316120) on Jan 25th.  Gartner is clearly the ‘loudest voice’ in this discussion.  I am seeing if I can get ahold of that Gartner.
  • RE: Existing concepts.  Prior to Gartner’s published report, the phrase has been used to generally mean ‘analytics which can be used for taking action’.  But, as this blog post (http://www.clickz.com/clickz/column/2166558/actionable-analytics)  points out, the phrase is linguistically ‘flawed’ .  (Actionable meaning ‘able to bring a lawsuit’..)  Aside from that nit, the web analytics community (the author refers to) uses that to mean ‘something you can take action on’.. as in how to turn that information into $$.

For me, using that phrase (as ‘bad’ as it is) really refers more to a ‘best practice’ or mindset versus some concrete ‘thing’.  It use today (by the pundits, e.g. Gartner) really tries to help differentiate how the future analytics should be different than the ‘same old BI (analytics) from yesteryear’.

Gartner’s points are a bit more than that – not just something that enables the business to take action, but something which is approachable/digestible by ‘the masses’.  Their phrase ‘invisible’ analytics aims to point out that decision makers are rarely the back-room number crunchers building models – even in the model-heavy financial industries.  The key is to make the analytics/models easily accessible and understandable for the decision maker.

I applaud that idea.  Yet.. care needs to be taken.  It is easy to hide all of the complexity (and more importantly the assumptions) from the end users and we can end up with what I affectionately term ‘babies wielding chainsaws’.  A great example is the financial meltdown on Wall Street – the hidden risk was in the assumptions and details obscured in the models.  I don’t think the general populace has the ability to either understand or even know what to question about analytics/models.  Throw in nuances like ‘correlation versus causation’, and forget it – all will be lost.

Gartner cites that there needs to me increased agility around analytics, but I think that ends up being held back by the knowledge, understanding, and maturity of the decision makers of analytics and models.  In order to have the analytic cycle shorten and better decisions being made, I think the education of the decision makers is one of the most important aspects.

Leave a comment

Filed under Analytics, Data Management

“Big Data” to solve everything?

In John D Cook’s blog post (http://www.johndcook.com/blog/2010/12/15/big-data-is-not-enough/) he quotes Bradly Efron in an article from Significance.  It is somewhat counter-culture (or at least thought-provoking) to the mainstream ‘Big Data’ mantra – Given enough data, you can figure it out.  Here is the quote, with John D. Cook’s emphasis added:

“In some ways I think that scientists have misled themselves into thinking that if you collect enormous amounts of data you are bound to get the right answer. You are not bound to get the right answer unless you are enormously smart. You can narrow down your questions; but enormous data sets often consist of enormous numbers of small sets of data, none of which by themselves are enough to solve the thing you are interested in, and they fit together in some complicated way.”

What struck a chord with me (a data guy) was the statement ‘and they fit together in some complicated way’.  Every time we examine a data set, there are all kinds of hidden nuances that are embedded in the content, or (more often) in the metadata.  Things like:

  • ‘Is this everything, or just a sample?’  –  If it is a sample, then how was the sample created?  Does it represent a random sample, or a time-series sample?
  • ‘Are there any cases where there are missing cases from this data set?’  –  Oh, the website only logs successful transactions, if it wasn’t successful, it was discarded.
  • ‘Are there any procedural biases?’ – When the customer didn’t give us their loyalty card, all of the clerks just swiped their own to give them the discount.
  • ‘Is there some data that was not provided due to privacy issues?’ – Oh, that extract has their birthday blanked out.
  •  ‘How do you know that the data you received is what was sent to you?’  – We figured out the issue – when Jimmy saved the file, he opened it up and browsed through the data before loading.  It turns out his cat walked on the keyboard and changed some of the data.
  • ‘How do you know that you are interpreting the content properly?’  –  Hmm.. this column has a bunch of ‘M and F’s.. That must mean Male and Female.  (Or, have you just changed the gender of all the data because you mistakenly translated ‘M-Mother and F-Father’?)

All of this is even more complicated once you start integrating data sets, and this is what Bradly Efron was getting at.  All of these nuances are exacerbated when you start trying to marry data sets from different places.  How do you reconcile two different sets of product codes which have their own procedural biases, but essentially report on the same things?

Full article here:

http://www-stat.stanford.edu/~ckirby/brad/other/2010Significance.pdf

Leave a comment

Filed under Big Data, Data, Data Management, Data Quality

Big Data use cases

A pretty good summary of use cases for ‘big data’.  This always ends up being the first set of questions when exposed to the idea of ‘big data’.  “What the heck do _we_ do which is considered Big Data?”  A lot of times this is because organizations don’t currently deal with these use cases BUT SHOULD to remain competitive.  Things are a-changing.

http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/

Leave a comment

Filed under Big Data, Data, Data Management, Systems Engineering

A SQLLOADER example – preserving parent-child relationships

 

A colleague sent me a problem he was having with loading data into ORACLE while preserving parent-child relationships..  Here was my response:

From what I understand,  the incoming data has no sequencing or relational values embedded into it?  E.g. a field which could designate the parent-child relationship?

Embedded relational fields

If there are relational fields, I would just load the data into temp tables, and then post process and assign the sequence numbers (which _you_ want) based on the previous data set’s relational values.

No embedded relational fields

If there are _not_ relational fields, then I am assuming that  the ‘proximity’ or ‘sequencing’ in the file is actually designating the parent-child relationship.  In that case, your data might  look something like:

01MFGDEPT

22JOE   SMITH

22SAM   SPADE

01SHPDEPT

22JILL  BROWN

22MARTY SPRAT

22CRAIG JONES

01FNCDEPT

22RICK  PRICE

Where 01 designates a dept record and 22 designates an employee who belongs in that department.

Because Joe Smith is immediately following the MFG DEPT record, the parent child relationship is assumed.  Is that correct?

(you specified type 2 and type 3, which I am assuming that you have one further level of embedded hierarchy)

My assumption is that you never know how many ‘child’ records you will see?  In this case, it could be 0-N DEPT_EMP records for one DEPT record.

If you always knew that there were a fixed number of child records, you COULD do something miserable such as concatenating multiple physical records into one logical record. This would keep ‘continuity’ from the parent to the child by slapping all those into one long record.  You would then proceed to post-process that combined table, splitting each record into their destination tables using PL/SQL (and using PL/SQL’s mechanism to retain the last sequence number, to ensure the parent’s unique sequence number is applied to the child records).

ASSEMBLING LOGICAL RECORDS FROM PHYSICAL RECORDS

http://download.oracle.com/docs/cd/E11882_01/server.112/e16536/ldr_control_file.htm#i1005509

If you can’t guarantee that, you could use sqlloader’s sequence facility to give each record a unique number which provides the order in which the records were loaded.  You could then post process this by running through the temp load tables and using the sequence number to find which child records were just after the parent record.

http://download.oracle.com/docs/cd/E11882_01/server.112/e16536/ldr_field_list.htm#i1008320

For example, the data above would be loaded as:

DEPT

100 MFGDEPT

103 SHPDEPT

107 FNCDEPT

DEPT_EMP

101 JOE SMITH

102 SAM SPADE

104 JILL BROWN

105 MARTY SPRAT

106 CRAIG JONES

108 RICK PRICE

The one point to note here is that if a row in the INFILE is rejected, that sequence number is skipped.  E.g. if SHPDEPT’s DEPT record was rejected, you would not see 103 anywhere in either table.  This may throw off your logic, if you are not careful.  E.g. you could think ‘I will just process the DEPT table, find each record, and find all child records between it and the next record in DEPT’.  If a dept record got rejected, then you would erroneously allocate the DEPT_EMP records (either to the wrong dept, or not allocate them at all). 

Basically, that code would look like a giant loop which would run through all the sequence numbers, and thereby create the parent-child relationships.  You would retain the ‘last parent I encountered’ so you could assign child records to belong to the parent record.  Again, you would need to look through the SQLLOADER logs to make sure no records were rejected, otherwise that loop would incorrectly assign parent-child relationships without more intricate ‘am I missing a sequence number’ logic.

Leave a comment

Filed under Data Management, ORACLE

Data Quality – Garbage In, Garbage Out

From Kaiser Fung’s blog “Numbers Rule Your World”:

The following articles discuss the behind-the-scenes process of preparing data for analysis. It points to the “garbage in garbage out” problem. One should always be aware of the potential hazards.

“The murky world of student-loan statistics”, Felix Salmon (link)

At the end of this post, Felix found it remarkable that the government would not have better access to the data

 The Reuters Blog post by Felix describes the typical problem with data and the challenges facing analysts who consume the data.  The problem is difficult enough when ‘you own all the data’ (i.e. can examine how the data is created, aggregated, managed, etc. because you are the source).  However, most analysis needs more than one pocket of data and relies on external sources of data to supplement what you might already have.  The more removed an analyst is from the source, the less insight and understanding you have on its data quality. 

One of the more disturbing aspects of Felix’s post is the fact that despite knowing there are significant errors in the previously published data, the NY Fed is only going to modify interpretation of current and future data.  Thus, the longitudinal view (the view across time) will have this strange (and likely soon forgotten) jump in the amount of student loan debt.  Good luck trying to do a longitudinal study using that data series.

 

A colleague responded to this by citing a CNN interview of a former GM executive discussing why GM declined.  GM’s management culture (dominated by MBAs who are numbers people) made decisions based on what the data told them.  When the former GM executive would bring perspectives from past experience, gut feeling and subjective judgement, he was advised that he came across as immature.  My colleague commented:

“This provides a footnote to why over-reliance on data is dangerous in itself”

To rephrase/expand his point, I would say:

“Data-centric decision making is the most scientific basis for substantive decision making.  HOWEVER, if you don’t understand the underlying data and its inherent flaws (known and/or unknown), you are living in a dream world.”

I think this is what he meant by ‘over reliance’ — total trust on the data in front of you to the exclusion of everything else.

 In my view, you are almost always faced with these two conditions:

  1.  Your data stinks, or at least has some rotten parts.
  2.  You don’t have all the data which you really need/want

 Once you acknowledge those conditions, you can start examining the ‘gut feel’ and the ‘subjective judgment’ in view of the data gaps.

Leave a comment

Filed under Data, Data Management, Data Quality

The Data Scientist

The Fourth Paradigm

My colleague at work today pointed me to the following article, which describes the intersection of technology, data, and the scientific method.

http://www.dbta.com/Articles/Columns/Applications-Insight/The-Rise-of-the-Data-Scientist-75428.aspx

The article mentions the book “The Fourth Paradigm“, which describes this new paradigm of data-driven discovery.  I will need to put it on my long list of books to read (which I am making very slow progress through.. sigh). There is a review of the book here

It talks about using tools such as Hadoop, MapReduce, SAS/SPSS/R to crunch through lots of scientific data to obtain meaningful information.

It pretty well sums up where I see myself heading/positioning myself, from a ‘data-centric’ viewpoint.

2 Comments

Filed under Data, Data Management