Very helpful quick-reference for articles from Nature Methods about creating visualizations.
I had never seen this site before, but I came across it while watching something about Information Visualization:
I guess I should probably know Nicholas Felton, from his high-profile “about me”:
He is the co-founder of Daytum.com, and currently a member of the product design team at Facebook. His work has been profiled in publications including the Wall Street Journal, Wired and Good Magazine and has been recognized as one of the 50 most influential designers in America by Fast Company.
Anyway, he has been producing ‘Personal Annual Reports’ which reflect each year’s activities. I haven’t dug too deeply, but the struck me as quite interesting and worth a deeper dive.
A colleague pointed me at this post, as we are working together on a project related to open data sources: http://readwrite.com/2008/04/09/where_to_find_open_data_on_the#awesm=~oARWkzqt2h9Anp
Many of the data sources I had already heard/known about, but a few new ones which I hadn’t. For our project, we have been looking at the OpenNYC data sets, which has some really interesting data domains contained within it. Hopefully we can publish our analysis of the open data sets and the obstacles for integration in the near future.
(Yet another data source list here: http://blog.visual.ly/data-sources/)
A colleague sent me this link http://martinfowler.com/bliki/PolyglotPersistence.html the other day, and I appreciate the author’s point of view. In the discussion that ensued, we started talking about the topics a bit more in-depth. One item that came up is whether, in this polyglot world, every application would need to track every other’s use of every data item. My response was:
While you don’t need every application to know every other application’s use of every data item, you do need (I contend) a unified (mental) model of the business constraints/requirements of the data items which are shared (or related/referenced). This is a gap in the way the original author was describing this polyglot persistence (IMHO) – he seems to leave that as an ‘exercise for the reader’.
If you have two medical insurance applications “Registration” and “Reimbursement”, then they need to agree on the higher order (conceptual and/or logical) data model against which both applications would. For example, does Reimbursement make a check out to the patient, or is it the person who heads the household? If Registration doesn’t have a concept of ‘head of household’, then how would Reimbursement be able to implement that?
The abstraction capability of a web service _does_ enable applications to conceal the nitnats of how exactly it stores its data (and so does SQL), but there has to be some exposed (and agreed-upon) data model.
I contend that this agreement at the conceptual/logical level is both the reason that the ‘Shared Database Integration’ has been so favored, and also the reason people rail against it and claim that the model is too monolithic and slow to change. I have rarely encountered an organization that is cognizant of (let alone, effectively express) the data relationships in their business. This lack of understanding (I claim) is the root cause to the glacial speed at which data models typically evolve.
A colleague sent me a link to a Forbes post “5 Cool Ways Big Data Is Changing Lives”, and I have to object to one of the entries in the post: ” When Big Data Goes Bad”
The example referred to is from when Forbes writer Kashmir Hill reported on how Target figured out a teen girl was pregnant before her father did.
Here is my issue with Raj Sabhlok‘s inclusion of this as one of “5 Cool Ways Big Data Is Changing Lives”:
1) It clearly is not a ‘cool way big data is changing lives’.. PERHAPS I could concede that it is changing lives, but, as implemented, I would not agree it is ‘cool’. (I guess I might concede that the predictive power is cool to data analysis folks)
2) And labeling the Target scenario as a case where Big Data [Went] Bad misleads the reader. It has nothing to do with the data itself but has to do with the business, policies, and implementation details of how the results might be used. And those types foibles have been around since the dawn of marketing, not as some new phenomenon that Big Data has caused.
Don’t blame the data, or even the techniques for analyzing the data, for things you bring upon yourself based on improper or poor usage / policies.
A colleague sent me a request for information about ‘actionable analytics’. The request was from their government customer to find whitepapers, research, etc. on ‘actionable analytics’.
First, I asked “How did the question/request come about?” Sometimes things go askew from the question to the request to the response. (and even further askew when trying to gather inputs from others).
Second, (and the reason I asked the first question) I think (my opinion and yet-to-be-refuted based on my cursory research) that ‘actionable analytics’ is a combination of marketing hype (e.g. Gartner) and poor-phrasing for an existing concept.
- RE: hype. Looking at Google trends, you see that this phrase is a recent phenomenon. Here: http://www.google.com/trends/explore#q=actionable%20analytics&date=today%2012-m&cmpt=q is the last 12 months, and you notice there is a peak in Jan 2013. Lo and behold, that coincided with Gartner’s publication (http://www.gartner.com/resId=2316120) on Jan 25th. Gartner is clearly the ‘loudest voice’ in this discussion. I am seeing if I can get ahold of that Gartner.
- RE: Existing concepts. Prior to Gartner’s published report, the phrase has been used to generally mean ‘analytics which can be used for taking action’. But, as this blog post (http://www.clickz.com/clickz/column/2166558/actionable-analytics) points out, the phrase is linguistically ‘flawed’ . (Actionable meaning ‘able to bring a lawsuit’..) Aside from that nit, the web analytics community (the author refers to) uses that to mean ‘something you can take action on’.. as in how to turn that information into $$.
For me, using that phrase (as ‘bad’ as it is) really refers more to a ‘best practice’ or mindset versus some concrete ‘thing’. It use today (by the pundits, e.g. Gartner) really tries to help differentiate how the future analytics should be different than the ‘same old BI (analytics) from yesteryear’.
Gartner’s points are a bit more than that – not just something that enables the business to take action, but something which is approachable/digestible by ‘the masses’. Their phrase ‘invisible’ analytics aims to point out that decision makers are rarely the back-room number crunchers building models – even in the model-heavy financial industries. The key is to make the analytics/models easily accessible and understandable for the decision maker.
I applaud that idea. Yet.. care needs to be taken. It is easy to hide all of the complexity (and more importantly the assumptions) from the end users and we can end up with what I affectionately term ‘babies wielding chainsaws’. A great example is the financial meltdown on Wall Street – the hidden risk was in the assumptions and details obscured in the models. I don’t think the general populace has the ability to either understand or even know what to question about analytics/models. Throw in nuances like ‘correlation versus causation’, and forget it – all will be lost.
Gartner cites that there needs to me increased agility around analytics, but I think that ends up being held back by the knowledge, understanding, and maturity of the decision makers of analytics and models. In order to have the analytic cycle shorten and better decisions being made, I think the education of the decision makers is one of the most important aspects.
An interesting post about the challenge of applying systems thinking to the Cloud.
I have a few nits about the author’s statement [my emphasis added]:
Systems thinking is difficult for those that have been educated to always apply reductionist thinking to problem solving. The idea in systems thinking is not to drill down to a root cause or a fundamental principle, but instead to continuously expand your knowledge about the system as a whole.
I would disagree that systems thinking doesn’t look for root causes. The point is that by expanding the problem, you have a higher chance of identifying the root causes. If you narrowly focus on the problem that is defined in front of you, you will rarely reach the true root causes – thus, true systems thinkers ask ‘what really is the system?’ The author does wind around and hits this point, but the original statement is a bit askew.
In John D Cook’s blog post (http://www.johndcook.com/blog/2010/12/15/big-data-is-not-enough/) he quotes Bradly Efron in an article from Significance. It is somewhat counter-culture (or at least thought-provoking) to the mainstream ‘Big Data’ mantra – Given enough data, you can figure it out. Here is the quote, with John D. Cook’s emphasis added:
“In some ways I think that scientists have misled themselves into thinking that if you collect enormous amounts of data you are bound to get the right answer. You are not bound to get the right answer unless you are enormously smart. You can narrow down your questions; but enormous data sets often consist of enormous numbers of small sets of data, none of which by themselves are enough to solve the thing you are interested in, and they fit together in some complicated way.”
What struck a chord with me (a data guy) was the statement ‘and they fit together in some complicated way’. Every time we examine a data set, there are all kinds of hidden nuances that are embedded in the content, or (more often) in the metadata. Things like:
- ‘Is this everything, or just a sample?’ – If it is a sample, then how was the sample created? Does it represent a random sample, or a time-series sample?
- ‘Are there any cases where there are missing cases from this data set?’ – Oh, the website only logs successful transactions, if it wasn’t successful, it was discarded.
- ‘Are there any procedural biases?’ – When the customer didn’t give us their loyalty card, all of the clerks just swiped their own to give them the discount.
- ‘Is there some data that was not provided due to privacy issues?’ – Oh, that extract has their birthday blanked out.
- ‘How do you know that the data you received is what was sent to you?’ – We figured out the issue – when Jimmy saved the file, he opened it up and browsed through the data before loading. It turns out his cat walked on the keyboard and changed some of the data.
- ‘How do you know that you are interpreting the content properly?’ – Hmm.. this column has a bunch of ‘M and F’s.. That must mean Male and Female. (Or, have you just changed the gender of all the data because you mistakenly translated ‘M-Mother and F-Father’?)
All of this is even more complicated once you start integrating data sets, and this is what Bradly Efron was getting at. All of these nuances are exacerbated when you start trying to marry data sets from different places. How do you reconcile two different sets of product codes which have their own procedural biases, but essentially report on the same things?
Full article here:
Often I like the material presented at www.informationisbeautiful.net, but a recent posting (http://www.informationisbeautiful.net/2012/rhetological-fallacies/) fell short. The posting is an ‘infographic’ (a stretch of the term) of Rhetological Fallacies. Nice. I have spent a bit of time looking at these types of things when doing Current Reality Trees / Future Reality Trees (part of Goldratt’s Theory of Constraints).
However, reading through the list, I started to see that this list was a bit incomplete or incorrect (at least in the explanations of the entries). I could even use entries in the list to refute other entries. For example:
“Gambler’s Fallacy: Assuming the history of outcomes will affect future outcomes”
Now, I know what the author was getting at, but the way this is stated is incorrect. The example given was:
“I’ve flipped this coin 10 times in a row and it’s been heads therefore the next coin flip is more likely to come up tails”
So in the example, the author is correct — those events are independent (where the probability of the subsequent flip is not dependent on the outcome of previous flips.. aka Bernoulli Trial). However, in the ‘definition’ of Gambler’s Fallacy, the author left out the critical word ‘independent’. If the events are not independent (e.g. the weather conditions observed at the start of an hour), then the future outcomes are different depending on the outcomes observed in the past. For example, we are more likely to observe rain at 2pm if we have observed rain at 1pm, with some measurable increase in probability.
Using the author’s own list, they fell prey to ‘Composition Fallacy: Assuming that a characteristic or beliefs of some or all of a group applies to the entire group’. (That sentence needs some work). Not all events are independent, and would lead someone to fall prey to a ‘Gambler’s Fallacy’, even though some events are independent.
There are a few others in this list which annoy me (such as ‘Appeal to Probability’), not because of the ‘idea’ behind it, but because of how it is vaguely expressed.
A pretty good summary of use cases for ‘big data’. This always ends up being the first set of questions when exposed to the idea of ‘big data’. “What the heck do _we_ do which is considered Big Data?” A lot of times this is because organizations don’t currently deal with these use cases BUT SHOULD to remain competitive. Things are a-changing.