Saturday, July 20, 2013

A Consumer's Gide To Statistics: Correlation Is Not Causation

I started off this series of articles a long while ago with how to spot a biased sample and how to interpret results of a study with an obviously biased sample - basically, don't take it at face value and all that.

Thinking about it, that should not have been the first article in the series. This should have been, as it covers one of the very first and most important rules that will be covered in any basic statistics course - correlation does not imply causation.

Some Formality

Correlation is when something occurs often in the presence of something else.

Given any observation of a living human, for instance, the beating of the heart is highly correlated to regular breathing.

More formally, attribute or event A is correlated with attribute or event B if and only if the probability of observing A with B is higher than the probability of observing A without B.

Causation is when something occurs as a result of something else.

As an example, imagine letting go of something a few feet above the ground. In the vast majority of cases, that thing will fall and hit the ground. You letting go of it was a cause of that, although there are other causes - gravity, for instance.

Formally speaking, there is a possibility that attribute or event A is caused by attribute or event B if and only if the probability of observing A with B is higher than the probability of observing A without B.

Wait a second...

Those definitions seem a little similar, don't they?

(They're also pretty bad for general use, and are only used here to demonstrate a point. In no way would they count as truly formal and complete definitions of either idea.)

That's because they are, and that's why it's important to be able to understand the difference. You see, you can only imply causality with statistics or probability, you can never prove it. You imply causality by demonstrating strong correlation and then providing some reasoning as to why one of the variables might cause the other, but implication and proof are two very different concepts.

(There are some more advanced techniques for determining if one thing is likely to be caused by another, but they neither prove causality nor are they within the scope of this conversation.

For those of you who are interested, though, the general method involves the use of non-linear regression and determining when the function defined by such a regression is not invertible. You should be able to find stuff out there on it pretty easily.)

An Example

Let's imagine for an example a study that shows that children with behavior problems with school are more likely to play and enjoy 'violent' video games.

Outside of the fact that this almost definitely implies use of a survey and the majority of the models used on survey data for studies like this are woefully inadequate for the purpose, what's wrong with an attached headline such as "Study Demonstrates Violent Video Games Cause Behavior Problems In Children"?

It's the wording. In no way have they demonstrated that anything causes anything else, and a reasonable case could be made that troubled children just naturally enjoy violence in games more, and also that troubled children are more likely to come from households where such games might be available for them to play.

Even if the headline and none of the writing directly implies causality, it is important not to automatically imply it yourself - this is often what the people writing the article want you to do, and you are not doing yourself or the researchers who performed the study any favors by taking arguably the most important problem in statistics and assuming it always goes one way.

General Case

As a general rule, take no causality for granted. If a study says they demonstrated that one thing causes another thing, you should immediately be on guard. If a study says they demonstrated that certain things occur with or at the same time as other things, you should not assume causality and dig a little deeper, assuming you are truly interested.


And that's that. One of the most important lessons one can learn involving statistics and probability. Go out there and apply it folks. The world will thank you for it.

Or, at least, I will.

EDIT: Also, for those of you who found your way here by random happenstance, the first article in this series is over here.

No comments:

Post a Comment