Correlation and Causation

Yet another of the most abused mathematical concepts is the concept of *correlation*, along with the related (but different) concept of *causation*.

Correlation is actually a remarkably simple concept, which makes it all the more frustrating

to see the nonsense constantly spewed in talking about it. Correlation is a *linear relationship* between two random variables.

Let's look at the pieces of that, to really nail down the meaning of the concept precisely.

What's a *random variable*?

A random variable is a rather dreadful misnomer - because it's not actually a variable. A random variable is a *function* which maps the outcome of an experiment to a numeric value. That can be as simple as mapping, say, "3 milliliters" (a measured outcome) to the number 3. It can also

be much more complicated - for example, the methods used in assigning a numeric value to the

quality of a chessboard position. But the thing to remember is that it's a mapping from the outcome of an experiment to a numeric value.

A random variable is typically described in terms of a probability distribution (to be more precise, a probably distribution *or* a probability density function.). The details

aren't appropriate for this post, but the basic idea is just that to understand a measured quantity

from an experiment, you need to understand how that quantity varies. The things you can meaningfully

conclude from measurements of a random variable with a normal distribution are quite different from things you can conclude from measurements of a random variable with a logarithmic distribution.

So given two independent random variables, x and y, what does it mean to say that X and Y are *correlated*? It means that given a change in x, you can with high probability predict

an equivalent change in Y using a linear calculation. In informal use, we frequently drop the linear part: if when X increases, Y also increases; when X decreases, Y also decreases; and when X stays the same, then Y stays the same; then we say X and Y are correlated.

As usual, an example helps. Suppose I'm measuring the shoe sizes and heights of a group of men. In general, (not always, but in general), measurements have shown that the size of your feet is

proportional to your maximum adult height. So if I did a mapping between the size of feet and height in adult men, I would find a strong correlation between height and foot size.

On the other hand, if I were to measure the size of the vocabulary that people use in the course

of a normal day, and compare it to the length of their ring fingers, I would find that there's no

correlation to speak of. (Yes, I have actually seen a study that measured this. The hypothesis was

that the length of ring fingers (at least, I think it was the ring finger) is related to the quantity

of certain hormones that a fetus is exposed to in the womb, and that that hormone level also affects

the development of the speech centers of the brain.)

If you're reading carefully, you'll notice that I went from a definition that said when two

things are correlated, but then in the example, I said "strong" correlation. The reason is that in

practice, things are rarely quite as clear as they are in theory. Even things that really do

correlate perfectly, if we're measuring them in a series of experiments, experimental error is going

to create enough variation that in the measured data, the correlation won't be perfect. So we have a

measurement, called the correlation coefficient, of a relation between two random variables X and Y (usually abbreviated *c _{x,y}* or

*r*), which measures the

_{x,y}*strength*of a correlation between X and Y.

*c*varies from -1 to +1. c

_{x,y}_{x,y}=0 indicates absolutely no correlation at all between X and Y; c

_{x,y}=+1 means that X and Y are perfectly correlated; c

_{x,y}=-1 means that X and -Y are perfectly correlated.

The way that we compute the correlation coefficient of a set of data is as follows. Given a set of data with two random variables X and Y, where there is a set of measurements {x_{1}...x_{n}} of X with mean x and standard deviation σ_{x}, and a series of measurements {y_{1},...,y_{n}} of Y with mean y and standard deviation σ_{y}, the correlation coefficient is:

The closer the correlation coefficient is to either 1 or -1, the stronger the linear relationship between the two random variables. If it's close to +1, then it's called a *strong positive correlation*; if it's close to -1, it's called a *strong negative correlation*.

One thing you'll constantly hear in discussions is "correlation does not imply causation". Causation isn't really a mathematical notion - and that's the root of that confusion. Correlation means that as one value changes, another variable changes in the same way. Causation means that when one value changes, it *causes* the other to change. There is a very big difference between

causation and correlation. To give a rather commonly abused example: the majority of children with autism are diagnosed between the ages of 18 months and three years old. That's also the *same* period of time when children receive a large number of immunizations. So people see the *correlation* between receiving immunizations and the diagnosis of autism, and assume that

that means that the immunizations *cause* autism. But in fact, there is no causal linkage. The causal factor in both cases is age: there is a particular age when a child's intellectual development reaches a stage when autism becomes obvious; and there is a particular age when certain vaccinations are traditionally given. It just happens that they're *roughly* the same age.

The catch - and it's a big one - is that correlation does *strongly suggest* a causal relationship. (There's a Yale professor of statistics who's famous for saying something close to "Correlation is not the same thing as causation - but it's a darn good hint!". ) It may not be the case that X causes Y or Y causes X - but if there's a strong correlation between them, you *should* suspect that there's a causal relationship. It may be that both X and Y are dependent on some third factor (is in the vaccine case). But too often, the mantra "correlation does not imply causation" is a hand-waving way of dismissing data that the speaker doesn't feel like dealing with.

On the other hand, you constantly see people waving around statistics that show correlations, and

who insist that the correlation implies causation. (Again, look at that vaccine example!) To show causation, you need to show a *mechanism* for the cause, and demonstrate that mechanism

experimentally. So when someone shows you a correlation, what you *should* do is look for a plausible causal mechanism, and see if there's any experimental data to support it. Without

a demonstrable causal mechanism, you can't be sure that there's a causal relationship - it's just

a correlation.

One more really interesting example that I read about when my daughter was little. There was a study published a few years ago showing a pretty strong correlation between leaving a night-light on in a child's room, and that child developing nearsightedness. This caused a big stir at the time; it got written up in newspapers, reported on by various television programs, etc. But other studies were unable to reproduce that correlation. Finally a group from Ohio State found that while they could not consistently reproduce a direct linkage between night-lights and nearsightedness, they *could* find a correlation between nearsighted parents and nearsighted children, and that there was *also* a correlation between nearsighted parents and the likelihood of having a night-light in their child's room. In other words, children whose parents are nearsighted are more likely to have a night-light in their childrens' rooms, and children are also likely to inherit nearsightedness from their parents. Correlation, but not causation: both the night-lights and the nearsightedness are caused by the parents' nearsightedness; there's no causal connection between the night-lights and the nearsightedness.

What people abusing the "correlation does not equal causation" fallacy frequently forget is that correlations can certainly be spurious. Tthe mercury/vaccine/autism correlation is among the most obvious examples. Cases of autism started rising in the early 1990's and are still rising now. In the early 1990's, several more vaccines were added to the recommended childhood immunization schedule. The two appeared to be correlated. However, what also happened in the early 1990's is that the diagnostic criteria for autism were widened and campaigns to increase autism awareness began in earnest, and that's what really led to the great increase in the number of autism diagnoses. Of course, later epidemiological studies, in which the number of autism diagnoses continued to increase years after thimerosal was removed from childhood vaccines pretty much put the final nail in the coffin of the mercury/thimerosal/vaccines/autism hypothesis.

My favorite example of the difference between correlation and causation is the "Ice cream causes drowning" scenario.

It can be observed that sales of ice cream from traveling ice cream trucks have a strong positive correlation with children drowning in swimming pools. Does this mean that ice cream causes drowning? Or that drowning causes more ice cream to be sold? No! There is a hidden variable (the season of the year) that affects both - during the summer children are out of school, and they spend a lot more time both in pools and buying ice cream.

At least my example won't bring out the anti-vaxxers. ^_^

Some more examples:

1. There is a strong correlation between the number of firemen at a fire and the amount of damage done

2. There is a strong negative correlation between amount of tutoring gotten and grades received (more tutoring, worse grades)

and a tricky one

3. There is a moderate correlation between IQ and astrological sign in young children, this attenuates with age and eventually disappears.

i recently read a very interesting book about scientific philosophy (or how to say that). it stated as a hypothesis that there is no correlation without a causal linkage of some sort. of course, not necessarily direct, but a common cause, or even more indirect link. we can imagine it like a directed graph of causal relations. if there is correlation, the two variables must be connected on that graph.

there was one more thing i could not fully understand. the author told an example (from Eliot Sober) about the price of bread, and the sea level in Venice, both rising with time constantly over a centaury of measurement, and it is a correlation with no causal connection for sure. the author somehow countered this argument, showing that in fact there is no correlation between the two.

In my opinion, perhaps the most commonly misunderstood thing about correlation is that fact that *strength* of a correlation is different from the *significance* of the correlation. Many people do not realize that it is possible for a correlation to be weak, but highly significant. (I.e. we know the connection is real, but it's just not a very strong relationship.) Less often (and usually due to abuse of data) one sees the phenomenon of a strong correlation which is not stastically significant. (I.e. what appears to be a strong correlation is in fact a statistical artifact.)

My favorite two correlation stories (actually regression...)

In a statistical textbook by Kendall and Yule (1950) the report a r-squared between "mental defectives" and BBC Radio Licenses at something like 0.9999 or something like that.

In here:

http://www.talkorigins.org/faqs/c-decay.html

A creationist claims to have a regression of 1.0000, ("The computer at flinders decided that...") but gives the number of points above and below the line.

That's an interesting one! Was there an explanation given, wherever you found it?

My guess would be an effect from the point in the year when you're born, to how early in your life you start attending school, with the result that a) children of the same age would have different amounts of schooling, and b) children at the same point in school would be at different stages of mental development. Presumably the small differences damp down over time.

Morgan: Your guess is exactly right!

School systems have birthday cutoffs, so, if you're the oldest kid in the class, you might have had a full year more of school than the youngest. At early ages, that's worth a few IQ points. At later ages, not.

"the number of autism diagnoses continued to increase years after thimerosal was removed from childhood vaccines"

I would be interested in the source of this information. To my knowledge there is no data available to make this statement (or the opposite one).

Do you mean the rate of autism diagnoses (per unit of time) continued to increase, or that people kept getting diagnosed with autism after thimerosal was removed?

"But in fact, there is no causal linkage"

It is good to distinguish facts from opinions. There are ways to prove a causal linkage, but (perhaps except for stricly controlled lab exeriments) there is no way to disprove it. One can only say that there is no evidence for causal linkage.

The above does not express any opinion on whether the rise in the rate of autism diagnoses in the 90s was caused by the presence of thimerosal in children's vaccines.

So given two independent random variables, x and y, what does it mean to say that X and Y are correlated?so, assuming that the x and y and the same RVs as X and Y, isn't this a silly thing to say? if rho_{X,Y} = 0 is consistent with the english saying "X and Y are correlated", then what DOES it mean to say that?

One of my favorite uses of correlation comes from the Church of the Flying Spaghetti Monster:

http://www.venganza.org/about/open-letter/

where it is proved that "... global warming, earthquakes, hurricanes, and other natural disasters are a direct effect of the shrinking numbers of Pirates since the 1800s." 🙂

"2. There is a strong negative correlation between amount of tutoring gotten and grades received (more tutoring, worse grades)"

Another common problem, confusing cause and effect. Students with bad grades are more likely to get tutoring, so it should be worse grades, more tutoring.

The case where we have a significant correlation, but no reasonable theory of causation seems to me to be the most interesting. We don't know if it spurious, or if we just haven't stumbled onto the probable cause. So in this case, it is best to not just dismiss the correlation, but to remain aware that some undiscovered discovery might be hiding there.

re bigTom's post

Of course, another possibility is that you have a type 1 error.....

Hmm. Under what circumstances? Because I can think of two methods for disproving causality.

Simple example: two clock systems seems correlated.

In one method we discover that two different models describes the clocks. One is a biological clock, controlled by light say, and one is electronic, controlled by a dumb program (from ticks to biological period). Each model can be verified, so we prove they are different.

In the other method we discover that different parameters describes the clocks. We change the day-light period, and the biological clock period changes but not the electronic. A simple hypothesis test proves that they are different.

Change "different" to "not causally linked".

Huh? How do x and y relate to X and Y? If they're the same, or if X is drawn from distribution P(x), and Y from P(y), then it means that someone has got something horribly wrong. If x and y are independent, then Corr(x, y)=0.

Bob

Pedantic P.S. in statistics, we usually describe the r.v. in upper case letters, and values drawn from the r.v. in lower case. Of course it doesn't really matter, but it was doing my head in trying to get it right. More coffee. More coffee...

A really good one is epidemiological studies where something is compared to a dozen variables. The devil is where you set your threshold to determine when a correlation is significant. Set it too high and you might miss something, too low and you have problems with false positives.

But worse is the probability of a false positve is cumulative over your set of variables. Such that a threshold that would give you a false positive rate of 5% for one varible, gives you with a dozen variables a false positive rate of 46% for the entire study. Meaning half the time your result is bogus.

I'd really like to find out which values of the correlation coefficient imply a "high" correlation. This seems very arbitrary in most cases, especially ith the amount of high publicity "junk" studies constantly being reported on with low sample sizes and poor sample selection.

From my expierience with physics experiments, I'm sort of of the view that you need a coefficient of greater than 0.9 if you want to really show a true correlation, and certainly if you want to infer causation.

Gibbon: re false positive rates:

This is only true if you are selecting IVs at random. Scientists rarely do this. Usually, they know or strongly suspect that there IS a relationship, so the type 1 error rate is really 0.

This is a very common misunderstanding of type 1 error rates. A type 1 error is saying the null hypothesis is false when it is actually true. If the null is really false, you cannot make a type 1 error.

@ObsessiveMathsFreak

This is exactly the error I was pointing out in my post above. Just because the correlation coefficient is small does not mean that the correlation is insignificant. (Of course, it is true that small correlations are generally harder to prove, since they are closer to the conventional null hypothesis of no correlation.) With large data sets, it is entirely possible to have a correlation coefficient of, say, .07, but a p-value (testing against the null hypothesis of zero correlation) of, say, .005 . Small effects

canbe real!Not only can small effects be real, but non-significant effects can be real, as well.

Jacob Cohen once wrote that he was tempted to call Null Hypothesis Significance Testing by another name...

Statistical Hypothesis Inference Testing (note the acronym)

Torbjörn,

>Each model can be verified, so we prove they are [not causally linked]

>We change the day-light period, and the biological clock period changes but not the electronic

This is what I refer to as "strictly controlled lab experiments". I would say the second method is more much convincing that the first, which depends on the assumption that the model is correct. The correctness of a model (or, scientific theory) can not be "proven", it can only be falsified or be the best we currently have.

Well, as the far-too-often saying says, no model is true. It comes down to how well it predicts (under the circumstances), and how well it fits in with our other knowledge.

AFAIK, in non-experimental situations, one proves/disproves causality by watching for changes ('natural experiments', or inducing them when possible, or simply by repeatedly modeling the data, for differing groups over different times.

Three things that people seem to forget - first, that many things might be causally linked, but at such a weak level that the available data and models give no statistical significance. This is not a block; it merely means that one should be aware of the (statistical) power involved.

Second, many situations are not open to full experimentation, barring miracles or mult-billion dollar budgets. Therefore observational methods are used, and 'natural experiments' are sought. One reviewer of the book 'Freaknomics' said that the authors were not outstanding in terms of their theoretical economics, but their ability to think of situations where they could get data to test hypotheses.

Third, what is a 'strong' correlation depends on the circumstances. I've seen a professor say that no (non-spurious) correlation in the social sciences is above 0.4; in laboratory experiments for non-biological sciences higher correlations can be expected and demonstrated. In measurement systems analyses, a correlation of 0.99 between two measurement systems might be too weak for practical use.

I like the analogy to sample times and SNR for signals in these cases. In that case the SHIT metaphor recognizes a type of Garbage (GIGO) due to experimental limitations.

OK. You are discussing systems where we can't easily change or observe changed parameters.

Agreed. Of course, we can choose arbitrary limits and agree that a theory that predicts data with more confidence is verified beyond reasonable doubt. Whether we call that proven or not is a matter of taste.

Hmm. Theories (analogous to the first case) usually coincides with more other theory and predicts more data than hypotheses (analogous to the second case). A specific test in the second case (contradiction from data) leads to stronger result than in the first case (denying the consequent from data), but the first case can amass more such verifications.

So I find the former more convincing (more knowledge, more verifications) when the predictions are strongly verified (physics). If the field unavoidably makes tests with less significance (biology and medicine) I would probably concur that the second method is better at some point. It is also direct and practical of course, no messing about with more or less complete models.

Oh boy, this is a bugbear of mine.

I've also found the opposite fallacy. In experimental cognitive psychology, we allow ourselves to attribute causality if we have manipulated a randomly allocated variable. So if we do a reaction time experiment, and we ask people to respond as fast as possible to a stimulus, and we randomly present that stimulus under two conditions (with or without a priming stimulus, for instance) and if mean reaction time under one condition is different from mean reaction time under the other, we can infer causality - that the manipulated variable (presence or absence of prime) caused the difference in RT.

But because usually we only allocate a discrete number of conditions, we usually end up using ANOVAs rather than correlations to compute the significance of the difference, even the though the math is actually the same. And what happens is that students end up thinking that you can infer a causality from an ANOVA but not from a correlation (because, they intone, "correlation doesn't imply causality"). Then they make two mistakes: they think that if they did an ANOVA they can infer causality, even when they didn't manipulate a randomly allocated variable, and if they did a correlation they think they

can'tinfer causality,because they did a correlation.In an class RT experiment on mental rotations, the stimuli were the letters b and d rotated by a random number of degrees. Participants had to report whether the letter was a b or a d. The operational hypothesis was that RTs would be longer on trials where the stimulus had been rotated further. In about a third of the lab reports I marked, students correctly correlated RT with degree of rotation, found a significant relationship, but then stated that they could not infer causality because "you can't infer causality from a correlation".

You can't infer causality unless you manipulate a randomly allocated variable. Whether you can infer causality doesn't depend on the math, but on the methodology.

There is a hierarchy of flawed assumptions about probability and statistics.

Earlier this week, partly to celebrate my son's 18th birtday, I took him and his friends to a stand-up comedy competition at the famous Ice House in Pasadena.

2nd place ($50 award plus invitation to compete in a Superbowl of Comedy" against other winners of these earlier rounds) went to a friend of mine (google "joeyfriedman") who is an MIT graduate software engineer, and is delighted that mainstream audiences now "get" geek and nerd jokes.

Anyway, one of his competitors had a routine about pretending not to understand the basics of probability, by attacking weathermen. To paraphrase one line: "He's not telling me anything at all when he says 'fifty percent chance of rain.' I mean, either it rains or it doesn't rain, right? Why bother with that report at all, why not just cut directly to Sports?"

It's funny, in part, because the audience KNOWS that he's wrong, but most could not tell you exactly WHY that's wrong.

When there is a correlation, there are three formal possibilities for causality:

A causes B, B causes A, or some other factor(s) causes both A and B.

In Elizabeth's example, she can pick one of the possibilities because she knows the causality of the other. She knows that a slow reaction time isn't causing the character to be rotated and she knows that some external influence isn't causing both character rotation and slow reaction time, because she is in control of character rotation. Her experiment is set up so that if a correlation exists, causality necessarily follows. She still needs to make the case for the causality being direct, however, as there could be any number of interacting steps between the cause and the effect.

It's not the "manipulation of a randomly allocated variable" that's the important part of the setup, it's known causality of one of the two variables. Correlation doesn't equal causation only in cases where the variables are actually independent. In the lab, you can set things up so that no other explanation other than causation is possible, so the old saying doesn't apply. That's the point of finding correlations, isn't it? Once you find a correlation "in the wild", you try to bring it into the lab where you can control one of the variables, and thereby get a independent and dependent variable. Once you have that, you can start to address the actual mechanism.

1. Restricting correlation to linear regressions is too restrictive. There may be other functions which may provide a better fit between the independent variable (x in Mr. Chu Carrols example) and the dependent variable (y in Mr. Chu Carrols' example).

2. It should be noted that statistical tests apply to the square of the correlation coefficient. As an example, Stephen Jay Gould pointed out that the authors of the Bell Curve book quoted correlation coefficients of 40% which implies a square correlation coeficient of 16%. Most statisticians won't get too excited about a square correlation coeficient of 16%, even if significant.

3. A distinction should also be drawn between statistical significance and practical significance. A small effect which is statistically significant may have no practical significance.

re anonymous' post: There is a 4th possibility: Type 1 error

re SLC's post, sorry, but I gotta disagree with 2 of your assertions.

1. The correlation coefficient IS about linear relationships, that's the definition. Obviously, there can be more complex relationships, but they won't be well-captured by the correlation coef.

2. The tests apply equally well to either R square or R, you just need to adjust the values. It's like F test with 1 numerator df vs. t test. Same test. And a square correlation of .16 is a VERY big deal in psychology or most social sciences.

OTOH, your 3rd point is exactly right, and one of my pet peeves about how people misuse statistics

Well, isn't that saying the same thing? We know the causality of one of the two variables

because we manipulated it. But in addition, we have to ensure that the variable is orthogonal to the independent variable, which is why we also require it to be randomly allocated. Otherwise we risk collinearity with some unmodelled predictor.But of course it only tells us the direction of causality - it doesn't tell us the mechanism, and it is easy to be fooled about the mechanism.

Sorry, the above comment was by me, not "Anonymous". Forgot to fill in the box.

Re Peter

1. Most professional staticians I have interacted with don't like regression analysis to start with. A correlation coefficient squared of 16% would be laughed at by them.

2. A correlation coeficient can be computed for other then linear relationships. As to whether it has any significance, thats another story. As an example, I have used a non-linear relationship to determine a model for the lateral position vs the longitudinal position of a vehicle during the action of making a lane change and, for most data points, gotten correlation coefficients squared of > 90%.

My favorite is that coffee is correlated with better health. Turns out unhealthy people tend to cut caffene consumption because it often aggrevates symptoms. So while one might look at the correlation and think coffee improves health, it turns out that health improves good coffee drinking!

I don't envy cognitive scientists, because they've got a damn difficult job. Whereas most scientists work under laboratory conditions and can truly control the system they're studying, the work of cognitive scientists is more like economists in that the proper control setups for their systems are almost never available. It's so difficult to get your head around the caveats that apply to their findings that even when students are taught statistics thoroughly and have correlation does not equal causation drilled into their heads, they still make mistakes.

It's no wonder the media gets it so wrong so often.

About the example involving the price of the bread and the sea rising in Venice, one could say the hidden common variable is the population growth, causing global warming and competition for resources, thus leading to the sea level rise and the inflation in the price of the bread.

Ricardo Herrmann:You forgot to include the number of pirates.

My favourite example of correlation but not causation is the claim that there is a positive correlation between the number of babies being born and the size of the stork population.

On the other hand, for a normal distribution X, then the correlation between X and |X| is 0. Its like half the time there is perfect positive correlation, and half the time perfect negative correlation and they cancel out.