This past weekend, my friend Orac sent me a link to an interesting piece
of bad math. One of Orac's big interest is vaccination and
anti-vaccinationists. The piece is a newsletter by a group calling itself the "Sound Choice
Pharmaceutical Institute" (SCPI), which purports to show a link
between vaccinations and autism. But instead of the usual anti-vac rubbish about
thimerosol, they claim that "residual human DNA contamintants from aborted human fetal cells"
Among others, Orac already covered the nonsense
of that from a biological/medical
perspective. What he didn't do, and why he forwarded this newsletter to me, is because
the basis of their argument is that they discovered key change points in the
autism rate that correlate perfectly with the introduction of various vaccines.
In fact, they claim to have discovered three different inflection points:
- 1979, the year that the MMR 2 vaccine was approved in the US;
- 1988, the year that a 2nd dose of the MMR 2 was added to the recommended vaccination
- 1995, the year that the chickenpox vaccine was approved in the US.
They claim to have discovered these inflection points using "iterative hockey stick analysis".
First of all, "hockey stick analysis" isn't exactly a standard
mathematical term. So we're on shaky ground right away. They describe
hockey-stick analysis as a kind of "computational line fitting analysis". But
they never identify what the actual method is, and there's no literature on
exactly what "iterative hockey stick analysis" is. So I'm working from a best
guess. Typically, when you try to fit a line to a set of data points,
you use a technique called linear regression. The most common linear regression method is
called the "least squares" method, and their graphs look roughly like least-squares
fitting, so I'm going to assume that that's what they use.
What least squares linear regression do is pretty simple - but it takes a
bit of explanation. Suppose you've got a set of data points where you've got
good reason to believe that you've got one independent variable, and one
dependent variable. Then you can plot those points on a standard graph, with
the independent variable on the x axis, and the dependent variable on the y
axis. That gives you a scattering of points. If there really is a linear
relationship between the dependent and independent variable, and your
measurements were all perfect, with no confounding factors, then the points would
fall on the line defined by that linear relationship.
But nothing in the real world is ever perfect. Our measurements always
have some amount of error, and there are always confounding factors. So
the points never fall perfectly along a line. So we want some way of
defining the best fit to a set of data. That is, understanding that there's
noise in the data, what's the line that comes closest to describing a linear relationship.
Least squares is one simple way of describing that. The idea is that the
best fit line is the line where, for each data point, you take the difference
between the predicted line and the actual measurement. You square that
difference, and then you add up all of those squared differences. The line
where that sum is smallest is the best fit. I'l avoid going into detail about
why you square it - if you're interested, say so in the comments, and maybe I'll write a basics
post about linear regression.
One big catch here is that least-squares linear regression produces a good result
if the data really has a linear relationship. If it doesn't, then least squares
will produce a lousy fit. There are lots of other curve fitting techniques, which work in
different ways. If you want to treat your data as perfect, you can use different techniques to
progressively fit the data better and better until you have a polynomial curve which
precisely includes every datum in your data set. You can start with fitting a line to two points; for
every two points, there's a line connecting them. Then for three points, you can fit them precisely
with a quadratic curve. For four points, you can fit them with a cubic curve. And so on.
Similarly, unless your data is perfectly linear, you can always improve a fit by
partitioning the data. Just like we can fit a curve to two points from the set; then get closer
by fitting it to three; then closer by fitting it to four, we can fit two lines to a 2 way partition
of the data, and get a closer match; then we can get closer with three lines in a three way partition,
and four lines in a four way partition, and so on, until you have a partition for every pair of adjacent
The key takeaway is that no matter what you data looks like, if
it's not perfectly linear, then you can always improve the fit by
creating a partition.
For "hockey stick analysis", what they're doing is looking for a good
place to put a partition. That's a reasonable thing to try to do, but you need
to be really careful about it - because, as I described above, you can
always find a partition. You need to make sure that you're actually
finding a genuine change in the basic relationship between the dependent and
independent variable, and not just noticing a random correlation.
Identifying change points like that is extremely tricky. To identify it,
you need to do a lot of work. In particular, you need to create a large number
of partitions of the data, in order to show that there is one specific
partition that produces a better result than any of the others. And that's not
enough: you can't just select one point that looks good, and see if you get a
better match by splitting there. That's a start: you need to show that the
inflection point that you chose is really the best inflection point.
But you also really need to go bayesian, and figure out an estimate of the chance
of the inflection being an illusion, and show that what the quality of the partition
that you found is better than what you would expect by chance.
Finding a partition point like that is, as you can see, not a simple
thing to do. You need a good supply of data: for small datasets, the
probability of finding a good partition is quite high. You need to do
a lot of careful analysis.
In general, trying to find multiple partition points is simply not
feasible unless you have a really huge quantity of data, and the slope change
is really dramatic. I'm not going to go into the details - but it's basically
just using more Bayesian analysis. You know that there's a high probability
that adding partitions to your data will increase the match quality. You need
to determine, given the expected improvement from partitioning based on the
distribution of you data, how much better of a fit you'd need to find after
partitioning for it to be reasonably certain that the change wasn't an
Just to show that there's one genuine partition point, you need to show a
pretty significant change. (Exactly how much depends on how much data you
have, what kind of distribution it has, and how well it correlates to the line
match.) But you can't do it for small changes. To show two genuine change points
requires an extremely robust change at both points, along with showing that
non-linear matches aren't better that the multiple slope changes. To show
three inflection points is close to impossible; if the slope is
shifting that often, it's almost certainly not a linear relationship.
To get down to specifics, the data set purportedly analyzed by SCPI
consists of autism rates measured over 35 years. That's just thirty
five data points. The chances of being able to reliably identify
one slope change in a set of 35 data points is slim at best. Two?
ridiculous. Three? Beyond ridiculous. There's just nowhere near
enough data to be able to claim that you've got three different inflection
points measured from 35 data points.
To make matters worse: the earliest data in their analysis comes from a
different source than the latest data. They've got some data from the
US Department of Education (1970->1987), and some data from the California
Department of Developmental Services (1973->1997). And those two are measuring
different things; the US DOE statistic is based on a count of the number of 19
year olds who have a diagnosis of autism (so it was data collected in 1989 through 2006);
the California DDS statistic is based on the autism diagnosis rate for children living in
So - guess where one of their slope changes occurs? Go on, guess.
The slope changed in the year when they switched from mixed data to
California DDS data exclusively. Gosh, you don't think that that might be a
confounding factor, do you? And gosh, it's by far the largest (and therefore
the most likely to be real) of the three slope changes they claim to
For the third slope change, they don't even show it on the same graph. In
fact, to get it, they needed to use an entirely different dataset from
either of the two others. Which is an interesting choice, given that the CA DDS
statistic that they used for the second slope change, actually appears
to show a decrease occurring around 1995. But when they switch datasets,
ignoring the one that they were using before, they find a third slope change
in 1995 - right when their other data set shows a decrease.
So... Let's summarize the problems here.
- They're using an iterative line-matching technique which is, at
- They're applying it to a dataset that is orders of
magnitude too small to be able to generate a meaningful result for a
single slope change, but they use it to identify three
different slope changes.
- They use mixed datasets that measure different things in different ways,
without any sort of meta-analysis to reconcile them.
- One of the supposed changes occurs at the point of changeover in the datasets.
- When one of their datasets shows a decrease in the slope, but another
shows an increase, they arbitrarily choose the one that shows an increase.