Before I get to the meat of the post, I want to remind you that our
DonorsChoose drive is ending in just a couple of days! A small number of readers have made extremely generous contributions, which
is very gratifying. (One person has even taken me up on my offer
of letting donors choose topics.) But the number of contributions has been very small. Please, follow the link in my sidebar, go to DonorsChoose, and make a donation. Even a few dollars can make a
big difference. And remember - if you donate one hundred dollars or more, email me a math topic that you'd like me to write about, and I'll
write you a blog article on that topic.
This post repeats a bunch of stuff that I mentioned in one of my basics posts last year on the margin of error. But given some of the awful rubbish I've heard in coverage of the coming election, I thought it was worth discussing a bit.
As the election nears, it seems like every other minute, we
hear predictions of the outcome of the election, based on polling. The
thing is, pretty much every one of those reports is
What happens is that they look at polls, and they talk about the results and what they mean. But they, like almost everyone, use the margin of error as if it means something very different than what it really does.
What you hear is that, for example, Barak Obama is leading Florida by 5 points, but the margin of error is +/- 4%, so it's really not a significant lead. What the journalists seem to think it means
is that the margin of error is a total measure of the accuracy of the polls - that the poll result is within the margin of error of the "true" result that the poll measures. So, by that interpretation,
the poll is predicting an outcome of 52/48, and the margin of error means that the range of actual voter preferences ranges between
48/52 and 56/44.
The thing is, that's not what the margin of error means. The margin
of error is a statistical measure of the probabilistic size of errors caused by unintentional sampling errors.
Polls - and much of statistics in general - are based on the idea of sampling. Given a large, relatively uniform population, you can
get an amazingly accurate measure of that population by looking at a small subset of it, called a representative sample. A sample
is a randomly selected group that is intended to be a microcosm of the
entire population. In an ideal representative sample, the
sample must have the same distribution of differences as the population as a whole.
There's a big problem there: how can you be sure that your sample is representative? The answer is, you can't! The only way to know for certain that a sample is representative is to measure the entire population, and compare the results of doing that to the sample. But once you've measured the entire population, what's the point of looking at a sample?
Fortunately, we can assess how likely it is that our sample is a good representation of the population. That's what the margin of error does - it measures the likelihood of the sample being representative of the population. It's computed by combining a
bunch of factors, the primary ones most commonly being the size of the population and the size of the sample. Given those, we can assess how
certain you can be of your measure being pretty close to accurate. Typically, we describe that certainty by stating how large an interval
you need to define on either side of the measured statistic to be
95% certain that the "actual" value is within that interval. The size of that interval is the margin of error.
So when you hear a pollster talking about a "poll of likely voters showing that Obama is ahead by 8 points with a margin of error of +/-4%", the big thing you should do is realize what they're measuring. In that case, the population isn't "the set of people who are going to vote next tuesday" (even though that's what the journalists try to make you think); the population is "the set of people who the poll believes are likely to vote next tuesday". So the margin of error
is a measure of how well their poll matches the population of people
who they believe are likely to vote - which is quite a different thing
from the population of people who actually do vote. In fact, it actually does slightly less, even, than that: it measures how much
sampling error is contained in their poll due to unintentionally
selecting a non-representative sample. That's not really saying very much in an election poll.
The population being sampled by polls is likely to be quite different from the actual population of voters for a number of reasons,
and this difference produces measurement errors that almost certainly significantly outweigh the unintentional sampling errors measured by the margin of error. For example:
- Intentional Sample Bias
- Intentional sample bias covers a variety of techniques that
pollsters use when they select people for the sample. For an extreme
example, some polls (like, I think, Zogby) try to get an equal number of
people who self-identify as republicans and democrats. But in most
states, the number of party members in the two major parties are not
equal. They are, in fact, often pretty dramatically uneven. A less
dramatic but still significant one is that many polls do their polling
through phone calls, and only call land-lines. Many younger people no
longer have land-lines; the exclusion of cell-phone numbers therefore
excludes some portion of the population from the sample. These kinds
of sample bias produce a significant mismatch between the population
of real voters, and the population being sampled.
- Unknown Population
- The biggest of polling errors leading up to an election is the fact
that the real population is unknown. No one is sure who's going
to vote - which means that no one is certain of what the correct
population to sample is. Pollsters try to identify a sample of people
who are likely to vote. But since the population is unknown,
they don't know if they're including people in the sample who aren't in
the actual population of voters, and they don't know if they're
excluding people from their samples who are going to vote. In
this election, this is likely to be a significant effect, because huge
numbers of people registered to vote for the first time, but no one
knows how many of those newly registered voters are likely to show
up and vote. Once again, there's a problem related to the fact that the
population that they're sampling isn't the same as the population that
the poll is trying to measure - so that error factor is outside the
margin of error.
- Phrasing Bias
- You can get significant differences in polls based on how the
question is phrased. "Who are you going to vote for?" will
likely generate different results from "Are you going to vote for Obama or McCain?", which will likely generate different results from
"Do you plan to vote for McCain or Obama?", which will generate different results from "Do you plan to vote for a Democrat or a Republican in the presidential election?". This is a well-known problem,
but it still has a significant effect.
- Dishonest Answers
- People aren't entirely trustworthy. They don't necessarily answer
questions honestly. A frequently discussed version of this is called
the Bradley effect. The Bradley effect is a phenomenon where people
are reluctant to admit to being racist. So when a pollster asks them
if they're going to vote for a black man, they'll say "yes", but when
it actually comes to voting, they'll vote for the white guy. I've heard
some people speculate on a reverse Bradley effect this year in some southern states, where people are reluctant to admit that they're going to vote for a black man, so they lie and say they're voting McCain. But the truth of the matter is, we don't know if the people answering the
polls are answering honestly. If they're not, that skews the poll results, and once again, it's not covered by the margin of error.