The margin of error is *the* most widely misunderstood and misleading concept in

statistics. It's positively frightening to people who actually understand what it means to see

how it's commonly used in the media, in conversation, sometimes even by other scientists!

The basic idea of it is very simple. Most of the time when we're doing statistics, we're doing statistics based on a sample - that is, the entire population we're interested in is difficult to study; so what we try to do is pick a *representative subset* called a sample. If the subset is *truly* representative, then the statistics you generate using information gathered from the sample will be *the same* as information gathered from the population as a whole.

But life is never simple. We *never* have perfectly representative samples; in fact, it's

*impossible* to select a perfectly representative sample. So we do our best to pick good

samples, and we use probability theory to work out a predication of how confident we can be that the

statistics from our sample are representative of the entire population. That's basically what the

margin of error represents: how well we *think* that the selected sample will allow us to

predict things about the entire population.

The way that we compute a margin of error consists of a couple of factors:

- The
*size*of the sample. - Magnitude of
*known*problems in the sample. - The
*extremity*of the statistic.

Let's look at each of those in a bit more detail to understand what they mean, and then I'll explain how we use them to compute a margin of error.

The larger a sample is, the more likely it is to be representative. The intuition behind that is

pretty simple: the more individuals that we include in the sample, the less likely we are to

accidentally omit some group or some trend that exists within the population. Taking a presidential

election as an example: if you polled 10 randomly selected people in Manhattan, you'd probably get

mostly democrats, and a few republicans. But Manhattan actually has a fairly sizeable group of people

who vote for independents, like the green party. If you sampled 100 people, you'd probably get at

least three or four greens. With the smaller sample size, you'd wind up with statistics that *overstated* the number of democrats in Manhattan, because the green voters, who tend to be very liberal, would probably be "hidden" inside of the democratic stat. If you sampled 1000 people, you'd be more likely to get a really good picture of NYC: you'd get the democrats and republicans, the conservative party, the working families party, and so on - all groups that you'd miss in the

smallest sample.

When we start to look at a statistic, we start with an *expectation*: a very rough sense

of what the outcome is likely to be. (Given a complete unknown, we generally start with the Bayesian

assumption that in the presence of zero knowledge, you can generally assign a 50/50 split as an

initial guess about the division of any population into exactly two categories.) When we work with a

sample, we tend to be *less* confident about how representative that sample is the farther the

measured statistic varies from the *expected* value.

Finally, sometimes we *know* that the mechanism we use for our sample is imperfect - that

is, we know that our sample contains an unavoidable bias. In that case, we expand the margin of error

to try to represent the reduced certainty caused by the known bias. For example, in elections, we

know that *in general*, there are certain groups of people who simply *are less likely*

to participate in exit polls. An exit poll simple *cannot* generate an unbiased sample, because the outcome is partially determined by *who* is willing to stop and take the poll. Another example is in polls involving things like sexuality, where because of social factors, people are *less likely* to admit to certain things. If you're trying to measure something like "What percentage of people have had extramarital affairs?", you *know* that many people are not going to tell the truth - so your result will include an expected bias.

Given those, how do we compute the margin of error? It depends a bit on how you're measuring. The

easiest (and most common) case is a *percentage* based statistic, so that's what we'll stick

with for this article. The margin of error is computed from the *standard error*, which is in

turn derived from an *approximation* of the standard deviation. Given a population of size P;

and a measured statistic of X (where X is in decimal form - so 50% means X=0.5), the standard error E

is:

<!-- Insert image for equation: E = ((X * (1-X))/P)^{1/2} -->

The way that equation is generated is beyond the scope of this - but it's built on a couple of

reasonable assumptions: the big one being that the statistic being measured has a *binomial*

distribution. (For now, think of a binomial distribution as being something where randomly selected

samples will generate results for a statistic that form a bell curve around the value that you would

get if you could measure the statistic for the *entire* population.) If you make that assumption, we wind up with an equation in terms of the *variance* of the population (the variance is the standard deviation squared) - and then with a couple of simplifications that

can be shown to not significantly alter the value of the standard standard error, we wind up with

that equation.

To get from the standard error to the margin of error, we need to pick a *confidence
interval*. A confidence interval is a percentage representing how certain we are that the

actual statistic lies within the measured statistic +/- the margin of error.

*In general*,

most statistics are computed using a confidence interval of 95%. You get a confidence interval

of 95% when you use twice the standard error as your margin of error. So the margin of error for most polls is 2E with a confidence of 95%. Using 2.58E as your margin of error gives you a confidence interval of 99%; using just 1E gives you a confidence interval of 68%.

There are a bunch of errors in how people generally use the margin of error:

- The most glaring error is not citing the confidence interval. You
*cannot*

know what a margin of error means if the confidence interval isn't specified. I don't think

I can remember the last time I saw a CI quoted outside of a scientific paper. - Many people, especially journalists, believe that the margin of error includes

*all possible sources of error*. It most emphatically does not - it*only*

specifies the magnitude of error introduced by*non-deliberate sampling errors*.

In a scientific experiment, experimental errors and measurement errors always affect the

outcome of the experiment - but the margin of error*does not*include those

factors - only the sampling error. In a poll, the wording of a question and the way in

which its asked have a*huge*impact - and that is*not*part of the margin

of error. (For example, if you wanted to measure support for concealed carry laws for guns,

you could ask people "Do you believe that people should be allowed to carry concealed

weapons anywhere they go, including government buildings, schools, and churches?", you'd

get one result. If you asked "Do you believe that citizens have a right to carry a weapon

to protect themselves and their families from criminals?", you'd get a very different

answer - the phrasing of the first question is likely to bias people*against*

saying yes, by deliberately invoking the image of guns in a school or a church. The

phrasing of the second question is likely to generate far more "Yes" answers than the

first, because it invokes the image of self-protection from rampaging bad-guys.) - People frequently believe that the margin of error is a measure of the
*quality*

of a statistic - that is, that a well-designed poll will have a smaller margin of error

than a poorly-designed poll. It doesn't - the MoE*only*represents sampling errors!

A*great*poll with a sample size of 100 will virtually always have a considerably

larger MoE than a*terrible*poll with a sample size of 1000. If you want to know

the quality of a poll, you need to know more information it than just the margin of error;

If you want to gauge the relative quality of two different polls, you need to know more

than just the margin of error. In either case, you really need to know the sample size,

how the sample was collected, and*most importantly*exactly what they measure.

To give another political example, there are a number of different polls taken on

a very regular basis of the approval ratings of the president. These polls vary quite

drastically - for example, in this

week's polls, the number of people who approve of the

president range from 30% to 39%, with margins of error in the 3% range.

"(Given a complete unknown, we generally start with the Bayesian assumption that in the presence of zero knowledge, you can generally assign a 50/50 split as an initial guess about the division of any population into exactly two categories.) "

This isn't entirely correct or maybe not informative enough. What we start with is a uniform distribution (well not always uniform, see below) on the probability of the ratios. That is, with zero knowledge, we will say that 99/1 98/2...51/49 50/50 49/51... 1/99 all have equal probability of being the true statistic.

It is true that the expectation of the uniform distribution is 0.5 but just saying that, doesn't really convey the uninformativeness of the initial distribution, especially when you consider that there are alot of distributions with 0.5 mean that aren't uninformative.

To go beyond the basics:

A lot of bayesians actually argue that it is not the uniform distribution but the Beta prior that is the most uninformative for binomial problems. This is not a very important point because in practice both distributions give almost identical results (except for very small samples).

The benefit of the Beta prior is that it is invariant to changes of scale. Given the x statistic, the Beta prior will give you the same results if you substitute x for x^n. In a way it is more uninformative because when you are using the uniform prior you assume a linear scale or linear measuring instrument whereas the Beta prior doesn't even make that assumption.

Uninformative priors for more complicated problems can be found using transformation groups to find invariance or symmetry and making sure that results are the same when we scale or change equivalent (groups of) models.

More information can be found here or you can order the full book on amazon.

Just a point of clarification. By "p" in the equation, presumably you mean the size of the sample, not the size of the population from which the sample was drawn. Earlier in the article you used "population" to mean the latter.

Yep, p is the same sample size according to wikipedia. Nonetheless, this is a great article. Perhaps something on Bayesian statistics in the future?

Figuring the 95% CI of a binomial proportion is actually a lot trickier than you make out. The + or - 2 SE is commonly used, but not that great - see Agresti and Coull, Approximate is better than exact for interval estimation of binomial proportions; American Statistician, v 52 (1998) pages 119-126

and refs therein, and also (using say google scholar) papers citing A and C.

`

BenE -

for example, in this week's polls, the number of people who approve of the president range from 30% to 39%, with margins of error in the 3% range.I'm quite sure more than 3% feel they've made an error. 😉

Bob O'H,

You're right, that's what I meant. And you're also right about using the uniform. I use it all the time because it is simpler to understand. What's wrong with assuming some kind of linearity in models anyways? And as I said, results are pretty much the same except for very small samples in which statistics are all over the place and only display uncertainty anyways.

Even Jaynes said it,

"A useful rule of thumb [...] is that changing the prior probability for a parameter by one power has in general about the same effect on our final conclusions as does having one more data point.[...]So, from a pragmatic standpoint, arguments about which prior probabilities correctly express a state of ``complete ignorance'', like those over prior ranges, usually amount to quibbling over pretty small peanuts. ? From the standpoint of principle, however, they are important and need to be thought about a great deal"

The point of Jeffrey's prior is mostly a theoretical one, it shows how bayesianism can display complete objectivity. In practice though it doesn't make much difference.

All right, a good explanation! I might be able to take some of this to use. Look at the comments you got and you see why people who do not have any background in stats are just buffaloed by the jargon. I always try to get across that Margin of error really has little (to nothing) to do with the accuracy or precision of the data used in a poll or other statistic. Its actually kind of a best case thing in general, "this is as good as you are going to get with this many samples..." kind of thing.

It is really hard to get across the point that you mentioned about how good an estimate is - when people see this one is +- 5% and this one +- 20% why is the second one possibly better.

I'm reviewing this in my Econometrics class right now! Thanks for the reminders.

One question: is this the same as confidence level? I study physics and have only seen confidence level used, but it seems to be measuring the same things: statistical error (aka number of points), and estimated systematic errors.

This article is really awful. There's no need to perpetuate the frequentist party line any longer. The frequentist approach is a train wreck of a theory, making probability & decision analysis about 50 times as complex as they really need to be. Bayesian inference is more powerful, and much simpler to boot. People get a mistaken idea that probability is very difficult, but that's only because of the messy (non-Bayesian) way it is taught. See "Making Hard Decisions" by Robert Clemen for a basic introduction to probability and decision analysis. See "Bayesian Data Analysis" by Gelman, Carlin, Stern, and Rubin for a more advanced text.

Wow, I'm glad bayesianism is starting to pick up. Last time I had a probability discussion on this blog, it was me and maybe another guy agains a bunch of frequencists who dismissed us and told us we were using non objective, non scientific mathematics. This time arround it seems like the Bayesians are setting the tone of the discussion.

BenE -

A problem with using an objective prior is that there are several available, so which one do you choose?

There were several articles about Subjective and Objective Bayesianism in Bayesian Analysis last year: http://ba.stat.cmu.edu/vol01is03.php

For example, Kadane argues against objective Bayesian approaches, and Jim Berger gives a more positive overview, but one still wonders where the One True Objective Bayesian Method is.

Bob

P.S. I guess Mark is having second thoughts about writing a post about Bayesian methods now. But it's nice to see so many crawl out of the woodwork.

I believe I was one of the participants, but if so I don't recognize the message. At least my message, which is that these are different conceptions of the concept of probability, with different best use.

One reason frequentist probability can be preferred in science is that it is easy to extract from most models and be verified by observations. As I understand it, bayesian inference can like modal logics be able to say anything about any thing.

But of course when it comes down to predictions that are amenable to real data this should not be a problem. And neither can frequentist models be automatically trusted. If the event happens or is expected to happen a few times, the result is of limited value.

Another reason frequentist probability can be preferred in science is that it can handle theoretical probabilities over infinite spaces. (Kolmogorov's axioms for frequentist probability vs Cox's axioms for bayesian.) As I understand it, there are real problems to define bayesian probabilities that can be used in common derivations in quantum field theory.

But certainly bayesian methods have scientific uses. It is one method to address the question of parsimony. Bayesian measures are commonly used to choose between different models in cosmology and cladistics for example.

So personally I don't think bayesian methods are non-scientific or of no value. I'm less certain about its use in models which can't be tested, like in proofs for gods. Or when it is conflated with other probabilistic conceptions. True, in most cases the both conceptions can agree on frequencies. But in other cases there are differences.

I'm reminded of Wilkins recent discussions of species ( http://scienceblogs.com/evolvingthoughts/2007/01/species.php ). Species (probability) is a phenomena, that can be described by a concept; "A species is any lineage of organisms that is distinct from other lineages because of differences in some shared biological property" ("the extent to which something is likely to happen or be the case").

But for species (perhaps also probabilities) no single actualization, conception, can cover all uses and details. "All the various conceptions of the concept try to give the differences in shared biological properties some detail - differences in sexual reproductive mechanisms, differences in genetic structure, differences in ecological niche adaptation, and so on.

And when we look at them that way, it becomes clear why none of them are sufficient or necessary for all species: the mechanisms that keep lineages distinct evolved uniquely in every case, and so generalisations only cover some, not all, of life." (And in case of probabilities, the main uses (frequency, inference) that keep probabilities distinct are unique.)

I could as well end this with Wilkins' words: "That's the end of my idiosyncratic view. Like it or not..."

Thanks Bob for the link, I didn't know about the Society for Bayesian Analysis.

I understand that there are debates in the bayesian approaches. These debates seem to be hard to solve because they are often more philosophycal than mathematical and are rooted in the way we think about science including all sorts of epistomological considerations.

In my opinion the problem goes farther than mathematics. I would go as far as saying that we have to include the physical laws of the universe into the equation when we are considering the objectivity of different priors. I would venture that the only way to find the best, most objective prior for any model or hypothesis that represent something that exists in this universe is to find the one that excludes any bias, not on a mathematical or number theoretic dimention, but on physical dimentions of the universe. I'm not convinced it is possible to solve this problem without thinking about the continuity of space, time, movements, and acceleration and the relative isolation these properties give to things and events and the effect this has on their probability of occuring or existing somewhere and sometime without physical disturbances to stop them.

This link with physics is a little bit like a return to the geometrical interpretation of mathematics used in ancient greece which was more grounded in concrete physical representations.

Regardless of all this philosophical babling, the bayesian approaches seems to allow more objectivity and more robustness than the frequencist approaches while being simpler.

Null hypothesis testing comes to mind as a nonsensical consequence of the frequencist approach. The statistics it gives are counter intuitive and can usually be manipulated in saying anything.

I think some day we will have a set of priors and applicability rules for (hopefully all) real world problems and these won't allow for any biases and number manipulation like frequencist theories.

Considering that observations must select which model is correct, I have personally much more trust in frequentist probability. It models a specific and general characteristic that is easy to extract and verify. In bayesian terms, my prior can be set high compared to a particular bayesian inference. 🙂

I remember this argument. ( http://scientopia.org/blogs/goodmath/2006/07/yet-another-crappy-bayesian-argument ) My last comment said:

"BenE:

You discuss a relative error. That is an experimental error, that has to be checked and controlled. And strictly speaking, it doesn't add up, its the variance of the populations that decreases.

I see nothing special about relative errors and other model or experimental defects. In this case with a very tight variance relative errors become extremely important and must be controlled. Otherwise the firm limit looses its meaning, as you suggest."

So I see this argument, as the argument about not measuring frequencies with finite amounts of data, as theoretical arguments that real application shows aren't factual concerns with these models. Did you have anything to add?

Hypothesis testing, however it is done, is important to science. This method (by contradiction from data), and falsification (by denying the consequent from data), is what makes us able to reject false theories. It is the basis that distinguish science from merely, well, suggesting models by inductive inference.

"I see nothing special about relative errors"Except that they void any trustworthyness null hypothesis tests might have since there is no way to remove these errors from data and often no way to even know if these errors are present in the results (without using bayesian methods, or looking at data plots, in which case, why do these tests if the visual representation is more informative?)

"So I see this argument, as the argument about not measuring frequencies with finite amounts of data, as theoretical arguments that real application shows aren't factual concerns with these models. Did you have anything to add?"Huh? The detrimental effects of null hypothesis testing in real applications are very very common. I see it all the time!

I've actually seen taught in statistics classes that you shouldn't use too many data points when doing these tests because you'll always end up finding something significant! The exact number of data point to use is left as a personal choice to the researcher. What the hell is that! This means having a big and representative sample becomes a bad thing. It means that given that a researcher has enough resources to get enough data, it's his choice whether he makes his results significant or not! These tests are useless!

It happens _all the time_ in academia. University professors have a huge incentive to publish (their job is at risk) and because of the dumb trust in these statistical tests, papers that show statistical significance in rejecting null hypotheses get published even when the real effect is very small. Researchers use this flaw to fish for results when there's really nothing interesting to report.

It is especially easy to do in the social sciences where the nature of their tests makes all sorts of biases available to exploit through null hypothesis testing. I am personally familiar with this in the world of psychology and I can tell you that academics who have become pros at this kind of manipulation are the ones that are hired by universities because these institutions often rate candidates by the number of publications they have. And it's a vicious circle since these people later become the ones who rate papers to be accepted for publication. Since it is of their tradition, they blindly accept papers that reject some null hypothesis even though the results are uninteresting and not useful. You wouldn't believe the crap that is published in psychology journals based on rejected null hypothesis.

Null hypothesis testing is one of the most widely exploited and blatant flaw in frequencist probabilities but there are other more subtle flaws you can read about in Jaynes book.

"Hypothesis testing, however it is done, is important to science. This method (by contradiction from data), and falsification (by denying the consequent from data), is what makes us able to reject false theories. It is the basis that distinguish science from merely, well, suggesting models by inductive inference."

Hypothesis testing is used in Bayesian theory and is a fine concept. It's

nullhypothesis testing that is a symptom of the nonsense inherent in frequencism. Bayesian theory accepts that no theory is perfect (they can always be rejected with frequencist techniques if you have enough data). Since theories are _all_ just approximations, the only thing science will ever be able attain, given physical limitations in measurements, is a best known approximation, and to find this best approximation we have to compare hypothesis against each other, _not_ against some hypothetical null which can almost always be disproved given enough data.This wasn't an argument at all! Since I don't know much about bayesian methods, this amounts to an argument from ignorance. And my personal trust have no bearing on the question. I must plead guilty of posting on an empty stomach, IIRC, which usually leads to an empty head. 🙂

More to the point, what is "robustness" here?

I addressed the question of tight variances and relative errors in my comment. You can't extract more information than the errors in the experiment let you. (And plotting data is essential in any good data analysis.)

I don't see your point. In this hypothesis testing you choose one hypothesis as a null, and it is tested against data for a contradiction. There are limitations here. The complement hypothesis is not verified because the null is rejected, et cetera. But this version is consistent with falsification, and we know it works.

...well, since the focus here is on 'basics' -- can someone definitively state here whether a "random" sample is absolutely required for any valid statistical sampling conclusions of a population under study ??

Can one perform scientific sampling with a 'non-random' sample ?

Are there mathematical & process adjustments that convert non-random-samples in to valid /accurate representations of the population under study ?

Ubiquitous major American public opinion polls routinely rely upon "samples" with a non-response rate greater than 50%.

That obviously seems a non-random sample.

Professional pollsters compensate by choosing a much larger initial sample than needed... polling 2,000-3,000 persons to get a desired 1,000 person response/sample for analysis.

Various mathematical & weighting adjustments (..fudge-factors ?) are also routinely employed to adjust for deficiencies in the 'randomness' of their polling samples.

Do you really need a "random-sample" for valid statistics -- or is it just nice-to-have ??

"I don't see your point. In this hypothesis testing you choose one hypothesis as a null, and it is tested against data for a contradiction. There are limitations here. The complement hypothesis is not verified because the null is rejected, et cetera. But this version is consistent with falsification, and we know it works."

I don't know what more to say. It surely doesn't work. People publish all kinds of bullshit based on it. It might be consistent with falsification in some cases but for continuous variables, it isn't consistent with anything (And that's how it's used most of the time). When you reject a null hypothesis on a continuous scale, you reject an hypothesis that is infinitesimally narrow, it has zero width. When you say some effect is different than 0 with 95% confidence, you might be 95% confident that it is not zero but you are not 95% confident that it is not 0.0000000000001 or -0.0000000000000000001. You have quite ridiculously rejected a slice of your hypothesis space that is so infinitely thin as to be nothing.

You say the complement is not verified but for a continuous space it actually is. Since the complement is pretty much all of your hypothesis space (minus that infinitesimal slice), and since the right hypothesis is by definition in there, the complement hypothesis has a probability that tends toward 100% of being true! That's how nonsensical this procedure is!

Thank you so much for your article. I NEVER hear the CI quoted either, though I'm just a novice with statistics it bothers me. I've also got textbooks on writing and speech to teach from that cherry pick strange sets of statistical concepts to include.

I'm putting together some teaching on the basics for my writing and speaking classes, and I wanted to add this information (not at the technical level you achieve here, of course).