Several people have asked me to write a few basic posts on statistics. I've
written a few basic posts on the subject - like, for example, this post on mean, median and mode. But I've never really started from the beginnings, for people
who really don't understand statistics at all.
To begin with: statistics is the mathematical analysis of aggregates. That is, it's a set of tool for looking at a large quantity of data about a population, and finding ways to measure, analyze, describe, and understand the information about the population.
There are two main kinds of statistics: sampled statistics, and
full-population statistics. Full-population statistics are
generated from information about all members of a population; sampled statistics
are generated by drawing a representative sample - a subset of the population that should have the same pattern of properties as the full population.
My first exposure to statistics was full-population statistics, and that's
what I'm going to talk about in the first couple of posts. After that, we'll move on to sampled statistics.
The way that I learned this stuff from from my father. My father was
working for RCA on semiconductor manufacturing. They were producing circuits for
satellites and military applications. They'd do a test-run of a particular
design and manufacturing process, and then test all of the chips from that
run. They'd basically submit them to increasing stress until they failed. They'd get failure data about every chip in the manufacturing run. My father's job was to
take that data, and use it to figure out the basic failure properties of the run, and whether or not a full production run using that design and process would produce chips with the desired reliability.
One evening, he brought some work home. After dinner, he spread out a ton of
little scraps of paper all over our dining room table. I (in third or fourth grade at the time), walking in and asked him what he was doing. So he explained it to me.
The little slips were test results. They were using a test system called, if I remember correctly, a teradyne. It printed out results on these silly little slips of paper. If you've ever watched "Space: 1999", they were like the slips that come out of the computer on that show.
Together, we went through the slips of paper, taking information off of them,
and putting them into long columns. Then we'd add up all of the information in
the column, and start doing the statistics. We did a couple of things. We computed
the mean and the standard deviation of the data; we did a linear regression;
and we computed a correlation coefficient. I'm going to explain each of those in turn.
First, we come to the mean. The mean is the average of a set of values. Given
a theoretical object which behaved individually exactly as the aggregate information would predict, the behavior of that object is the mean. To compute the mean, you sum up all of the values in the dataset, and divide by the number of
values. To write it formally, if your data are N values x1,...,xn, then the mean, which is usually written as
x, is defined by:
x = (1/n)Σi=1..nxi
The mean is a tricky thing. It's not nearly as informative as you might
hope. A very typical example of what's wrong with it is an old joke: Bill Gates walks into a homeless shelter, and suddenly, the average person in the shelter is a millionaire.
To be more concrete, suppose you had a set of salaries at a small company. The receptionist makes $30K. Two tech support guys make $40K each. Two programmers make
$70K each. The technical manager makes $100K. And the CEO makes $600K. What's the mean salary? (30+40+40+70+70+100+600)/7 = 950/7 = 135. So the average salary of an employee is $135,000. But that's more than the second-highest salary! So knowing the mean salary doesn't tell you very much on its own.
One fix for that is called the standard deviation. The standard deviation
tells you how much variation there is in the data. If everything is
very close together, the standard deviation will be small. If the data is very
spread out, then the standard deviation will be large.
To compute the standard deviation, for each value in the
population, you take the difference between that value and the
mean. You square it, so that it's always positive. Then you take
those squared differences, and take their mean. The result is
called the variance. The standard deviation is the square root
of the variance. The square root is generally written σ, so:
σ = sqrt((1/n)Σi=1..n(x-x)2)
So, let's go back to our example. The following table shows, for each salary,
the salary, the difference between the salary and the mean, and the square
of the difference.
Now, we take the sum of the squares, which gives us 254974. Then we
divide by the number of values (7), giving us 36425. Finally, we take
the square root of that number, giving us about 190. So the standard deviation
of the salaries is 190,000. That's pretty darned big, for a mean
The real meaning of the standard deviation is very specific. Given a set
of data, 68 percent of the data will be within the range (x-σ, x+σ) (which we usually say as "within one standard deviation of the mean", or ever "within one sigma"); and about 95 percent of the data is within 2 sigmas of the mean.
What should you take away from this? A couple of things. First,
that these statistics are about aggregates, not individuals. Second,
that when you see someone draw a conclusion from a mean without telling you anything more than the mean, you really don't know enough to draw any
particularly meaningful conclusions about the data. To know how much the mean tells you, you need to know how the data is distributed - and the easiest way of
describing that is by the standard deviation.
Next post, I'll talk about something called linear regression, which was the next thing my dad taught me when I learned this stuff. Linear regression is a way of taking a bunch of data, and analyzing it to see if there's a simple linear relationship between some pair of attributes.