There's one topic I've been asked about multiple times, but which I've never gotten around to writing about. It happens to be one of the first math things that my dad taught me about: linear regression.

Here's the problem: you're doing an experiment. You're measuring one quantity as you vary another. You've got a good reason to believe that there's a linear relationship between the two quantities. But your measurements are full of noise, so when you plot the data on a graph, you get a scattershot. How can you figure out what line is the best match to your data, and how can you measure how good the match is?

When my dad taught me this, he was working for RCA manufacturing semiconductor chips for military and satellite applications. The focus of his work was building chips that would survive in the high-radiation environment of space - in the jargon, he was building *radiation hard* components. They'd put together a set of masks for an assembly line, and do a test run. Then they'd take the chips from that run, and they'd expose them to gamma radiation until they failed. That would give them a good way of estimating the actual radiation hardness of the run, and whether it was good enough for their customers. Based on a combination of theory and experience, they knew that the relationship they cared about was nearly linear: for a given amount of radiation, the number of specific circuitry failures was proportional to the amount of gamma exposure.

For example, here's a graph that I generated semi-randomly of data points. The distribution of the points isn't really what you'd get from real observations, but it's good enough for demonstration.

The way that we'd usually approach this is called *least square* linear regression. The idea is that what we want do do is create a line where the square of the vertical distance between the chosen line and the measured data points is a minimum.

For the purposes of this, we'll say that one quantity is the *independent value*, and we'll call that x, and the other quantity is the *dependent variable*, and we'll call that y. In theory, the dependent variable, as its name suggests *depends on* the independent variable. In fact, we don't always really know which value depends on the other, so we do our best to make an intelligent guess.

So what we want to do is find a linear equation, \(y = mx + b\) where the mean-square distance is minimal. All we need to do is find values for \(m\) (the slope of the line) and \(b\) (the point where the line crosses the \(y\) axis, also called the y intercept). And, in fact, \(b\) is relatively easy to compute once we know the slope of the line. So the real trick is to find the slope of the line.

The way that we do that is: first we compute the means of \(x\) and \(y\), which we'll call \(overline{x}\) and \(overline{y}\). Then using those, we compute the slope as:

\[ m = frac{Sigma_{i=1}^n (x-hat{x})(y-hat{y})}{Sigma_{i=1}^{n} (x-hat{x})^2}\]

Then for the y intercept: \(b = hat{y} - mhat{x}\).

In the case of this data: I set up the script so that the slope would be about 2.2 +/- 0.5. The slop in the figure is 2.54, and the y-intercept is 18.4.

Now, we want to check how good the linear relationship is. There's several different ways of doing that. The simplest is called the correlation coefficient, or \(r\).

\[ r = frac{left(Sigma (x-hat{x})right) left(Sigma (y - hat{y})right)}{

sqrt{ left(Sigma (x-hat{x})^2right) left(Sigma (y - hat{y})^2right) }

}

\]

If you look at this, it's really a check of how well the variation between the measured values and the expected values (according to the regression) match. On the top, you've got a set of products; on the bottom, you've got the square root of the same thing squared. The bottom is, essentially, just stripping the signs away. The end result is that if the correlation is perfect - that is, if the dependent variable increases linearly with the independent, then the correlation will be 1. If the dependency variable decreases linearly in opposition to the dependent, then the correlation will be -1. If there's no relationship, then the correlation will be 0.

For this particular set of data, I generated it with a linear equation with a little bit of random noise. The correlation coefficient is slighly greater than 0.95, which is exctly what you'd expect.

When you see people use linear regression, there are a few common errors that you'll see all the time.

- No matter what your data set looks like, linear regression
*will*find a line. It won't tell you "Oops, I couldn't find a match". So the fact that you fit a line means absolutely*nothing*by itself. If you're doing it right, you start off with a hypothesis based on prior plausibility for a linear relation, and you're using regression as*part of a process*to test that hypothesis. - You don't get to look at the graph before you do the analysis. What I mean by that is, if you look at the data, you'll naturally notice some patterns. Humans are pattern seekers - we're really good at noticing them. And almost any data set that you look at carefully enough will contain some patterns purely by chance. If you look at the data, and there's a particular pattern that you want to see, you'll probably find a way to look at the data that produces that pattern. For example, in the first post on this blog, I was looking at a shoddy analysis by some anti-vaxxers, who were claiming that they'd found an inflection point in the rate of autism diagnoses, and used linear regression to fit two lines - one before the inflection, one after. But that wasn't supported in the data. It was random - the data was very noisy. You could fit different lines to different sections by being selective. If you picked one time, you'd get a steeper slope before that time, and a shallower one after. But by picking different points, you could get a steeping slope after. The point is, when you're testing the data, you need to design the tests before you've seen the data, in order to keep your bias out!
- A strong correlation doesn't imply
*linear*correlation. If you fit a line to a bunch of data that's not really linear, you can still get a strong positive (or negative) correlation. Correlation is really testing whether the data is increasing the way you'd expect it to, not whether it's truly linear. Random data will have a near-zero correlation. Data where the dependent variable doesn't vary consistently with the independent will have near-zero correlation. But there are plenty of ways of getting data where the dependent and independent variables increase together that produce a strong correlation. You need to do other things to judge the strength of the fit. (I might do some more posts on this kind of thing to cover some of that.)

Takes me back to grad stats class. Might have been the last math class I ever took.

I teach my students least squares fit in the lab class for science/engineering majors that I teach. I love that stuff. It all makes so much sense.

Ok, but why least squares as opposed to other methods such as the Theil–Sen estimator that require fewer assumptions about how well-behaved the noise in your data is? (I.e. when you generated your data "semi-randomly" what noise model were you using and why do you expect that model to be realistic?)

I think it is important to note that the "linear" in linear regression refers to the fact that the model is assumed to be linear in the regression parameters (i.e. m and b in your example) but not necessarily assumes a linear relationship of the data. The model y = a x^2 + b^x + c is also linear in a, b, c as is the model y = a log(x) + b.

The solution to both such models in the least squares sense is obtained by solving a overdetermined linear system.

Why do we square the differences? I've always wondered why we just don't use the absolute values (to make them all positive). Squaring them seems to imply that large differences are much more important than small ones, but is that true? Or, if squaring is good, why not cubing? It seems very arbitrary, but I'm probably missing something. Thanks.

Why do you measure distances in the plane by square-summing the components instead of just summing their absolute values?

I can write another post getting more in depth, but the quick version for now: The square-summing approach gives you a line where the distribution of points around the line will produce a normal bell-curve.

This attitude is typical of the (very little) statistics exercises I've seen: these people rarely seem to wonder which sort of formula is likely to provide a more realistic model for the given situation, they are more concerned about picking something that allows them to do their usual math. What you should be adding to convince me is something like: "(...) and a Bell curve is roughly the distribution we expect to observe in reality".

That's part of the process of selecting a model.

In the context in which I was taught this, and which I explained in the post - that is, noisy experimental measurements - there's good reason to believe that the data should be a normal distribution which would produce a bell curve.

The most common cases where linear regression are used are similar: you have sound reasons to expect normal distribution of the data, which makes this method a good fit.

But it all comes back to the points at the end of the post. Linear regression in general, and least-squares in particular, are an analytical method that produces good results for certain types of data. If you want to do serious data analysis, you definitely need to know about more than just least squares - you need to know multiple methods, and you need to understand the properties and the assumption inherent in each method, use that knowledge to choose the method that's appropriate for the data you're analyzing, and test how well the result of your analysis works as a prediction.

We don't need to assume that the residuals / errors have a Gaussian distribution for least squares to be a useful method. Rather, we only need to assume that the errors are uncorrelated (not even independent!), have mean zero, and have identical variances (not even identically distributed!). This result is known as the Gauss-Markov theorem:

http://en.wikipedia.org/wiki/Gauss–Markov_theorem

Basically, if these assumptions hold, then the least-squares estimate gives you the best linear unbiased estimator for the regression function E[Y | X = x], which is the thing we're trying to estimate from the data. Of course, as you pointed out, there's no guarantee that the regression function *is* linear. But even if it's not, the least-squares model gives us the best possible *linear* model (thinking in terms of a Taylor expansion of the regression function E[Y | X = x] = a + b*x + O(x^2)).

From what my linear algebra prof said, I believe that's actually more intensive than least squares, but there is a method which does that.

Nassim Nicholas Taleb, among other people, has some considered criticisms of the least square linear regression, because of the un-stability (lack of robustness) of such from the action of the outliers.

The next time I perform a linear regression, I will superimpose 16 or 32 runs, each run omitting one or two outlier candidate points, chosen randomly. If the lines look like uncooked spaghetti dropped on the floor, I will know not to trust that data set to be able to give me a regression.

I more I learn about statistics, the more I see the value of the same person performing the analysis being the same person who is looking to inform a decision about an action (or inaction!) being the same person who will suffer a clawback or some other penalty if the action fails (or if the inaction leads to devastation).

Just randomly dropping outliers to see what happens is not a statistically sound practice. Depending on how you select them, it's easy to consistently grab values that are above or below the regression line, and thus be consistently skewing, which will bounce the line around.

In things like this, you really need to understand the properties of the data aggregate, in order to come up with statistically valid ways of testing.

For example, one method that's used on some data sets is to randomly divide the data into 5 partitions. Then you perform regression on each combination of four partitions. The results should be extremely close - and there are specific metrics that you can use to strictly measure that "extremely close" part.

But if the outliers were distributed inside each of the 5 partitions to cancel each other out and moderate their individual skewing effect (a possibility due to the randomness of the partitioning), wouldn't you have an effect as pernicious as "consistently grab[ing] values that are above or below the regression line" - a possible false sense of robustness in the 5 partitions method, compared to a possible false demonstration of lack of robustness in the dropping outliers method.

My (uninformed) hunch is that robustness of the least squares linear regression is an underdeveloped topic in the literature - so picking a method to detect lack of robustness on cost/benefit is not informed by the literature. Which gets back to my comment - whenever I dip my toe into statistics, I notice the rarity of the decision maker, with a possible loss on the line, performing the analysis herself or himself, where a decision about action/inaction hinges on the analysis. So unwarranted assumptions about stability abound.

We've been having a lot of debate in the office recently about OLS v. orthogonal distance regression/deming's regression (when you have variability in both x and y) - any thoughts?

[...] X Funding the preproposal. Good idea, or this way be dragons? The Real Wisdom of the Crowds Least Square Linear Regression Research [...]

Of course, we can't talk about this without throwing in some bad math:

http://www.talkorigins.org/faqs/c-decay.html

(With some bad physics thrown in for fun!)

Couple of typos (If I'm reading all this correctly). Right before you define m, you say the means of x and y are x-bar and y-bar, but in the formula for m, they're x^ and y^.

Also in the m formula, you sum with i going from 1 to n, but you don't use i anywhere. I think the xes and ys are supposed to have subscripts.

Finally, in the following paragraph, you have "slop" for "slope".

You don't *always* find a line do you? What about if your points are (-1,0), (1,0), (0,-1), and (0,1)? You get a mean of (0,0) and a slope of 0/0.

Would it be possible to do a series on estimation theory for dummies? It seems to be a hot topic at the moment, at least in some fields.