Guessing the regression line

If at least three arbitrary points in the plane are given, it is usually no longer possible to specify a straight line that runs through all points. The goal of linear regression is to approximate such a point cloud as well as possible by a straight line.

Especially common is the method of least squares: If points (x1,y1),...,(xn,yn) are given, a straight line y=ax+b is searched for, first the vertical distance yi-(axi+b) between the point and the straight line is determined for all points, this is squared, this yields the residuals (yi-(axi+b))2. Now the residual sum is formed, i.e. all residuals are added to each other, and a and b are determined in such a way that this sum becomes as small as possible. This minimization problem can be solved in general, ready formulas for a and b are obtained.

The regression line can be calculated in any case, even if a linear regression is not useful at all. In applications, unfortunately, it cannot always be decided on the basis of framework conditions whether a linear relationship is meaningful as a description or not. Therefore, measures for the quality of the approximation of the point cloud by the regression line are necessary. The residual sum cannot be used directly as an evaluation measure for this purpose, and since its value depends on the magnitude of the x- and y-values. If, for example, other units are chosen, the residual sum also changes. One way out is the correlation coefficient : It always provides numbers between -1 and 1, independent of sample size and magnitude of the x and y values. For values close to 1 or -1, the points can be well fitted by a rising or falling straight line, respectively. For values close to 0, however, fitting the point cloud by a straight line is rather inappropriate as a model description.

From a statistical point of view, linear regression is a simple form of a linear model: a random value is added to the value ax+b of the straight line, which takes measurement errors into account, for example. Very common is the assumption that the error is subject to a normal distribution with expected value 0, i.e. the observations are random variables Y1,...,Yn with Yi=axi+b+Zi, where Zi is normally distributed with expected value 0 and an unknown variance σ2. If, based on a concrete sample, the values a and b are now to be estimated, then the estimators of the method of least squares result for a and b straight.

In this model, the square of the correlation coefficient, the coefficient of determination, has an important interpretation: The fact that the yi are not all equal is due on the one hand to the fact that different values xi were set, and on the other hand this is also caused by the normally distributed error term. The coefficient of determination now gives the proportion of the variation of the yi that is explained by the xi. Thus, for values close to 1, a very large part of the variation of the yi is explained by the setting of the xi, i.e., by the linear regression, while for values close to 0, the variation of the yi is mainly due to the normally distributed error, and the regression hardly explains why different yi resulted.

Function of the interactive figure

In the figure below a scatter plot of randomly generated points appears. By two mouse clicks you can try define a appropriate regression line for the plot. You can also display the correct regression line. By comparing the residual sum of your proposal and the minimum residual sum you can see how good your estimation was.

Current line:
Residual sum of the line:
Optimal line:
Minimal residual sum:
Empirical correlation coefficient: