If there are three or more points given in the plane, usually one can not precisely specify a line which passes through all the points any more. The aim of the **linear regression** is to approximate such a point cloud as well as possible by a straight line.

The method of **least squares** is quite common here: If the points (x_{1},y_{1}),...,(x_{n},y_{n}) are given, a line y=ax+b is in search. As the first step for all points the vertical distance y_{i}-(ax_{i}+b) between the point and the straight line is calculated. This distance then is squared and provides the residuals (y_{i}-(ax_{i}+b))^{2}. Now the residuals will be summed up and a and b are determined so that this sum is as small as possible. This minimization problem can be solved, it will become the final formulas for a and b in general.

The regression line can be calculated in any case, even if a linear regression is not meaningful. In applications unfortunately one can not always decide by the conditions whether a linear relationship as a description is useful or not. Therefore, measures of the quality of the approximation of the point cloud by the regression line are needed. The residual sum can not not be used directly as an evaluation measure, because their value depends on the magnitude of the x and y values. If, for example, other units are selected, the residual sum changes. One solution of this is the **correlation coefficient**: It provides independent of sample size and the magnitude of the x and y values always a number between -1 and 1. On values close to 1 or -1, the points can be well matched by a rising or falling straight line. On values near 0, the adaptation of the point cloud by a straight line is rather inappropriate as a model description.

From a statistical point of view the linear regression is a simple form of a linear model: A random value, which is taking into account the measurement error, is added to the straight line ax+b. It is very common to assume that the error is normally distributed with mean of 0, i.e. that the observations are random variables Y_{1},...,Y_{n} with Y_{i}=ax_{i}+b+Z_{i}, where the Z_{i} are normally distributed with an expected value of 0 and an unknown variance σ^{2}. If there is the task to estimate the values a and b based on a concrete sample, for a and b just the estimators of the least squares method will appear.

In this model, the square of the correlation coefficient, the coefficient of determination, has an important interpretation: The fact that the y_{i} are not all the same is due to the fact that that different values x_{i} have been set and that the error term is normally distributed. The coefficient of determination now gives the part of the variation of y_{i }which can be explained by x_{i}. For values close to 1 a very large part of the variation of y_{i} can be explained by the adjusting of the x_{i}, i.e. by the linear regression. For values close to 0, the variation of the y_{i} is mainly due to the normally distributed error and can hardly be explained by the regression.

In the figure below a scatter plot of randomly generated points appears. By two mouse clicks you can try define a appropriate regression line for the plot. You can also display the correct regression line. By comparing the residual sum of your proposal and the minimum residual sum you can see how good your estimation was.

Current line: | |

Residual sum of the line: | |

Optimal line: | |

Minimal residual sum: | |

Empirical correlation coefficient: |