Institute of Mathematics > Mathematics interactive > Statistics > Guessing the regression line

Guessing the regression line

If there are three or more points given in the plane, usually one can not precisely specify a line which passes through all the points any more. The aim of the linear regression is to approximate such a point cloud as well as possible by a straight line.

The method of least squares is quite common here: If the points (x1,y1),...,(xn,yn) are given, a line y=ax+b is in search. As the first step for all points the vertical distance yi-(axi+b) between the point and the straight line is calculated. This distance then is squared and provides the residuals (yi-(axi+b))2Now the residuals will be summed up and a and b are determined so that this sum is as small as possible. This minimization problem can be solved, it will become the final formulas for a and b in general.

The regression line can be calculated in any case, even if a linear regression is not meaningful. In applications unfortunately one can not always decide by the conditions whether a linear relationship as a description is useful or not. Therefore, measures of the quality of the approximation of the point cloud by the regression line are needed. The residual sum can not not be used directly as an evaluation measure, because their value depends on the magnitude of the x and y values. If, for example, other units are selected, the residual sum changes. One solution of this is the correlation coefficient: It provides independent of sample size and the magnitude of the x and y values ​​always a number between -1 and 1. On values ​​close to 1 or -1, the points can be well matched by a rising or falling straight line. On values ​​near 0, the adaptation of the point cloud by a straight line is rather inappropriate as a model description.

From a statistical point of view the linear regression is a simple form of a linear model: A random value, which is taking into account the measurement error, is added to the straight line ax+b. It is very common to assume that the error is normally distributed with mean of 0, i.e. that the observations are random variables Y1,...,Yn with Yi=axi+b+Zi, where the Zi are normally distributed with an expected value of 0 and an unknown variance σ2If there is the task to estimate the values a and b based on a concrete sample, for a and b just the estimators of the least squares method will appear.

In this model, the square of the correlation coefficient, the coefficient of determination, has an important interpretation: The fact that the yi are not all the same is due to the fact that that different values xi have been set and that the error term is normally distributed. The coefficient of determination now gives the part of the variation of ywhich can be explained by xiFor values close to 1 a very large part of the variation of yi can be explained by the adjusting of the xi, i.e. by the linear regression. For values close to 0, the variation of the yi is mainly due to the normally distributed error and can hardly be explained by the regression.

Function of the interactive figure

In the figure below a scatter plot of randomly generated points appears. By two mouse clicks you can try define a appropriate regression line for the plot. You can also display the correct regression line. By comparing the residual sum of your proposal and the minimum residual sum you can see how good your estimation was.

Current line:
Residual sum of the line:
Optimal line:
Minimal residual sum:
Empirical correlation coefficient:

Sitemap  Contact  Data Privacy  Imprint
© TU Clausthal 2019