Guessing the regression line

If at least three arbitrary points in the plane are given, it is usually no longer possible to specify a straight line that runs through all points. The goal of linear regression is to approximate such a point cloud as well as possible by a straight line.

Especially common is the method of least squares: If points (_x1,_y1),...,(_xn,_yn) are given, a straight line y=ax+b is searched for, first the vertical distance _yi-(_axi+b) between the point and the straight line is determined for all points, this is squared, this yields the residuals (_yi-(_axi+b)⁾². Now the residual sum is formed, i.e. all residuals are added to each other, and a and b are determined in such a way that this sum becomes as small as possible. This minimization problem can be solved in general, ready formulas for a and b are obtained.

The regression line can be calculated in any case, even if a linear regression is not useful at all. In applications, unfortunately, it cannot always be decided on the basis of framework conditions whether a linear relationship is meaningful as a description or not. Therefore, measures for the quality of the approximation of the point cloud by the regression line are necessary. The residual sum cannot be used directly as an evaluation measure for this purpose, and since its value depends on the magnitude of the x- and y-values. If, for example, other units are chosen, the residual sum also changes. One way out is the correlation coefficient : It always provides numbers between -1 and 1, independent of sample size and magnitude of the x and y values. For values close to 1 or -1, the points can be well fitted by a rising or falling straight line, respectively. For values close to 0, however, fitting the point cloud by a straight line is rather inappropriate as a model description.

From a statistical point of view, linear regression is a simple form of a linear model: a random value is added to the value ax+b of the straight line, which takes measurement errors into account, for example. Very common is the assumption that the error is subject to a normal distribution with expected value 0, i.e. the observations are random variables _Y1,...,_Yn with _Yi=axi+b+Zi, where _Zi is normally distributed with expected value 0 and an unknown variance ^σ2. If, based on a concrete sample, the values a and b are now to be estimated, then the estimators of the method of least squares result for a and b straight.

In this model, the square of the correlation coefficient, the coefficient of determination, has an important interpretation: The fact that the _yi are not all equal is due on the one hand to the fact that different values _xi were set, and on the other hand this is also caused by the normally distributed error term. The coefficient of determination now gives the proportion of the variation of the _yi that is explained by the _xi. Thus, for values close to 1, a very large part of the variation of the _yi is explained by the setting of the _xi, i.e., by the linear regression, while for values close to 0, the variation of the _yi is mainly due to the normally distributed error, and the regression hardly explains why different _yi resulted.

Function of the interactive figure

In the figure below a scatter plot of randomly generated points appears. By two mouse clicks you can try define a appropriate regression line for the plot. You can also display the correct regression line. By comparing the residual sum of your proposal and the minimum residual sum you can see how good your estimation was.

Current line:
Residual sum of the line:
Optimal line:
Minimal residual sum:
Empirical correlation coefficient:

Name	Purpose	Lifetime	Type	Provider
_pk_id	Used to store a few details about the user such as the unique visitor ID.	13 months	HTML	Matomo
_pk_ref	Used to store the attribution information, the referrer initially used to visit the website.	6 months	HTML	Matomo
_pk_ses	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo
_pk_cvar	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo
_pk_hsr	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo

Guessing the regression line

Function of the interactive figure

Info

Portals

Weather & Webcam

Social Media

Content

Content

Content

Guessing the regression line