In the simplest case, we determine if a linear relationship exist between a independent variable x and a dependent variable y. That means data pairs of xy lies on a straight line. Let us assume we are interested how to predict the cholesterol level with the daily time spent on physical exercise (Table 1).
Table 1: Data on levels of cholesterol in a group of people doing daily exercise for three months (from Bärlocher, 1999).
As we would expect, more physical exercise tend to be associated with lower cholesterol levels. The relationship between the two variables can be approximated roughly with a straight line – see Fig. 1 – and we could use this fitted line to predict the expected level of cholesterol for any given time of physical exercise. But even with a very strong relationship, as here, there is still some variation in the level of cholesterol that can’t be accounted for by our linear model – the level of cholesterol sometimes lies above the line and sometimes below. In simple linear regression, we take account of this unexplained variation by using a model of the following form:
C = b0 + b1*t + e
C is the level of cholesterol, t is the time of exercise and e represents the unexplained variations. We can’t predict the size or direction of the e, but we can say something about how large it is likely to be.
Fitting the model
Before we can use our model to make predictions, we need to estimate the coefficients b0 and b1. We do this by fitting a line to our data, for example by using the criterion of least squares. The idea is to choose the line that minimizes the sum of the squares of the distances between the observed values of the response and the values predicted by the model. We need to determine b0 and b1. The constant b0 represents the interception with the y axis and b1 is the slope of straight line. MaxStat will carry out the required calculations as shown in Fig. 2 for the above data.
We obtain the equation of C = 223.1 – 2.71 * t. Now, for any time of daily physical exercise we can predict the level of cholesterol.
NOTE: WE CAN NOT ASSUME THAT THE INCREASE IN PHYSICAL EXERCISE CAUSES LOWERING THE CHOLESTEROL LEVELS.
How good is the model ?
Generally, we want to know how good our model describes the observed values. We use the so called regression coefficient R, which can have a value between -1 and 1. We can get two important information from the coefficient. First, a negative coefficient indicates a negative slope of the straight line, that means if variables x and y increases then R is positive, but if increases in x is linked to decreases in y then R becomes negative. In our example, with increasing time of exercise the levels of cholesterol decreases and so we have obtained a negative slope and coefficient R. Secondly, the square of R (r2) is called coefficient of determination. It tells us the fraction of the overall variation, which can be explained by the linear model. For example, with a R2 of 0.9918, we can explain 99.18% of the variation with the model.
In a good model the confidence intervals (C.I.) of b0 and b1 is narrow to achieve high accuracy of the predictions through the model (see Fig. 2). The slope b1 should be also significantly different from zero, otherwise no relations between the two variables exist.
The approach described here assumes linear relationship between the two variables. An examination of the residues gives us often an indication if a linearity really exist. Residues are the difference between the measured values yi and the predicted values of y. There should be no trend of the residues as indicated in the Fig. 3. The residues should follow a normal distribution, which we can test with the normality tests.
One of the common assumptions underlying least squares regression analysis is that the standard deviation of the error term is constant over all values of the predictor variable. This assumption does not hold in every observation. In a weighted regression analysis, we put less weight on the less precise measurements and more weight to more precise measurements when estimating the unknown parameters. For example, weights inversely proportional to the variance at each level of the explanatory variables yields the most precise parameter estimates possible. The main advantage that weighted regression analysis has over unweighted is the ability to handle data points of varying precision. We can use weighted least squares efficiently for small data sets. The biggest disadvantage of weighted least squares is that we do not know the weights exactly, so estimated weights must be used instead. The effect of using estimated weights is difficult to assess, but generally small variations in the the weights due to estimation do not often affect a regression analysis significantly. That is not the case, when the weights are estimated from small numbers of replicated observations.