A multiple linear regression analysis is carried out to predict the values of a dependent variable, Y, given a set of k explanatory variables (X1,X2,….,Xk). Multiple regression is a logical extension of the principles of simple linear regression to situations in which there are several predictor variables. For instance if we have two predictor variables, X1 and X2, then the form of the model is given by:
Y = b0 + b1*X1 + b1*X2 + e
which comprises a deterministic component involving the three regression coefficients (b0, b1 and b2) and a random component involving the residual (error) term, e.
The error term e is unknown because the true model is unknown. Once the model has been estimated, the regression residuals are defined as the difference between the observed and predicted values. The residuals measure the closeness of fit of the predicted values. The algorithm for estimating the regression equation (solution of the normal equations) guarantees that the residuals have a mean of zero. The variance of the residuals measures the “size” of the error, and is small if the model fits the data well.
Statistics in multiple linear regression
The explanatory power of the regression is summarized by its R2 value, also called the coefficient of determination, is often described as the proportion of variance explained by regression. We need to keep in mind that a high R2 does not imply causation. The R2 value for a regression can be made arbitrarily high simply by including more and more predictors in the model. The adjusted R2 is one of several statistics that attempts to compensate for this artificial increase in accuracy.
The F value, which is computed from the mean squared terms, estimates the statistical significance of the regression equation. The advantage of the F value over R2 is that the F value takes into account the degrees of freedom, which depend on the sample size and the number of predictors in the model. A model can have a high R2 and still not be statistically significant if the sample size is not large compared with the number of predictors in the model. The F value incorporates sample size and number of predictors in an assessment of significance of the relationship.
Examples where multiple linear regression may be used
• To predict an individuals income given several socioeconomic characteristics.
• To predict the overall examination performance of students in ‘A’ levels, given the values of a set of exam scores at age 16.
• To estimate systolic or diastolic blood pressure, given a variety of socioeconomic and behavioral characteristics (occupation, drinking smoking, age etc).
As is the case with simple linear regression and correlation, this analysis does not allow us to make causal inferences, but it does allow us to investigate how a set of explanatory variables is associated with a dependent variable of interest.
The MLR model is based on several assumptions. Provided the assumptions are satisfied, the regression explanatory variables are optimal in the sense that they are unbiased, efficient, and consistent.
• Linearity: the relationship between the predicted values and and the explanatory variables is linear. The MLR model applies to linear relationships. If relationships are nonlinear, there are two things we can do: (i) we can transform the data to make the relationships linear, or (ii) use an alternative statistical model. We can use scatterplots as an exploratory step in regression to identify possible departures from linearity.
• Nonstochastic: The errors are uncorrelated with the individual explanatory variables. This assumption is checked in residuals analysis with scatterplots of the residuals against individual predictors. Violation of the assumption might suggest a transformation of the explanatory variables.
• Zero mean: The expected value of the residuals is zero. This is not a problem because the least squares method of estimating regression equations guarantees that the mean is zero.
• Normality: the error term is normally distributed. This assumption must be satisfied for conventional tests of significance of coefficients and other statistics of the regression equation to be valid.
• Non-autoregression: The residuals are random, or uncorrelated in time. This assumption is one most likely to be violated in time series applications.
• Constant variance: The variance of the residuals is constant. In time series applications, a violation of this assumption is indicated by some organized pattern of dependence of the residuals on time. An example of violation is a pattern of residuals whose scatter (variance) increases over time. Another aspect of this assumption is that the error variance should not change systematically with the size of the predicted values.