Problem on multiple linear regression variable selection

Section 3.3 in the book discusses ways to numerically evaluate a multiple linear regression model. An issue that often arises is whether using a subset of the available predictor variables might be better than using all of them. This can be counter-intuitive for some students who argue that we should use all the available variables even if some of them aren't particularly helpful. However, not only can some variables be unhelpful, their inclusion can sometimes make things worse. This can happen if the model overfits the sample data so that within-sample predictions are "too specific" and they don't generalize well when the model is used to predict response values for new data. The purpose of this exercise is to demonstrate the phenomenon of overfitting if some "redundant" predictor variables are included in a model.

Download the simulated data from one of the following files (in SPSS, text, and Excel format, respectively): SUBSET.SAV, SUBSET.TXT, SUBSET.XLS. There is a single response variable, Y, and nine possible predictor variables, X1, X2, X3, X4, X5, X6, X7, X8, and X9. The full sample dataset of 100 observations has been split in two, so that the first 50 observations can be used to fit the models and the remaining 50 observations can be used to see how well the models predict the response variable for new data. To facilitate this, Y1 represents the value of the response for the first 50 observations (and is missing for the rest) and Y2 represents the value of the response for the last 50 observations (and is missing for the rest).

  1. Fit a multiple linear regression model using all nine X variables as the predictors and Y1 as the response. Make a note of the values of the regression standard error, s, the coefficient of determination, R2, and also adjusted R2.
  2. You should have noticed that the individual p-values for X6, X7, X8, and X9 are on the large side, suggesting that they perhaps provide redundant information about Y1 beyond that provided by X1, X2, X3, X4, and X5. Do a nested model F-test to confirm this.
  3. For the reduced model with just X1, X2, X3, X4, and X5, make a note of the values of the regression standard error, s, the coefficient of determination, R2, and also adjusted R2. You should find that s has decreased (implying increased predictive ability), R2 has increased (which is inevitable—see page 84—so this finding tells us nothing useful), and adjusted R2 has increased (implying that inclusion of X6, X7, X8, and X9 was perhaps causing some overfitting).
  4. Calculate predicted response values for the last 50 observations under both models. Then calculate squared errors (differences between Y2 and the predicted response values). Finally, calculate the square root of the mean of these squared errors - this is know as the "root mean squared error" or "RMSE." You should find that the RMSE under the first (complete) model is some 5% larger than the RMSE under the second (reduced) model. In other words, we can make more accurate predictions of the response value in a new dataset by using fewer predictors in our model. This confirms that for this dataset, using all nine predictors leads to within-sample overfitting and using just the first five predictors leads to more accurate out-of-sample predictions.

Last updated: November, 2006

The views and opinions expressed in this page are strictly those of the page author. The contents of this page have not been reviewed or approved by the University of Oregon.

© 2006, Iain Pardoe, Lundquist College of Business, University of Oregon