Section 3.3 in the book discusses ways to numerically evaluate a multiple linear regression model. An issue that often arises is whether using a subset of the available predictor variables might be better than using all of them. This can be counter-intuitive for some students who argue that we should use all the available variables even if some of them aren't particularly helpful. However, not only can some variables be unhelpful, their inclusion can sometimes make things worse. This can happen if the model overfits the sample data so that within-sample predictions are "too specific" and they don't generalize well when the model is used to predict response values for new data. The purpose of this exercise is to demonstrate the phenomenon of overfitting if some "redundant" predictor variables are included in a model.
Download the simulated data from one of the following files (in SPSS, text, and Excel format, respectively): SUBSET.SAV, SUBSET.TXT, SUBSET.XLS. There is a single response variable, Y, and nine possible predictor variables, X1, X2, X3, X4, X5, X6, X7, X8, and X9. The full sample dataset of 100 observations has been split in two, so that the first 50 observations can be used to fit the models and the remaining 50 observations can be used to see how well the models predict the response variable for new data. To facilitate this, Y1 represents the value of the response for the first 50 observations (and is missing for the rest) and Y2 represents the value of the response for the last 50 observations (and is missing for the rest).
Last updated: November, 2006
The views and opinions expressed in this page are strictly those of the page author. The contents of this page have not been reviewed or approved by the University of Oregon.
© 2006, Iain Pardoe, Lundquist College of Business, University of Oregon