### Problem on multiple linear regression model building

The following problem provides another challenging dataset that
students can use to try to find their best multiple linear regression
model.

You've been asked to develop a regression model for predicting company
stock price. You have data on 100 stocks, and would like to build a
regression model for predicting logY = natural logarithm of current
stock price from 7 potential predictor variables:

- "X1" = earnings per share
- "X2" = percent growth
- "X3" = price/earnings ratio
- "X4" = last year's stock price
- "X5" = profits in $m
- "X6" = sales in $m
- "market" = stock exchange (labeled 1, 2, or 3)

The data are available in the following data files (in SPSS, text, and
Excel format, respectively): stocks.sav, stocks.txt, stocks.xls. Note that market is a qualitative
(categorical) variable with three levels. Do *not* use this
variable as a predictor; instead, you will need to use two dummy
indicator variables based on this variable to model differing "market
effects." For example, your two indicator variables could be D7 (= 1
for market 2, = 0 otherwise) and D8 (= 1 for market 3, = 0 otherwise),
so that market 1 is the reference level. See Computer Help #3 for how
to create these indicator variables. This problem is focused on
model-building, not interpretation, but *if* you wanted to
interpet models that include these indicator variables, you would
plug-in D7 = 0 and D8 = 0 for market 1, or D7 = 1 and D8 = 0 for
market 2, or D7 = 0 and D8 = 1 for market 3 (see pages 153-158 in
Chapter 4 of the book for another example).

Build a suitable regression model. You may want to consider the
following topics in doing so:

- models with both quantitative and qualitative variables
- natural logarithm transformations of quantitative predictors
- interactions
- comparing nested models
- missing data (see section 5.2.7 on pages 215-217 in the book)

*[Hint: try to ensure the largest sample size possible through
careful selection of which quantitative predictors to use; a "good"
model should have R*^{2} around 0.85, a regression standard
error, s, around 0.34, and a sample size, n, of 87.]

*Last updated: April, 2012*

*© 2012, Iain Pardoe*