R/S-PLUS instructions

These instructions accompany Applied Regression Modeling by Iain Pardoe, 2nd edition published by Wiley in 2012. The numbered items cross-reference with the "computer help" references in the book. These instructions are based on the programming interface (command line) of R 1.40 for Mac OS X, but they should also work for other versions of R and also S-PLUS. Find instructions for other statistical software packages here.

Getting started and summarizing univariate data

  1. Change R's default options by selecting R > Preferences (Mac) or Edit > GUI Preferences (Windows).
  2. To open a comma-separated (CSV) data file, type mydata <- read.csv("file.csv"), where file.csv is the name of the data file (with the correct path specified if necessary). One way to then make the dataset available for analysis is to type attach(mydata). You will now be able to analyze each of the variables in the mydata dataset. To see what these variables are, type names(mydata). When you are finished with this dataset, type detach(mydata). We do not go into the other methods for working with R datasets here—see an R companion book such as Fox and Weisberg (2011) for further discussion of this.
  3. To recall a previously entered command, hit the "up" arrow (repeatedly) on your keyboard.
  4. Output appears in the main R Console Window and can be copied and pasted from R to a word processor like OpenOffice Writer or Microsoft Word. Graphs appear in separate windows and can also easily be copied and pasted to other applications.
  5. You can access help by selecting Help > R help. For example, to find out about "boxplots" click Search Engine & Keywords, type boxplots in the search box, and click on one of the resulting hits.
  6. To transform data or compute a new variable, type, for example, logX <- log(X) for the natural logarithm of X and Xsq <- X**2 for X2. If you get the error message "unexpected input in ...," this means that there is a syntax error in your expression—a common mistake is to forget the multiplication symbol (*) between a number and a variable (e.g., 2*X represents 2X). Created variables like this will be placed in R's "global environment." To see a list of the variables in the global environment, type ls(). Variables in the global environment take precedence over identically named variables in attached datasets. To avoid any confusion, avoid creating a variable (or other R "object") that has the same name as a variable in an attached dataset.
  7. To create indicator (dummy) variables from a qualitative variable, type, for example, D1 <- ifelse(X=="level", yes=1, no=0), where X is the qualitative variable and level is the name of one of the categories in X. Repeat for other indicator variables (if necessary).
  8. Calculate descriptive statistics for quantitative variables by typing summary(Y), where Y is the quantitative variable. Other, more specific commands include mean(Y), median(Y), sd(Y), min(Y), max(Y), and quantile(Y, probs=c(.25,.5,.75), type=7), where probs is a numeric vector of probabilities with values in [0, 1] and type is an integer between 1 and 9 selecting one of nine quantile (percentile) algorithms (the default algorithm 7 gives slightly different results than algorithm 6 used by SPSS and Minitab).
  9. Create contingency tables or cross-tabulations for qualitative variables by typing table(X1, X2), where X1 and X2 are the qualitative variables. Calculate row sums by typing rowsum <- apply(table(X1, X2), 1, sum). Then the row percentages are 100*table(X1, X2)/rowsum. Calculate column sums by typing colsum <- apply(table(X1, X2), 2, sum). Then the column percentages are 100$*t(t(table(X1, X2))/colsum) (the function t() stands for "transpose").
  10. If you have a quantitative variable and a qualitative variable, you can calculate descriptive statistics for cases grouped in different categories by typing, for example, aggregate(Y, by, FUN), where Y is the quantitative variable, by is a list of the qualitative variables (e.g., list(X1=X1, X2=X2)), and FUN is the name of the function for calculating the descriptive statistic (e.g., mean).
  11. To make a stem-and-leaf plot for a quantitative variable, type stem(Y, scale=1), where Y is the quantitative variable and scale controls the plot length.
  12. To make a histogram for a quantitative variable, type hist(Y, freq=TRUE, breaks=c(1,2,3,4)), where Y is the quantitative variable, freq is a logical value (if TRUE, the histogram represents frequencies, if FALSE, it represents probability densities), and breaks specifies how to construct the breakpoints.
  13. To make a scatterplot with two quantitative variables, type plot(X, Y), where X is the horizontal axis variable and Y is the vertical axis variable.
  14. All possible scatterplots for more than two variables can be drawn simultaneously (called a scatterplot matrix}) by typing pairs(cbind(Y, X1, X2)), where Y, X1, and X2 are quantitative variables.
  15. You can mark or label cases in a scatterplot with different colors/symbols according to categories in a qualitative variable by using the points() function. For example, in a dataset of 20 observations, suppose X2 contains values 1-4 to represent four categories, and Y and X1 are two quantitative variables. Then the following code produces a scatterplot with numbers (representing the value of X2) marking the points:
    plot(X1, Y, type="n")
    for (i in 1:20) points(X1[i], Y[i], pch=as.character(X2[i]))
    .
  16. You can identify individual cases after drawing a scatterplot by typing identify(X, Y, labels), where X is the horizontal axis variable, Y is the vertical axis variable, and labels is an optional variable giving labels for the points. After typing this command, left-click on points on the scatterplot to make labels appear; when you're finished, right-click (and select Stop if needed).
  17. To remove one of more observations from a dataset, type, for example, mydatax <- mydata[-1,], which would remove observation #1, or mydatax <- mydata[-c(1,3),], which would remove observations #1 and #3. To work with the new dataset, type detach(mydata) and attach(mydatax).
  18. To make a bar chart for cases in different categories, use the barplot() function.
  19. To make boxplots for cases in different categories, use the boxplot() function.
  20. To make a QQ-plot (also known as a normal probability plot) for a quantitative variable, type qqnorm(Y), where Y is a quantitative variable. To add a diagonal line to aid interpretation, type qqline(Y).
  21. To compute a confidence interval for a univariate population mean, type t.test(Y, conf.level=0.95), where Y is the variable for which you want to calculate the confidence interval, and conf.level is the confidence level of the interval.
  22. To do a hypothesis test for a univariate population mean, type t.test(Y, mu=value, alternative="two.sided"), where Y is the variable for which you want to do the test, mu is the (null) hypothesized value, and alternative is "two.sided" for two-tailed, "less" for lower-tailed, or "greater" for upper-tailed.

Simple linear regression

  1. To fit a simple linear regression model (i.e., find a least squares line), type model <- lm(Y ∼ X), where Y is the response variable and X is the predictor variable, and then type summary(model). In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), type model <- lm(Y ∼ X - 1).
  2. To add a regression line or least squares line to a scatterplot, type plot(X, Y), where X is the predictor variable and Y is the response variable, and then type lines(sort(X), fitted(model)[order(X)]), where model is the name of the fitted model object (see help #25).
  3. To find confidence intervals for the regression parameters in a simple linear regression model, type confint(model, level=0.95), where model is the name of the fitted model object (see help #25) and level is the confidence level required. This applies more generally to multiple linear regression also.

Multiple linear regression

  1. To fit a multiple linear regression model, type model <- lm(Y ∼ X1 + X2), where Y is the response variable and X1 and X2 are the predictor variables, and then type summary(model). If you wish to include an interaction either create an interaction term first using help #6 and include it in the list of model terms or, alternatively, type model <- lm(Y ∼ X1 + X2 + X1:X2). In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), type model <- lm(Y ∼ X1 + X2 - 1).
  2. To add a quadratic regression line to a scatterplot, first fit the quadratic regression model, model <- lm(Y ∼ X + Xsq), where Y is the response variable, X is the predictor variable, X, and Xsq is X2. To produce a smooth quadratic regression line, it might be necessary to create a "more even" version of X for use in the graph, for example, newX <- seq(min(X), max(X), length=100) and newXsq <- newX**2. Next, type plot(X, Y) and then type lines(newX, predict(model, newdata=data.frame(X=newX, Xsq=newXsq))).
  3. Categories of a qualitative variable can be thought of as defining subsets of the sample. If there is also a quantitative response and a quantitative predictor variable in the dataset, a regression model can be fit to the data to represent separate regression lines for each subset. For example, in a dataset of 20 observations, suppose X2 contains the values 1-4 to represent four categories, and Y and X1 are two quantitative variables. Create an interaction term, X1X2 <- X1*X2, and fit the model, model <- lm(Y ∼ X1 + X2 + X1X2). Then the following code produces a scatterplot with numbers (representing the value of X2) instead of circles marking the points, and four separate regression lines:
    plot(X1, Y, type="n")
    for (i in 1:20) points(X1[i], Y[i], pch=as.character(X2[i]))
    for (i in 1:4) lines(X1[X2==i], fitted(model)[X2==i])
    .
  4. To find the F-statistic and associated p-value for a nested model F-test in multiple linear regression, first fit the reduced model, for example, model1 <- lm(Y ∼ X1), and the complete model, for example, model2 <- lm(Y ∼ X1 + X2 + X3). Then type anova(model1, model2). The F-statistic is in the second row of the "Analysis of Variance Table" in the column headed F, while the associated p-value is in the column headed Pr(>F).
  5. To save residuals in a multiple linear regression model, type res <- residuals(model), where model is the name of the fitted model object (see help #31). They can now be used just like any other variable, for example, to construct residual plots. To save what Pardoe (2012) calls standardized residuals, type sta <- rstandard(model). To save what Pardoe (2012) calls studentized residuals, type stu <- rstudent(model).
  6. To add a loess fitted line to a scatterplot (useful for checking the zero mean regression assumption in a residual plot), type, for example, scatter.smooth(fitted(model), rstudent(model), span=0.75), where model is the name of the fitted model object (see help #31).
  7. To save leverages in a multiple linear regression model, type lev <- hatvalues(model), where model is the name of the fitted model object (see help #31). They can now be used just like any other variable, for example, to construct scatterplots.
  8. To save Cook's distances in a multiple linear regression model, type cook <- cooks.distance(model), where model is the name of the fitted model object (see help #31). They can now be used just like any other variable, for example, to construct scatterplots.
  9. To create some residual plots automatically in a multiple linear regression model, type plot(model), where model is the name of the fitted model object (see help #31). This produces a plot of residuals versus fitted values, a QQ-plot of the standardized residuals, a plot of the square root of the standardized residuals versus fitted values, and a plot of standardized residuals versus leverages (with Cook's distance thresholds of 0.5 and 1 marked)—hit Return to cycle through the plots. To create residual plots manually, first create studentized residuals (see help #35), and then construct scatterplots with these studentized residuals on the vertical axis.
  10. To create a correlation matrix of quantitative variables (useful for checking potential multicollinearity problems), type cor(cbind(Y, X1, X2)), where Y, X1, and X2 are quantitative variables.
  11. To find variance inflation factors in multiple linear regression, first install and load the car package of Fox and Weisberg (2011). Then type vif(model), where model is the name of the fitted model object (see help #31).
  12. To draw a predictor effect plot for graphically displaying the effects of transformed quantitative predictors and/or interactions between quantitative and qualitative predictors in multiple linear regression, first create a variable representing the effect, say, "X1effect" (see help #6). See Section 5.5 in Pardoe (2012) for an example.

Last updated: May, 2012

© 2012, Iain Pardoe