R/SPLUS instructions
These instructions accompany Applied Regression Modeling by Iain Pardoe, 2nd edition
published by Wiley in 2012. The numbered items crossreference with the "computer help" references
in the book. These instructions are based on the programming interface (command line) of R 1.40 for
Mac OS X, but they should also work for other versions of R and also SPLUS. Find instructions for
other statistical software packages here.
Getting started and summarizing univariate data
 Change R's default options by selecting R > Preferences (Mac) or
Edit > GUI Preferences (Windows).
 To open a commaseparated (CSV) data file, type mydata <
read.csv("file.csv"), where file.csv is the name of the data file (with the correct
path specified if necessary). One way to then make the dataset available for analysis is to type
attach(mydata). You will now be able to analyze each of the variables in the
mydata dataset. To see what these variables are, type names(mydata). When you are
finished with this dataset, type detach(mydata). We do not go into the other methods for
working with R datasets here—see an R companion book such as Fox and Weisberg (2011) for
further discussion of this.
 To recall a previously entered command, hit the "up" arrow (repeatedly) on
your keyboard.
 Output appears in the main R Console Window and can be copied and
pasted from R to a word processor like OpenOffice Writer or Microsoft Word. Graphs appear in
separate windows and can also easily be copied and pasted to other applications.
 You can access help by selecting Help > R help. For example, to
find out about "boxplots" click Search Engine & Keywords, type boxplots in the
search box, and click on one of the resulting hits.
 To transform data or compute a new variable, type, for example,
logX < log(X) for the natural logarithm of X and Xsq < X**2 for
X^{2}. If you get the error message "unexpected input in ...," this means that there
is a syntax error in your expression—a common mistake is to forget the multiplication symbol
(*) between a number and a variable (e.g., 2*X represents 2X). Created
variables like this will be placed in R's "global environment." To see a list of the variables in
the global environment, type ls(). Variables in the global environment take precedence
over identically named variables in attached datasets. To avoid any confusion, avoid creating a
variable (or other R "object") that has the same name as a variable in an attached dataset.
 To create indicator (dummy) variables from a qualitative variable, type,
for example, D1 < ifelse(X=="level", yes=1, no=0), where X is the qualitative
variable and level is the name of one of the categories in X. Repeat for other
indicator variables (if necessary).

 To find a percentile (critical value) for a tdistribution, type
qt(p, df, lower.tail=F), where p is the onetail significance level (uppertail
area) and df is the degrees of freedom. For example, qt(0.05, 29, lower.tail=F)
returns the 95th percentile of the tdistribution with 29 degrees of freedom (1.699), which is the
critical value for an uppertail test with a 5% significance level. By contrast,
qt(0.025, 29, lower.tail=F) returns the 97.5th percentile of the tdistribution with 29
degrees of freedom (2.045), which is the critical value for a twotail test with a 5% significance
level.
 To find a percentile (critical value) for an Fdistribution, type
qf(p, df1, df2, lower.tail=F), where p is the significance level (uppertail
area), df1 is the numerator degrees of freedom, and df2 is the denominator degrees
of freedom. For example, qf(0.05, 2, 3, lower.tail=F) returns the 95th percentile of the
Fdistribution with 2 numerator degrees of freedom and 3 denominator degrees of freedom
(9.552).
 To find a percentile (critical value) for a chisquared distribution, type
qchisq(p, df, lower.tail=F), where p is the significance level (uppertail
area) and df is the degrees of freedom. For example, qchisq(0.05, 2,
lower.tail=F) returns the 95th percentile of the chisquared distribution with 2 degrees of
freedom (5.991).

 To find an uppertail area (onetail pvalue) for a tdistribution, type
pt(t, df, lower.tail=F), where t is the absolute value of the tstatistic and
df is the degrees of freedom. For example, pt(2.40, 29, lower.tail=F) returns the
uppertail area for a tstatistic of 2.40 from the tdistribution with 29 degrees of freedom
(0.012), which is the pvalue for an uppertail test. By contrast,
2*pt(2.40, 29, lower.tail=F) returns the twotail area for a tstatistic of 2.40 from the
tdistribution with 29 degrees of freedom (0.023), which is the pvalue for a twotail test.
 To find an uppertail area (pvalue) for an Fdistribution, type
pf(f, df1, df2, lower.tail=F), where f is the value of the Fstatistic,
df1 is the numerator degrees of freedom, and df2 is the denominator degrees of
freedom. For example, pf(51.4, 2, 3, lower.tail=F) returns the uppertail area (pvalue)
for an Fstatistic of 51.4 for the Fdistribution with 2 numerator degrees of freedom and 3
denominator degrees of freedom (0.005).
 To find an uppertail area (pvalue) for a chisquared distribution, type
pchisq(chisq, df, lower.tail=F), where chisq is the value of the chisquared
statistic and df is the degrees of freedom. For example,
pchisq(0.38, 2, lower.tail=F) returns the uppertail area (pvalue)
for a chisquared statistic of 0.38 for the chisquared distribution with 2 degrees of freedom
(0.827).
 Calculate descriptive statistics for quantitative variables by typing
summary(Y), where Y is the quantitative variable. Other, more specific commands
include mean(Y), median(Y), sd(Y), min(Y), max(Y), and
quantile(Y, probs=c(.25,.5,.75), type=7), where probs is a numeric vector of
probabilities with values in [0, 1] and type is an integer between 1 and 9 selecting one
of nine quantile (percentile) algorithms (the default algorithm 7 gives slightly different results than algorithm 6 used by SPSS and Minitab).
 Create contingency tables or crosstabulations for qualitative
variables by typing table(X1, X2), where X1 and X2 are the qualitative
variables. Calculate row sums by typing rowsum < apply(table(X1, X2), 1, sum). Then the
row percentages are 100*table(X1, X2)/rowsum. Calculate column sums by typing
colsum < apply(table(X1, X2), 2, sum). Then the column percentages are
100$*t(t(table(X1, X2))/colsum) (the function t() stands for "transpose").
 If you have a quantitative variable and a qualitative variable, you can calculate
descriptive statistics for cases grouped in different categories by typing, for example,
aggregate(Y, by, FUN), where Y is the quantitative variable, by is a list
of the qualitative variables (e.g., list(X1=X1, X2=X2)), and FUN is the name of
the function for calculating the descriptive statistic (e.g., mean).
 To make a stemandleaf plot for a quantitative variable, type
stem(Y, scale=1), where Y is the quantitative variable and scale controls
the plot length.
 To make a histogram for a quantitative variable, type
hist(Y, freq=TRUE, breaks=c(1,2,3,4)), where Y is the quantitative variable,
freq is a logical value (if TRUE, the histogram represents frequencies, if
FALSE, it represents probability densities), and breaks specifies how to construct
the breakpoints.
 To make a scatterplot with two quantitative variables, type
plot(X, Y), where X is the horizontal axis variable and Y is the vertical
axis variable.
 All possible scatterplots for more than two variables can be drawn simultaneously
(called a scatterplot matrix}) by typing pairs(cbind(Y, X1, X2)), where Y,
X1, and X2 are quantitative variables.
 You can mark or label cases in a scatterplot with different colors/symbols
according to categories in a qualitative variable by using the points() function. For
example, in a dataset of 20 observations, suppose X2 contains values 14 to represent four
categories, and Y and X1 are two quantitative variables. Then the following code
produces a scatterplot with numbers (representing the value of X2) marking the
points:
plot(X1, Y, type="n")
for (i in 1:20) points(X1[i], Y[i], pch=as.character(X2[i])).
 You can identify individual cases after drawing a scatterplot by typing
identify(X, Y, labels), where X is the horizontal axis variable, Y is the
vertical axis variable, and labels is an optional variable giving labels for the points.
After typing this command, leftclick on points on the scatterplot to make labels appear; when
you're finished, rightclick (and select Stop if needed).
 To remove one of more observations from a dataset, type, for example,
mydatax < mydata[1,], which would remove observation #1, or
mydatax < mydata[c(1,3),], which would remove observations #1 and #3. To work with the new dataset, type detach(mydata) and attach(mydatax).
 To make a bar chart for cases in different categories, use the
barplot() function.
 For frequency bar charts of one qualitative variable, type
barplot(table(X1)), where X1 is a qualitative variable.
 For frequency bar charts of two qualitative variables,
type barplot(table(X1, X2), beside=TRUE, legend.text=TRUE), where X1
and X2 are qualitative variables and beside is a logical value for whether the
chart is clustered (TRUE) or stacked (FALSE).
 The bars can also represent various
summary functions for a quantitative variable. For example, to produce a bar chart of means,
type:
means < aggregate(Y, by=list(X1=X1, X2=X2), mean)
barplot(tapply(means$x, list(means$X1, means$X2), sum), beside=TRUE),
where X1 and X2 are the qualitative variables and Y is a quantitative
variable.
 To make boxplots for cases in different categories, use the
boxplot() function.
 For just one qualitative variable, type boxplot(Y ∼ X1).
 For two qualitative variables, type boxplot(Y ∼ X1 + X2), where
Y is a quantitative variable, and X1 and X2 are qualitative variables.
Alternatively, use the bwplot() function in the Lattice package, which has syntax
bwplot(Y ~ X1  X2).
 To make a QQplot (also known as a normal probability plot) for a
quantitative variable, type qqnorm(Y), where Y is a quantitative variable. To add
a diagonal line to aid interpretation, type qqline(Y).
 To compute a confidence interval for a univariate population mean, type
t.test(Y, conf.level=0.95), where Y is the variable for which you want to
calculate the confidence interval, and conf.level is the confidence level of the
interval.
 To do a hypothesis test for a univariate population mean, type
t.test(Y, mu=value, alternative="two.sided"), where Y is the variable for which
you want to do the test, mu is the (null) hypothesized value, and alternative is
"two.sided" for twotailed, "less" for lowertailed, or "greater" for
uppertailed.
Simple linear regression
 To fit a simple linear regression model (i.e., find a least squares line),
type model < lm(Y ∼ X), where Y is the response variable and X is
the predictor variable, and then type summary(model). In the rare circumstance that you
wish to fit a model without an intercept term (regression through the origin), type
model < lm(Y ∼ X  1).
 To add a regression line or least squares line to a scatterplot,
type plot(X, Y), where X is the predictor variable and Y is the response
variable, and then type lines(sort(X), fitted(model)[order(X)]), where model is
the name of the fitted model object (see help #25).
 To find confidence intervals for the regression parameters in a simple
linear regression model, type confint(model, level=0.95), where model is the name
of the fitted model object (see help #25) and level is the confidence level
required. This applies more generally to multiple linear regression also.

 To find a fitted value or predicted value of Y (the response
variable) at a particular value of X (the predictor variable), type predict(model), where
model is the name of the fitted model object (see help #25). This returns the fitted or predicted values of Y at each of the Xvalues in the dataset.
 You can also obtain a
fitted or predicted value of Y at an Xvalue that is not in the dataset by typing
predict(model, newdata=data.frame(X=a)), where newdata contains a data frame
with variables that have the same names as the predictors in the model (X in this case),
and a is the particular Xvalue that we are interested in.
 This applies more
generally to multiple linear regression also.

 To find a confidence interval for the mean of Y at a particular value of
X, type predict(model, interval="confidence"), where model is the name of the
fitted model object (see help #25). The confidence intervals for the mean of Y at each of the
Xvalues in the dataset are displayed as two columns headed lwr and upr.
 You can
also obtain a confidence interval for the mean of Y at an Xvalue that is not in the dataset by
typing predict(model, newdata=data.frame(X=a), interval="confidence"), where
newdata contains a data frame with variables that have the same names as the predictors in
the model (X in this case), and a is the particular Xvalue that we are
interested in.
 This applies more generally to multiple linear regression also.

 To find a prediction interval for an individual value of Y at a particular
value of X, type predict(model, interval="prediction"), where model is the name of
the fitted model object (see help #25). The prediction intervals for individual values of Y at each of the Xvalues in the dataset are displayed as two columns headed lwr and
upr.
 You can also obtain a prediction interval for an individual value of Y at an
Xvalue that is not in the dataset by typing
predict(model, newdata=data.frame(X=a), interval="prediction"), where
newdata contains a data frame with variables that have the same names as the predictors in
the model (X in this case), and a is the particular Xvalue that we are
interested in.
 This applies more generally to multiple linear regression also.
Multiple linear regression
 To fit a multiple linear regression model, type
model < lm(Y ∼ X1 + X2), where Y is the response
variable and X1 and X2 are the predictor variables, and then
type summary(model). If you wish to include an interaction either
create an interaction term first using help #6 and include it in the list of
model terms or, alternatively, type
model < lm(Y ∼ X1 + X2 + X1:X2). In the rare
circumstance that you wish to fit a model without an intercept term (regression
through the origin),
type model < lm(Y ∼ X1 + X2  1).
 To add a quadratic regression line to a scatterplot, first fit the
quadratic regression model, model < lm(Y ∼ X + Xsq), where Y is the response
variable, X is the predictor variable, X, and Xsq is X^{2}. To produce a
smooth quadratic regression line, it might be necessary to create a "more even" version of
X for use in the graph, for example, newX < seq(min(X), max(X), length=100) and
newXsq < newX**2. Next, type plot(X, Y) and then type
lines(newX, predict(model, newdata=data.frame(X=newX, Xsq=newXsq))).
 Categories of a qualitative variable can be thought of as defining subsets
of the sample. If there is also a quantitative response and a quantitative predictor variable in
the dataset, a regression model can be fit to the data to represent separate regression lines for
each subset. For example, in a dataset of 20 observations, suppose X2 contains the values
14 to represent four categories, and Y and X1 are two quantitative variables.
Create an interaction term, X1X2 < X1*X2, and fit the model,
model < lm(Y ∼ X1 + X2 + X1X2). Then the following code produces a scatterplot
with numbers (representing the value of X2) instead of circles marking the points, and four
separate regression lines:
plot(X1, Y, type="n")
for (i in 1:20) points(X1[i], Y[i], pch=as.character(X2[i]))
for (i in 1:4) lines(X1[X2==i], fitted(model)[X2==i]).
 To find the Fstatistic and associated pvalue for a nested model Ftest in
multiple linear regression, first fit the reduced model, for example,
model1 < lm(Y ∼ X1), and the complete model, for example,
model2 < lm(Y ∼ X1 + X2 + X3). Then type anova(model1, model2). The
Fstatistic is in the second row of the "Analysis of Variance Table" in the column headed
F, while the associated pvalue is in the column headed Pr(>F).
 To save residuals in a multiple linear regression model, type
res < residuals(model), where model is the name of the fitted model object (see help #31). They can now be used just like any other variable, for example, to construct
residual plots. To save what Pardoe (2012) calls standardized residuals, type
sta < rstandard(model). To save what Pardoe (2012) calls
studentized residuals, type stu < rstudent(model).
 To add a loess fitted line to a scatterplot (useful for checking the zero
mean regression assumption in a residual plot), type, for example,
scatter.smooth(fitted(model), rstudent(model), span=0.75), where model is the name
of the fitted model object (see help #31).
 To save leverages in a multiple linear regression model, type
lev < hatvalues(model), where model is the name of the fitted model object
(see help #31). They can now be used just like any other variable, for example, to construct
scatterplots.
 To save Cook's distances in a multiple linear regression model, type
cook < cooks.distance(model), where model is the name of the fitted model object
(see help #31). They can now be used just like any other variable, for example, to construct
scatterplots.
 To create some residual plots automatically in a multiple linear regression
model, type plot(model), where model is the name of the fitted model object (see
help #31). This produces a plot of residuals versus fitted values, a QQplot of the standardized residuals, a plot of the square root of the standardized residuals versus fitted values, and a plot of standardized residuals versus leverages (with Cook's distance thresholds of 0.5 and 1
marked)—hit Return to cycle through the plots. To create residual plots
manually, first create studentized residuals (see help #35), and then construct scatterplots with
these studentized residuals on the vertical axis.
 To create a correlation matrix of quantitative variables (useful for
checking potential multicollinearity problems), type cor(cbind(Y, X1, X2)), where
Y, X1, and X2 are quantitative variables.
 To find variance inflation factors in multiple linear regression, first
install and load the car package of Fox and Weisberg (2011). Then type
vif(model), where model is the name of the fitted model object (see help
#31).
 To draw a predictor effect plot for graphically displaying the effects of
transformed quantitative predictors and/or interactions between quantitative and qualitative
predictors in multiple linear regression, first create a variable representing the effect, say, "X1effect" (see help #6).
 If the "X1effect" variable just involves X1 (e.g.,
1 + 3X1 + 4X1^{2}), type plot(sort(X1), X1effect[order(X1)], type="l").
 If the "X1effect" variable involves a qualitative variable
(e.g., 1 − 2X1 + 3D2X1, where D2 is an indicator variable), type the following:
X1effect0 < X1effect[D2==0]
X1effect1 < X1effect[D2==1]
plot(X1, X1effect, type="n")
lines(sort(X1[D2==0]), X1effect0[order(X1[D2==0])])
lines(sort(X1[D2==1]), X1effect1[order(X1[D2==1])]).
See Section 5.5 in Pardoe (2012) for an example.
Last updated: May, 2012
© 2012, Iain Pardoe