Stata instructions
These instructions accompany Applied Regression Modeling by Iain Pardoe, 2nd edition
published by Wiley in 2012. The numbered items crossreference with the "computer help" references
in the book. These instructions are based on Stata 8 for Windows, but they should also work for other versions. Find instructions for other statistical software packages
here.
Getting started and summarizing univariate data
 Change Stata's default options by selecting ?.
 To open a Stata data file, type use "file.dta", where
file.dta is the name of the data file (with the correct path specified if necessary). You can also import text or Excel data files using the Text Import Wizard by selecting
File > Browse.
 To recall a previously entered command, singleclick it in the "Review"
window.
 Output appears in the "Stata Results" window and can be copied and
pasted from Stata to a word processor like OpenOffice Writer or Microsoft Word. Graphs appear in
separate windows and can also easily be copied using Edit > Copy Graph and then pasted to
other applications.
 You can access help by selecting Help > Contents. To
find out about a particular topic click Help > Search or to find out about a particular
Stata command click Help > Stata Command.
 To transform data or compute a new variable, type, for example,
generate logX=ln(X) for the natural logarithm of X and generate Xsq=X^2 for
X^{2}. If you get the error message "[?]," this means that there
is a syntax error in your expression—a common mistake is to forget the multiplication symbol
(*) between a number and a variable (e.g., 2*X represents 2X).
 To create indicator (dummy) variables from a qualitative variable, type,
for example, generate D1=(X=="level"), where X is the qualitative
variable and level is the name of one of the categories in X. Repeat for other
indicator variables (if necessary).

 To find a percentile (critical value) for a tdistribution, type
display invttail(df, p), where p is the onetail significance level (uppertail
area) and df is the degrees of freedom. For example, display invttail(29, 0.05)
returns the 95th percentile of the tdistribution with 29 degrees of freedom (1.699), which is the
critical value for an uppertail test with a 5% significance level. By contrast,
display invttail(29, 0.025) returns the 97.5th percentile of the tdistribution with 29
degrees of freedom (2.045), which is the critical value for a twotail test with a 5% significance
level.
 To find a percentile (critical value) for an Fdistribution, type
display invFtail(df1, df2, p), where p is the significance level (uppertail
area), df1 is the numerator degrees of freedom, and df2 is the denominator degrees
of freedom. For example, display invFtail(2, 3, 0.05) returns the 95th percentile of the
Fdistribution with 2 numerator degrees of freedom and 3 denominator degrees of freedom
(9.552).
 To find a percentile (critical value) for a chisquared distribution, type
display invchi2tail(df, p), where p is the significance level (uppertail
area) and df is the degrees of freedom. For example, display invchi2tail(2, 0.05) returns the 95th percentile of the chisquared distribution with 2 degrees of
freedom (5.991).

 To find an uppertail area (onetail pvalue) for a tdistribution, type
display ttail(df, t), where t is the absolute value of the tstatistic and
df is the degrees of freedom. For example, display ttail(29, 2.40) returns the
uppertail area for a tstatistic of 2.40 from the tdistribution with 29 degrees of freedom
(0.012), which is the pvalue for an uppertail test. By contrast,
display 2*ttail(29, 2.40) returns the twotail area for a tstatistic of 2.40 from the
tdistribution with 29 degrees of freedom (0.023), which is the pvalue for a twotail test.
 To find an uppertail area (pvalue) for an Fdistribution, type
display Ftail(df1, df2, f), where f is the value of the Fstatistic,
df1 is the numerator degrees of freedom, and df2 is the denominator degrees of
freedom. For example, display Ftail(2, 3, 51.4) returns the uppertail area (pvalue)
for an Fstatistic of 51.4 for the Fdistribution with 2 numerator degrees of freedom and 3
denominator degrees of freedom (0.005).
 To find an uppertail area (pvalue) for a chisquared distribution, type
display chi2tail(df, chisq), where chisq is the value of the chisquared
statistic and df is the degrees of freedom. For example,
display chi2tail(2, 0.38) returns the uppertail area (pvalue)
for a chisquared statistic of 0.38 for the chisquared distribution with 2 degrees of freedom
(0.827).
 Calculate descriptive statistics for quantitative variables by typing
summarize Y, where Y is the quantitative variable. Type
summarize Y, detail to include more statistics. Other, more specific commands include
mean Y, median Y, sd Y, min Y, and max Y.
 Create contingency tables or crosstabulations for qualitative
variables by typing table X1 X2, where X1 and X2 are the qualitative
variables. Calculate row percentages by typing table X1 X2, row. Calculate column
percentages by typing table X1 X2, column.
 If you have a quantitative variable and a qualitative variable, you can calculate
descriptive statistics for cases grouped in different categories by typing, for
example,
sort X
by X: summarize Y
where Y is the quantitative variable and X is the qualitative variable.
 To make a stemandleaf plot for a quantitative variable, type
stem Y, round(d), where Y is the quantitative variable and d controls
the rounding (e.g., "1" for integers, "0.1" for tenths, etc.).
 To make a histogram for a quantitative variable, type
histogram Y, bin=10, where Y is the quantitative variable and bin specifies the number of bins.
 To make a scatterplot with two quantitative variables, type
graph twoway scatter Y X, where Y is the vertical axis variable and Y is
the horizontal axis variable.
 All possible scatterplots for more than two variables can be drawn simultaneously
(called a scatterplot matrix}) by typing graph matrix Y X1 X2, where Y,
X1, and X2 are quantitative variables.
 You can mark or label cases in a scatterplot with different colors/symbols
according to categories in a qualitative variable by using the separate command. For
example, suppose X2 contains values 1 and 2 to represent two categories, and Y and
X1 are two quantitative variables. Then the following code produces a scatterplot with
different symbols (representing the values of X2) marking the points:
separate Y, by(X2)
graph twoway (scatter Y1 X1, mlabel(X2)) (scatter Y2 X1, mlabel(X2)).
 You can identify individual cases in a scatterplot by using the
mlabel option, for example, graph twoway scatter Y X, mlabel(id), where X
is the horizontal axis variable, Y is the vertical axis variable, and id is a
variable containing labels for the points.
 To remove one of more observations from a dataset, type, for example,
Drop if ID==1, which would remove the observation with ID 1.
 To make a bar chart for cases in different categories, use
graph bar.
 For frequency bar charts of one qualitative variable, type
graph bar (count) X1, over(X1), where X1 is a qualitative variable.
 For frequency bar charts of two qualitative variables,
type graph bar (count) X1, over(X1) over(X2), where X1
and X2 are qualitative variables.
 The bars can also represent various summary functions for a quantitative variable.
For example, to produce a bar chart of means, type graph bar (mean) Y, over(X1) over(X2),
where X1 and X2 are the qualitative variables and Y is a quantitative
variable.
 To make boxplots for cases in different categories, use graph box.
 For just one qualitative variable, type graph box Y, by(X1), where
Y is a quantitative variable and X1 is a qualitative variable.
 For two qualitative variables, type graph box Y, by(X1) by(X2), where
Y is a quantitative variable, and X1 and X2 are qualitative
variables.
 To make a QQplot (also known as a normal probability plot) for a
quantitative variable, type qnorm Y, where Y is a quantitative variable.
 To compute a confidence interval for a univariate population mean, type
ci Y, level(0.95), where Y is the variable for which you want to
calculate the confidence interval, and the value in parentheses after level is the
confidence level of the interval.
 To do a hypothesis test for a univariate population mean, type
ttest Y==value, where Y is the variable for which you want to do the test and
value is the (null) hypothesized value.
Simple linear regression
 To fit a simple linear regression model (i.e., find a least squares line),
type regress Y X, where Y is the response variable and X is
the predictor variable. In the rare circumstance that you
wish to fit a model without an intercept term (regression through the origin), type
regress Y X, noconstant.
 To add a regression line or least squares line to a scatterplot,
type graph twoway (scatter Y X) (lfit Y X), where Y is the response variable and
X is the predictor variable.
 Stata displays 95% confidence intervals for the regression parameters in a
simple linear regression model by default. This applies more generally to multiple linear
regression also.

 To find a fitted value or predicted value of Y (the response
variable) at a particular value of X (the predictor variable), type predict yhat, xb after
fitting the model (see help #25). This sets variable yhat equal to the fitted or predicted
values of Y at each of the Xvalues in the dataset.
 You can also obtain a fitted or predicted value of Y at an Xvalue that is not in
the dataset by typing adjust X=a, ci level(95), where a is the particular Xvalue that we
are interested in. In multiple linear regression use, for example,
adjust X1=a X2=b, ci level(95).
 This applies more generally to multiple linear regression also.

 To find confidence intervals for the mean of Y at particular values of
X, first find the standard errors of estimation by typing predict see, stdp after fitting
the model (see help #25). Find the lower limits of the confidence intervals for the mean of Y at
each of the Xvalues in the dataset by typing generate lci = yhatinvttail(df,p)*see, where
yhat is the fitted or predicted values of Y (see computer help #28), and
invttail(df,p) is the appropriate tpercentile (see computer help #8). Find the upper limits similarly by typing generate uci = yhat+invttail(df,p)*see.
 You can also obtain a confidence interval for the mean of Y at an Xvalue that is
not in the dataset by typing adjust X=a, ci level(95), where a is the particular
Xvalue that we are interested in. In multiple linear regression use, for example,
adjust X1=a X2=b, ci level(95).
 This applies more generally to multiple linear regression also.

 To find prediction intervals for an individual value of Y at particular
values of X, first find the standard errors of prediction by typing predict sep, stdf after
fitting the model (see help #25). Find the lower limits of the prediction intervals for an
individual value of Y at each of the Xvalues in the dataset by typing
generate lpi = yhatinvttail(df,p)*sep, where yhat is the fitted or predicted
values of Y (see computer help #28), and invttail(df,p) is the appropriate tpercentile
(see computer help #8). Find the upper limits similarly by typing
generate upi = yhat+invttail(df,p)*sep.
 You can also obtain a prediction interval for an individual value of Y at an
Xvalue that is not in the dataset by typing adjust X=a, stdf ci level(95), where a is the particular
Xvalue that we are interested in. In multiple linear regression use, for example,
adjust X1=a X2=b, stdf ci level(95).
 This applies more generally to multiple linear regression also.
Multiple linear regression
 To fit a multiple linear regression model, type
regress Y X1 X2, where Y is the response variable and X1 and X2 are the predictor variables. In the rare
circumstance that you wish to fit a model without an intercept term (regression through the origin),
type regress Y X1 X2, noconstant.
 To add a quadratic regression line to a scatterplot, type
graph twoway (scatter Y X) (qfit Y X), where Y is the response variable and
X is the predictor variable.
 Categories of a qualitative variable can be thought of as defining subsets
of the sample. If there is also a quantitative response and a quantitative predictor variable in
the dataset, a regression model can be fit to the data to represent separate regression lines for
each subset. For example, suppose X2 contains the values 14 to represent four categories,
and Y and X1 are two quantitative variables. The following code produces a
scatterplot with different symbols representing the values of X2 and four separate
regression lines:
separate Y, by(X2)
graph twoway (scatter Y1 X1) (scatter Y2 X1) (scatter Y3 X1) (scatter Y4 X1) //
(lfit Y1 X1) (lfit Y2 X1) (lfit Y3 X1) (lfit Y4 X1).
 To find the Fstatistic and associated pvalue for a nested model Ftest in
multiple linear regression, first fit the complete model, for example,
regress Y X1 X2 X3. Then type, for example, test X2 X3 to test whether the regression parameters for both X1 and X2 are zero. Stata displays the
Fstatistic and the associated pvalue (labeled Prob > F).
 To save residuals in a multiple linear regression model, type
predict res, residuals, after fitting the model (see help #31). The variable res
can now be used just like any other variable, for example, to construct residual plots. To save
what Pardoe (2012) calls standardized residuals, use rstandard in place of
residuals. To save what Pardoe (2012) calls studentized residuals, use
rstudent.
 To add a lowess fitted line to a scatterplot (useful for checking the zero
mean regression assumption in a residual plot), type, for example,
graph twoway (lowess stures yhat, bwidth(.75)) (scatter stures yhat), where stures are studentized residuals (see help #35), yhat are fitted or predicted values (see help
#28), and bwidth controls how wiggly the line is (lower means more wiggly).
 To save leverages in a multiple linear regression model, type
predict lev, leverage, after fitting the model (see help #31). The variable
lev can now be used just like any other variable, for example, to construct
scatterplots.
 To save Cook's distances in a multiple linear regression model, type
predict cook, cooksd, after fitting the model (see help #31). The variable
cook can now be used just like any other variable, for example, to construct
scatterplots.
 To create some residual plots automatically in a multiple linear regression
model, type rvfplot after fitting the model (see help #31), which produces a plot of
residuals versus fitted values or type lvr2plot, which produces a plot of
leverage versus "normalized squared residuals." To create residual plots
manually, first create studentized residuals (see help #35), and then construct scatterplots with
these studentized residuals on the vertical axis.
 To create a correlation matrix of quantitative variables (useful for
checking potential multicollinearity problems), type correlate Y X1 X2, where
Y, X1, and X2 are quantitative variables.
 To find variance inflation factors in a multiple linear regression model,
type vif after fitting the model (see help #31).
 To draw a predictor effect plot for graphically displaying the effects of
transformed quantitative predictors and/or interactions between quantitative and qualitative
predictors in multiple linear regression, first create a variable representing the effect, say, "X1effect" (see help #6).
 If the "X1effect" variable just involves X1 (e.g., 1 + 3X1 + 4X1^{2}),
type graph twoway connected scatter X1effect X1, msymbol(none) sort.
 If the "X1effect" variable involves a qualitative variable
(e.g., 1 − 2X1 + 3D2X1, where D2 is an indicator variable), type
graph twoway connected scatter X1effect X1, by(D2) msymbol(none) sort.
See Section 5.5 in Pardoe (2012) for an example.
Last updated: Feb, 2016
© 2016, Iain Pardoe