Stata instructions

These instructions accompany Applied Regression Modeling by Iain Pardoe, 2nd edition published by Wiley in 2012. The numbered items cross-reference with the "computer help" references in the book. These instructions are based on Stata 8 for Windows, but they should also work for other versions. Find instructions for other statistical software packages here.

Getting started and summarizing univariate data

  1. Change Stata's default options by selecting ?.
  2. To open a Stata data file, type use "file.dta", where file.dta is the name of the data file (with the correct path specified if necessary). You can also import text or Excel data files using the Text Import Wizard by selecting File > Browse.
  3. To recall a previously entered command, single-click it in the "Review" window.
  4. Output appears in the "Stata Results" window and can be copied and pasted from Stata to a word processor like OpenOffice Writer or Microsoft Word. Graphs appear in separate windows and can also easily be copied using Edit > Copy Graph and then pasted to other applications.
  5. You can access help by selecting Help > Contents. To find out about a particular topic click Help > Search or to find out about a particular Stata command click Help > Stata Command.
  6. To transform data or compute a new variable, type, for example, generate logX=ln(X) for the natural logarithm of X and generate Xsq=X^2 for X2. If you get the error message "[?]," this means that there is a syntax error in your expression—a common mistake is to forget the multiplication symbol (*) between a number and a variable (e.g., 2*X represents 2X).
  7. To create indicator (dummy) variables from a qualitative variable, type, for example, generate D1=(X=="level"), where X is the qualitative variable and level is the name of one of the categories in X. Repeat for other indicator variables (if necessary).
  8. Calculate descriptive statistics for quantitative variables by typing summarize Y, where Y is the quantitative variable. Type summarize Y, detail to include more statistics. Other, more specific commands include mean Y, median Y, sd Y, min Y, and max Y.
  9. Create contingency tables or cross-tabulations for qualitative variables by typing table X1 X2, where X1 and X2 are the qualitative variables. Calculate row percentages by typing table X1 X2, row. Calculate column percentages by typing table X1 X2, column.
  10. If you have a quantitative variable and a qualitative variable, you can calculate descriptive statistics for cases grouped in different categories by typing, for example,
    sort X
    by X: summarize Y
    where Y is the quantitative variable and X is the qualitative variable.
  11. To make a stem-and-leaf plot for a quantitative variable, type stem Y, round(d), where Y is the quantitative variable and d controls the rounding (e.g., "1" for integers, "0.1" for tenths, etc.).
  12. To make a histogram for a quantitative variable, type histogram Y, bin=10, where Y is the quantitative variable and bin specifies the number of bins.
  13. To make a scatterplot with two quantitative variables, type graph twoway scatter Y X, where Y is the vertical axis variable and Y is the horizontal axis variable.
  14. All possible scatterplots for more than two variables can be drawn simultaneously (called a scatterplot matrix}) by typing graph matrix Y X1 X2, where Y, X1, and X2 are quantitative variables.
  15. You can mark or label cases in a scatterplot with different colors/symbols according to categories in a qualitative variable by using the separate command. For example, suppose X2 contains values 1 and 2 to represent two categories, and Y and X1 are two quantitative variables. Then the following code produces a scatterplot with different symbols (representing the values of X2) marking the points:
    separate Y, by(X2)
    graph twoway (scatter Y1 X1, mlabel(X2)) (scatter Y2 X1, mlabel(X2))
    .
  16. You can identify individual cases in a scatterplot by using the mlabel option, for example, graph twoway scatter Y X, mlabel(id), where X is the horizontal axis variable, Y is the vertical axis variable, and id is a variable containing labels for the points.
  17. To remove one of more observations from a dataset, type, for example, Drop if ID==1, which would remove the observation with ID 1.
  18. To make a bar chart for cases in different categories, use graph bar.
  19. To make boxplots for cases in different categories, use graph box.
  20. To make a QQ-plot (also known as a normal probability plot) for a quantitative variable, type qnorm Y, where Y is a quantitative variable.
  21. To compute a confidence interval for a univariate population mean, type ci Y, level(0.95), where Y is the variable for which you want to calculate the confidence interval, and the value in parentheses after level is the confidence level of the interval.
  22. To do a hypothesis test for a univariate population mean, type ttest Y==value, where Y is the variable for which you want to do the test and value is the (null) hypothesized value.

Simple linear regression

  1. To fit a simple linear regression model (i.e., find a least squares line), type regress Y X, where Y is the response variable and X is the predictor variable. In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), type regress Y X, noconstant.
  2. To add a regression line or least squares line to a scatterplot, type graph twoway (scatter Y X) (lfit Y X), where Y is the response variable and X is the predictor variable.
  3. Stata displays 95% confidence intervals for the regression parameters in a simple linear regression model by default. This applies more generally to multiple linear regression also.

Multiple linear regression

  1. To fit a multiple linear regression model, type regress Y X1 X2, where Y is the response variable and X1 and X2 are the predictor variables. In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), type regress Y X1 X2, noconstant.
  2. To add a quadratic regression line to a scatterplot, type graph twoway (scatter Y X) (qfit Y X), where Y is the response variable and X is the predictor variable.
  3. Categories of a qualitative variable can be thought of as defining subsets of the sample. If there is also a quantitative response and a quantitative predictor variable in the dataset, a regression model can be fit to the data to represent separate regression lines for each subset. For example, suppose X2 contains the values 1-4 to represent four categories, and Y and X1 are two quantitative variables. The following code produces a scatterplot with different symbols representing the values of X2 and four separate regression lines:
    separate Y, by(X2)
    graph twoway (scatter Y1 X1) (scatter Y2 X1) (scatter Y3 X1) (scatter Y4 X1) //
    (lfit Y1 X1) (lfit Y2 X1) (lfit Y3 X1) (lfit Y4 X1)
    .
  4. To find the F-statistic and associated p-value for a nested model F-test in multiple linear regression, first fit the complete model, for example, regress Y X1 X2 X3. Then type, for example, test X2 X3 to test whether the regression parameters for both X1 and X2 are zero. Stata displays the F-statistic and the associated p-value (labeled Prob > F).
  5. To save residuals in a multiple linear regression model, type predict res, residuals, after fitting the model (see help #31). The variable res can now be used just like any other variable, for example, to construct residual plots. To save what Pardoe (2012) calls standardized residuals, use rstandard in place of residuals. To save what Pardoe (2012) calls studentized residuals, use rstudent.
  6. To add a lowess fitted line to a scatterplot (useful for checking the zero mean regression assumption in a residual plot), type, for example, graph twoway (lowess stures yhat, bwidth(.75)) (scatter stures yhat), where stures are studentized residuals (see help #35), yhat are fitted or predicted values (see help #28), and bwidth controls how wiggly the line is (lower means more wiggly).
  7. To save leverages in a multiple linear regression model, type predict lev, leverage, after fitting the model (see help #31). The variable lev can now be used just like any other variable, for example, to construct scatterplots.
  8. To save Cook's distances in a multiple linear regression model, type predict cook, cooksd, after fitting the model (see help #31). The variable cook can now be used just like any other variable, for example, to construct scatterplots.
  9. To create some residual plots automatically in a multiple linear regression model, type rvfplot after fitting the model (see help #31), which produces a plot of residuals versus fitted values or type lvr2plot, which produces a plot of leverage versus "normalized squared residuals." To create residual plots manually, first create studentized residuals (see help #35), and then construct scatterplots with these studentized residuals on the vertical axis.
  10. To create a correlation matrix of quantitative variables (useful for checking potential multicollinearity problems), type correlate Y X1 X2, where Y, X1, and X2 are quantitative variables.
  11. To find variance inflation factors in a multiple linear regression model, type vif after fitting the model (see help #31).
  12. To draw a predictor effect plot for graphically displaying the effects of transformed quantitative predictors and/or interactions between quantitative and qualitative predictors in multiple linear regression, first create a variable representing the effect, say, "X1effect" (see help #6). See Section 5.5 in Pardoe (2012) for an example.

Last updated: Feb, 2016

© 2016, Iain Pardoe