scispace - formally typeset
Search or ask a question

Showing papers in "Stata Journal in 2010"


Journal ArticleDOI
TL;DR: In this paper, the authors present an alternative to interpreting interactions effects in terms of marginal effects by exponentiating the regression coefficients, which will give us an odds ratio or incidence-rate ratio.
Abstract: When estimating a non-linear model such as [R] logit or [R] poisson, we often have two options when it comes to interpreting the regression coefficients: compute some form of marginal effect; or exponentiate the coefficients, which will give us an odds ratio or incidence-rate ratio. The marginal effect is an approximation of how much the dependent variable is expected to increase or decrease for a unit change in an explanatory variable: that is, the effect is presented on an additive scale. The exponentiated coefficients give the ratio by which the dependent variable changes for a unit change in an explanatory variable: that is, the effect is presented on a multiplicative scale. An extensive overview is given by Long and Freese (2006). Sometimes we are also interested in how the effect of one variable changes when another variable changes, namely, the interaction effect. As there is more than one way in which we can define an effect in a non-linear model, there must also be more than one way in which we can define an interaction effect. This tip deals with how to interpret these interaction effects when we want to present effects as odds ratios or incidence-rate ratios. This can be an attractive alternative to interpreting interactions effects in terms of marginal effects.

444 citations


Journal ArticleDOI
TL;DR: In this article, the authors compare the predictive power of different models by fitting alternative models to a training set of data, and then measuring and comparing their predictive powers by using out-of-sample prediction and somersd in a test set to produce statistically sensible confidence intervals and p-values for the differences between the predictive powers.
Abstract: Medical researchers frequently make statements that one model pre- dicts survival better than another, and they are frequently challenged to provide rigorous statistical justification for those statements. Stata provides the estat concordance command to calculate the rank parameters Harrell's C and Somers' D as measures of the ordinal predictive power of a model. However, no confidence limits or p-values are provided to compare the predictive power of distinct models. The somersd package, downloadable from Statistical Software Components, can provide such confidence intervals, but they should not be taken seriously if they are calculated in the dataset in which the model was fit. Methods are demonstrated for fitting alternative models to a training set of data, and then measuring and comparing their predictive powers by using out-of-sample prediction and somersd in a test set to produce statistically sensible confidence intervals and p-values for the differences between the predictive powers of different models.

350 citations


Journal ArticleDOI
TL;DR: An iterative approach for the estimation of linear regression models with high-dimensional fixed effects with minimum memory requirements is described and it is shown that the approach can be extended to nonlinear models and to more than two high- dimensional fixed effects.
Abstract: In this article, we describe an iterative approach for the estimation of linear regression models with high-dimensional fixed effects. This approach is computationally intensive but imposes minimum...

271 citations


Journal ArticleDOI
TL;DR: This article discusses a method by Erikson et al. for decomposing a total effect in a logit model into direct and indirect effects and shows how to include control variables in this decomposition, which was not allowed in the original method.
Abstract: This article discusses a method by Erikson et al. (2005) for decomposing a total effect in a logit model into direct and indirect effects. Moreover, this article extends this method in three ways. First, in the original method the variable through which the indirect effect occurs is assumed to be normally distributed. In this article the method is generalized by allowing this variable to have any distribution. Second, the original method did not provide standard errors for the estimates. In this article the bootstrap is proposed as a method of providing those. Third, I show how to include control variables in this decomposition, which was not allowed in the original method. The original method and these extensions are implemented in the ldecomp package.

202 citations


Journal ArticleDOI
TL;DR: In this article, the authors discuss the implementation of various estimators proposed to estimate quantile treatment effects (QTE), and distinguish four cases: conditional and unconditional QTE with exogenous or endogenous treatment variable.
Abstract: WARNING: this page is no longer updated. Go to http://www.econ.brown.edu/fac/Blaise_Melly/ to find the current version of the codes. News: This paper will be published by the Stata Journal soon. Therefore, it can nolonger be downloaded from this page. In this paper, we discuss the implementation of various estimators proposed to estimate quantile treatment effects (QTE). We distinguish four cases: conditional and unconditional QTE with exogenous or endogenous treatment variable. Therefore, the ivqte command covers four different estimators: the classical quantile regression estimator of Koenker and Bassett (1978) extended to heteroskedasticity consistent standard errors, the IV quantile regression estimator of Abadie, Angrist, and Imbens (2002), the estimator for unconditional QTE proposed by Firpo (2007), and the IV estimator for unconditional QTE proposed by Frolich and Melly (2007). The implemented IV procedures estimate the causal effects for the sub-population of compliers and are well-suited for binary instruments only. This command also provides analytical standard errors and various options for nonparametric estimation. As a by-product, the command locreg implements local linear and local logit estimators for mixed data (continuous, ordered discrete, unordered discrete and binary regressors).

192 citations


Journal ArticleDOI
TL;DR: In this paper, the Stata program Oglm (ordinal generalized linear models) can be used to fit heterogeneous choice and related models, such as location-scale models or heteroskedastic ordered models.
Abstract: When a binary or ordinal regression model incorrectly assumes that er- ror variances are the same for all cases, the standard errors are wrong and (unlike ordinary least squares regression) the parameter estimates are biased Hetero- geneous choice models (also known as location-scale models or heteroskedastic ordered models) explicitly specify the determinants of heteroskedasticity in an at- tempt to correct for it Such models are also useful when the variance itself is of substantive interest This article illustrates how the author's Stata program oglm (ordinal generalized linear models) can be used to fit heterogeneous choice and related models It shows that two other models that have appeared in the liter- ature (Allison's model for group comparisons and Hauser and Andrew's logistic response model with proportionality constraints) are special cases of a heteroge- neous choice model and alternative parameterizations of it The article further argues that heterogeneous choice models may sometimes be an attractive alterna- tive to other ordinal regression models, such as the generalized ordered logit model fit by gologit2 Finally, the article offers guidelines on how to interpret, test, and modify heterogeneous choice models

180 citations


Journal ArticleDOI
TL;DR: The main approaches to resampling variance estimation in complex survey data: balanced repeated replication, the jackknife, and the bootstrap are discussed.
Abstract: In this article, I discuss the main approaches to resampling variance es- timation in complex survey data: balanced repeated replication, the jackknife, and the bootstrap. Balanced repeated replication and the jackknife are implemented in the Stata svy suite. The bootstrap for complex survey data is implemented by the bsweights command. I describe this command and provide working examples. Editors' note. This article was submitted and accepted before the new svy boot- strap prefix was made available in the Stata 11.1 update. The variance estimation method implemented in the new svy bootstrap prefix is equivalent to the one in bs4rw. The only real difference is syntax. For example,

146 citations


Journal ArticleDOI
TL;DR: The bacon command, presented in this article, allows one to quickly identify outliers, even on large datasets of tens of thousands of observations, despite being computationally intensive.
Abstract: Identifying outliers in multivariate data is computationally intensive. The bacon command, presented in this article, allows one to quickly identify outliers, even on large datasets of tens of thou...

145 citations


Journal ArticleDOI
TL;DR: A user-written data envelopment analysis command for Stata that will allow users to conduct the standard optimization procedure and extended managerial analysis and constructs a linear programming model based on the selected dea options.
Abstract: In this article, we introduce a user-written data envelopment analysis command for Stata. Data envelopment analysis is a linear programming method for assessing the efficiency and productivity of u...

131 citations


Journal ArticleDOI
TL;DR: In this article, a method based on pseudovalues is proposed for direct regression modeling of the survival function, the restricted mean, and the cumulative incidence function in competing risks with right-censored data.
Abstract: We draw upon a series of articles in which a method based on pseudovalues is proposed for direct regression modeling of the survival function, the restricted mean, and the cumulative incidence function in competing risks with right-censored data. The models, once the pseudovalues have been computed, can be fit using standard generalized estimating equation software. Here we present Stata procedures for computing these pseudo-observations. An example from a bone marrow transplantation study is used to illustrate the method.

129 citations


Journal ArticleDOI
TL;DR: The meta-analysis command metaan as mentioned in this paper can be used to perform fixed- or random-effects meta analysis, and it can report a variety of heterogeneity measures, including Cochran's Q, I2, H2M and the between-studies variance estimate τ^2.
Abstract: This article describes the new meta-analysis command metaan, which can be used to perform fixed- or random-effects meta-analysis. Besides the standard DerSimonian and Laird approach, metaan offers a wide choice of available models: maximum likelihood, profile likelihood, restricted maximum likelihood, and a permutation model. The command reports a variety of heterogeneity measures, including Cochran's Q, I2, H2M, and the between-studies variance estimate τ^2. A forest plot and a graph of the maximum likelihood function can also be generated.

Journal ArticleDOI
TL;DR: A new Stata program, vselect, is presented that helps users perform variable selection after performing a linear regression and provides options for stepwise methods such as forward selection and backward elimination.
Abstract: We present a new Stata program, vselect, that helps users perform variable selection after performing a linear regression. Options for stepwise methods such as forward selection and backward elimin...

Journal ArticleDOI
TL;DR: The smileplot package as mentioned in this paper implements a range of multiple-test procedures and uses an alternative formulation of multiple test procedures, which is also used by the R function p.adjust, which out-puts a variable of q-values that are equal in each observation to the minimum familywise error rate or false discovery rate that would result in the inclusion of the corresponding p-value in the discovery set if the specified multiple test pro- cedure was applied to the full set of input p-values.
Abstract: Multiple-test procedures are increasingly important as technology in- creases scientists' ability to make large numbers of multiple measurements, as they do in genome scans. Multiple-test procedures were originally defined to input a vector of input p-values and an uncorrected critical p-value, interpreted as a fami- lywise error rate or a false discovery rate, and to output a corrected critical p-value and a discovery set, defined as the subset of input p-values that are at or below the corrected critical p-value. A range of multiple-test procedures is implemented us- ing the smileplot package in Stata (Newson and the ALSPAC Study Team 2003, Stata Journal 3: 109-132; 2010, Stata Journal 10: 691-692). The qqvalue com- mand uses an alternative formulation of multiple-test procedures, which is also used by the R function p.adjust. qqvalue inputs a variable of p-values and out- puts a variable of q-values that are equal in each observation to the minimum familywise error rate or false discovery rate that would result in the inclusion of the corresponding p-value in the discovery set if the specified multiple-test pro- cedure was applied to the full set of input p-values. Formulas and examples are presented.

Journal ArticleDOI
TL;DR: In this paper, the minimum covariance determinant estimator is used to estimate location parameters and multivariate scales, which can be used to robustify Mahalanobis distances and to identify outliers.
Abstract: Before implementing any multivariate statistical analysis based on empirical covariance matrices, it is important to check whether outliers are present because their existence could induce significant biases. In this article, we present the minimum covariance determinant estimator, which is commonly used in robust statistics to estimate location parameters and multivariate scales. These estimators can be used to robustify Mahalanobis distances and to identify outliers. Verardi and Croux (1999, Stata Journal 9: 439-453; 2010, Stata Journal 10: 313) programmed this estimator in Stata and made it available with the med command. The implemented algorithm is relatively fast and, as we show in the simulation example section, outperforms the methods already available in Stata, such as the Hadi method. © 2010 StataCorp LP.

Journal ArticleDOI
TL;DR: A new command is introduced, apcfit, that performs the methods in Stata that models age, period, and cohort as continuous variables through the use of spline functions.
Abstract: Age-period-cohort models provide a useful method for modeling inci- dence and mortality rates. It is well known that age-period-cohort models suffer from an identifiability problem due to the exact relationship between the variables (cohort = period − age). In 2007, Carstensen published an article advocating the use of an analysis that models age, period, and cohort as continuous variables through the use of spline functions (Carstensen, 2007, Statistics in Medicine 26: 3018-3045). Carstensen implemented his method for age-period-cohort models in the Epi package for R. In this article, a new command is introduced, apcfit, that performs the methods in Stata. The identifiability problem is overcome by forcing constraints on either the period or cohort effects. The use of the command is il- lustrated through an example relating to the incidence of colon cancer in Finland. The example shows how to include covariates in the analysis.

Journal ArticleDOI
TL;DR: A new Stata command, simsum, analyzes data from simulation studies, which may comprise point estimates and standard errors from several analysis methods, possibly resulting from several different simulation settings.
Abstract: A new Stata command, simsum, analyzes data from simulation studies. The data may comprise point estimates and standard errors from several analysis methods, possibly resulting from several differen...

Journal ArticleDOI
TL;DR: An improved parameterization of fixed-effects models using sum-to-zero constraints that provides estimates of fixed effects relative to mean effects within well-defined reference groups and provides standard errors for those estimates that are appropriate for shrinkage estimation.
Abstract: Availability of large, multilevel longitudinal databases in various fields including labor economics (with workers and firms observed over time) and ed- ucation research (with students and teachers observed over time) has increased the application of panel-data models with multiple levels of fixed-effects. Existing software routines for fitting fixed-effects models were not designed for applications in which the primary interest is obtaining estimates of any of the fixed-effects parameters. Such routines typically report estimates of fixed effects relative to arbitrary holdout units. Contrasts to holdout units are not ideal in cases where the fixed-effects parameters are of interest because they can change capriciously, they do not correspond to the structural parameters that are typically of inter- est, and they are inappropriate for empirical Bayes (shrinkage) estimation. We develop an improved parameterization of fixed-effects models using sum-to-zero constraints that provides estimates of fixed effects relative to mean effects within well-defined reference groups (e.g., all firms of a given type or all teachers of a given grade) and provides standard errors for those estimates that are appropriate for shrinkage estimation. We implement our parameterization in a Stata routine called felsdvregdm by modifying the felsdvreg routine designed for fitting high- dimensional fixed-effects models. We demonstrate our routine with an example dataset from the Florida Education Data Warehouse.

Journal ArticleDOI
TL;DR: Sample skewness and kurtosis are limited by functions of sample size as discussed by the authors, or approximations to them, have repeatedly been rediscovered over the last several decades, but nevertheless seem to remain only poorly known.
Abstract: Sample skewness and kurtosis are limited by functions of sample size The limits, or approximations to them, have repeatedly been rediscovered over the last several decades, but nevertheless seem to remain only poorly known The limits impart bias to estimation and, in extreme cases, imply that no sample could bear exact witness to its parent distribution The main results are explained in a tutorial review, and it is shown how Stata and Mata may be used to confirm and explore their consequences Thiele did not use all the now-standard terminology The names standard deviation, skewness, and kurtosis we owe to Karl Pearson, and the name variance we owe to Ronald Aylmer Fisher (David 2001) Much of the impact of moments can be traced to these two statisticians Pearson was a vigorous proponent of using moments in distribution curve fitting His own system of probability distributions pivots on varying skewness, measured relative to the mode Fisher's advocacy of maximum likelihood as a superior estimation method was combined with his exposition of variance as central to statistical thinking The many editions of Fisher's 1925 text Statistical Methods for Research Workers, and of texts that in turn drew upon its approach, have introduced several generations to the ideas of skewness and kurtosis Much more detail on this history is given by Walker (1929), Hald (1998, 2007), and Fiori and Zenga (2009)

Journal ArticleDOI
TL;DR: A Stata command that estimates a probability distribution using a maximum entropy or minimum cross-entropy criterion is introduced and it is shown how this command can be used to calibrate survey data to various population totals.
Abstract: Maximum entropy and minimum cross-entropy estimation are applicable when faced with ill-posed estimation problems. I introduce a Stata command that estimates a probability distribution using a maxi...

Journal ArticleDOI
TL;DR: The authors describe two multivariate distributions, the skew-norma and the skew norma, for modeling nonnormal data in practice, leading to the development of flexible distributions for modeling such situations.
Abstract: Nonnormal data arise often in practice, prompting the development of flexible distributions for modeling such situations In this article, we describe two multivariate distributions, the skew-norma

Journal ArticleDOI
TL;DR: In this paper, the authors present a Stata tool called ARTPEP, which is intended to project the power and events of a trial with a time-to-event outcome into the future given patient accrual figures so far and assumptions about event rates and other defining parameters.
Abstract: In 2005, Barthel, Royston, and Babiker presented a menu-driven Stata program under the generic name of ART (assessment of resources for trials) to calculate sample size and power for complex clinical trial designs with a time-to- event or binary outcome In this article, we describe a Stata tool called ARTPEP, which is intended to project the power and events of a trial with a time-to-event outcome into the future given patient accrual figures so far and assumptions about event rates and other defining parameters ARTPEP has been designed to work closely with the ART program and has an associated dialog box We illustrate the use of ARTPEP with data from a phase III trial in esophageal cancer

Journal ArticleDOI
TL;DR: In this article, the authors describe commands for generating spatial-effect variables for monadic contagion as well as for all possible forms of contagion in dyadic data, where the unit of analysis is the pair or dyad representing an interaction or a relation between two individual units, agents, or actors.
Abstract: Spatial dependence exists whenever the expected utility of one unit of analysis is affected by the decisions or behavior made by other units of analysis. Spatial dependence is ubiquitous in social relations and interactions. Yet, there are surprisingly few social science studies accounting for spatial dependence. This holds true for settings in which researchers use monadic data, where the unit of analysis is the individual unit, agent, or actor, and even more true for dyadic data settings, where the unit of analysis is the pair or dyad representing an interaction or a relation between two individual units, agents, or actors. Dyadic data offer more complex ways of modeling spatial-effect variables than do monadic data. The commands described in this article facilitate spatial analysis by providing an easy tool for generating, with one command line, spatial-effect variables for monadic contagion as well as for all possible forms of contagion in dyadic data.

Journal ArticleDOI
TL;DR: The SPost user package as mentioned in this paper is a suite of postestimation commands to compute additional tests and compute additional test suites for Categorical Dependent Variables using Stata, which can be used for regression models for categorical dependencies.
Abstract: The SPost user package (Long and Freese, 2006, Regression Models for Categorical Dependent Variables Using Stata [Stata Press]) is a suite of postestimation commands to compute additional tests and...

Journal ArticleDOI
TL;DR: In this article, the authors proposed a method to adjust for informative dropout in longitudinal data analysis by combining the restricted it- erative generalized least-squares method with a nested expectation-maximization algorithm.
Abstract: Many studies in various research areas have designs that involve re- peated measurements over time of a continuous variable across a group of subjects. A frequent and serious problem in such studies is the occurrence of missing data. In many cases, missing data are caused by an event that leads to a premature termination of the series of repeated measurements on some subjects. When the probability of the occurrence of this event is related to the subject-specific under- lying trend of the variable of interest, this missingness process is called informative censoring or informative drop-out. Standard likelihood-based methods (for exam- ple, linear mixed models) fail to give consistent estimates. In such cases, one needs to apply methods that simultaneously model the observed data and the missing- ness process. In this article, we review a method proposed by Touloumi et al. (1999, Statistics in Medicine 18: 1215-1233) to adjust for informative drop-out in longitudinal data analysis. We also present the jmre1 command, which can be used to fit the proposed model. The estimation method combines the restricted it- erative generalized least-squares method with a nested expectation-maximization algorithm. The method is implemented mainly using Stata's matrix program- ming language, Mata. Our example is derived from the epidemiology of the HIV infection.

Journal ArticleDOI
TL;DR: The new margins command, available in Stata 11, greatly simplifies many postestimation computations and can be used to compute quantities such as elasticities and semielasticities with a much simpler command syntax than that previously available.
Abstract: The new margins command, available in Stata 11, greatly simplifies many postestimation computations. Discussions of this command have justifiably highlighted its capabilities to work with factor variables and expressions involving factor-variable operators, such as c.x#c.x. But you may find another aspect of margins very useful: its ability to compute quantities such as elasticities and semielasticities with a much simpler command syntax than that previously available.


Journal ArticleDOI
TL;DR: Weisberg et al. as discussed by the authors presented a new Stata estimation program, mboxcox, that computes the normalizing scaled power transformations for a set of variables using multivari- ate Box-Cox method.
Abstract: We present a new Stata estimation program, mboxcox, that computes the normalizing scaled power transformations for a set of variables. The multivari- ate Box-Cox method (defined in Velilla, 1993, Statistics and Probability Letters 17: 259-263; used in Weisberg, 2005, Applied Linear Regression (Wiley)) is used to determine the transformations. We demonstrate using a generated example and a real dataset.

Journal ArticleDOI
TL;DR: Stata’s handling of dates and times is centered on daily dates, which may all sound simple in principle, leaving just the matter of identifying the syntax for converting from one form of representation to another.
Abstract: Stata’s handling of dates and times is centered on daily dates. Days are aggregated into weeks, months, quarters, half-years, and years. Days are divided into hours, minutes, and seconds. This may all sound simple in principle, leaving just the matter of identifying the syntax for converting from one form of representation to another. For an introduction, see [U] 24 Working with dates and times. For more comprehensive treatments, see [D] dates and times and [D] functions. As a matter of history, know that specific date functions were introduced in Stata 4 in 1995, replacing an earlier system based on ado-files. These date functions were much enhanced in Stata 6 in 1999 and again in Stata 10 in 2007.

Journal ArticleDOI
TL;DR: In this paper, a simple method to obtain in Stata Murphy-Topel corrected variances for a two-step estimation of a heckprobit model with endo- geneity in the main equation is presented.
Abstract: We outline a fairly simple method to obtain in Stata Murphy-Topel- corrected variances for a two-step estimation of a heckprobit model with endo- geneity in the main equation. The procedure uses predict's score option and the powerful matrix tool accum in Stata and builds on previous works by Hardin (2002, Stata Journal 2: 253-266) and Hole (2006, Stata Journal 6: 521-529).

Journal ArticleDOI
TL;DR: In this article, a new Stata command called screening is described for data management that can be used to examine the content of complex narrative-text variables to identify one or more user-defined keyw...
Abstract: In this article, we describe screening, a new Stata command for data management that can be used to examine the content of complex narrative-text variables to identify one or more user-defined keyw...