scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Statistical Software in 2011"


Journal ArticleDOI
TL;DR: Mice adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs.
Abstract: The R package mice imputes incomplete multivariate data by chained equations. The software mice 1.0 appeared in the year 2000 as an S-PLUS library, and in 2001 as an R package. mice 1.0 introduced predictor selection, passive imputation and automatic pooling. This article documents mice, which extends the functionality of mice 1.0 in several ways. In mice, the analysis of imputed data is made completely general, whereas the range of models under which pooling works is substantially extended. mice adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs. Imputation of categorical data is improved in order to bypass problems caused by perfect prediction. Special attention is paid to transformations, sum scores, indices and interactions using passive imputation, and to the proper setup of the predictor matrix. mice can be downloaded from the Comprehensive R Archive Network. This article provides a hands-on, stepwise approach to solve applied incomplete data problems.

10,234 citations


Journal ArticleDOI
TL;DR: MatchIt implements a wide range of sophisticated matching methods, making it possible to greatly reduce the dependence of causal inferences on hard-to-justify, but commonly made, statistical modeling assumptions.
Abstract: MatchIt implements the suggestions of Ho, Imai, King, and Stuart (2007) for improving parametric statistical models by preprocessing data with nonparametric matching methods. MatchIt implements a wide range of sophisticated matching methods, making it possible to greatly reduce the dependence of causal inferences on hard-to-justify, but commonly made, statistical modeling assumptions. The software also easily fits into existing research practices since, after preprocessing data with MatchIt , researchers can use whatever parametric model they would have used without MatchIt , but produce inferences with substantially more robustness and less sensitivity to modeling assumptions. MatchIt is an R program, and also works seamlessly with Zelig .

3,012 citations


Journal ArticleDOI
TL;DR: The Amelia II package implements a new expectation-maximization with bootstrapping algorithm that works faster, with larger numbers of variables, and is far easier to use, than various Markov chain Monte Carlo approaches, but gives essentially the same answers.
Abstract: Amelia II is a complete R package for multiple imputation of missing data. The package implements a new expectation-maximization with bootstrapping algorithm that works faster, with larger numbers of variables, and is far easier to use, than various Markov chain Monte Carlo approaches, but gives essentially the same answers. The program also improves imputation models by allowing researchers to put Bayesian priors on individual cell values, thereby including a great deal of potentially valuable and extensive information. It also includes features to accurately impute cross-sectional datasets, individual time series, or sets of time series for different cross-sections. A full set of graphical diagnostics are also available. The program is easy to use, and the simplicity of the algorithm makes it far more robust; both a simple command line and extensive graphical user interface are included.

2,404 citations


Journal ArticleDOI
TL;DR: This paper gives rise to a new R package that allows you to smoothly apply a split-apply-combine strategy, without having to worry about the type of structure in which your data is stored.
Abstract: Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together. This insight gives rise to a new R package that allows you to smoothly apply this strategy, without having to worry about the type of structure in which your data is stored. The paper includes two case studies showing how these insights make it easier to work with batting records for veteran baseball players and a large 3d array of spatio-temporal ozone measurements.

2,243 citations


Journal ArticleDOI
TL;DR: The R package unmarked provides a unified modeling framework for ecological research, including tools for data exploration, model fitting, model criticism, post-hoc analysis, and model comparison.
Abstract: Ecological research uses data collection techniques that are prone to substantial and unique types of measurement error to address scientific questions about species abundance and distribution. These data collection schemes include a number of survey methods in which unmarked individuals are counted, or determined to be present, at spatially- referenced sites. Examples include site occupancy sampling, repeated counts, distance sampling, removal sampling, and double observer sampling. To appropriately analyze these data, hierarchical models have been developed to separately model explanatory variables of both a latent abundance or occurrence process and a conditional detection process. Because these models have a straightforward interpretation paralleling mechanisms under which the data arose, they have recently gained immense popularity. The common hierarchical structure of these models is well-suited for a unified modeling interface. The R package unmarked provides such a unified modeling framework, including tools for data exploration, model fitting, model criticism, post-hoc analysis, and model comparison.

1,675 citations


Journal ArticleDOI
TL;DR: This work introduces a pathwise algorithm for the Cox proportional hazards model, regularized by convex combinations of ℓ1 andℓ2 penalties (elastic net), and employs warm starts to find a solution along a regularization path.
Abstract: We introduce a pathwise algorithm for the Cox proportional hazards model, regularized by convex combinations of l1 and l2 penalties (elastic net). Our algorithm fits via cyclical coordinate descent, and employs warm starts to find a solution along a regularization path. We demonstrate the efficacy of our algorithm on real and simulated data sets, and find considerable speedup between our algorithm and competing methods.

1,579 citations


Journal ArticleDOI
TL;DR: The Rcpp package simplifies integrating C++ code with R by providing a consistent C++ class hierarchy that maps various types of R objects to dedicated C++ classes.
Abstract: The Rcpp package simplies integrating C++ code with R. It provides a consistent C++ class hierarchy that maps various types of R objects (vectors, matrices, functions, environments, . . . ) to dedicated C++ classes. Object interchange between R and C++ is managed by simple, exible and extensible concepts which include broad support for C++ Standard Template Library idioms. C++ code can both be compiled, linked and loaded on the y, or added via packages. Flexible error and exception code handling is provided. Rcpp substantially lowers the barrier for programmers wanting to combine C++ code with R.

1,322 citations


Journal ArticleDOI
TL;DR: Matching as mentioned in this paper is an R package which provides functions for multivariate and propensity score matching and for finding optimal covariate balance based on a genetic search algorithm and a variety of univariate and multivariate metrics to determine if balance actually has been obtained.
Abstract: Matching is an R package which provides functions for multivariate and propensity score matching and for finding optimal covariate balance based on a genetic search algorithm. A variety of univariate and multivariate metrics to determine if balance actually has been obtained are provided. The underlying matching algorithm is written in C++, makes extensive use of system BLAS and scales efficiently with dataset size. The genetic algorithm which finds optimal balance is parallelized and can make use of multiple CPUs or a cluster of computers. A large number of options are provided which control exactly how the matching is conducted and how balance is evaluated.

1,184 citations


Journal ArticleDOI
TL;DR: An overview of the capabilities of the DLNM package is offered, describing the conceptual and practical steps to specify and interpret DLNMs with an example of application to real data.
Abstract: Distributed lag non-linear models (DLNMs) represent a modeling framework to flexibly describe associations showing potentially non-linear and delayed effects in time series data. This methodology rests on the definition of a crossbasis, a bi-dimensional functional space expressed by the combination of two sets of basis functions, which specify the relationships in the dimensions of predictor and lags, respectively. This framework is implemented in the R package dlnm, which provides functions to perform the broad range of models within the DLNM family and then to help interpret the results, with an emphasis on graphical representation. This paper offers an overview of the capabilities of the package, describing the conceptual and practical steps to specify and interpret DLNMs with an example of application to real data.

1,028 citations


Journal ArticleDOI
TL;DR: poLCA is a software package for the estimation of latent class and latent class regression models for polytomous outcome variables, implemented in the R statistical computing environment using expectation-maximization and Newton-Raphson algorithms to find maximum likelihood estimates of the model parameters.
Abstract: poLCA is a software package for the estimation of latent class and latent class regression models for polytomous outcome variables, implemented in the R statistical computing environment. Both models can be called using a single simple command line. The basic latent class model is a finite mixture model in which the component distributions are assumed to be multi-way cross-classification tables with all variables mutually independent. The latent class regression model further enables the researcher to estimate the effects of covariates on predicting latent class membership. poLCA uses expectation-maximization and Newton-Raphson algorithms to find maximum likelihood estimates of the model parameters.

991 citations


Journal ArticleDOI
TL;DR: The R package topicmodels provides basic infrastructure for fitting topic models based on data structures from the text mining package tm to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables.
Abstract: Topic models allow the probabilistic modeling of term frequency occurrences in documents. The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics. The R package topicmodels provides basic infrastructure for fitting topic models based on data structures from the text mining package tm. The package includes interfaces to two algorithms for fitting topic models: the variational expectation-maximization algorithm provided by David M. Blei and co-authors and an algorithm using Gibbs sampling by Xuan-Hieu Phan and co-authors.

Journal ArticleDOI
TL;DR: Ice is described, an implementation in Stata of the MICE approach to multiple imputation, and real data from an observational study in ovarian cancer is used to illustrate the most important of the many options available with ice.
Abstract: Missing data are a common occurrence in real datasets. For epidemiological and prognostic factors studies in medicine, multiple imputation is becoming the standard route to estimating models with missing covariate data under a missing-at-random assumption. We describe ice, an implementation in Stata of the MICE approach to multiple imputation. Real data from an observational study in ovarian cancer are used to illustrate the most important of the many options available with ice. We remark briefly on the new database architecture and procedures for multiple imputation introduced in releases 11 and 12 of Stata.

Journal ArticleDOI
TL;DR: The range of Markov models and their extensions which can be fitted to panel-observed data, and their implementation in the msm package for R are reviewed, intended to be straightforward to use, flexible and comprehensively documented.
Abstract: Panel data are observations of a continuous-time process at arbitrary times, for example, visits to a hospital to diagnose disease status. Multi-state models for such data are generally based on the Markov assumption. This article reviews the range of Markov models and their extensions which can be fitted to panel-observed data, and their implementation in the msm package for R. Transition intensities may vary between individuals, or with piecewise-constant time-dependent covariates, giving an inhomogeneous Markov model. Hidden Markov models can be used for multi-state processes which are misclassified or observed only through a noisy marker. The package is intended to be straightforward to use, flexible and comprehensively documented. Worked examples are given of the use of msm to model chronic disease progression and screening. Assessment of model fit, and potential future developments of the software, are also discussed.

Journal ArticleDOI
TL;DR: This article describes the many capabilities offered by the TraMineR toolbox for categorical sequence data and focuses more specifically on the analysis and rendering of state sequences.
Abstract: This article describes the many capabilities offered by the TraMineR toolbox for categorical sequence data. It focuses more specifically on the analysis and rendering of state sequences. Addressed features include the description of sets of sequences by means of transversal aggregated views, the computation of longitudinal characteristics of individual sequences and the measure of pairwise dissimilarities. Special emphasis is put on the multiple ways of visualizing sequences. The core element of the package is the state se- quence object in which we store the set of sequences together with attributes such as the alphabet, state labels and the color palette. The functions can then easily retrieve this information to ensure presentation homogeneity across all printed and graphical displays. The article also demonstrates how TraMineR’s outcomes give access to advanced analyses such as clustering and statistical modeling of sequence data.

Journal ArticleDOI
TL;DR: The lubridate package for R is presented, which facilitates working with dates and times and introduces a conceptual framework for arithmetic with date-times in R.
Abstract: This paper presents the lubridate package for R, which facilitates working with dates and times. Date-times create various technical problems for the data analyst. The paper highlights these problems and offers practical advice on how to solve them using lubridate. The paper also introduces a conceptual framework for arithmetic with date-times in R.

Journal ArticleDOI
TL;DR: This introduction to the R package rgenoud is a modied version of Mebane and Sekhon (2011), published in the Journal of Statistical Software and contains higher resolution gures.
Abstract: This introduction to the R package rgenoud is a modied version of Mebane and Sekhon (2011), published in the Journal of Statistical Software. That version of the introduction contains higher resolution gures. genoud is an R function that combines evolutionary algorithm methods with a derivativebased (quasi-Newton) method to solve dicult optimization problems. genoud may also be used for optimization problems for which derivatives do not exist. genoud solves problems that are nonlinear or perhaps even discontinuous in the parameters of the function to be optimized. When the function to be optimized (for example, a log-likelihood) is nonlinear in the model’s parameters, the function will generally not be globally concave and may have irregularities such as saddlepoints or discontinuities. Optimization methods that rely on derivatives of the objective function may be unable to nd any optimum at all. Multiple local optima may exist, so that there is no guarantee that a derivative-based method will converge to the global optimum. On the other hand, algorithms that do not use derivative information (such as pure genetic algorithms) are for many problems needlessly poor at local hill climbing. Most statistical problems are regular in a neighborhood of the solution. Therefore, for some portion of the search space, derivative information is useful. The function supports parallel processing on multiple CPUs on a single machine or a cluster of computers.

Journal ArticleDOI
TL;DR: The mi package in R has features that allow the user to get inside the imputation process and evaluate the reasonableness of the resulting models and imputations, and uses Bayesian models and weakly informative prior distributions to construct more stable estimates of imputation models.
Abstract: Our mi package in R has several features that allow the user to get inside the imputation process and evaluate the reasonableness of the resulting models and imputations. These features include: choice of predictors, models, and transformations for chained imputation models; standard and binned residual plots for checking the fit of the conditional distributions used for imputation; and plots for comparing the distributions of observed and imputed data. In addition, we use Bayesian models and weakly informative prior distributions to construct more stable estimates of imputation models. Our goal is to have a demonstration package that (a) avoids many of the practical problems that arise with existing multivariate imputation programs, and (b) demonstrates state-of-the-art diagnostics that can be applied more generally and can be incorporated into the software of others.

Journal ArticleDOI
TL;DR: MCpack is introduced, an R package that contains functions to perform Bayesian inference using posterior simulation for a number of statistical models, and some useful utility functions are introduced.
Abstract: We introduce MCMCpack, an R package that contains functions to perform Bayesian inference using posterior simulation for a number of statistical models In addition to code that can be used to fit commonly used models, MCMCpack also contains some useful utility functions, including some additional density functions and pseudo-random number generators for statistical distributions, a general purpose Metropolis sampling algorithm, and tools for visualization

Journal ArticleDOI
TL;DR: The current investigation advances the technique by developing a computational platform integrating both statistical and IRT procedures into a single program, and a Monte Carlo simulation approach was incorporated to derive empirical criteria for various DIF statistics and effect size measures.
Abstract: Logistic regression provides a flexible framework for detecting various types of differential item functioning (DIF). Previous efforts extended the framework by using item response theory (IRT) based trait scores, and by employing an iterative process using group--specific item parameters to account for DIF in the trait scores, analogous to purification approaches used in other DIF detection frameworks. The current investigation advances the technique by developing a computational platform integrating both statistical and IRT procedures into a single program. Furthermore, a Monte Carlo simulation approach was incorporated to derive empirical criteria for various DIF statistics and effect size measures. For purposes of illustration, the procedure was applied to data from a questionnaire of anxiety symptoms for detecting DIF associated with age from the Patient--Reported Outcomes Measurement Information System.

Journal ArticleDOI
TL;DR: The R package DEoptim is described which implements the differential evolution algorithm for the global optimization of a real-valued function of areal-valued parameter vector and interfaces with C code for efficiency.
Abstract: This article describes the R package DEoptim which implements the differential evolution algorithm for the global optimization of a real-valued function of a real-valued parameter vector. The implementation of differential evolution in DEoptim interfaces with C code for efficiency. The utility of the package is illustrated via case studies in fitting a Parratt model for X-ray reflectometry data and a Markov-Switching Generalized AutoRegressive Conditional Heteroskedasticity (MSGARCH) model for the returns of the Swiss Market Index.

Journal ArticleDOI
TL;DR: The mstate package in R is developed, which covers all steps of the analysis of multi-state models, from model building and data preparation to estimation and graphical representation of the results, and is suitable for non- and semi-parametric (Cox) models.
Abstract: Multi-state models are a very useful tool to answer a wide range of questions in survival analysis that cannot, or only in a more complicated way, be answered by classical models. They are suitable for both biomedical and other applications in which time-to-event variables are analyzed. However, they are still not frequently applied. So far, an important reason for this has been the lack of available software. To overcome this problem, we have developed the mstate package in R for the analysis of multi-state models. The package covers all steps of the analysis of multi-state models, from model building and data preparation to estimation and graphical representation of the results. It can be applied to non- and semi-parametric (Cox) models. The package is also suitable for competing risks models, as they are a special category of multi-state models. This article offers guidelines for the actual use of the software by means of an elaborate multi-state analysis of data describing post-transplant events of patients with blood cancer. The data have been provided by the EBMT (the European Group for Blood and Marrow Transplantation). Special attention will be paid to the modeling of different covariate effects (the same for all transitions or transition-specific) and different baseline hazard assumptions (different for all transitions or equal for some).

Journal ArticleDOI
TL;DR: The REALCOM-IMPUTE software performs multilevel multiple imputation, and handles ordinal and unordered categorical data appropriately, and may be used either as a standalone package, or in conjunction with the multileVEL software MLwiN or Stata.
Abstract: Multiple imputation is becoming increasingly established as the leading practical approach to modelling partially observed data, under the assumption that the data are missing at random. However, many medical and social datasets are multilevel, and this structure should be reected not only in the model of interest, but also in the imputation model. In particular, the imputation model should reect the dierences between level 1 variables and level 2 variables (which are constant across level 1 units). This led us to develop the REALCOM-IMPUTE software, which we describe in this article. This software performs multilevel multiple imputation, and handles ordinal and unordered categorical data appropriately. It is freely available on-line, and may be used either as a standalone package, or in conjunction with the multilevel software MLwiN or Stata.

Journal ArticleDOI
TL;DR: This work attempts to provide some diagnostic information about the function, its scaling and parameter bounds, and the solution characteristics of optimx, a wrapper to consolidate many of these choices for the optimization of functions that are mostly smooth with parameters at most bounds-constrained.
Abstract: R users can often solve optimization tasks easily using the tools in the optim function in the stats package provided by default on R installations. However, there are many other optimization and nonlinear modelling tools in R or in easily installed add-on packages. These present users with a bewildering array of choices. optimx is a wrapper to consolidate many of these choices for the optimization of functions that are mostly smooth with parameters at most bounds-constrained. We attempt to provide some diagnostic information about the function, its scaling and parameter bounds, and the solution characteristics. optimx runs a battery of methods on a given problem, thus facilitating comparative studies of optimization algorithms for the problem at hand. optimx can also be a useful pedagogical tool for demonstrating the strengths and pitfalls of different classes of optimization approaches including Newton, gradient, and derivative-free methods.


Journal ArticleDOI
TL;DR: The R package Synth as discussed by the authors implements synthetic control methods for comparative case studies designed to estimate the causal effects of policy interventions and other events of interest (Abadie and Gardeazabal 2003; Abadie, Diamond, and Hainmueller 2010).
Abstract: The R package Synth implements synthetic control methods for comparative case studies designed to estimate the causal effects of policy interventions and other events of interest (Abadie and Gardeazabal 2003; Abadie, Diamond, and Hainmueller 2010). These techniques are particularly well-suited to investigate events occurring at an aggregate level (i.e., countries, cities, regions, etc.) and affecting a relatively small number of units. Benefits and features of the Synth package are illustrated using data from Abadie and Gardeazabal (2003), which examined the economic impact of the terrorist conflict in the Basque Country.

Journal ArticleDOI
TL;DR: This paper presents the SAS/STAT MI and MIANALYZE procedures, which perform inference by multiple imputation under numerous settings, and implements popular methods for creating imputations under monotone and nonmonotone (arbitrary) patterns of missing data.
Abstract: Multiple imputation provides a useful strategy for dealing with data sets that have missing values. Instead of filling in a single value for each missing value, a multiple imputation procedure replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. These multiply imputed data sets are then analyzed by using standard procedures for complete data and combining the results from these analyses. No matter which complete-data analysis is used, the process of combining results of parameter estimates and their associated standard errors from different imputed data sets is essentially the same. This process results in valid statistical inferences that properly reflect the uncertainty due to missing values. This paper reviews methods for analyzing missing data and applications of multiple imputation techniques. This paper presents the SAS/STAT MI and MIANALYZE procedures, which perform inference by multiple imputation under numerous settings. PROC MI implements popular methods for creating imputations under monotone and nonmonotone (arbitrary) patterns of missing data, and PROC MIANALYZE analyzes results from multiply imputed data sets.

Journal ArticleDOI
TL;DR: In this article, the authors describe the R package ipw for estimating inverse probability weights, which can be used with binomial, categorical, ordinal and continuous exposure variables, and show how to use the package to fit marginal structural models through inverse probability weighting, to estimate causal effects.
Abstract: We describe the R package ipw for estimating inverse probability weights. We show how to use the package to fit marginal structural models through inverse probability weighting, to estimate causal effects. Our package can be used with data from a point treatment situation as well as with a time-varying exposure and time-varying confounders. It can be used with binomial, categorical, ordinal and continuous exposure variables.

Journal ArticleDOI
TL;DR: In this article, the potential of the lmer function from the lme4 package in R for item response (IRT) modeling is discussed, and three broad categories of models are described: item covariate models, person covariate model, and person-by-item model.
Abstract: In this paper we elaborate on the potential of the lmer function from the lme4 package in R for item response (IRT) modeling. In line with the package, an IRT framework is described based on generalized linear mixed modeling. The aspects of the framework refer to (a) the kind of covariates -- their mode (person, item, person-by-item), and their being external vs. internal to responses, and (b) the kind of effects the covariates have -- fixed vs. random, and if random, the mode across which the effects are random (persons, items). Based on this framework, three broad categories of models are described: Item covariate models, person covariate models, and person-by-item covariate models, and within each category three types of more specific models are discussed. The models in question are explained and the associated lmer code is given. Examples of models are the linear logistic test model with an error term, differential item functioning models, and local item dependency models. Because the lme4 package is for univariate generalized linear mixed models, neither the two-parameter, and three-parameter models, nor the item response models for polytomous response data, can be estimated with the lmer function.

Journal ArticleDOI
TL;DR: This paper provides an introduction to a simple, yet comprehensive, set of programs for the implementation of some Bayesian nonparametric and semiparametric models in R, DPpackage.
Abstract: Data analysis sometimes requires the relaxation of parametric assumptions in order to gain modeling flexibility and robustness against mis-specification of the probability model. In the Bayesian context, this is accomplished by placing a prior distribution on a function space, such as the space of all probability distributions or the space of all regression functions. Unfortunately, posterior distributions ranging over function spaces are highly complex and hence sampling methods play a key role. This paper provides an introduction to a simple, yet comprehensive, set of programs for the implementation of some Bayesian nonparametric and semiparametric models in R, DPpackage . Currently, DPpackage includes models for marginal and conditional density estimation, receiver operating characteristic curve analysis, interval-censored data, binary regression data, item response data, longitudinal and clustered data using generalized linear mixed models, and regression data using generalized additive models. The package also contains functions to compute pseudo-Bayes factors for model comparison and for eliciting the precision parameter of the Dirichlet process prior, and a general purpose Metropolis sampling algorithm. To maximize computational efficiency, the actual sampling for each model is carried out using compiled C, C++ or Fortran code.

Journal ArticleDOI
TL;DR: This paper presents a new software package decon for R, which contains a collection of functions that use the deconvolution kernel methods to deal with the measurement error problems, and adapt the fast Fourier transform algorithm for density estimation with error-free data to the deconVolution kernel estimation.
Abstract: Data from many scientific areas often come with measurement error. Density or distribution function estimation from contaminated data and nonparametric regression with errors-in-variables are two important topics in measurement error models. In this paper, we present a new software package decon for R, which contains a collection of functions that use the deconvolution kernel methods to deal with the measurement error problems. The functions allow the errors to be either homoscedastic or heteroscedastic. To make the deconvolution estimators computationally more efficient in R, we adapt the fast Fourier transform algorithm for density estimation with error-free data to the deconvolution kernel estimation. We discuss the practical selection of the smoothing parameter in deconvolution methods and illustrate the use of the package through both simulated and real examples.