scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Statistical Software in 2009"



Journal ArticleDOI
TL;DR: The CircStat toolbox for MATLAB is implemented which provides methods for the descriptive and inferential statistical analysis of directional data and analyzes a dataset from neurophysiology to demonstrate the capabilities of the Circstat toolbox.
Abstract: Directional data is ubiquitious in science. Due to its circular nature such data cannot be analyzed with commonly used statistical techniques. Despite the rapid development of specialized methods for directional statistics over the last fifty years, there is only little software available that makes such methods easy to use for practioners. Most importantly, one of the most commonly used programming languages in biosciences, MATLAB, is currently not supporting directional statistics. To remedy this situation, we have implemented the CircStat toolbox for MATLAB which provides methods for the descriptive and inferential statistical analysis of directional data. We cover the statistical background of the available methods and describe how to apply them to data. Finally, we analyze a dataset from neurophysiology to demonstrate the capabilities of the CircStat toolbox.

2,557 citations


Journal ArticleDOI
TL;DR: The mixtools package for R provides a set of functions for analyzing a variety of finite mixture models, which include both traditional methods, such as EM algorithms for univariate and multivariate normal mixtures, and newer methods that reflect some recent research in finite mixture Models.
Abstract: The mixtools package for R provides a set of functions for analyzing a variety of nite mixture models. These functions include both traditional methods, such as EM algorithms for univariate and multivariate normal mixtures, and newer methods that reect some recent research in nite mixture models. In the latter category, mixtools provides algorithms for estimating parameters in a wide range of dierent mixture-of-regression contexts, in multinomial mixtures such as those arising from discretizing continuous multivariate data, in nonparametric situations where the multivariate component densities are completely unspecied, and in semiparametric situations such as a univariate location mixture of symmetric but otherwise unspecied densities. Many of the algorithms of the mixtools package are EM algorithms or are based on EM-like ideas, so this article includes an overview of EM algorithms for nite mixture models.

1,079 citations


Journal ArticleDOI
TL;DR: The dtw package allows R users to compute time series alignments mixing freely a variety of continuity constraints, restriction windows, endpoints, local distance definitions, and so on.
Abstract: Dynamic time warping is a popular technique for comparing time series, providing both a distance measure that is insensitive to local compression and stretches and the warping which optimally deforms one of the two input series onto the other. A variety of algorithms and constraints have been discussed in the literature. The dtw package provides an unification of them; it allows R users to compute time series alignments mixing freely a variety of continuity constraints, restriction windows, endpoints, local distance definitions, and so on. The package also provides functions for visualizing alignments and constraints using several classic diagram types.

833 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present the methodology of multidimensional scaling problems (MDS) solved by means of the majorization algorithm, where the objective function to be minimized is known as stress and functions which majorize stress are elaborated.
Abstract: In this paper we present the methodology of multidimensional scaling problems (MDS) solved by means of the majorization algorithm. The objective function to be minimized is known as stress and functions which majorize stress are elaborated. This strategy to solve MDS problems is called SMACOF and it is implemented in an R package of the same name which is presented in this article. We extend the basic SMACOF theory in terms of configuration constraints, three-way data, unfolding models, and projection of the resulting configurations onto spheres and other quadratic surfaces. Various examples are presented to show the possibilities of the SMACOF approach offered by the corresponding package.

476 citations


Journal ArticleDOI
TL;DR: The program implements the coarsened exact matching (CEM) algorithm, described below, which may be used alone or in combination with any existing matching method.
Abstract: This program is designed to improve causal inference via a method of matching that is widely applicable in observational data and easy to understand and use (if you understand how to draw a histogram, you will understand this method). The program implements the coarsened exact matching (CEM) algorithm, described below. CEM may be used alone or in combination with any existing matching method. This algorithm, and its statistical properties, are described in Iacus, King, and Porro (2008).

395 citations


Journal ArticleDOI
TL;DR: Substantial extensions to the effects package for R are described to construct effect displays for multinomial and proportional-odds logit models, which are limited to linear and generalized linear models.
Abstract: Based on recent work by Fox and Andersen (2006), this paper describes substantial extensions to the effects package for R to construct effect displays for multinomial and proportional-odds logit models. The package previously was limited to linear and generalized linear models. Effect displays are tabular and graphical representations of terms — typically high-order terms — in a statistical model. For polytomous logit models, effect displays depict fitted category probabilities under the model, and can include point-wise confidence envelopes for the effects. The construction of effect displays by functions in the effects package is essentially automatic. The package provides several kinds of displays for polytomous logit models.

267 citations



Journal ArticleDOI
TL;DR: In this article, a generalized version of the pool-adjacent-violators algorithm (PAVA) is proposed to minimize a separable convex function with simple chain constraints.
Abstract: In this paper we give a general framework for isotone optimization. First we discuss a generalized version of the pool-adjacent-violators algorithm (PAVA) to minimize a separable convex function with simple chain constraints. Besides of general convex functions we extend existing PAVA implementations in terms of observation weights, approaches for tie handling, and responses from repeated measurement designs. Since isotone optimization problems can be formulated as convex programming problems with linear constraints we then develop a primal active set method to solve such problem. This methodology is applied on specific loss functions relevant in statistics. Both approaches are implemented in the R package isotone. (authors' abstract)

211 citations


Journal ArticleDOI
TL;DR: R package BB is discussed, in particular, its capabilities for solving a nonlinear system of equations, and the utility of these functions for solving large systems of nonlinear equations, smooth, nonlinear estimating equations in statistical modeling, and non-smooth estimating equations arising in rank-based regression modeling of censored failure time data.
Abstract: This introduction to the R package BB is a (slightly) modied version of Varadhan and Gilbert (2009), published in the Journal of Statistical Software. We discuss R package BB, in particular, its capabilities for solving a nonlinear system of equations. The function BBsolve in BB can be used for this purpose. We demonstrate the utility of these functions for solving: (a) large systems of nonlinear equations, (b) smooth, nonlinear estimating equations in statistical modeling, and (c) non-smooth estimating equations arising in rank-based regression modeling of censored failure time data. The function BBoptim can be used to solve smooth, box-constrained optimization problems. A main strength of BB is that, due to its low memory and storage requirements, it is ideally suited for solving high-dimensional problems with thousands of variables.

194 citations


Journal ArticleDOI
TL;DR: Anacor as mentioned in this paper is a package for the computation of simple and canonical correspondence analysis with missing values, which is specified in a rather general way by imposing covariates on the rows and/or the columns of the 2D frequency table.
Abstract: This paper presents the R package anacor for the computation of simple and canonical correspondence analysis with missing values. The canonical correspondence analysis is specified in a rather general way by imposing covariates on the rows and/or the columns of the two-dimensional frequency table. The package allows for scaling methods such as standard, Benzecri, centroid, and Goodman scaling. In addition, along with well-known two- and three-dimensional joint plots including confidence ellipsoids, it offers alternative plotting possibilities in terms of transformation plots, Benzecri plots, and regression plots.

Journal ArticleDOI
TL;DR: The maximum entropy bootstrap is an algorithm that creates an ensemble for time series inference and its scope is illustrated by means of several guided applications.
Abstract: The maximum entropy bootstrap is an algorithm that creates an ensemble for time series inference. Stationarity is not required and the ensemble satisfies the ergodic theorem and the central limit theorem. The meboot R package implements such algorithm. This document introduces the procedure and illustrates its scope by means of several guided applications.

Journal ArticleDOI
TL;DR: An overview of techniques for parallel computing with R on computer clusters, on multi-core systems, and in grid computing is presented, comparing sixteen different packages, comparing them on their state of development, the parallel technology used, as well as on usability, acceptance, and performance.
Abstract: R is a mature open-source programming language for statistical computing and graphics. Many areas of statistical research are experiencing rapid growth in the size of data sets. Methodological advances drive increased use of simulations. A common approach is to use parallel computing. This paper presents an overview of techniques for parallel computing with R on computer clusters, on multi-core systems, and in grid computing. It reviews sixteen different packages, comparing them on their state of development, the parallel technology used, as well as on usability, acceptance, and performance. Two packages (snow, Rmpi) stand out as particularly suited to general use on computer clusters. Packages for grid computing are still in development, with only one package currently available to the end user. For multi-core systems five different packages exist, but a number of issues pose challenges to early adopters. The paper concludes with ideas for further developments in high performance computing with R. Example code is available in the appendix.

Journal ArticleDOI
TL;DR: An R function is implemented that uses Markov chain Monte Carlo (MCMC) algorithms to uniformly sample the feasible region of constrained linear problems and a new algorithm where an MCMC step reflects on the inequality constraints.
Abstract: An R function is implemented that uses Markov chain Monte Carlo (MCMC) algorithms to uniformly sample the feasible region of constrained linear problems. Two existing hit-and-run sampling algorithms are implemented, together with a new algorithm where an MCMC step reects on the inequality constraints. The new algorithm is more robust compared to the hit-and-run methods, at a small cost of increased calculation time.

Journal ArticleDOI
TL;DR: In this article, the authors present methodological and practical issues of the R package homals which performs homogeneity analysis and various extensions, such as nonlinear principal component analysis, nonlinear canonical correlation analysis, and predictive models which emulate discriminant analysis and regression models.
Abstract: Homogeneity analysis combines the idea of maximizing the correlations between variables of a multivariate data set with that of optimal scaling. In this article we present methodological and practical issues of the R package homals which performs homogeneity analysis and various extensions. By setting rank constraints nonlinear principal component analysis can be performed. The variables can be partitioned into sets such that homogeneity analysis is extended to nonlinear canonical correlation analysis or to predictive models which emulate discriminant analysis and regression models. For each model the scale level of the variables can be taken into account by setting level constraints. All algorithms allow for missing values.

Journal ArticleDOI
TL;DR: This introduction to the R package sets is a (slightly) modied version of Meyer and Hornik (2009a), published in the Journal of Statistical Software.
Abstract: This introduction to the R package sets is a (slightly) modied version of Meyer and Hornik (2009a), published in the Journal of Statistical Software. We present data structures and algorithms for sets and some generalizations thereof (fuzzy sets, multisets, and fuzzy multisets) available for R through the sets package. Fuzzy (multi-)sets are based on dynamically bound fuzzy logic families. Further extensions include user-denable iterators and matching functions.

Journal ArticleDOI
TL;DR: The R package archetypes is presented, which provides an implementation of the archetypal analysis algorithm within R and different exploratory tools to analyze the algorithm during its execution and its final result.
Abstract: Archetypal analysis has the aim to represent observations in a multivariate data set as convex combinations of extremal points. This approach was introduced by Cutler and Breiman (1994); they defined the concrete problem, laid out the theoretical foundations and presented an algorithm written in Fortran. In this paper we present the R package archetypes which is available on the Comprehensive R Archive Network. The package provides an implementation of the archetypal analysis algorithm within R and different exploratory tools to analyze the algorithm during its execution and its final result. The application of the package is demonstrated on two examples.

Journal ArticleDOI
TL;DR: In this paper, the authors present a package that implements and extends the method of Cobb (C Cobb and Watson 1980; Cobb, Koppstein and Chen 1983), and makes it easy to quantitatively fit and compare different cusp catastrophe models in a statistically principled way.
Abstract: Of the seven elementary catastrophes in catastrophe theory, the "cusp" model is the most widely applied. Most applications are however qualitative. Quantitative techniques for catastrophe modeling have been developed, but so far the limited availability of flexible software has hindered quantitative assessment. We present a package that implements and extends the method of Cobb (Cobb and Watson 1980; Cobb, Koppstein, and Chen 1983), and makes it easy to quantitatively fit and compare different cusp catastrophe models in a statistically principled way. After a short introduction to the cusp catastrophe, we demonstrate the package with two instructive examples.

Journal ArticleDOI
TL;DR: The MCPMod package provides tools for the analysis of dose finding trials, as well as a variety of tools necessary to plan an experiment to be analyzed using the MCP-Mod methodology.
Abstract: In this article the MCPMod package for the R programming environment will be introduced. It implements a recently developed methodology for the design and analysis of dose-response studies that combines aspects of multiple comparison procedures and modeling approaches (Bretz et al. 2005). The MCPMod package provides tools for the analysis of dose finding trials, as well as a variety of tools necessary to plan an experiment to be analyzed using the MCP-Mod methodology.

Journal ArticleDOI
TL;DR: The book, now in its second edition, provides an overview of this active area of research in time series econometrics and manages to be thorough (using formal notation), yet remains applied in its focus.
Abstract: The book, now in its second edition, provides an overview of this active area of research in time series econometrics. It manages to be thorough (using formal notation), yet remains applied in its focus. A number of examples are discussed, often by using datasets from the original publications. Code examples are provided throughout, frequently using the contributed packages urca and vars by the same author.

Journal ArticleDOI
TL;DR: The BiplotGUI package provides a graphical user interface for the construction of, interaction with, and manipulation of biplots in R that requires almost no knowledge of R syntax.
Abstract: Biplots simultaneously provide information on both the samples and the variables of a data matrix in two- or three-dimensional representations. The BiplotGUI package provides a graphical user interface for the construction of, interaction with, and manipulation of biplots in R. The samples are represented as points, with coordinates determined either by the choice of biplot, principal coordinate analysis or multidimensional scaling. Various transformations and dissimilarity metrics are available. Information on the original variables is incorporated by linear or non-linear calibrated axes. Goodness-of-fit measures are provided. Additional descriptors can be superimposed, including convex hulls, alpha-bags, point densities and classification regions. Amongst the interactive features are dynamic variable value prediction, zooming and point and axis drag-and-drop. Output can easily be exported to the R workspace for further manipulation. Three-dimensional biplots are incorporated via the rgl package. The user requires almost no knowledge of R syntax.

Journal ArticleDOI
TL;DR: In this article, the authors implemented the meta-analysis methodology in an Microsoft (Excel) add-in which is freely available and incorporates more meta analysis models (including the iterative maximum likelihood and profile likelihood) than are usually available, while paying particular attention to the user-friendliness of the package.
Abstract: Meta-analysis is a statistical methodology that combines or integrates the results of several independent clinical trials considered by the analyst to be 'combinable' (Huque 1988). However, completeness and user-friendliness are uncommon both in specialised meta-analysis software packages and in mainstream statistical packages that have to rely on user-written commands. We implemented the meta-analysis methodology in an Microsoft (Excel) add-in which is freely available and incorporates more meta-analysis models (including the iterative maximum likelihood and profile likelihood) than are usually available, while paying particular attention to the user-friendliness of the package.

Journal ArticleDOI
TL;DR: The present paper provides a general SAS program for the random construction of a Williams design and the relevant procedure for randomization, and meets the practical needs of researchers in the application of Williams designs.
Abstract: A Williams design is a special and useful type of cross-over design. Balance is achieved by using only one particular Latin square if there are even numbers of treatments, and by using only two appropriate squares if there are odd numbers of treatments. PROC PLAN of SAS/STAT is a practical tool, not only for random construction of the Williams square, but also for randomly assigning treatment sequences to the subjects, which makes integration of the two procedures possible. The present paper provides a general SAS program for the random construction of a Williams design and the relevant procedure for randomization. Examples of a three-treatment, three-period (3 x 3) and a four-treatment, four-period (4 x 4) cross-over designs are given to illustrate the function of the SAS program. The results can be regenerated and replicated with the same random number seed. The general SAS program meets the practical needs of researchers in the application of Williams designs.

Journal ArticleDOI
TL;DR: A C++ template class library for the efficient and convenient implementation of very general Sequential Monte Carlo algorithms is presented and two example applications are provided: a simple particle filter for illustrative purposes and a state-of-the-art algorithm for rare event estimation.
Abstract: Sequential Monte Carlo methods are a very general class of Monte Carlo methods for sampling from sequences of distributions. Simple examples of these algorithms are used very widely in the tracking and signal processing literature. Recent developments illustrate that these techniques have much more general applicability, and can be applied very effectively to statistical inference problems. Unfortunately, these methods are often perceived as being computationally expensive and difficult to implement. This article seeks to address both of these problems. A C++ template class library for the efficient and convenient implementation of very general Sequential Monte Carlo algorithms is presented. Two example applications are provided: a simple particle filter for illustrative purposes and a state-of-the-art algorithm for rare event estimation.

Journal ArticleDOI
TL;DR: The R package LogConcDEAD (Log-concave density estimation in arbitrary dimensions) is introduced, its main function is to compute the nonparametric maximum likelihood estimator of a log-conCave density.
Abstract: In this article we introduce the R package LogConcDEAD (Log-concave density estimation in arbitrary dimensions). Its main function is to compute the nonparametric maximum likelihood estimator of a log-concave density. Functions for plotting, sampling from the density estimate and evaluating the density estimate are provided. All of the functions available in the package are illustrated using simple, reproducible examples with simulated data.

Journal Article
TL;DR: This work implemented the meta-analysis methodology in an Microsoft (Excel) add-in which is freely available and incorporates more meta- analysis models than are usually available, while paying particular attention to the user-friendliness of the package.
Abstract: Meta-analysis is a statistical methodology that combines or integrates the results of several independent clinical trials considered by the analyst to be `combinable' (Huque 1988). However, completeness and user-friendliness are uncommon both in specialised meta-analysis software packages and in mainstream statistical packages that have to relyon user-written commands. We implemented the meta-analysis methodology in an Microsoft Excel add-in which is freely available and incorporates more meta-analysis models (including the iterative maximum likelihood and prole likelihood) than are usually available, while paying particular attention to the user-friendliness of the package.

Journal ArticleDOI
TL;DR: For survival modeling with microarray data, a software program is developed which can be used conveniently and interactively in the R environment and can discover multiple sets of genes by iterative forward selection rather than one large set of genes.
Abstract: Gene expression data can be associated with various clinical outcomes. In particular, these data can be of importance in discovering survival-associated genes for medical applications. As alternatives to traditional statistical methods, sophisticated methods and software programs have been developed to overcome the high-dimensional difficulty of microarray data. Nevertheless, new algorithms and software programs are needed to include practical functions such as the discovery of multiple sets of survival-associated genes and the incorporation of risk factors, and to use in the R environment which many statisticians are familiar with. For survival modeling with microarray data, we have developed a software program (called rbsurv) which can be used conveniently and interactively in the R environment. This program selects survival-associated genes based on the partial likelihood of the Cox model and separates training and validation sets of samples for robustness. It can discover multiple sets of genes by iterative forward selection rather than one large set of genes. It can also allow adjustment for risk factors in microarray survival modeling. This software package, the rbsurv package, can be used to discover survival-associated genes with microarray data conveniently.

Journal ArticleDOI
TL;DR: The design framework, computational implementation and the utilization of SOCR Analyses are presented, which will help facilitate statistical learning for high school and undergraduate students.
Abstract: The web-based, Java-written SOCR (Statistical Online Computational Resource) tools have been utilized in many undergraduate and graduate level statistics courses for seven years now (Dinov 2006; Dinov et al. 2008b). It has been proven that these resources can successfully improve students' learning (Dinov et al. 2008b). Being first published online in 2005, SOCR Analyses is a somewhat new component and it concentrate on data modeling for both parametric and non-parametric data analyses with graphical model diagnostics. One of the main purposes of SOCR Analyses is to facilitate statistical learning for high school and undergraduate students. As we have already implemented SOCR Distributions and Experiments, SOCR Analyses and Charts fulfill the rest of a standard statistics curricula. Currently, there are four core components of SOCR Analyses. Linear models included in SOCR Analyses are simple linear regression, multiple linear regression, one-way and two-way ANOVA. Tests for sample comparisons include t-test in the parametric category. Some examples of SOCR Analyses' in the non-parametric category are Wilcoxon rank sum test, Kruskal-Wallis test, Friedman's test, Kolmogorov-Smirnoff test and Fligner-Killeen test. Hypothesis testing models include contingency table, Friedman's test and Fisher's exact test. The last component of Analyses is a utility for computing sample sizes for normal distribution. In this article, we present the design framework, computational implementation and the utilization of SOCR Analyses.

Journal ArticleDOI
TL;DR: The main innovation is the interactive feature extraction from color images, which is then coupled with the statistical learning algorithms and intensive feedback from the user over many classification-correction iterations, resulting in a highly accurate and user-friendly solution.
Abstract: Supervised learning can be used to segment/identify regions of interest in images using both color and morphological information. A novel object identification algorithm was developed in Java to locate immune and cancer cells in images of immunohistochemically-stained lymph node tissue from a recent study published by Kohrt et al. (2005). The algorithms are also showing promise in other domains. The success of the method depends heavily on the use of color, the relative homogeneity of object appearance and on interactivity. As is often the case in segmentation, an algorithm specifically tailored to the application works better than using broader methods that work passably well on any problem. Our main innovation is the interactive feature extraction from color images. We also enable the user to improve the classification with an interactive visualization system. This is then coupled with the statistical learning algorithms and intensive feedback from the user over many classification-correction iterations, resulting in a highly accurate and user-friendly solution. The system ultimately provides the locations of every cell recognized in the entire tissue in a text file tailored to be easily imported into R (Ihaka and Gentleman 1996; R Development Core Team 2009) for further statistical analyses. This data is invaluable in the study of spatial and multidimensional relationships between cell populations and tumor structure. This system is available at http://www.GemIdent.com/ together with three demonstration videos and a manual.

Journal ArticleDOI
TL;DR: Hands-on illustrations---based on example exercises and control files provided in the package---are presented to get new users started easily.
Abstract: Package exams provides a framework for automatic generation of standardized statistical exams which is especially useful for large-scale exams. To employ the tools, users just need to supply a pool of exercises and a master file controlling the layout of the final PDF document. The exercises are specified in separate Sweave files (containing R code for data generation and LaTeX code for problem and solution description) and the master file is a LaTeX document with some additional control commands. This paper gives an overview of the main design aims and principles as well as strategies for adaptation and extension. Hands-on illustrations---based on example exercises and control files provided in the package---are presented to get new users started easily.