scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Statistical Software in 2008"


Journal ArticleDOI
TL;DR: FactoMineR an R package dedicated to multivariate data analysis with the possibility to take into account different types of variables (quantitative or categorical), different kinds of structure on the data, and finally supplementary information (supplementary individuals and variables).
Abstract: In this article, we present FactoMineR an R package dedicated to multivariate data analysis. The main features of this package is the possibility to take into account different types of variables (quantitative or categorical), different types of structure on the data (a partition on the variables, a hierarchy on the variables, a partition on the individuals) and finally supplementary information (supplementary individuals and variables). Moreover, the dimensions issued from the different exploratory data analyses can be automatically described by quantitative and/or categorical variables. Numerous graphics are also available with various options. Finally, a graphical user interface is implemented within the Rcmdr environment in order to propose an user friendly package.

6,472 citations


Journal ArticleDOI
TL;DR: The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R to simplify model training and tuning across a wide variety of modeling techniques.
Abstract: The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques. It also includes methods for pre-processing training data, calculating variable importance, and model visualizations. An example from computational chemistry is used to illustrate the functionality on a real data set and to benchmark the benefits of parallel processing with several types of models.

5,144 citations


Journal ArticleDOI
TL;DR: Two automatic forecasting algorithms that have been implemented in the forecast package for R, based on innovations state space models that underly exponential smoothing methods, are described.
Abstract: Automatic forecasts of large numbers of univariate time series are often needed in business and other contexts. We describe two automatic forecasting algorithms that have been implemented in the forecast package for R. The first is based on innovations state space models that underly exponential smoothing methods. The second is a step-wise algorithm for forecasting with ARIMA models. The algorithms are applicable to both seasonal and non-seasonal data, and are compared and illustrated using four real time series. We also briefly describe some of the other functionality available in the forecast package.

2,825 citations


Journal ArticleDOI
TL;DR: In this article, a new implementation of hurdle and zero-inflated regression models in the functions hurdle() and zeroinfl() from the package pscl is introduced, which reuses design and functionality of the basic R functions just as the underlying conceptual tools extend the classical models.
Abstract: The classical Poisson, geometric and negative binomial regression models for count data belong to the family of generalized linear models and are available at the core of the statistics toolbox in the R system for statistical computing. After reviewing the conceptual and computational features of these methods, a new implementation of hurdle and zero-inflated regression models in the functions hurdle() and zeroinfl() from the package pscl is introduced. It re-uses design and functionality of the basic R functions just as the underlying conceptual tools extend the classical models. Both hurdle and zero-inflated model, are able to incorporate over-dispersion and excess zeros-two problems that typically occur in count data sets in economics and the social sciences-better than their classical counterparts. Using cross-section data on the demand for medical care, it is illustrated how the classical as well as the zero-augmented models can be fitted, inspected and tested in practice.

1,971 citations


Journal ArticleDOI
TL;DR: Ergm has the capability of approximating a maximum likelihood estimator for an ERGM given a network data set; simulating new network data sets from a fitted ERGM using Markov chain Monte Carlo; and assessing how well a fittedERGM does at capturing characteristics of a particular networkData set.
Abstract: We describe some of the capabilities of the ergm package and the statistical theory underlying it. This package contains tools for accomplishing three important, and interrelated, tasks involving exponential-family random graph models (ERGMs): estimation, simulation, and goodness of t. More precisely, ergm has the capability of approximating a maximum likelihood estimator for an ERGM given a network data set; simulating new network data sets from a tted ERGM using Markov chain Monte Carlo; and assessing how well a tted ERGM does at capturing characteristics of a particular network data set.

1,203 citations


Journal ArticleDOI
TL;DR: A computational framework is established in coin that likewise embeds the corresponding R functionality in a common S4 class structure with associated generic functions, inherit the flexibility of the underlying theory and conditional inference functions for important special cases can be set up easily.
Abstract: The R package coin implements a unified approach to permutation tests providing a huge class of independence tests for nominal, ordered, numeric, and censored data as well as multivariate data at mixed scales. Based on a rich and flexible conceptual framework that embeds different permutation test procedures into a common theory, a computational framework is established in coin that likewise embeds the corresponding R functionality in a common S4 class structure with associated generic functions. As a consequence, the computational tools in coin inherit the flexibility of the underlying theory and conditional inference functions for important special cases can be set up easily. Conditional versions of classical tests---such as tests for location and scale problems in two or more samples, independence in two- or three-way contingency tables, or association problems for censored, ordered categorical or multivariate data---can easily be implemented as special cases using this computational toolbox by choosing appropriate transformations of the observations. The paper gives a detailed exposition of both the internal structure of the package and the provided user interfaces along with examples on how to extend the implemented functionality.

1,189 citations


Journal ArticleDOI
TL;DR: The tm package is presented which provides a framework for text mining applications within R and techniques for count-based analysis methods, text clustering, text classification and string kernels are presented.
Abstract: During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.

1,057 citations



Journal ArticleDOI
TL;DR: Statnet is a suite of software packages for statistical network analysis that provides a comprehensive framework for ERGM-based network modeling, including tools for model estimation, model evaluation, model- based network simulation, and network visualization.
Abstract: statnet is a suite of software packages for statistical network analysis. The packages implement recent advances in network modeling based on exponential-family random graph models (ERGM). The components of the package provide a comprehensive framework for ERGM-based network modeling, including tools for model estimation, model evaluation, model-based network simulation, and network visualization. This broad functionality is powered by a central Markov chain Monte Carlo (MCMC) algorithm. The coding is optimized for speed and robustness.

832 citations


Journal ArticleDOI
TL;DR: The R np package implements a variety of nonparametric and semiparametric kernel-based estimators that are popular among econometricians, and focuses on kernel methods appropriate for the mix of continuous, discrete, and categorical data often found in applied settings.
Abstract: We describe the R np package via a series of applications that may be of interest to applied econometricians. The np package implements a variety of nonparametric and semiparametric kernel-based estimators that are popular among econometricians. There are also procedures for nonparametric tests of significance and consistent model specification tests for parametric mean regression models and parametric quantile regression models, among others. The np package focuses on kernel methods appropriate for the mix of continuous, discrete, and categorical data often found in applied settings. Data-driven methods of bandwidth selection are emphasized throughout, though we caution the user that data-driven bandwidth selection methods can be computationally demanding.

829 citations


Journal ArticleDOI
TL;DR: The plm package as mentioned in this paper is a package for R which intends to make the estimation of linear panel models straightforward and provides functions to estimate a wide variety of models and to make (robust) inference.
Abstract: Panel data econometrics is obviously one of the main fields in the profession, but most of the models used are difficult to estimate with R. plm is a package for R which intends to make the estimation of linear panel models straightforward. plm provides functions to estimate a wide variety of models and to make (robust) inference.

Journal ArticleDOI
TL;DR: The R package clValid contains functions for validating the results of a clustering analysis, and the user can choose from nine clustering algorithms in existing R packages, including hierarchical, K-means, self-organizing maps (SOM), to choose from.
Abstract: The R package clValid contains functions for validating the results of a clustering analysis There are three main types of cluster validation measures available, \internal", \stability", and \biological" The user can choose from nine clustering algorithms in existing R packages, including hierarchical, K-means, self-organizing maps (SOM),

Journal ArticleDOI
TL;DR: A beanplot is an alternative to the boxplot for visual comparison of univariate data between groups and is easy to compare different groups of data in a beanplot and to see if a group contains enough observations to make the group interesting from a statistical point of view.
Abstract: This introduction to the R package beanplot is a (slightly) modied version of Kampstra (2008), published in the Journal of Statistical Software. Boxplots and variants thereof are frequently used to compare univariate data. Boxplots have the disadvantage that they are not easy to explain to non-mathematicians, and that some information is not visible. A beanplot is an alternative to the boxplot for visual comparison of univariate data between groups. In a beanplot, the individual observations are shown as small lines in a one-dimensional scatter plot. Next to that, the estimated density of the distributions is visible and the average is shown. It is easy to compare dierent groups of data in a beanplot and to see if a group contains enough observations to make the group interesting from a statistical point of view. Anomalies in the data, such as bimodal distributions and duplicate measurements, are easily spotted in a beanplot. For groups with two subgroups (e.g., male and female), there is a special asymmetric beanplot. For easy usage, an implementation was made in R.

Journal ArticleDOI
TL;DR: The statnet suite of R packages contains a wide range of functionality for the statistical analysis of social networks, including the implementation of exponential-family random graph (ERG) models.
Abstract: The statnet suite of R packages contains a wide range of functionality for the statistical analysis of social networks, including the implementation of exponential-family random graph (ERG) models. In this paper we illustrate some of the functionality of statnet through a tutorial analysis of a friendship network of 1,461 adolescents.

Journal ArticleDOI
TL;DR: The structure of the package vars and its implementation of vector autoregressive, structural vector Autoregressive and structural vector error correction models are explained in this paper and it is further possible to convert vector error Correction models into their level VAR representation.
Abstract: The structure of the package vars and its implementation of vector autoregressive, structural vector autoregressive and structural vector error correction models are explained in this paper. In addition to the three cornerstone functions VAR(), SVAR() and SVEC() for estimating such models, functions for diagnostic testing, estimation of a restricted models, prediction, causality analysis, impulse response analysis and forecast error variance decomposition are provided too. It is further possible to convert vector error correction models into their level VAR representation. The different methods and functions are elucidated by employing a macroeconomic data set for Canada. However, the focus in this writing is on the implementation part rather than the usage of the tools at hand.

Journal ArticleDOI
TL;DR: An overview of a software package which provides support for a range of network analytic functionality within the R statistical computing environment is provided.
Abstract: Modern social network analysis---the analysis of relational data arising from social systems---is a computationally intensive area of research. Here, we provide an overview of a software package which provides support for a range of network analytic functionality within the R statistical computing environment. General categories of currently supported functionality are described, and brief examples of package syntax and usage are shown.

Journal ArticleDOI
TL;DR: The PresenceAbsence package for R provides a toolkit for selecting the optimal threshold for translating a probability surface into presence-absence maps specifically tailored to their intended use.
Abstract: The PresenceAbsence package for R provides a set of functions useful when evaluating the results of presence-absence analysis, for example, models of species distribution or the analysis of diagnostic tests. The package provides a toolkit for selecting the optimal threshold for translating a probability surface into presence-absence maps specifically tailored to their intended use. The package includes functions for calculating threshold dependent measures such as confusion matrices, percent correctly classified (PCC), sensitivity, specificity, and Kappa, and produces plots of each measure as the threshold is varied. It also includes functions to plot the Receiver Operator Characteristic (ROC) curve and calculates the associated area under the curve (AUC), a threshold independent measure of model quality. Finally, the package computes optimal thresholds by multiple criteria, and plots these optimized thresholds on the graphs.

Journal ArticleDOI
TL;DR: The package seriation is presented which provides an infrastructure for seriation with R and comprises data structures to represent linear orders as permutation vectors, a wide array of seriation methods using a consistent interface, a method to calculate the value of various loss and merit functions, and several visualization techniques which build on seriation.
Abstract: Seriation, i.e., nding a suitable linear order for a set of objects given data and a loss or merit function, is a basic problem in data analysis. Caused by the problem’s combinatorial nature, it is hard to solve for all but very small sets. Nevertheless, both exact solution methods and heuristics are available. In this paper we present the package seriation which provides an infrastructure for seriation with R. The infrastructure comprises data structures to represent linear orders as permutation vectors, a wide array of seriation methods using a consistent interface, a method to calculate the value of various loss and merit functions, and several visualization techniques which build on seriation. To illustrate how easily the package can be applied for a variety of applications, a comprehensive collection of examples is presented.

Journal ArticleDOI
TL;DR: The functionality of the Flexmix package was enhanced and concomitant variable models as well as varying and constant parameters for the component specific generalized linear regression models can be fitted.
Abstract: flexmix provides infrastructure for flexible fitting of finite mixture models in R using the expectation-maximization (EM) algorithm or one of its variants. The functionality of the package was enhanced. Now concomitant variable models as well as varying and constant parameters for the component specific generalized linear regression models can be fitted. The application of the package is demonstrated on several examples, the implementation described and examples given to illustrate how new drivers for the component specific models and the concomitant variable models can be defined.

Journal ArticleDOI
TL;DR: The yaImpute package is introduced, an R package for nearest neighbor search and imputation that offers a suite of diagnostics for comparison among results generated from different imputation analyses and a set of functions for mapping imputation results.
Abstract: This article introduces yaImpute, an R package for nearest neighbor search and imputation. Although nearest neighbor imputation is used in a host of disciplines, the methods implemented in the yaImpute package are tailored to imputation-based forest attribute estimation and mapping. The impetus to writing the yaImpute is a growing interest in nearest neighbor imputation methods for spatially explicit forest inventory, and a need within this research community for software that facilitates comparison among different nearest neighbor search algorithms and subsequent imputation techniques. yaImpute provides directives for defining the search space, subsequent distance calculation, and imputation rules for a given number of nearest neighbors. Further, the package offers a suite of diagnostics for comparison among results generated from different imputation analyses and a set of functions for mapping imputation results.

Journal ArticleDOI
TL;DR: The classes of statistics that are currently available in the ergm package are described and means for controlling the Markov chain Monte Carlo (MCMC) algorithm that the package uses for estimation are described.
Abstract: Exponential-family random graph models (ERGMs) represent the processes that govern the formation of links in networks through the terms selected by the user. The terms specify network statistics that are sufficient to represent the probability distribution over the space of networks of that size. Many classes of statistics can be used. In this article we describe the classes of statistics that are currently available in the ergm package. We also describe means for controlling the Markov chain Monte Carlo (MCMC) algorithm that the package uses for estimation. These controls affect either the proposal distribution on the sample space used by the underlying Metropolis-Hastings algorithm or the constraints on the sample space itself. Finally, we describe various other arguments to core functions of the ergm package.

Journal ArticleDOI
TL;DR: The network package provides an class which may be used for encoding complex relational structures composed a vertex set together with any combination of undirected/directed, valued/unvalued, dyadic/hyper, and single/multiple edges; storage requirements are on the order of the number of edges involved.
Abstract: Effective memory structures for relational data within R must be capable of representing a wide range of data while keeping overhead to a minimum. The network package provides an class which may be used for encoding complex relational structures composed a vertex set together with any combination of undirected/directed, valued/unvalued, dyadic/hyper, and single/multiple edges; storage requirements are on the order of the number of edges involved. Some simple constructor, interface, and visualization functions are provided, as well as a set of operators to facilitate employment by end users. The package also supports a C-language API, which allows developers to work directly with network objects within backend code.

Journal ArticleDOI
TL;DR: An R package, CCA, is implemented, freely available from the Comprehensive R Archive Network, to develop numerical and graphical outputs and to enable the user to handle missing values.
Abstract: Canonical correlations analysis (CCA) is an exploratory statistical method to highlight correlations between two data sets acquired on the same experimental units. The cancor() function in R (R Development Core Team 2007) performs the core of computations but further work was required to provide the user with additional tools to facilitate the interpretation of the results. We implemented an R package, CCA, freely available from the Comprehensive R Archive Network (CRAN, http://CRAN.R-project.org/), to develop numerical and graphical outputs and to enable the user to handle missing values. The CCA package also includes a regularized version of CCA to deal with data sets with more variables than units. Illustrations are given through the analysis of a data set coming from a nutrigenomic study in the mouse.

Journal ArticleDOI
TL;DR: The systemfit package as discussed by the authors provides the capability to estimate systems of linear equations within the R programming environment, which can be used for "ordinary least squares", "seemingly unrelated regression" (SUR), and the instrumental variable (IV) methods "two-stage least squares" (2SLS), where SUR and 3SLS estimations can optionally be iterated.
Abstract: Many statistical analyses (e.g., in econometrics, biostatistics and experimental design) are based on models containing systems of structurally related equations. The systemfit package provides the capability to estimate systems of linear equations within the R programming environment. For instance, this package can be used for "ordinary least squares" (OLS), "seemingly unrelated regression" (SUR), and the instrumental variable (IV) methods "two-stage least squares" (2SLS) and "three-stage least squares" (3SLS), where SUR and 3SLS estimations can optionally be iterated. Furthermore, the systemfit package provides tools for several statistical tests. It has been tested on a variety of datasets and its reliability is demonstrated.

Journal ArticleDOI
TL;DR: The GEEQBOX toolbox as discussed by the authors analyzes correlated data via the method of generalized estimating equations (GEE) and quasi-least squares (QLS), an approach based on GEE that overcomes some limitations of GEE.
Abstract: The GEEQBOX toolbox analyzes correlated data via the method of generalized estimating equations (GEE) and quasi-least squares (QLS), an approach based on GEE that overcomes some limitations of GEE that have been noted in the literature. GEEQBOX is currently able to handle correlated data that follows a normal, Bernoulli or Poisson distribution, and that is assumed to have an AR(1), Markov, tri-diagonal, equicorrelated, unstructured or working independence correlation structure. This toolbox is for use with MATLAB.


Journal ArticleDOI
TL;DR: Lawstat as mentioned in this paper is a software package that contains statistical tests and procedures that are utilized in various litigations on securities law, antitrust law, equal employment and discrimination as well as in public policy and biostatistics.
Abstract: We present a new R software package lawstat that contains statistical tests and procedures that are utilized in various litigations on securities law, antitrust law, equal employment and discrimination as well as in public policy and biostatistics. Along with the well known tests such as the Bartels test, runs test, tests of homogeneity of several sample proportions, the Brunner-Munzel tests, the Lorenz curve, the Cochran-Mantel-Haenszel test and others, the package contains new distribution-free robust tests for symmetry, robust tests for normality that are more sensitive to heavy-tailed departures, measures of relative variability, Levene-type tests against trends in variances etc. All implemented tests and methods are illustrated by simulations and real-life examples from legal cases, economics and biostatistics. Although the package is called lawstat, it presents implementation and discussion of statistical procedures and tests that are also employed in a variety of other applications, e.g., biostatistics, environmental studies, social sciences and others, in other words, all applications utilizing statistical data analysis. Hence, name of the package should not be considered as a restriction to legal statistics. The package will be useful to applied statisticians and "quantitatively alert practitioners" of other subjects as well as an asset in teaching statistical courses.

Journal ArticleDOI
TL;DR: Latentnet is a package to fit and evaluate statistical latent position and cluster models for networks and provides a Bayesian way of assessing how many groups there are, and thus whether or not there is any clustering.
Abstract: latentnet is a package to fit and evaluate statistical latent position and cluster models for networks. Hoff, Raftery, and Handcock (2002) suggested an approach to modeling networks based on positing the existence of an latent space of characteristics of the actors. Relationships form as a function of distances between these characteristics as well as functions of observed dyadic level covariates. In latentnet social distances are represented in a Euclidean space. It also includes a variant of the extension of the latent position model to allow for clustering of the positions developed in Handcock, Raftery, and Tantrum (2007). The package implements Bayesian inference for the models based on an Markov chain Monte Carlo algorithm. It can also compute maximum likelihood estimates for the latent position model and a two-stage maximum likelihood method for the latent position cluster model. For latent position cluster models, the package provides a Bayesian way of assessing how many groups there are, and thus whether or not there is any clustering (since if the preferred number of groups is 1, there is little evidence for clustering). It also estimates which cluster each actor belongs to. These estimates are probabilistic, and provide the probability of each actor belonging to each cluster. It computes four types of point estimates for the coefficients and positions: maximum likelihood estimate, posterior mean, posterior mode and the estimator which minimizes Kullback-Leibler divergence from the posterior. You can assess the goodness-of-fit of the model via posterior predictive checks. It has a function to simulate networks from a latent position or latent position cluster model.

Journal ArticleDOI
TL;DR: The ig package as mentioned in this paper is designed to analyze data from inverse Gaussian type distributions, which contains basic probabilistic functions, lifetime indicators and a random number generator from this model.
Abstract: The inverse Gaussian distribution is a positively skewed probability model that has received great attention in the last 20 years. Recently, a family that generalizes this model called inverse Gaussian type distributions has been developed. The new R package named ig has been designed to analyze data from inverse Gaussian type distributions. This package contains basic probabilistic functions, lifetime indicators and a random number generator from this model. Also, parameter estimates and diagnostics analysis can be obtained using likelihood methods by means of this package. In addition, goodness-of-fit methods are implemented in order to detect the suitability of the model to the data. The capabilities and features of the ig package are illustrated using simulated and real data sets. Furthermore, some new results related to the inverse Gaussian type distribution are also obtained. Moreover, a simulation study is conducted for evaluating the estimation method implemented in the ig package.

Journal ArticleDOI
TL;DR: In this article, the authors present an exact formula for sample sizes up to 31, six recursion formulae, and one matrix formula that can be used to calculate a two-sided p value.
Abstract: One of the most widely used goodness-of-fit tests is the two-sided one sample Kolmogorov-Smirnov (K-S) test which has been implemented by many computer statistical software packages. To calculate a two-sided p value (evaluate the cumulative sampling distribution), these packages use various methods including recursion formulae, limiting distributions, and approximations of unknown accuracy developed over thirty years ago. Based on an extensive literature search for the two-sided one sample K-S test, this paper identifies an exact formula for sample sizes up to 31, six recursion formulae, and one matrix formula that can be used to calculate a p value. To ensure accurate calculation by avoiding catastrophic cancelation and eliminating rounding error, each of these formulae is implemented in rational arithmetic. For the six recursion formulae and the matrix formula, computational experience for sample sizes up to 500 shows that computational times are increasing functions of both the sample size and the number of digits in the numerator and denominator integers of the rational number test statistic. The computational times of the seven formulae vary immensely but the Durbin recursion formula is almost always the fastest. Linear search is used to calculate the inverse of the cumulative sampling distribution (find the confidence interval half-width) and tables of calculated half-widths are presented for sample sizes up to 500. Using calculated half-widths as input, computational times for the fastest formula, the Durbin recursion formula, are given for sample sizes up to two thousand.