scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Statistical Software in 2006"


Journal ArticleDOI
TL;DR: The core features of the R package geepack are described, which implements the generalized estimating equations (GEE) approach for fitting marginal generalized linear models to clustered data, through an example of clustered binary data.
Abstract: This paper describes the core features of the R package geepack, which implements the generalized estimating equations (GEE) approach for fitting marginal generalized linear models to clustered data. Clustered data arise in many applications such as longitudinal data and repeated measures. The GEE approach focuses on models for the mean of the correlated observations within clusters without fully specifying the joint distribution of the observations. It has been widely used in statistical practice. This paper illustrates the application of the GEE approach with geepack through an example of clustered binary data.

1,785 citations


Journal Article
TL;DR: The LTM package ltm as discussed by the authors is developed for the analysis of multivariate dichotomous and polytomous data using latent variable models, under the Item Response Theory approach.
Abstract: The R package ltm has been developed for the analysis of multivariate dichotomous and polytomous data using latent variable models, under the Item Response Theory approach. For dichotomous data the Rasch, the Two-Parameter Logistic, and Birnbaum’s Three-Parameter models have been implemented, whereas for polytomous data Semejima’s Graded Response model is available. Parameter estimates are obtained under marginal maximum likelihood using the Gauss-Hermite quadrature rule. The capabilities and features of the package are illustrated using two real data examples.

835 citations


Journal ArticleDOI
TL;DR: The purpose of this paper is to present and compare these implementations of support vector machines, among the most popular and efficient classification and regression methods currently available.
Abstract: Being among the most popular and efficient classification and regression methods currently available, implementations of support vector machines exist in almost every popular programming language. Currently four R packages contain SVM related software. The purpose of this paper is to present and compare these implementations. (authors' abstract)

576 citations


Journal ArticleDOI
TL;DR: Conceptual tools and their translation to computational tools in the package sandwich are discussed, enabling the computation of sandwich estimators in general parametric models.
Abstract: Sandwich covariance matrix estimators are a popular tool in applied regression modeling for performing inference that is robust to certain types of model misspecification. Suitable implementations are available in the R system for statistical computing for certain model fitting functions only (in particular lm()), but not for other standard regression functions, such as glm(), nls(), or survreg(). Therefore, conceptual tools and their translation to computational tools in the package sandwich are discussed, enabling the computation of sandwich estimators in general parametric models. Object orientation can be achieved by providing a few extractor functions' most importantly for the empirical estimating functions' from which various types of sandwich estimators can be computed.

447 citations


Journal ArticleDOI
TL;DR: The R function LDheatmap() is described which produces a graphical display, as a heat map, of pairwise linkage disequilibrium measurements between single nucleotide polymorphisms within a genomic region using the grid graphics system.
Abstract: We describe the R function LDheatmap() which produces a graphical display, as a heat map, of pairwise linkage disequilibrium measurements between single nucleotide polymorphisms within a genomic region. LDheatmap() uses the grid graphics system, an alternative to the traditional R graphics system. The features of the LDheatmap() function and the use of tools from the grid package to modify heat maps are illustrated by examples.

424 citations


Journal ArticleDOI
TL;DR: Strucplot displays include hierarchical conditional plots such as mosaic, association, and sieve plots, and can be combined into more complex, specialized plots for visualizing conditional independence, GLMs, and the results of independence tests.
Abstract: This paper describes the `strucplot' framework for the visualization of multi-way contingency tables. Strucplot displays include hierarchical conditional plots such as mosaic, association, and sieve plots, and can be combined into more complex, specialized plots for visualizing conditional independence, GLMs, and the results of independence tests. The framework's modular design allows flexible customization of the plots' graphical appearance, including shading, labeling, spacing, and legend, by means of graphical appearance control (`grapcon') functions. The framework is provided by the R package vcd. (author's abstract)

333 citations


Journal ArticleDOI
TL;DR: This text is one of a series of five handbooks that present an overview on how to use a major statistical software package, including S-PLUS, Stata, SPSS, SAS, and R.
Abstract: This text is one of a series of five handbooks that present an overview on how to use a major statistical software package. Handbooks include S-PLUS, Stata, SPSS, SAS, and R. Although R is not strictly speaking a statistical package, it is a currently popular statistical language that is downloaded into ones computer from various mirror sites. It is similar in logic to the S language of the 1980s, which later became transformed to the S-PLUS commercial package.

218 citations


Journal ArticleDOI
TL;DR: In this article, the density and distribution functions of the ratio z/w for any two jointly normal variates z,w, and provides details on methods for transforming a general ratio z /w into a standard form, (a+x)/(b+y) with x and y independent standard normal and a, b non-negative constants.
Abstract: This article extends and amplifies on results from a paper of over forty years ago. It provides software for evaluating the density and distribution functions of the ratio z/w for any two jointly normal variates z,w, and provides details on methods for transforming a general ratio z/w into a standard form, (a+x)/(b+y) , with x and y independent standard normal and a, b non-negative constants. It discusses handling general ratios when, in theory, none of the moments exist yet practical considerations suggest there should be approximations whose adequacy can be verified by means of the included software. These approximations show that many of the ratios of normal variates encountered in practice can themselves be taken as normally distributed. A practical rule is developed: If a < 2.256 and 4 < b then the ratio (a+x)/(b+y) is itself approximately normally distributed with mean μ = a/(1.01b - .2713) and variance σ 2 = (a 2 + 1)/(b 2 + .108b - 3.795) μ 2 .

186 citations


Journal ArticleDOI
TL;DR: Ada is an R package that implements three popular variants of boosting, together with a version of stochastic gradient boosting, which incorporates a random mechanism at each boosting step showing an improvement in performance and speed in generating the ensemble.
Abstract: Boosting is an iterative algorithm that combines simple classification rules with "mediocre" performance in terms of misclassification error rate to produce a highly accurate classification rule. Stochastic gradient boosting provides an enhancement which incorporates a random mechanism at each boosting step showing an improvement in performance and speed in generating the ensemble. ada is an R package that implements three popular variants of boosting, together with a version of stochastic gradient boosting. In addition, useful plots for data analytic purposes are provided along with an extension to the multi-class case. The algorithms are illustrated with synthetic and real data sets.

128 citations


Journal ArticleDOI
TL;DR: Hpassoc is developed, software for R implementing a likelihood approach to inference of haplotype and non-genetic effects in GLMs of trait associations, which highlights the flexibility to specify dominant and recessive effects of genetic risk factors.
Abstract: Complex medical disorders, such as heart disease and diabetes, are thought to involve a number of genes which act in conjunction with lifestyle and environmental factors to increase disease susceptibility. Associations between complex traits and single nucleotide polymorphisms (SNPs) in candidate genomic regions can provide a useful tool for identifying genetic risk factors. However, analysis of trait associations with single SNPs ignores the potential for extra information from haplotypes, combinations of variants at multiple SNPs along a chromosome inherited from a parent. When haplotype-trait associations are of interest and haplotypes of individuals can be determined, generalized linear models (GLMs) may be used to investigate haplotype associations while adjusting for the effects of non-genetic cofactors or attributes. Unfortunately, haplotypes cannot always be determined cost-effectively when data is collected on unrelated subjects. Uncertain haplotypes may be inferred on the basis of data from single SNPs. However, subsequent analyses of risk factors must account for the resulting uncertainty in haplotype assignment in order to avoid potential errors in interpretation. To account for such uncertainty, we have developed hapassoc, software for R implementing a likelihood approach to inference of haplotype and non-genetic effects in GLMs of trait associations. We provide a description of the underlying statistical method and illustrate the use of hapassoc with examples that highlight the flexibility to specify dominant and recessive effects of genetic risk factors, a feature not shared by other software that restricts users to additive effects only. Additionally, hapassoc can accommodate missing SNP genotypes for limited numbers of subjects.

116 citations



Journal ArticleDOI
TL;DR: In this article, the authors provide programs for computing six quantities of interest (probability density function, mean, variance, cumulative distribution function, quantile function and random numbers) for any truncated distribution: whether it is left truncated, right truncated or doubly truncated.
Abstract: Truncated distributions arise naturally in many practical situations. In this note, we provide programs for computing six quantities of interest (probability density function, mean, variance, cumulative distribution function, quantile function and random numbers) for any truncated distribution: whether it is left truncated, right truncated or doubly truncated. The programs are written in R: a freely downloadable statistical software.

Journal ArticleDOI
TL;DR: This paper describes an integrated educational web-based framework for: interactive distribution modeling, virtual online probability experimentation, statistical data analysis, visualization and integration, and evidence that SOCR resources build student's intuition and enhance their learning.
Abstract: The need for hands-on computer laboratory experience in undergraduate and graduate statistics education has been firmly established in the past decade. As a result a number of attempts have been undertaken to develop novel approaches for problem-driven statistical thinking, data analysis and result interpretation. In this paper we describe an integrated educational web-based framework for: interactive distribution modeling, virtual online probability experimentation, statistical data analysis, visualization and integration. Following years of experience in statistical teaching at all college levels using established licensed statistical software packages, like STATA, S-PLUS, R, SPSS, SAS, Systat, etc., we have attempted to engineer a new statistics education environment, the Statistics Online Computational Resource (SOCR). This resource performs many of the standard types of statistical analysis, much like other classical tools. In addition, it is designed in a plug-in object-oriented architecture and is completely platform independent, web-based, interactive, extensible and secure. Over the past 4 years we have tested, fine-tuned and reanalyzed the SOCR framework in many of our undergraduate and graduate probability and statistics courses and have evidence that SOCR resources build student's intuition and enhance their learning.

Journal ArticleDOI
TL;DR: There is a rapidly increasing number of books with titles “Something with R”, where “ something” is some area of statistics, and it is good that the books use R, because R is the lingua franca of computational statistics.
Abstract: There is a rapidly increasing number of books with titles “Something with R”, where “Something” is some area of statistics. Clearly this is a good development from the point of view of JSS: statistical software gets more attention than it did in the “Without R” era. I think it is also good from a somewhat broader perspective: paying more attention to software blends applied and theoretical aspects of statistics, and illustrates the fact that statistics is properly defined as the development and study of techniques for data analysis. For those of us who are so inclined source code for a working algorithm is a precise and reproducible way to explain what a technique actually does. And finally it is good that the books use R, and not something else, because R is the lingua franca of computational statistics.

Journal ArticleDOI
TL;DR: The package ggm has a few basic functions that find the essential graph, the induced concentration and covariance graphs, and several types of chain graphs implied by the directed acyclic graph (DAG) after grouping and reordering the variables.
Abstract: We describe some functions in the R package ggm to derive from a given Markov model, represented by a directed acyclic graph, different types of graphs induced after marginalizing over and conditioning on some of the variables. The package has a few basic functions that find the essential graph, the induced concentration and covariance graphs, and several types of chain graphs implied by the directed acyclic graph (DAG) after grouping and reordering the variables. These functions can be useful to explore the impact of latent variables or of selection effects on a chosen data generating model.

Journal ArticleDOI
TL;DR: The partitions package of R routines as mentioned in this paper is a small package of C code for integer partititions, with support for unrestricted partitions, unequal partitions, and restricted partitions, which can be used to solve combinatorial problems.
Abstract: This paper introduces the partitions package of R routines, for numerical calculation of integer partititions. Functionality for unrestricted partitions, unequal partitions, and restricted partitions is provided in a small package that accompanies this note; the emphasis is on terse, efficient C code. A simple combinatorial problem is solved using the package.

Journal ArticleDOI
TL;DR: In this paper, a state space model is specified similarly to a generalized linear model in R, and then the time-varying terms are marked in the formula, with special functions for specifying polynomial time trends, harmonic seasonal patterns, unstructured seasonal patterns and time-variate covariates.
Abstract: We provide a language for formulating a range of state space models with response densities within the exponential family. The described methodology is implemented in the R-package sspir. A state space model is specified similarly to a generalized linear model in R, and then the time-varying terms are marked in the formula. Special functions for specifying polynomial time trends, harmonic seasonal patterns, unstructured seasonal patterns and time-varying covariates can be used in the formula. The model is fitted to data using iterated extended Kalman filtering, but the formulation of models does not depend on the implemented method of inference. The package is demonstrated on three datasets.

Journal ArticleDOI
TL;DR: In this article, an algorithm for calculating concordance-discordance totals in a time of order N log N, where N is the number of observations, using a balanced binary search tree is presented.
Abstract: An algorithm is presented for calculating concordance-discordance totals in a time of order N log N , where N is the number of observations, using a balanced binary search tree. These totals can be used to calculate jackknife estimates and confidence limits in the same time order for a very wide range of rank statistics, including Kendall's tau, Somers' D, Harrell's c, the area under the receiver operating characteristic (ROC) curve, the Gini coefficient, and the parameters underlying the sign and rank-sum tests. A Stata package is introduced for calculating confidence intervals for these rank statistics using this algorithm, which has been implemented in the Mata compilable matrix programming language supplied with Stata.

Journal ArticleDOI
TL;DR: In this article, the authors describe graphical methods for multiple-response data within the framework of the multivariate linear model (MLM), aimed at understanding what is being tested in a multivariate test, and how factor/predictor effects are expressed across multiple response measures.
Abstract: This paper describes graphical methods for multiple-response data within the framework of the multivariate linear model (MLM), aimed at understanding what is being tested in a multivariate test, and how factor/predictor effects are expressed across multiple response measures. In particular, we describe and illustrate a collection of SAS macro programs for: (a) Data ellipses and low-rank biplots for multivariate data, (b) HE plots, showing the hypothesis and error covariance matrices for a given pair of responses, and a given effect, (c) HE plot matrices, showing all pairwise HE plots, and (d) low-rank analogs of HE plots, showing all observations, group means, and their relations to the response variables.

Journal ArticleDOI
TL;DR: This paper deals with the R-php statistical software, that is an environment for statistical analysis, freely accessible and attainable through the World Wide Web, based on R, and thinks that this tool could be particularly useful for teaching purposes.
Abstract: This paper deals with the R-php statistical software, that is an environment for statistical analysis, freely accessible and attainable through the World Wide Web, based on R. Indeed, this software uses, as "engine" for statistical analyses, R via PHP and its design has been inspired by a paper of de Leeuw (1997). R-php is based on two modules: a base module and a point-and-click module. R-php base allows the simple editing of R code in a form. R-php point-and-click allows some statistical analyses by means of a graphical user interface (GUI): then, to use this module it is not necessary for the user to know the R environment, but all the allowed analyses can be performed by using the computer mouse. We think that this tool could be particularly useful for teaching purposes: one possible use could be in a University computer laboratory to permit a smooth approach of students to R.

Journal ArticleDOI
TL;DR: An age-adjusted bootstrap-based method is developed to assess the significance of assumed asymptotic normal tests for animal carcinogenicity data and is applied to National Toxicology Program data sets to evaluate a dose-related trend of a test substance on the incidence of neoplasms.
Abstract: A computational tool for testing for a dose-related trend and/or a pairwise difference in the incidence of an occult tumor via an age-adjusted bootstrap-based poly-k test and the original poly-k test is presented in this paper. The poly-k test (Bailer and Portier 1988) is a survival-adjusted Cochran-Armitage test, which achieves robustness to effects of differential mortality across dose groups. The original poly-k test is asymptotically standard normal under the null hypothesis. However, the asymptotic normality is not valid if there is a deviation from the tumor onset distribution that is assumed in this test. Our age-adjusted bootstrap-based poly-k test assesses the significance of assumed asymptotic normal tests and investigates an empirical distribution of the original poly-k test statistic using an age-adjusted bootstrap method. A tumor of interest is an occult tumor for which the time to onset is not directly observable. Since most of the animal carcinogenicity studies are designed with a single terminal sacrifice, the present tool is applicable to rodent tumorigenicity assays that have a single terminal sacrifice. The present tool takes input information simply from a user screen and reports testing results back to the screen through a user-interface. The computational tool is implemented in C/C++ and is applied to analyze a real data set as an example. Our tool enables the FDA and the pharmaceutical industry to implement a statistical analysis of tumorigenicity data from animal bioassays via our age-adjusted bootstrap-based poly-k test and the original poly-k test which has been adopted by the National Toxicology Program as its standard statistical test.

Journal ArticleDOI
TL;DR: The authors developed some functions for proportional symbol mapping using R, including mathematical and perceptual scaling, that demonstrated the new expressive power and options available in R, particularly for the visualization of conceptual point data.
Abstract: Visualization of spatial data on a map aids not only in data exploration but also in communication to impart spatial conception or ideas to others. Although recent carto-graphic functions in R are rapidly becoming richer, proportional symbol mapping, which is one of the common mapping approaches, has not been packaged thus far. Based on the theories of proportional symbol mapping developed in cartography, the authors developed some functions for proportional symbol mapping using R, including mathematical and perceptual scaling. An example of these functions demonstrated the new expressive power and options available in R, particularly for the visualization of conceptual point data.

Journal ArticleDOI
TL;DR: A computer program is described called ITA 2.0 which implements both of the algorithms available to perform an Item Tree Analysis and is shown with a concrete data set how the program can be used for the analysis of questionnaire data.
Abstract: Item Tree Analysis (ITA) is an explorative method of data analysis which can be used to establish a hierarchical structure on a set of dichotomous items from a questionnaire or test. There are currently two different algorithms available to perform an ITA. We describe a computer program called ITA 2.0 which implements both of these algorithms. In addition we show with a concrete data set how the program can be used for the analysis of questionnaire data.

Journal ArticleDOI
TL;DR: A set of FORTRAN subprograms is presented to compute density and cumulative distribution functions and critical values for the range ratio statistics of Dixon.
Abstract: A set of FORTRAN subprograms is presented to compute density and cumulative distribution functions and critical values for the range ratio statistics of Dixon (1951, The Annals of Mathematical Statistics ) These statistics are useful for detection of outliers in small samples.

Journal ArticleDOI
TL;DR: The development and execution of a computer program that accurately calculates first- and second-stage short-run control chart factors for (X, MR) charts using the equations derived in the first paper is described.
Abstract: This paper is the second in a series of two papers that fully develops two-stage short-run (X, MR) control charts. This paper describes the development and execution of a computer program that accurately calculates first- and second-stage short-run control chart factors for (X, MR) charts using the equations derived in the first paper. The software used is Mathcad. The program accepts values for number of subgroups, α for the X chart, and α for the MR chart both above the upper control limit and below the lower control limit. Tables are generated for specific values of these inputs and the implications of the results are discussed. A numerical example illustrates the use of the program.

Journal ArticleDOI
TL;DR: The elliptic package of R routines, for numerical calculation of elliptic and related functions, and the package illustrates these numerically and visually, with a statistical application in fluid mechanics.
Abstract: This paper introduces the elliptic package of R routines, for numerical calculation of elliptic and related functions. Elliptic functions furnish interesting and instructive examples of many ideas of complex analysis, and the package illustrates these numerically and visually. A statistical application in fluid mechanics is presented.

Journal ArticleDOI
TL;DR: A computer program for estimating Gompertz curve using Gauss-Newton method of least squares is described in detail and is an improved version of the program proposed in Dastidar (2005).
Abstract: A computer program for estimating Gompertz curve using Gauss-Newton method of least squares is described in detail. It is based on the estimation technique proposed in Reddy (1985). The program is developed using Scilab (version 3.1.1), a freely available scientific software package that can be downloaded from http://www.scilab.org/. Data is to be fed into the program from an external disk file which should be in Microsoft Excel format. The output will contain sample size, tolerance limit, a list of initial as well as the final estimate of the parameters, standard errors, value of Gauss-Normal equations namely GN1 GN2 and GN3, No. of iterations, variance(σ2), Durbin-Watson statistic, goodness of fit measures such as R2, D value, covariance matrix and residuals. It also displays a graphical output of the estimated curve vis a vis the observed curve. It is an improved version of the program proposed in Dastidar (2005).

Journal ArticleDOI
TL;DR: The exactLoglinTest as mentioned in this paper package implements a sequentially rounded normal approximation and importance sampling to approximate probabilities from the conditional distribution, and a Monte Carlo algorithm is proposed to estimate P values from the resulting conditional distribution.
Abstract: This manuscript overviews exact testing of goodness of fit for log-linear models using the R package exactLoglinTest. This package evaluates model fit for Poisson log-linear models by conditioning on minimal sufficient statistics to remove nuisance parameters. A Monte Carlo algorithm is proposed to estimate P values from the resulting conditional distribution. In particular, this package implements a sequentially rounded normal approximation and importance sampling to approximate probabilities from the conditional distribution. Usually, this results in a high percentage of valid samples. However, in instances where this is not the case, a Metropolis Hastings algorithm can be implemented that makes more localized jumps within the reference set. The manuscript details how some conditional tests for binomial logit models can also be viewed as conditional Poisson log-linear models and hence can be performed via exactLoglinTest. A diverse battery of examples is considered to highlight use, features and extensions of the software. Notably, potential extensions to evaluating disclosure risk are also considered.

Journal ArticleDOI
TL;DR: In this article, a new SAS/IML tool for performing a spatial sign based multivariate analysis of variance is introduced, which has promising efficiency and robustness properties compared with the classical multivariate analyses of variance model.
Abstract: Recently, new nonparametric multivariate extensions of the univariate sign methods have been proposed. Randles (2000) introduced an affine invariant multivariate sign test for the multivariate location problem. Later on, Hettmansperger and Randles (2002) considered an affine equivariant multivariate median corresponding to this test. The new methods have promising efficiency and robustness properties. In this paper, we review these developments and compare them with the classical multivariate analysis of variance model. A new SAS/IML tool for performing a spatial sign based multivariate analysis of variance is introduced.

Journal ArticleDOI
TL;DR: A cross-validation method for the selection of the thresholding value in wavelet shrinkage of Oh, Kim, and Lee (2006) is reviewed, and the R package CVThresh implementing details of the calculations for the procedures are introduced.
Abstract: The core of the wavelet approach to nonparametric regression is thresholding of wavelet coefficients. This paper reviews a cross-validation method for the selection of the thresholding value in wavelet shrinkage of Oh, Kim, and Lee (2006), and introduces the R package CVThresh implementing details of the calculations for the procedures. This procedure is implemented by coupling a conventional cross-validation with a fast imputation method, so that it overcomes a limitation of data length, a power of 2. It can be easily applied to the classical leave-one-out cross-validation and K-fold cross-validation. Since the procedure is computationally fast, a level-dependent cross-validation can be developed for wavelet shrinkage of data with various sparseness according to levels.