scispace - formally typeset
Search or ask a question

Showing papers in "R Journal in 2017"


Journal ArticleDOI
TL;DR: The glmmTMB package fits many types of GLMMs and extensions, including models with continuously distributed responses, but here the authors focus on count responses and its ability to estimate the Conway-Maxwell-Poisson distribution parameterized by the mean is unique.
Abstract: Count data can be analyzed using generalized linear mixed models when observations are correlated in ways that require random effects However, count data are often zero-inflated, containing more zeros than would be expected from the typical error distributions We present a new package, glmmTMB, and compare it to other R packages that fit zero-inflated mixed models The glmmTMB package fits many types of GLMMs and extensions, including models with continuously distributed responses, but here we focus on count responses glmmTMB is faster than glmmADMB, MCMCglmm, and brms, and more flexible than INLA and mgcv for zero-inflated modeling One unique feature of glmmTMB (among packages that fit zero-inflated mixed models) is its ability to estimate the Conway-Maxwell-Poisson distribution parameterized by the mean Overall, its most appealing features for new users may be the combination of speed, flexibility, and its interface’s similarity to lme4

4,497 citations


Journal ArticleDOI
TL;DR: An R package, visreg, is introduced for the convenient visualization of this relationship between an outcome and an explanatory variable via short, simple function calls and provides pointwise condence bands and partial residuals to allow assessment of variability as well as outliers and other deviations from modeling assumptions.
Abstract: Regression models allow one to isolate the relationship between the outcome and an explanatory variable while the other variables are held constant. Here, we introduce an R package, visreg, for the convenient visualization of this relationship via short, simple function calls. In addition to estimates of this relationship, the package also provides pointwise condence bands and partial residuals to allow assessment of variability as well as outliers and other deviations from modeling assumptions. The package provides several options for visualizing models with interactions, including lattice plots, contour plots, and both static and interactive perspective plots. The implementation of the package is designed to be fully object-oriented and interface seamlessly with R’s rich collection of model classes, allowing a consistent interface for visualizing not only linear models, but generalized linear models, proportional hazards models, generalized additive models, robust regression models, and many more.

682 citations


Journal ArticleDOI
TL;DR: This paper provides an introduction to the imputeTS package and its provided algorithms and tools, and gives a short overview about univariate time series imputation in R.
Abstract: Abstract The imputeTS package specializes on univariate time series imputation. It offers multiple state-of-the-art imputation algorithm implementations along with plotting functions for time series missing data statistics. While imputation in general is a well-known problem and widely covered by R packages, finding packages able to fill missing values in univariate time series is more complicated. The reason for this lies in the fact that most imputation algorithms rely on inter-attribute correlations, while univariate time series imputation instead needs to employ time dependencies. This paper provides an introduction to the imputeTS package and its provided algorithms and tools. Furthermore, it gives a short overview about univariate time series imputation in R.

535 citations


Journal ArticleDOI
TL;DR: Partial dependence plots as discussed by the authors are low-dimensional graphical renderings of the prediction function so that the relationship between the outcome and predictors of interest can be more easily understood, and are especially useful in explaining the output from black box models.
Abstract: Complex nonparametric models—like neural networks, random forests, and support vector machines—are more common than ever in predictive analytics, especially when dealing with large observational databases that don’t adhere to the strict assumptions imposed by traditional statistical techniques (e.g., multiple linear regression which assumes linearity, homoscedasticity, and normality). Unfortunately, it can be challenging to understand the results of such models and explain them to management. Partial dependence plots offer a simple solution. Partial dependence plots are lowdimensional graphical renderings of the prediction function so that the relationship between the outcome and predictors of interest can be more easily understood. These plots are especially useful in explaining the output from black box models. In this paper, we introduce pdp, a general R package for constructing partial dependence plots.

502 citations


Posted ContentDOI
TL;DR: R rentrez, a package which provides an R interface to 50 NCBI databases, is presented, which is well-documented, contains an extensive suite of unit tests and has an active user base.
Abstract: The USA National Center for Biotechnology Information (NCBI) is one of the world’s most important sources of biological information. NCBI databases like PubMed and GenBank contain millions of records describing bibliographic, genetic, genomic, and medical data. Here I present rentrez, a package which provides an R interface to 50 NCBI databases. The package is well-documented, contains an extensive suite of unit tests and has an active user base. The programmatic interface to the NCBI provided by rentrez allows researchers to query databases and download or import particular records into R sessions for subsequent analysis. The complete nature of the package, its extensive test-suite and the fact the package implements the NCBI’s usage policies all make rentrez a powerful aid to developers of new packages that perform more specific tasks.

167 citations


Journal ArticleDOI
TL;DR: The markovchain package aims to provide S4 classes and methods to easily handle Discrete Time Markov Chains (DTMCs), filling the gap with what is currently available in the CRAN repository, and an exhaustive description of the main functions included in the package is provided.
Abstract: The markovchain package aims to provide S4 classes and methods to easily handle Discrete Time Markov Chains (DTMCs), filling the gap with what is currently available in the CRAN repository. In this work, I provide an exhaustive description of the main functions included in the package, as well as hands-on examples. Introduction DTMCs are a notable class of stochastic processes. Although their basic theory is not overly complex, they are extremely effective to model categorical data sequences (Ching et al., 2008). To illustrate, notable applications can be found in linguistic (see Markov’s original paper Markov (1907)), information theory (Google original algorithm is based on Markov Chains theory, Lawrence Page et al. (1999)), medicine (transition across HIV severity states, Craig and Sendi (2002)), economics and sociology (Jones (1997) shows an application of Markov Chains to model social mobility). The markovchain package (Spedicato, Giorgio Alfredo, 2016) provides an efficient tool to create, manage and analyse Markov Chains (MCs). Some of the main features include the possibility to: validate the input transition matrix, plot the transition matrix as a graph diagram, perform structural analysis of DTMCs (e.g. classification of transition matrices and states, analysis of the stationary distribution, etc . . . ), perform statistical inference (such as fitting transition matrices from various input data, simulating stochastic processes trajectories from a given DTMC, etc..). The author believes that no R package provides a unified infrastructure to easily manage DTMCs as markovchain does at the time this paper is being drafted. The package targets data scientists using DTMC, Academia members, supporting faculty instructors, as well as students of undergraduate courses on Stochastic Processes. The paper will be organized as follows: Section 2.2 gives a brief overview on R packages and alternative software that provide similar functionalities, Section 2.3 reviews DTMC basic theory, Section 2.4 discusses the package design and structure, Section 2.5 shows how to create and manipulate homogeneous DTMCs, Section 2.6 and Section 2.7 respectively present the functions created to perform structural analysis, and statistical inference on DTMCs. A brief overview of the functionalities written to deal with non homogeneous discrete dime Markov chains (NHDTMCs) is provided in Section 2.8. A discussion on numerical reliability and computational performance is provided in Section 2.9. Finally, Section 2.10 draws final conclusions and briefly discusses future potential developments of the package. Analysis of existing DTMC-related software As reviewed later in more details, a DTMC is defined by a stochastic matrix known as transition matrix (TM), which is a square matrix satisfying Equation 1. { Pij ∈ [0, 1] ∀i, j ∑i Pij = 1 (1) Although defining a stochastic matrix is trivial in any mathematical or statistical software, a DTMC dedicated infrastructure can provide object oriented programmed methods to verify the validity of the input data (i.e. if the input matrix is a stochastic one) , as well as to perform structural analysis on DTMC objects. Various packages mention MCs related models in the CRAN repository, whereby a few of them will be now reviewed. The clickstream package (Scholz, 2016), on CRAN since 2014, aims to model websites click stream using higher order Markov Chains. It provides a MarkovChain S4 class that is similar to the markovchain class. Further, DTMCPack (Nicholson, William, 2013) and MTCM (Bessi, Alessandro, 2015) also work with DTMCs but provide even more limited functions: the first one focuses on creating simulations from a given DTMC, whilst the second contains only one function for estimating the underlying transition matrix for a given categorical sequence. Moreover, none of them appears to have been updated since 2015. The coverage of functionalities provided by markovchain package for analysing DTMCs appears to be more complete than the above mentioned packages, since none of them provides methods for importing or coercing transition matrices from other objects, such as R matrices or data.frames. Furthermore, markovchain is the only package providing a quick graph plotting facility for DTMC objects. The same applies when considering the functionalities used The R Journal Vol. 9/2, December 2017 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 85 to perform structural analysis of transition matrices and to fit DTMCs from various kind of input data. More interestingly, the FuzzyStatProb package (Pablo J. Villacorta and José L. Verdegay, 2016) gives an alternative approach for estimating the parameters of DTMCs using \"fuzzy logic\". This review voluntarily omits discussing packages that are not specifically focused on DTMC. Nonetheless, the depmixS4 (Visser and Speekenbrink, 2010) and the HMM (Himmelmann, 2010) packages deal with Hidden Markov Models (HMMs). In addition, the number of R packages focused on the estimation of statistical models using the Markov Chain Monte Carlo simulation approach is sensibly bigger. Finally, the msm (Jackson, 2011), heemod (Antoine Filipovi et al., 2017) and the TPmsm packages (Artur Araújo et al., 2014) focus on health applications of multi state analysis using different kinds of models, including Markov-related ones among them. Finally, among other well known software used in Mathematics and Statistics, only Mathematica (Wolfram Research, Inc., 2013) provides routines specifically written to deal with Markov processes at the author’s knowledge. Nevertheless, the analysis of DTMCs could be easily handled within the Matlab programming language (MATLAB, 2017) due to its well known linear algebra capabilities. Review of underlying theory In this section a brief review of the theory of DTMCs is presented. Readers willing to dive deeper can inspect Cassandras (1993) and Grinstead and Snell (2012). A DTMC is a stochastic process whose domain is a discrete set of states, {s1, s2, . . . , sk}. The chain starts in a generic state at time zero and moves from a state to another by steps. Let pij be the probability that a chain currently in state si moves to state sj at the next step. The key characteristic of DTMC processes is that pij does not depend upon the previous state in the chain. The probability pij for a (finite) DTMC is defined by a transition matrix previously introduced (see Equation 1). It is also possible to define the TM by column, under the constraint that the sum of the elements in each column is 1. To illustrate, a few toy examples on transition matrices are now presented; the \"Land of Oz\" weather Matrix, Kemeny et al. (1974). Equation 2 shows the transition probability between (R)ainy, (N)ice and (S)now weathers.  R N S R 0.5 0.25 0.25 N 0.5 0 0.5 S 0.25 0.25 0.5  (2) Further, the Mathematica Matrix 3, taken from the Mathematica 9 Computer Algebra System manual (Wolfram Research, Inc., 2013), that will be used when discussing the analysis the structural proprieties of DTMCs, is as follows:  A B C D A 0.5 0.5 0 0 B 0.5 0.5 0 0 C 0.25 0.25 0.25 0.25 D 0 0 0 1  (3) Simple operations on TMs allow to understand structural proprieties of DTMCs. For example, the n− th power of P is a matrix whose entries represent the probabilities that a DTMC in state si at time t will be in state sj at time t + n. In particular, if ut is the probability vector for time t (that is, a vector whose j− th entries represent the probability that the chain will be in the j− th state at time t), then the distribution of the chain at time t + n is given by un = u ∗ Pn. Main properties of Markov chains are now presented. A state si is reachable from state sj if ∃n→ pn ij > 0. If the inverse is also true then si and sj are said to communicate . For each MC, there always exists a unique decomposition of the state space into a sequence of disjoint subsets in which all the states within each subset communicate. Each subset is known as a communicating class of the MC. It is possible to link this decomposition to graph theory, since the communicating classes represent the strongly connected components of the graph underlying the transition matrix (Jarvis and Shier, 1999). A state sj of a DTMC is said to be absorbing if it is impossible to leave it, meaning pjj = 1. An absorbing Markov chain is a chain that contains at least one absorbing state which can be reached, not necessarily in a single step. Non absorbing states of an absorbing MC are defined as transient states . In addition, states that can be visited more than once by the MC are known as recurrent states . If a DTMC contains r ≥ 1 absorbing states, it is possible to re-arrange their order by separating The R Journal Vol. 9/2, December 2017 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 86 transient and absorbing states such that the t transient states come before the r absorbing ones. Such re-arranged matrix is said to be in canonical form (see Equation 4), where its composition can be represented by sub matrices. ( Qt,t Rt,r 0r,t Ir,r ) (4) Such matrices are: Q (a t-square sub matrix containing the transition probabilities across transient states), R (a nonzero t-by-r matrix containing transition probabilities from non-absorbing to absorbing states), 0 ( an r-by-t zero matrix), and Ir (an r-by-r identity matrix). It is possible to use these matrices to calculate various structural proprieties of the DTMC. Since limn→∞ Qn = 0, it can be shown that in every absorbing matrix the probability to be eventually absorbed is 1, regardless of the state where the MC is initiated. Further, in Equation 5 the fundamental matrix is presented, where the generic nij entry expresses the expected number of times the process will transit in state sj, given that it started in state si. Also, the i-th entry of vector t = N ∗ 1̄, being 1̄ a t-sized vector of ones, expresses the expected number of steps before an absorbing DTMC, started in state

99 citations


Journal ArticleDOI
TL;DR: The GA package as discussed by the authors provides a collection of general purpose functions for optimisation using genetic algorithms, which can be used to combine the power of genetic algorithms with the speed of a local optimiser.
Abstract: Genetic algorithms are stochastic iterative algorithms in which a population of individuals evolve by emulating the process of biological evolution and natural selection. The R package GA provides a collection of general purpose functions for optimisation using genetic algorithms. This paper describes some enhancements recently introduced in version 3 of the package. In particular, hybrid GAs have been implemented by including the option to perform local searches during the evolution. This allows to combine the power of genetic algorithms with the speed of a local optimiser. Another major improvement is the provision of facilities for parallel computing. Parallelisation has been implemented using both the master-slave approach and the islands evolution model. Several examples of usage are presented, with both real-world data examples and benchmark functions, showing that often high-quality solutions can be obtained more efficiently.

83 citations


Journal ArticleDOI
TL;DR: An introduction to the mosaic package describes some of the guiding principles behind the design and provides illustrative examples of several of the most important functions it implements to help students “think with data" using R in their early course work.
Abstract: The mosaic package provides a simplified and systematic introduction to the core functionality related to descriptive statistics, visualization, modeling, and simulation-based inference required in first and second courses in statistics. This introduction to the package describes some of the guiding principles behind the design of the package and provides illustrative examples of several of the most important functions it implements. These can be combined to help students “think with data" using R in their early course work, starting with simple, yet powerful, declarative commands.

81 citations


Journal ArticleDOI
TL;DR: The smoof package implements a large set of test functions and test function generators for both the single and multi-objective case in continuous optimization and provides functions to easily create own test functions.
Abstract: Benchmarking algorithms for optimization problems usually is carried out by running the algorithms under consideration on a diverse set of benchmark or test functions. A vast variety of test functions was proposed by researchers and is being used for investigations in the literature. The smoof package implements a large set of test functions and test function generators for both the singleand multi-objective case in continuous optimization and provides functions to easily create own test functions. Moreover, the package offers some additional helper methods, which can be used in the context of optimization.

61 citations


Journal ArticleDOI
TL;DR: This work provides explicit formulas for the implementation of the estimator of the (stratified) baseline hazard function in the presence of tied event times and obtains fast access to the baseline hazards and predictions of survival probabilities, their confldence intervals and con-�dence bands.
Abstract: In the presence of competing risks a prediction of the time-dynamic absolute risk of an event can be based on cause-specific Cox regression models for the event and the competing risks (Benichou and Gail, 1990). We present computationally fast and memory optimized C++ functions with an R interface for predicting the covariate specific absolute risks, their confidence intervals, and their confidence bands based on right censored time to event data. We provide explicit formulas for our implementation of the estimator of the (stratified) baseline hazard function in the presence of tied event times. As a by-product we obtain fast access to the baseline hazards (compared to survival::basehaz()) and predictions of survival probabilities, their confidence intervals and confidence bands. Confidence intervals and confidence bands are based on point-wise asymptotic expansions of the corresponding statistical functionals. The software presented here is implemented in the riskRegression package.

59 citations


Journal ArticleDOI
TL;DR: Recently added interactive visualizations to explore association rules are discussed and how easily they can be used in arulesViz via a unified interface is demonstrated.
Abstract: Association rule mining is a popular data mining method to discover interesting relationships between variables in large databases. An extensive toolbox is available in the R-extension package arules. However, mining association rules often results in a vast number of found rules, leaving the analyst with the task to go through a large set of rules to identify interesting ones. Sifting manually through extensive sets of rules is time-consuming and strenuous. Visualization and especially interactive visualization has a long history of making large amounts of data better accessible. The R-extension package arulesViz provides most popular visualization techniques for association rules. In this paper, we discuss recently added interactive visualizations to explore association rules and demonstrate how easily they can be used in arulesViz via a unified interface. With examples, we help to guide the user in selecting appropriate visualizations and interpreting the results.

Journal ArticleDOI
TL;DR: The table1() function in the furniture package streamlines much of the exploratory data analysis process, making the computation and communication of summary statistics simple and beautiful while offering significant time-savings to the researcher.
Abstract: A basic understanding of the distributions of study variables and the relationships among them is essential to inform statistical modeling. This understanding is achieved through the computation of summary statistics and exploratory data analysis. Unfortunately, this step tends to be under-emphasized in the research process, in part because of the often tedious nature of thorough exploratory data analysis. The table1() function in the furniture package streamlines much of the exploratory data analysis process, making the computation and communication of summary statistics simple and beautiful while offering significant time-savings to the researcher.


Journal ArticleDOI
TL;DR: The Rocker project as mentioned in this paper provides a suite of Docker images with customized R environments for particular tasks, which can increase portability, scaling, reproducibility, and convenience of R users and developers.
Abstract: We describe the Rocker project, which provides a widely-used suite of Docker images with customized R environments for particular tasks. We discuss how this suite is organized, and how these tools can increase portability, scaling, reproducibility, and convenience of R users and developers.

Journal ArticleDOI
TL;DR: A new R package is presented for dealing with non-normality and variance heterogeneity of sample data when conducting hypothesis tests of main effects and interactions in mixed models, which departs from an existing SAS program which implements Johansen's general formulation of Welch-James’s statistic with approximate degrees of freedom.
Abstract: A new R package is presented for dealing with non-normality and variance heterogeneity of sample data when conducting hypothesis tests of main effects and interactions in mixed models. The proposal departs from an existing SAS program which implements Johansen’s general formulation of Welch-James’s statistic with approximate degrees of freedom, which makes it suitable for testing any linear hypothesis concerning cell means in univariate and multivariate mixed model designs when the data pose non-normality and non-homogeneous variance. Improved type I error rate control is obtained using bootstrapping for calculating an empirical critical value, whereas robustness against non-normality is achieved through trimmed means and Winsorized variances. A wrapper function eases the application of the test in common situations, such as performing omnibus tests on all effects and interactions, pairwise contrasts, and tetrad contrasts of two-way interactions. The package is demonstrated in several problems including unbalanced univariate and multivariate designs.

Journal ArticleDOI
TL;DR: Three different approaches to visualize networks by building on the grammar of graphics framework implemented in the ggplot2 package are explored, which allow users to enhance networks with additional information on edges and nodes and convert network data objects to the more familiar data frames.
Abstract: This paper explores three different approaches to visualize networks by building on the grammar of graphics framework implemented in the ggplot2 package. The goal of each approach is to provide the user with the ability to apply the flexibility of ggplot2 to the visualization of network data, including through the mapping of network attributes to specific plot aesthetics. By incorporating networks in the ggplot2 framework, these approaches (1) allow users to enhance networks with additional information on edges and nodes, (2) give access to the strengths of ggplot2, such as layers and facets, and (3) convert network data objects to the more familiar data frames.

Journal ArticleDOI
TL;DR: The eurostat R package is introduced that provides a collection of custom tools for the Eurostat open data service, including functions to query, download, manipulate, and visualize these data sets in a smooth, automated and reproducible manner.
Abstract: The increasing availability of open statistical data resources is providing novel opportunities for research and citizen science. Efficient algorithmic tools are needed to realize the full potential of the new information resources. We introduce the eurostat R package that provides a collection of custom tools for the Eurostat open data service, including functions to query, download, manipulate, and visualize these data sets in a smooth, automated and reproducible manner. The online documentation provides detailed examples on the analysis of these spatio-temporal data collections. This work provides substantial improvements over the previously available tools, and has been extensively tested by an active user community. The eurostat R package contributes to the growing open source ecosystem dedicated to reproducible research in computational social science and digital humanities.

Journal ArticleDOI
TL;DR: The Pattern Sequence Based Forecasting (PSF) as mentioned in this paper algorithm is an R package that implements the PSF algorithm for univariate time series forecasting, which has been successfully applied to many different fields.
Abstract: This paper discusses about an R package that implements the Pattern Sequence based Forecasting (PSF) algorithm, which was developed for univariate time series forecasting. This algorithm has been successfully applied to many different fields. The PSF algorithm consists of two major parts: clustering and prediction. The clustering part includes selection of the optimum number of clusters. It labels time series data with reference to such clusters. The prediction part includes functions like optimum window size selection for specific patterns and prediction of future values with reference to past pattern sequences. The PSF package consists of various functions to implement the PSF algorithm. It also contains a function which automates all other functions to obtain optimized prediction results. The aim of this package is to promote the PSF algorithm and to ease its implementation with minimum efforts. This paper describes all the functions in the PSF package with their syntax. It also provides a simple example of usage. Finally, the usefulness of this package is discussed by comparing it to auto.arima and ets, well-known time series forecasting functions available on CRAN repository.

Journal ArticleDOI
TL;DR: Flan is described, a package providing tools for fluctuation analysis of mutant cell counts that includes functions dedicated to the distribution of final numbers of mutant cells, enabling inference on different sorts of data with several possible methods.
Abstract: This paper describes flan, a package providing tools for fluctuation analysis of mutant cell counts. It includes functions dedicated to the distribution of final numbers of mutant cells. Parametric estimation and hypothesis testing are also implemented, enabling inference on different sorts of data with several possible methods. An overview of the subject is proposed. The general form of mutation models is described, including the classical models as particular cases. Estimating from a model, when the data have been generated by another, induces different possible biases, which are identified and discussed. The three estimation methods available in the package are described, and their mean squared errors are compared. Finally, implementation is discussed, and a few examples of usage on real data sets are given.

Journal ArticleDOI
TL;DR: The wec package is introduced, that provides functions to apply weighted effect coding to factor variables, and to interactions between a factor variable and a continuous variable and between (b.) two factor variables.
Abstract: Weighted effect coding refers to a specific coding matrix to include factor variables in generalised linear regression models. With weighted effect coding, the effect for each category represents the deviation of that category from the weighted mean (which corresponds to the sample mean). This technique has particularly attractive properties when analysing observational data, that commonly are unbalanced. The wec package is introduced, that provides functions to apply weighted effect coding to factor variables, and to interactions between (a.) a factor variable and a continuous variable and between (b.) two factor variables.

Journal ArticleDOI
TL;DR: CleanNLP as mentioned in this paper is a set of fast tools for converting a textual corpus into normalized tables, including tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and information extraction.
Abstract: The package cleanNLP provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes Stanford's CoreNLP library, exposing a number of annotation tasks for text written in English, French, German, and Spanish. Annotators include tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and information extraction.

Journal ArticleDOI
TL;DR: The afmToolkit R package allows the user to automatically batch process AFM force-distance and force-time curves and eases the basic processing of large amount of AFM F-d/t curves at once.
Abstract: Atomic force microscopy (AFM) is widely used to measure molecular and colloidal interactions as well as mechanical properties of biomaterials. In this paper the afmToolkit R package is introduced. This package allows the user to automatically batch process AFM force-distance and force-time curves. afmToolkit capabilities range from importing ASCII files and preprocessing the curves (contact point detection, baseline correction. . . ) for finding relevant physical information, such as Young’s modulus, adhesion energies and exponential decay for force relaxation and creep experiments. This package also contains plotting, summary and feature extraction functions. The package also comes with several data sets so the user can test the aforementioned features with ease. The package afmToolkit eases the basic processing of large amount of AFM F-d/t curves at once. It is also flexible enough to easily incorporate new functions as they are needed and can be seen as a programming infrastructure for further algorithm development.

Journal ArticleDOI
TL;DR: RQGIS supports the seamless integration of Python code using reticulate from within R for improved extendability, and offers a wider range of geoalgorithms, and is often easier to use due to various convenience functions.
Abstract: Integrating R with Geographic Information Systems (GIS) extends R’s statistical capabilities with numerous geoprocessing and data handling tools available in a GIS. QGIS is one of the most popular open-source GIS, and it furthermore integrates other GIS programs such as the System for Automated Geoscientific Analyses (SAGA) GIS and the Geographic Resources Analysis Support System (GRASS) GIS within a single software environment. This and its QGIS Python API makes it a perfect candidate for console-based geoprocessing. By establishing an interface, the R package RQGIS makes it possible to use QGIS as a geoprocessing workhorse from within R. Compared to other packages building a bridge to GIS (e.g., rgrass7, RSAGA, RPyGeo), RQGIS offers a wider range of geoalgorithms, and is often easier to use due to various convenience functions. Finally, RQGIS supports the seamless integration of Python code using reticulate from within R for improved extendability.

Journal ArticleDOI
TL;DR: The PGEE package includes three main functions: CVfit, PGEE, and MGEE, which performs simultaneous estimation and variable selection for longitudinal data with high-dimensional covariates and fits unpenalized GEE to the data for comparison.
Abstract: We introduce an R package PGEE that implements the penalized generalized estimating equations (GEE) procedure proposed by Wang et al. (2012) to analyze longitudinal data with a large number of covariates. The PGEE package includes three main functions: CVfit, PGEE, and MGEE. The CVfit function computes the cross-validated tuning parameter for penalized generalized estimating equations. The function PGEE performs simultaneous estimation and variable selection for longitudinal data with high-dimensional covariates; whereas the function MGEE fits unpenalized GEE to the data for comparison. The R package PGEE is illustrated using a yeast cell-cycle gene expression data set.

Journal ArticleDOI
TL;DR: Checkmate as mentioned in this paper provides a plethora of functions to check the type and related properties of the most frequently used R objects and variable types, which can be employed to detect unexpected input during runtime and to signal understandable and traceable errors.
Abstract: Dynamically typed programming languages like R allow programmers to write generic, flexible and concise code and to interact with the language using an interactive Read-eval-print-loop (REPL). However, this flexibility has its price: As the R interpreter has no information about the expected variable type, many base functions automatically convert the input instead of raising an exception. Unfortunately, this frequently leads to runtime errors deeper down the call stack which obfuscates the original problem and renders debugging challenging. Even worse, unwanted conversions can remain undetected and skew or invalidate the results of a statistical analysis. As a resort, assertions can be employed to detect unexpected input during runtime and to signal understandable and traceable errors. The package "checkmate" provides a plethora of functions to check the type and related properties of the most frequently used R objects and variable types. The package is mostly written in C to avoid any unnecessary performance overhead. Thus, the programmer can conveniently write concise, well-tested assertions which outperforms custom R code for many applications. Furthermore, checkmate simplifies writing unit tests using the framework "testthat" by extending it with plenty of additional expectation functions, and registered C routines are available for package developers to perform assertions on arbitrary SEXPs (internal data structure for R objects implemented as struct in C) in compiled code.

Journal ArticleDOI
TL;DR: The MDplot package provides plotting functions to allow for automated visualisation of molecular dynamics simulation output and a Bash interface that allows simple embedding of MDplot into Bash scripts as the final analysis step is provided.
Abstract: The MDplot package provides plotting functions to allow for automated visualisation of molecular dynamics simulation output. It is especially useful in cases where the plot generation is rather tedious due to complex file formats or when a large number of plots are generated. The graphs that are supported range from those which are standard, such as RMsD/RMsF (root-mean-square deviation and root-mean-square fluctuation, respectively) to less standard, such as thermodynamic integration analysis and hydrogen bond monitoring over time. All told, they address many commonly used analyses. In this article, we set out the MDplot package's functions, give examples of the function calls, and show the associated plots. Plotting and data parsing is separated in all cases, i.e. the respective functions can be used independently. Thus, data manipulation and the integration of additional file formats is fairly easy. Currently, the loading functions support GROMOS, GROMACS, and AMBER file formats. Moreover, we also provide a Bash interface that allows simple embedding of MDplot into Bash scripts as the final analysis step. Availability The package can be obtained in the latest major version from CRAN (https://cran.r-project.org/package=MDplot) or in the most recent version from the project's GitHub page at https://github.com/MDplot/MDplot, where feedback is also most welcome. MDplot is published under the GPL-3 license.

Journal ArticleDOI
TL;DR: An implementation of split-population duration regression in the spduration package for R that allows for time-varying covariates and insights for when immune units exist, that can significantly increase the predictive performance compared to standard duration models are provided.
Abstract: We present an implementation of split-population duration regression in the spduration (Beger et al., 2017) package for R that allows for time-varying covariates. The statistical model accounts for units that are immune to a certain outcome and are not part of the duration process the researcher is primarily interested in. We provide insights for when immune units exist, that can significantly increase the predictive performance compared to standard duration models. The package includes estimation and several post-estimation methods for split-populationWeibull and log-logistic models. We provide an empirical application to data on military coups.

Journal ArticleDOI
TL;DR: Numerical experiments show that the functions of the mle.tools package are efficient to estimate the biases by the Cox-Snell formula and for calculating the observed and expected Fisher information.
Abstract: Recently, Mazucheli (2017) uploaded the package mle.tools to CRAN. It can be used for bias corrections of maximum likelihood estimates through the methodology proposed by Cox and Snell (1968). The main function of the package, coxsnell.bc(), computes the bias corrected maximum likelihood estimates. Although in general, the bias corrected estimators may be expected to have better sampling properties than the uncorrected estimators, analytical expressions from the formula proposed by Cox and Snell (1968) are either tedious or impossible to obtain. The purpose of this paper is twofolded: to introduce the mle.tools package, especially the coxsnell.bc() function; secondly, to compare, for thirty one continuous distributions, the bias estimates from the coxsnell.bc() function and the bias estimates from analytical expressions available in the literature. We also compare, for five distributions, the observed and expected Fisher information. Our numerical experiments show that the functions are efficient to estimate the biases by the Cox-Snell formula and for calculating the observed and expected Fisher information.

Journal ArticleDOI
TL;DR: The authors’ package MCI implements the steps of market area analysis into R with a focus on fitting the models and data preparation and processing.
Abstract: In retail location analysis, marketing research and spatial planning, the market areas of stores and/or locations are a frequent subject. Market area analyses consist of empirical observations and modeling via theoretical and/or econometric models such as the Huff Model or the Multiplicative Competitive Interaction Model. The authors’ package MCI implements the steps of market area analysis into R with a focus on fitting the models and data preparation and processing.

Journal ArticleDOI
TL;DR: Combined, anomalyDetection offers cyber analysts an efficient and simplified approach to break up network events into time-segment blocks and identify periods associated with suspected anomalies for further evaluation.
Abstract: As the number of cyber-attacks continues to grow on a daily basis, so does the delay in threat detection. For instance, in 2015, the Office of Personnel Management discovered that approximately 21.5 million individual records of Federal employees and contractors had been stolen. On average, the time between an attack and its discovery is more than 200 days. In the case of the OPM breach, the attack had been going on for almost a year. Currently, cyber analysts inspect numerous potential incidents on a daily basis, but have neither the time nor the resources available to perform such a task. anomalyDetection aims to curtail the time frame in which anomalous cyber activities go unnoticed and to aid in the efficient discovery of these anomalous transactions among the millions of daily logged events by i) providing an efficient means for pre-processing and aggregating cyber data for analysis by employing a tabular vector transformation and handling multicollinearity concerns; ii) offering numerous built-in multivariate statistical functions such as Mahalanobis distance, factor analysis, principal components analysis to identify anomalous activity, iii) incorporating the pipe operator (%>%) to allow it to work well in the tidyverse workflow. Combined, anomalyDetection offers cyber analysts an efficient and simplified approach to break up network events into time-segment blocks and identify periods associated with suspected anomalies for further evaluation.