scispace - formally typeset
Search or ask a question

Showing papers in "Computational Statistics in 2020"


Journal ArticleDOI
TL;DR: This work focuses on the clustering of such functional data, in order to ease their modeling and understanding, and presents a novel clustering technique based on a latent mixture model which fits the data in group-specific functional subspaces through a multivariate functional principal component analysis.
Abstract: With the emergence of numerical sensors in many aspects of every- day life, there is an increasing need in analyzing multivariate functional data. This work focuses on the clustering of such functional data, in order to ease their modeling and understanding. To this end, a novel clustering technique for multivariate functional data is presented. This method is based on a func- tional latent mixture model which fits the data in group-specific functional subspaces through a multivariate functional principal component analysis. A family of parsimonious models is obtained by constraining model parameters within and between groups. An EM algorithm is proposed for model inference and the choice of hyper-parameters is addressed through model selection. Nu- merical experiments on simulated datasets highlight the good performance of the proposed methodology compared to existing works. This algorithm is then applied to the analysis of the pollution in French cities for one year.

56 citations


Journal ArticleDOI
TL;DR: A new class of distributions called the odd log-logistic Lindley-G family is proposed, which can have symmetrical, right-skewed, leftt-Skewed and reversed-J shaped densities, and decreasing, increasing, bathtub, unimodal and reversed -J shaped hazard rates.
Abstract: In this paper, a new class of distributions called the odd log-logistic Lindley-G family is proposed. Several of its statistical and reliability properties are studied in-detail. One members of the proposed family can have symmetrical, right-skewed, leftt-skewed and reversed-J shaped densities, and decreasing, increasing, bathtub, unimodal and reversed-J shaped hazard rates. The model parameters are estimated using the maximum likelihood and Bayesian methods. Monte-Carlo simulation study is carried out to examine the bias and mean square error of maximum likelihood and Bayesian estimators. Finally, four real data sets are analyzed to show the flexibility of the new family.

38 citations


Journal ArticleDOI
TL;DR: The R package Plackettuce as discussed by the authors implements a generalization of the PLACKETTE-Luce model for rankings data, which accommodates both ties and partial rankings.
Abstract: This paper presents the R package PlackettLuce, which implements a generalization of the Plackett–Luce model for rankings data. The generalization accommodates both ties (of arbitrary order) and partial rankings (complete rankings of subsets of items). By default, the implementation adds a set of pseudo-comparisons with a hypothetical item, ensuring that the underlying network of wins and losses between items is always strongly connected. In this way, the worth of each item always has a finite maximum likelihood estimate, with finite standard error. The use of pseudo-comparisons also has a regularization effect, shrinking the estimated parameters towards equal item worth. In addition to standard methods for model summary, PlackettLuce provides a method to compute quasi standard errors for the item parameters. This provides the basis for comparison intervals that do not change with the choice of identifiability constraint placed on the item parameters. Finally, the package provides a method for model-based partitioning using covariates whose values vary between rankings, enabling the identification of subgroups of judges or settings with different item worths. The features of the package are demonstrated through application to classic and novel data sets.

33 citations


Journal ArticleDOI
TL;DR: A new one-parameter model on the unit interval is defined, called the unit-improved second-degree Lindley distribution, and some of its structural properties are obtained, and it is proved empirically to be competitive to the beta, Kumaraswamy, simplex, unit-Lindley, units-Gamma and Topp–Leone models.
Abstract: We define a new one-parameter model on the unit interval, called the unit-improved second-degree Lindley distribution, and obtain some of its structural properties. The methods of maximum likelihood, bias-corrected maximum likelihood, moments, least squares and weighted least squares are used to estimate the unknown parameter. The finite sample performance of these methods are investigated by means of Monte Carlo simulations. Moreover, we introduce a new regression model as an alternative to the beta, unit-Lindley and simplex regression models and present a residual analysis based on Pearson and Cox–Snell residuals. The new models are proved empirically to be competitive to the beta, Kumaraswamy, simplex, unit-Lindley, unit-Gamma and Topp–Leone models by means of two real data sets. Empirical findings indicate that the proposed models can provide better fits than other competitive models when the data are close to the boundaries of the unit interval.

26 citations


Journal ArticleDOI
TL;DR: This paper provides a review of SML from a Bayesian decision theoretic point of view -- where it is argued that many SML techniques are closely connected to making inference by using the so called Bayesian paradigm.
Abstract: Statistical Machine Learning (SML) refers to a body of algorithms and methods by which computers are allowed to discover important features of input data sets which are often very large in size. The very task of feature discovery from data is essentially the meaning of the keyword ‘learning’ in SML. Theoretical justifications for the effectiveness of the SML algorithms are underpinned by sound principles from different disciplines, such as Computer Science and Statistics. The theoretical underpinnings particularly justified by statistical inference methods are together termed as statistical learning theory. This paper provides a review of SML from a Bayesian decision theoretic point of view—where we argue that many SML techniques are closely connected to making inference by using the so called Bayesian paradigm. We discuss many important SML techniques such as supervised and unsupervised learning, deep learning, online learning and Gaussian processes especially in the context of very large data sets where these are often employed. We present a dictionary which maps the key concepts of SML from Computer Science and Statistics. We illustrate the SML techniques with three moderately large data sets where we also discuss many practical implementation issues. Thus the review is especially targeted at statisticians and computer scientists who are aspiring to understand and apply SML for moderately large to big data sets.

22 citations


Journal ArticleDOI
TL;DR: A comparison between different modern heuristic optimization methods applied to maximize the likelihood function for parameter estimation is presented and both the performance of heuristic methods and estimation of GGD parameters are investigated.
Abstract: The generalized gamma distribution (GGD) is a popular distribution because it is extremely flexible. Due to the density function structure of GGD, estimating the parameters of the GGD family by statistical point estimation techniques is a complicated task. In other words, for the parameter estimation, the maximizing likelihood function of GGD is a problematic case. Hence, alternative approaches can be used to obtain estimators of GGD parameters. This paper proposes an alternative parameter estimation method for GGD by using the heuristic optimization approaches such as Genetic Algorithms (GA), Differential Evolution (DE), Particle Swarm Optimization (PSO), and Simulated Annealing (SA). A comparison between different modern heuristic optimization methods applied to maximize the likelihood function for parameter estimation is presented in this paper. The paper also investigates both the performance of heuristic methods and estimation of GGD parameters. Simulations show that heuristic approaches provide quite accurate estimates. In most of the cases, DE has better performance than other heuristics in terms of bias values of parameter estimations. Besides, the usefulness of an alternative parameter estimation method for GGD using the heuristic optimization approach is illustrated with a real data set.

19 citations


Journal ArticleDOI
TL;DR: The collinearity indices $$k_{j}$$ k j, traditionally misinterpreted as variance inflation factors, are reinterpreted in this paper where they will be used to distinguish and quantify the essential and non-essential collinearsity.
Abstract: Marquandt and Snee (Am Stat 29(1):3–20, 1975), Marquandt (J Am Stat Assoc 75(369):87–91, 1980) and Snee and Marquardt (Am Stat 38(2):83–87, 1984) refer to non-essential multicollinearity as that caused by the relation with the independent term. Although it is clear that the solution is to center the independent variables in the regression model, it is unclear when this kind of collinearity exists. The goal of this study is to diagnose the non-essential collinearity parting from a simple linear model. The collinearity indices $$k_{j}$$, traditionally misinterpreted as variance inflation factors, are reinterpreted in this paper where they will be used to distinguish and quantify the essential and non-essential collinearity. The results can be immediately extended to the multiple linear model. The study also has some recommendations for statistical software such as SPSS, Stata, GRETL or R for improving the diagnosis of non-essential collinearity.

16 citations


Journal ArticleDOI
TL;DR: It is shown that the Weibull model constitutes a conjugate model for the gamma frailty, leading to explicit expressions for the moments, survival functions, hazard functions, quantiles, and mean residual lifetimes, which facilitate the parameter interpretation of prognostic inference.
Abstract: In meta-analysis of individual patient data with semi-competing risks, the joint frailty–copula model has been proposed, where frailty terms account for the between-study heterogeneity and copulas account for dependence between terminal and nonterminal event times. In the previous works, the baseline hazard functions in the joint frailty–copula model are estimated by the nonparametric model or the penalized spline model, which requires complex maximization schemes and resampling-based interval estimation. In this article, we propose the Weibull distribution for the baseline hazard functions under the joint frailty–copula model. We show that the Weibull model constitutes a conjugate model for the gamma frailty, leading to explicit expressions for the moments, survival functions, hazard functions, quantiles, and mean residual lifetimes. These results facilitate the parameter interpretation of prognostic inference. We propose a maximum likelihood estimation method and make our computer programs available in the R package, joint.Cox. We also show that the delta method is feasible to calculate interval estimates, which is a useful alternative to the resampling-based method. We conduct simulation studies to examine the accuracy of the proposed methods. Finally, we use the data on ovarian cancer patients to illustrate the proposed method.

16 citations


Journal ArticleDOI
TL;DR: Simulation results demonstrate that the proposed tests are powerful under different grouping of the investigated random vector and an empirical application to detecting dependence of a portfolio of stocks in NASDAQ illustrates the applicability and effectiveness of the provided tests.
Abstract: Inspired by the correlation matrix and based on the generalized Spearman’s $$\rho $$ and Kendall’s $$\tau $$ between random variables proposed in Lu et al. ( J Nonparametr Stat 30(4):860–883, 2018), $$\rho $$ -matrix and $$\tau $$ -matrix are suggested for multivariate data sets. The matrices are used to construct the $$\rho $$ -measure and the $$\tau $$ -measure among random vectors with statistical estimation and the asymptotic distributions under the null hypothesis of independence that produce the nonparametric tests of independence for multiple vectors. Simulation results demonstrate that the proposed tests are powerful under different grouping of the investigated random vector. An empirical application to detecting dependence of the closing price of a portfolio of stocks in NASDAQ also illustrates the applicability and effectiveness of our provided tests. Meanwhile, the corresponding measures are applied to characterize strength of interdependence of that portfolio of stocks during the recent two years.

16 citations


Journal ArticleDOI
TL;DR: This paper proposes a method of uncovering latent communities using both network structural information and node attributes so that the nodes within each community not only connect to other nodes in similar patterns but also share homogeneous attributes.
Abstract: Community detection is one of the main research topics in network analysis. Most network data reveal a certain structural relationship between nodes and provide attributes describing them. Utilizing available node attributes can help uncover latent communities from an observed network. In this paper, we propose a method of uncovering latent communities using both network structural information and node attributes so that the nodes within each community not only connect to other nodes in similar patterns but also share homogeneous attributes. The proposed method transforms the graph distance of nodes to structural similarity via the Gaussian kernel function. The attribute similarity between nodes is also measured by the Gaussian kernel function. Our method takes advantage of spectral clustering by appending node attributes to the node representation obtained from the network structure. Further, the proposed method has the ability to automatically learn the degree to which different attributes contribute. The solid performance of the proposed method is demonstrated in simulated data and four real-world networks.

13 citations


Journal ArticleDOI
TL;DR: Inference and assessment of significance is based on very high-dimensional multivariate (generalized) linear models: in contrast to often used marginal approaches, this provides a step towards more causal-oriented inference.
Abstract: We provide a view on high-dimensional statistical inference for genome-wide association studies. It is in part a review but covers also new developments for meta analysis with multiple studies and novel software in terms of an R-package hierinf. Inference and assessment of significance is based on very high-dimensional multivariate (generalized) linear models: in contrast to often used marginal approaches, this provides a step towards more causal-oriented inference.

Journal ArticleDOI
TL;DR: A Bayesian approach to estimate the parameters of ordinary differential equations (ODE) from the observed noisy data is developed, replacing the ODE constraint with a probability expression and combining it with the nonparametric data fitting procedure into a joint likelihood framework.
Abstract: We develop a Bayesian approach to estimate the parameters of ordinary differential equations (ODE) from the observed noisy data. Our method does not need to solve ODE directly. We replace the ODE constraint with a probability expression and combine it with the nonparametric data fitting procedure into a joint likelihood framework. One advantage of the proposed method is that for some ODE systems, one can obtain closed form conditional posterior distributions for all variables which substantially reduce the computational cost and facilitate the convergence process. An efficient Riemann manifold based hybrid Monte Carlo scheme is implemented to generate samples for variables whose conditional posterior distribution cannot be written in terms of closed form. Our approach can be applied to situations where the state variables are only partially observed. The usefulness of the proposed method is demonstrated through applications to both simulated and real data.

Journal ArticleDOI
TL;DR: The results from the Bayesian analysis show that posterior inference might be affected by the choice of priors used to formulate the informative priors, and provides satisfactory estimation strategy in terms of precision compared to the frequentist approach, accounting for uncertainty in parameters and return levels estimation.
Abstract: The dependency effect to extreme value distributions (EVDs) using the frequentist and Bayesian approaches have been used to analyse the extremes of annual and daily maximum wind speed at Port Elizabeth, South Africa. In the frequentist approach, the parameters of EVDs were estimated using maximum likelihood, whereas in the Bayesian approach the Markov Chain Monte Carlo technique with the Metropolis–Hastings algorithm was used. The results show that the EVDs fitted considering the dependency and seasonality effects with in the data series provide apparent benefits in terms of improved precision in estimation of the parameters as well as return levels of the distributions. The paper also discusses a method to construct informative priors empirically using historical data of the underlying process from other weather stations. The results from the Bayesian analysis show that posterior inference might be affected by the choice of priors used to formulate the informative priors. The Bayesian approach provides satisfactory estimation strategy in terms of precision compared to the frequentist approach, accounting for uncertainty in parameters and return levels estimation.

Journal ArticleDOI
TL;DR: The results show that the Anderson–Darling method outperforms its competitors and that the maximum likelihood estimators strongly depends on perfect ranking for accurate estimation.
Abstract: Maximum likelihood estimation (MLE) applied to ranked set sampling (RSS) designs is usually based on the assumption of perfect ranking. However, it may suffers of lack of efficiency when ranking errors are present. The main goal of this article is to investigate the performance of six alternative estimation methods to MLE for parameter estimation under RSS. We carry out an extensive simulation study and measure the performance of the maximum product of spacings, ordinary and weighted least-squares, Cramer-von-Mises, Anderson–Darling and right-tail Anderson–Darling estimators, along with the maximum likelihood estimators, through the Kullback–Leibler divergence from the true and estimated probability density functions. Our simulation study considered eight continuous probability distributions, six sample sizes and six levels of correlation between the interest and concomitant variables. In general, our results show that the Anderson–Darling method outperforms its competitors and that the maximum likelihood estimators strongly depends on perfect ranking for accurate estimation. Finally, we present an illustrative example using a data set concerning the percent of body fat. R code is available in the supplementary material.

Journal ArticleDOI
TL;DR: This paper proposes an algorithm for sparse Principal Component Analysis for non-Gaussian data that focuses on the Poisson distribution which has been used extensively in analysing text data.
Abstract: Dimension reduction tools offer a popular approach to analysis of high-dimensional big data. In this paper, we propose an algorithm for sparse Principal Component Analysis for non-Gaussian data. Since our interest for the algorithm stems from applications in text data analysis we focus on the Poisson distribution which has been used extensively in analysing text data. In addition to sparsity our algorithm is able to effectively determine the desired number of principal components in the model (order determination). The good performance of our proposal is demonstrated with both synthetic and real data examples.

Journal ArticleDOI
TL;DR: A new generalized negative binomial thinning operator with dependent counting series, which is applied on a real data set and compared to some relevant INAR(1) models.
Abstract: In this paper, we introduce a new generalized negative binomial thinning operator with dependent counting series. Some properties of the thinning operator are derived. A new stationary integer-valued autoregressive model based on the thinning operator is constructed. In addition various properties of the process are determined, unknown parameters are estimated by several methods and the behavior of the estimators is described through the numerical results. Also, the model is applied on a real data set and compared to some relevant INAR(1) models.

Journal ArticleDOI
TL;DR: In this paper, a supervised topic model for multi-class classification is proposed, where learned topics are directly connected to individual classes without the need for a reference class, which can handle many classes as well as many covariates.
Abstract: Generating user interpretable multi-class predictions in data-rich environments with many classes and explanatory covariates is a daunting task. We introduce Diagonal Orthant Latent Dirichlet Allocation (DOLDA), a supervised topic model for multi-class classification that can handle many classes as well as many covariates. To handle many classes we use the recently proposed Diagonal Orthant probit model (Johndrow et al., in: Proceedings of the sixteenth international conference on artificial intelligence and statistics, 2013) together with an efficient Horseshoe prior for variable selection/shrinkage (Carvalho et al. in Biometrika 97:465–480, 2010). We propose a computationally efficient parallel Gibbs sampler for the new model. An important advantage of DOLDA is that learned topics are directly connected to individual classes without the need for a reference class. We evaluate the model’s predictive accuracy and scalability, and demonstrate DOLDA’s advantage in interpreting the generated predictions.

Journal ArticleDOI
TL;DR: A novel parametric quantile regression model for limited range response variables, which can be very useful in modeling bounded response variables at different levels (quantiles) in the presence of atypical observations is introduced.
Abstract: On the basis of a two-parameter heavy-tailed distribution, we introduce a novel parametric quantile regression model for limited range response variables, which can be very useful in modeling bounded response variables at different levels (quantiles) in the presence of atypical observations. We consider a frequentist approach to perform inferences, and the maximum likelihood method is employed to estimate the model parameters. We also propose a residual analysis to assess departures from model assumptions. Additionally, the local influence method is discussed, and the normal curvature for studying local influence on the maximum likelihood estimates is derived under a specific perturbation scheme. An application to real data is presented to show the usefulness of the new parametric quantile regression model in practice.

Journal ArticleDOI
TL;DR: This article addresses the problem of joint selection of both fixed effects and random effects with the use of several shrinkage priors in linear mixed models with a stochastic search Gibbs sampler to implement a fully Bayesian approach for variable selection.
Abstract: Recently, many shrinkage priors have been proposed and studied in linear models to address massive regression problems. However, shrinkage priors are rarely used in mixed effects models. In this article, we address the problem of joint selection of both fixed effects and random effects with the use of several shrinkage priors in linear mixed models. The idea is to shrink small coefficients to zero while minimally shrink large coefficients due to the heavy tails. The shrinkage priors can be obtained via a scale mixture of normal distributions to facilitate computation. We use a stochastic search Gibbs sampler to implement a fully Bayesian approach for variable selection. The approach is illustrated using simulated data and a real example.

Journal ArticleDOI
TL;DR: It is shown that the notion of polynomial mesh (norming set), used to provide discretizations of a compact set nearly optimal for certain approximation theoretic purposes, can also be used to obtain finitely supported near G-optimal designs forPolynomial regression.
Abstract: We show that the notion of polynomial mesh (norming set), used to provide discretizations of a compact set nearly optimal for certain approximation theoretic purposes, can also be used to obtain finitely supported near G-optimal designs for polynomial regression. We approximate such designs by a standard multiplicative algorithm, followed by measure concentration via Caratheodory-Tchakaloff compression.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a decision tree algorithm that allows users to guide the algorithm through the data partitioning process, which can be used to analyze data sets containing missing values.
Abstract: The main contribution of this paper is the development of a new decision tree algorithm. The proposed approach allows users to guide the algorithm through the data partitioning process. We believe this feature has many applications but in this paper we demonstrate how to utilize this algorithm to analyse data sets containing missing values. We tested our algorithm against simulated data sets with various missing data structures and a real data set. The results demonstrate that this new classification procedure efficiently handles missing values and produces results that are slightly more accurate and more interpretable than most common procedures without any imputations or pre-processing.

Journal ArticleDOI
TL;DR: In this article, a mixture model was proposed to describe the distribution of extremal observations and where the anomaly type was viewed as a latent variable, allowing to cluster extreme observations and obtain an informative planar representation of anomalies using standard graph-mining tools.
Abstract: In a wide variety of situations, anomalies in the behaviour of a complex system, whose health is monitored through the observation of a random vector $$\mathbf{X }=(X_1,\; \ldots ,\; X_d)$$ valued in $$\mathbb {R}^d$$, correspond to the simultaneous occurrence of extreme values for certain subgroups $$\alpha \subset \{1,\; \ldots ,\; d \}$$ of variables $$X_j$$. Under the heavy-tail assumption, which is precisely appropriate for modeling these phenomena, statistical methods relying on multivariate extreme value theory have been developed in the past few years for identifying such events/subgroups. This paper exploits this approach much further by means of a novel mixture model that permits to describe the distribution of extremal observations and where the anomaly type $$\alpha $$ is viewed as a latent variable. One may then take advantage of the model by assigning to any extreme point a posterior probability for each anomaly type $$\alpha $$, defining implicitly a similarity measure between anomalies. It is explained at length how the latter permits to cluster extreme observations and obtain an informative planar representation of anomalies using standard graph-mining tools. The relevance and usefulness of the clustering and 2-d visual display thus designed is illustrated on simulated datasets and on real observations as well, in the aeronautics application domain.

Journal ArticleDOI
TL;DR: The Kummer confluent hypergeometric function is demonstrated to be computationally most efficient and a new computationally efficient formula is derived for the probability mass function of the number of renewals by a given time.
Abstract: Convolutions of independent gamma variables are encountered in many applications such as insurance, reliability, and network engineering. Accurate and fast evaluations of their density and distribution functions are critical for such applications, but no open source, user-friendly software implementation has been available. We review several numerical evaluations of the density and distribution of convolution of independent gamma variables and compare them with respect to their accuracy and speed. The methods that are based on the Kummer confluent hypergeometric function are computationally most efficient. The benefit of employing the confluent hypergeometric function is further demonstrated by a renewal process application, where the gamma variables in the convolution are Erlang variables. We derive a new computationally efficient formula for the probability mass function of the number of renewals by a given time. An R package coga provides efficient C++ based implementations of the discussed methods and are available in CRAN.

Journal ArticleDOI
TL;DR: This paper proposes a new general stochastic optimization algorithm, which is the combination of simulated annealing and the multiple try Metropolis algorithm, and successfully applied to the computation of the halfspace depth of data sets which are not necessarily in general position.
Abstract: The halfspace depth is a powerful tool for the nonparametric multivariate analysis. However, its computation is very challenging for it involves the infimum over infinitely many directional vectors. The exact computation of the halfspace depth is a NP-hard problem if both sample size n and dimension d are parts of input. The approximate algorithms often can not get accurate (exact) results in high dimensional cases within limited time. In this paper, we propose a new general stochastic optimization algorithm, which is the combination of simulated annealing and the multiple try Metropolis algorithm. As a by product of the new algorithm, it is then successfully applied to the computation of the halfspace depth of data sets which are not necessarily in general position. The simulation and real data examples indicate that the new algorithm is highly competitive to, especially in the high dimension and large sample cases, other (exact and approximate) algorithms, including the simulated annealing and the quasi-Newton method and so on, both in accuracy and efficiency.

Journal ArticleDOI
TL;DR: Some regression models for analyzing relationships between random intervals (i.e., random variables taking intervals as outcomes) are presented and it is shown that the new model provides the most accurate results by preserving the coherency with the interval nature of the data.
Abstract: Some regression models for analyzing relationships between random intervals (i.e., random variables taking intervals as outcomes) are presented. The proposed approaches are extensions of previous existing models and they account for cross relationships between midpoints and spreads (or radii) of the intervals in a unique equation based on the interval arithmetic. The estimation problem, which can be written as a constrained minimization problem, is theoretically analyzed and empirically tested. In addition, numerically stable general expressions of the estimators are provided. The main differences between the new and the existing methods are highlighted in a real-life application, where it is shown that the new model provides the most accurate results by preserving the coherency with the interval nature of the data.

Journal ArticleDOI
TL;DR: It is shown that the corrected instability measure outperforms current instability-based measures across the whole sequence of possible k, overcoming limitations of current insability-based methods for large k.
Abstract: We improve instability-based methods for the selection of the number of clusters k in cluster analysis by developing a corrected clustering distance that corrects for the unwanted influence of the distribution of cluster sizes on cluster instability. We show that our corrected instability measure outperforms current instability-based measures across the whole sequence of possible k, overcoming limitations of current insability-based methods for large k. We also compare, for the first time, model-based and model-free approaches to determining cluster-instability and find their performance to be comparable. We make our method available in the R-package cstab.

Journal ArticleDOI
TL;DR: In this article, the spectral condition number plot (SCNP) is used for penalty parameter assessment for ridge-type covariance (precision) estimators, which can be applied to a broad class of ridge type estimators that employ regularization to cope with the subsequent singularity of the sample covariance matrix.
Abstract: Many modern statistical applications ask for the estimation of a covariance (or precision) matrix in settings where the number of variables is larger than the number of observations. There exists a broad class of ridge-type estimators that employs regularization to cope with the subsequent singularity of the sample covariance matrix. These estimators depend on a penalty parameter and choosing its value can be hard, in terms of being computationally unfeasible or tenable only for a restricted set of ridge-type estimators. Here we introduce a simple graphical tool, the spectral condition number plot, for informed heuristic penalty parameter assessment. The proposed tool is computationally friendly and can be employed for the full class of ridge-type covariance (precision) estimators.

Journal ArticleDOI
TL;DR: It is shown how the proposed acceptance–rejection (AR) algorithm can outperform the standard random forest algorithm (RF) and some of its variants including extremely randomized (ER) trees and smooth sigmoid surrogate (SSS) trees.
Abstract: In this paper, we propose a new random forest method based on completely randomized splitting rules with an acceptance–rejection criterion for quality control. We show how the proposed acceptance–rejection (AR) algorithm can outperform the standard random forest algorithm (RF) and some of its variants including extremely randomized (ER) trees and smooth sigmoid surrogate (SSS) trees. Twenty datasets were analyzed to compare prediction performance and a simulated dataset was used to assess variable selection bias. In terms of prediction accuracy for classification problems, the proposed AR algorithm performed the best, with ER being the second best. For regression problems, RF and SSS performed the best, followed by AR, and then ER at the last. However, each algorithm was most accurate for at least one study. We investigate scenarios where the AR algorithm can yield better predictive performance. In terms of variable importance, both RF and SSS demonstrated selection bias in favor of variables with many possible splits, while both ER and AR largely removed this bias.

Journal ArticleDOI
TL;DR: It is shown that the two proposed estimation methods are asymptotically equivalent to the semiparametric inverse probability weighting method.
Abstract: Zero-inflated Poisson (ZIP) regression is widely applied to model effects of covariates on an outcome count with excess zeros. In some applications, covariates in a ZIP regression model are partially observed. Based on the imputed data generated by applying the multiple imputation (MI) schemes developed by Wang and Chen (Ann Stat 37:490–517, 2009), two methods are proposed to estimate the parameters of a ZIP regression model with covariates missing at random. One, proposed by Rubin (in: Proceedings of the survey research methods section of the American Statistical Association, 1978), consists of obtaining a unified estimate as the average of estimates from all imputed datasets. The other, proposed by Fay (J Am Stat Assoc 91:490–498, 1996), consists of averaging the estimating scores from all imputed data sets to solve the imputed estimating equation. Moreover, it is shown that the two proposed estimation methods are asymptotically equivalent to the semiparametric inverse probability weighting method. A modified formula is proposed to estimate the variances of the MI estimators. An extensive simulation study is conducted to investigate the performance of the estimation methods. The practicality of the methodology is illustrated with a dataset of motorcycle survey of traffic regulations.

Journal ArticleDOI
TL;DR: This paper examines the performance of the maximum likelihood estimators from RSS-based designs with the corresponding SRS estimators when the data are assumed to follow a Power Lindley or a Weighted Lindley distribution.
Abstract: Ranked set sampling (RSS) has been proved to be a cost-efficient alternative to simple random sampling (SRS). However, there are situations where some measurements are censored, which may not ensure the superiority of RSS over SRS. In this paper, the performance of the maximum likelihood estimators is examined when the data are assumed to follow a Power Lindley or a Weighted Lindley distribution, and are collected according to the original RSS or one of its two variations (the median and extreme RSS). An extensive simulation study, considering uncensored and right-censored data, and perfect and imperfect ranking, is carried out based on the two mentioned distributions in order to compare the performance of the maximum likelihood estimators from RSS-based designs with the corresponding SRS estimators. Two illustrations are presented based on real data sets. The first involves the lifetimes of aluminum specimens, while the second deals with the amount of spray mixture deposited on the leaves of apple trees.