scispace - formally typeset
Search or ask a question

Showing papers in "Statistics and Its Interface in 2016"


Journal ArticleDOI
TL;DR: In this article, the authors summarized recent methodological and software developments in statistics that address the big data challenges and grouped them into three classes: subsampling-based, divide and conquer, and online updating for stream data.
Abstract: Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard analytic tools. They present opportunities as well as challenges to statisticians. The role of computational statisticians in scientific discovery from big data analyses has been under-recognized even by peer statisticians. This article summarizes recent methodological and software developments in statistics that address the big data challenges. Methodologies are grouped into three classes: subsampling-based, divide and conquer, and online updating for stream data. As a new contribution, the online updating approach is extended to variable selection with commonly used criteria, and their performances are assessed in a simulation study with stream data. Software packages are summarized with focuses on the open source R and R packages, covering recent tools that help break the barriers of computer memory and computing power. Some of the tools are illustrated in a case study with a logistic regression for the chance of airline delay.

94 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed an automatic selection of the bandwidth of the semi-recursive kernel estimators of a regression function defined by the stochastic approximation algorithm, and showed that using the selected bandwidth and some special stepsizes, the proposed semi recursive estimators will be very competitive to the nonrecursive estimators in terms of estimation error but much better in computational costs.
Abstract: In this paper we propose an automatic selection of the bandwidth of the semi-recursive kernel estimators of a regression function defined by the stochastic approximation algorithm. We showed that, using the selected bandwidth and some special stepsizes, the proposed semi-recursive estimators will be very competitive to the nonrecursive one in terms of estimation error but much better in terms of computational costs. We corroborated these theoretical results through simulation study and a real dataset.

27 citations


Journal ArticleDOI
TL;DR: In this paper, the authors explored the distributional theory and corresponding properties of the zero-and-one-inflated Poisson (ZOIP) distribution and developed likelihood-based inference methods for parameters of interest.
Abstract: To model count data with excess zeros and excess ones, in their unpublished manuscript, Melkersson and Olsson (1999) extended the zero-inflated Poisson distribution to a zero-and-one-inflated Poisson (ZOIP) distribution. However, the distributional theory and corresponding properties of the ZOIP have not yet been explored, and likelihoodbased inference methods for parameters of interest were not well developed. In this paper, we extensively study the ZOIP distribution by first constructing five equivalent stochastic representations for the ZOIP random variable and then deriving other important distributional properties. Maximum likelihood estimates of parameters are obtained by both the Fisher scoring and expectation—maximization algorithms. Bootstrap confidence intervals for parameters of interest and testing hypotheses under large sample sizes are provided. Simulations studies are performed and five real data sets are used to illustrate the proposed methods.

21 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a new algorithm for fitting SSANOVA models to super-large sample data, which can fit nonparametric regression models to very large samples within a few seconds using a standard laptop or tablet computer.
Abstract: In the current era of big data, researchers routinely collect and analyze data of super-large sample sizes. Data-oriented statistical methods have been developed to extract information from super-large data. Smoothing spline ANOVA (SSANOVA) is a promising approach for extracting information from noisy data; however, the heavy computational cost of SSANOVA hinders its wide application. In this paper, we propose a new algorithm for fitting SSANOVA models to super-large sample data. In this algorithm, we introduce rounding parameters to make the computation scalable. To demonstrate the benefits of the rounding parameters, we present a simulation study and a real data example using electroencephalography data. Our results reveal that (using the rounding parameters) a researcher can fit nonparametric regression models to very large samples within a few seconds using a standard laptop or tablet computer.

13 citations


Journal ArticleDOI
TL;DR: The limiting distribution of monitoring statistic under the stationary long memory null hypothesis is derived and the proposed monitoring scheme is consistent for stationary to nonstationary change using a sieve bootstrap approximation method.
Abstract: This paper adopts a moving ratio statistic to monitor persistence change in long memory process. The limiting distribution of monitoring statistic under the stationary long memory null hypothesis is derived. We show that the proposed monitoring scheme is consistent for stationary to nonstationary change. In particular, a sieve bootstrap approximation method is proposed. The sieve bootstrap method is used to determine the critical values for the null distribution of monitoring statistic which depends on unknown long memory parameter. The empirical size, power and average run length of the proposed monitoring procedure are evaluated in a simulation study. Simulations indicate that the new monitoring procedure performs well in finite samples. Finally, we illustrate our monitoring procedure using a set of foreign exchange rate data.

13 citations


Journal ArticleDOI
TL;DR: In this paper, the authors established a link between the censored nonlinear regression model and a recently studied class of symmetric distributions, which extends the normal one by the inclusion of kurtosis, called scale mixtures of normal (SMN) distributions.
Abstract: In the framework of censored nonlinear regression models, the random errors are routinely assumed to have a normal distribution, mainly for mathematical convenience. However, this method has been criticized in the literature due to its sensitivity to deviations from the normality assumption. In practice, data such as income or viral load in AIDS studies, often violate this assumption because of heavy tails. In this paper, we establish a link between the censored nonlinear regression model and a recently studied class of symmetric distributions, which extends the normal one by the inclusion of kurtosis, called scale mixtures of normal (SMN) distributions. The Student-t, Pearson type VII, slash and contaminated normal, among others distributions, are contained in this class. Choosing a member of this class can be a good alternative to model this kind of data, because they have been shown its flexibility in several applications. We develop an analytically simple and efficient EM-type algorithm for iteratively computing maximum likelihood estimates of model parameters together with standard errors as a byproduct. The algorithm uses nice expressions at the E-step, relying on formulae for the mean and variance of truncated SMN distributions. The usefulness of the proposed methodology is illustrated through applications to simulated and real data.

11 citations



Journal ArticleDOI
TL;DR: Modifications are presented to efficiently identify subgroups of subjects who respond more favorably to one treatment than another based on their baseline characteristics and a measure for assessment of the predictive performance of the constructed tree is proposed.
Abstract: The tree-based methodology has been widely applied to identify predictors of health outcomes in medical studies. However, the classical tree-based approaches do not pay particular attention to treatment assignment and thus do not consider prediction in the context of treatment received. In recent years, attention has been shifting from average treatment effects to identifying moderators of treatment response, and tree-based approaches to identify subgroups of subjects with enhanced treatment responses are emerging. In this study, we extend and present modifications to one of these approaches (Zhang et al., 2010 [29]) to efficiently identify subgroups of subjects who respond more favorably to one treatment than another based on their baseline characteristics. We extend the algorithm by incorporating an automatic pruning step and propose a measure for assessment of the predictive performance of the constructed tree. We evaluate the proposed method through a simulation study and illustrate the approach using a data set from a clinical trial of treatments for alcohol dependence. This simple and efficient statistical tool can be used for developing algorithms for clinical decision making and personalized treatment for patients based on their characteristics.

10 citations


Journal ArticleDOI
TL;DR: In this paper, weighted estimating equations are proposed for simultaneous estimation of the regression parameters and the transformation function, and the resulting regression estimators are shown to be asymptotically normal with a closed form of variance-covariance matrix.
Abstract: Case-cohort designs provide a cost effective way in large cohort studies. Semiparametric transformation models, which include the proportional hazards model and the proportional odds model as special cases, are considered here for length-biased right-censored data under case-cohort design. Weighted estimating equations, which can be used even when the censoring variables are dependent of the covariates, are proposed for simultaneous estimation of the regression parameters and the transformation function. The resulting regression estimators are shown to be asymptotically normal with a closed form of variance-covariance matrix and can be estimated by the plug-in method. Simulation studies show that the proposed approach performs well for practical use. An application to the Oscar data is also given to illustrate the methodology.

9 citations




Journal ArticleDOI
TL;DR: A new SVD algorithm based on the split-and- merge strategy, which possesses an embarrassingly parallel structure and thus can be efficiently implemented on a distributed or multicore machine and thus is particularly suitable for big data problems.
Abstract: We propose a new SVD algorithm based on the split-and- merge strategy, which possesses an embarrassingly parallel structure and thus can be efficiently implemented on a distributed or multicore machine. The new algorithm can also be implemented in serial for online eigen-analysis. The new algorithm is particularly suitable for big data problems: Its embarrassingly parallel structure renders it usable for feature screening, while this has been beyond the ability of the existing parallel SVD algorithms.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a generative gradient for pre-training CNNs by a nonparametric importance sampling scheme, which is fundamentally different from the commonly used discriminative gradient, and yet has the same computational architecture and cost as the latter.
Abstract: The convolutional neural networks (CNNs) have proven to be a powerful tool for discriminative learning. Recently researchers have also started to show interest in the generative aspects of CNNs in order to gain a deeper understanding of what they have learned and how to further improve them. This paper investigates generative modeling of CNNs. The main contributions include: (1) We construct a generative model for the CNN in the form of exponential tilting of a reference distribution. (2) We propose a generative gradient for pre-training CNNs by a non-parametric importance sampling scheme, which is fundamentally different from the commonly used discriminative gradient, and yet has the same computational architecture and cost as the latter. (3) We propose a generative visualization method for the CNNs by sampling from an explicit parametric image distribution. The proposed visualization method can directly draw synthetic samples for any given node in a trained CNN by the Hamiltonian Monte Carlo (HMC) algorithm, without resorting to any extra hold-out images. Experiments on the challenging ImageNet benchmark show that the proposed generative gradient pre-training consistently helps improve the performances of CNNs, and the proposed generative visualization method generates meaningful and varied samples of synthetic images from a large-scale deep CNN.


Journal ArticleDOI
TL;DR: In this paper, a smoothed estimator for the generalized Lorenz curve is proposed and the smoothed jackknife empirical likelihood method is used to construct confidence intervals for the generalized Lorenz curves.
Abstract: Lorenz curve is one of the most commonly used devices for describing the inequality of income distributions. The generalized Lorenz curve is the Lorenz curve scaled by the mean of an income distribution and itself is an interesting object of study. In this paper, we define a smoothed estimator for the generalized Lorenz curve and propose a smoothed jackknife empirical likelihood method to construct confidence intervals for the generalized Lorenz curve. It is shown that the Wilks’ theorem still holds for the smoothed jackknife empirical likelihood. Extensive simulation studies are conducted to compare the finite sample performances of the proposed methods with other methods based on simple random samples. Finally, the proposed methods are illustrated with a real example.

Journal ArticleDOI
TL;DR: In this paper, the authors evaluate six Bayesian model assessment criteria: Akaike information criterion (AIC) (Akaike, 1973), BIC (Schwartz, 1978), integrated classification likelihood criterion (ICL) (Biernacki et al., 1998), the deviance information criterion(DIC), the logarithm of the pseudomarginal likelihood (LPML) (Geisser and Eddy, 1979) and the widely applicable Information criterion (WAIC), and report that all six criteria suffer from difficulty in separating deeply overlapping latent
Abstract: In joint-modeling analyses that simultaneously consider a set of longitudinal predictors and a primary outcome, the two most frequently used response versus longitudinaltrajectory models utilize latent class (LC) and multiple shared random effects (MSRE) predictors. In practice, it is common to use one model assessment criterion to justify the use of the model. How different criteria perform under the joint longitudinal predictor-scalar outcome model is less understood. In this paper, we evaluate six Bayesian model assessment criteria: Akaike information criterion (AIC) (Akaike, 1973), Bayesian information criterion (BIC) (Schwartz, 1978), integrated classification likelihood criterion (ICL) (Biernacki et al., 1998), the deviance information criterion (DIC) (Spiegelhalter et al., 2002), the logarithm of the pseudomarginal likelihood (LPML) (Geisser and Eddy, 1979) and the widely applicable information criterion (WAIC) (Watanabe, 2010). When needed, the criteria are modified, following the Bayesian principle, to accommodate the joint modeling framework that analyzes longitudinal predictors and binary health outcome data. We report our evaluation based on empirical numerical studies, exploring the relationships and similarities among these criteria. We focus on two evaluation aspects: goodness-of-fit adjusted for the complexity of the models, mostly reflected by the numbers of latent features/classes in the longitudinal trajectories that are part of the hierarchical structure in the joint models, and prediction evaluation based on both training and test samples as well as their contrasts. Our results indicate that all six criteria suffer from difficulty in separating deeply overlapping latent features, with AIC, BIC, ICL and WAIC outperforming others in terms of correctly identifying the number of latent classes. With respect to prediction, DIC, WAIC and LPML tend to choose the models with too many latent classes, leading to better predictive performance on independent validation samples than the models chosen by other criteria do. An interesting result concerning the wrong model choice will be reported. Finally, we use the results from the simulation study to identify the suitable candidate models to link the useful features in the follicle

Journal ArticleDOI
TL;DR: A novel approach based on a method of optimal-partitioning (clustering) of functional data to identify prototypical outcome profiles that are distinct from one or the other treatment and outcome profiles common to the different treatments is developed.
Abstract: Understanding heterogeneity in phenotypical characteristics, symptoms manifestations and response to treatment of subjects with psychiatric illnesses is a continuing challenge in mental health research. A long-standing goal of medical studies is to identify groups of subjects characterized with a particular trait or quality and to distinguish them from other subjects in a clinically relevant way. This paper develops and illustrates a novel approach to this problem based on a method of optimal-partitioning (clustering) of functional data. The proposed method allows for the simultaneous clustering of different populations (e.g., symptoms of drug and placebo treated patients) in order to identify prototypical outcome profiles that are distinct from one or the other treatment and outcome profiles common to the different treatments. The clustering results are used to discover potential treatment effect modifiers (i.e., moderators), in particular, moderators of specific drug effects and placebo response. A depression clinical trial is used to illustrate the method.

Journal ArticleDOI
TL;DR: A corrected proof of Theorem 6 is provided with further clarifications of the assumptions of the assumption of the true latent parameter θ0,i.
Abstract: We have found a mistake in the proof Theorem 6 in our published paper “Optimal False Discovery Rate Control for Dependent Data” [4]. We apologize to the readers and thank Professor Jens Ledet Jensen at Aarhus University for his question which led to identification of this mistake. We provide here a corrected proof of Theorem 6 with further clarifications of the assumptions. In the GWAS setting that we consider the paper, Xi’s are often the Z-scores with Var(Xi) = 1 for very large sample sizes. We assume that σii = 1. We define the true latent parameter θ0,i: if θ0,i = 0, Xi ∼ N(0, 1); and if θ0,i = 1, Xi ∼ N(μi, 1). Also, we denote the working latent parameter as θi, which is used to define the likelihood ratio f(Xi | θi = 1)/f(Xi | θi = 0) and f(X | θi = 1)/f(X | θi = 0).


Journal ArticleDOI
TL;DR: In this article, an iterative subsampling approach is proposed to improve the computational efficiency of the solution path clustering (SPC) method, which achieves clustering by concave regularization on the pairwise distances between cluster centers.
Abstract: We develop an iterative subsampling approach to improve the computational efficiency of our previous work on solution path clustering (SPC). The SPC method achieves clustering by concave regularization on the pairwise distances between cluster centers. This clustering method has the important capability to recognize noise and to provide a short path of clustering solutions; however, it is not sufficiently fast for big datasets. Thus, we propose a method that iterates between clustering a small subsample of the full data and sequentially assigning the other data points to attain orders of magnitude of computational savings. The new method preserves the ability to isolate noise, includes a solution selection mechanism that ultimately provides one clustering solution with an estimated number of clusters, and is shown to be able to extract small tight clusters from noisy data. The method's relatively minor losses in accuracy are demonstrated through simulation studies, and its ability to handle large datasets is illustrated through applications to gene expression datasets. An R package, SPClustering, for the SPC method with iterative subsampling is available at this http URL

Journal ArticleDOI
TL;DR: The proposed information score approach provides a principled way of combining the residuals and leverage scores for anomaly detection and a set of tools, including leverage score and generalized information score, to perform model diagnostics and outlier detection in large-scale reduced-rank estimation.
Abstract: Reduced-rank methods are very popular in high-dimensional multivariate analysis for conducting simultaneous dimension reduction and model estimation. However, the commonly-used reduced-rank methods are not robust, as the underlying reduced-rank structure can be easily distorted by only a few data outliers. Anomalies are bound to exist in big data problems, and in some applications they themselves could be of the primary interest. While naive residual analysis is often inadequate for outlier detection due to potential masking and swamping, robust reduced-rank estimation approaches could be computationally demanding. Under Stein's unbiased risk estimation framework, we propose a set of tools, including leverage score and generalized information score, to perform model diagnostics and outlier detection in large-scale reduced-rank estimation. The leverage scores give an exact decomposition of the so-called model degrees of freedom to the observation level, which lead to exact decomposition of many commonly-used information criteria; the resulting quantities are thus named information scores of the observations. The proposed information score approach provides a principled way of combining the residuals and leverage scores for anomaly detection. Simulation studies confirm that the proposed diagnostic tools work well. A pattern recognition example with hand-writing digital images and a time series analysis example with monthly U.S. macroeconomic data further demonstrate the efficacy of the proposed approaches.


Journal ArticleDOI
TL;DR: In this paper, an algorithm that combines divide-and-conquer and conquer (D&C) ideas used to design MCMC algorithms for big data with a sequential MCMC strategy was proposed.
Abstract: Bayesian computation crucially relies on Markov chain Monte Carlo (MCMC) algorithms. In the case of massive data sets, running the Metropolis-Hastings sampler to draw from the posterior distribution becomes prohibitive due to the large number of likelihood terms that need to be calculated at each iteration. In order to perform Bayesian inference for a large set of time series, we consider an algorithm that combines “divide and conquer” ideas previously used to design MCMC algorithms for big data with a sequential MCMC strategy. The performance of the method is illustrated using a large set of financial data.

Journal ArticleDOI
TL;DR: This work describes a two-stage design for a single-arm phase II trial where the primary objective is to test the rate of tumor response defined as complete plus partial response, and the secondary objective isto estimate the rates of disease control defined as tumor response plus stable disease.
Abstract: Many oncology phase II trials are single arm studies designed to screen novel treatments based on efficacy outcome. Efficacy is often assessed as an ordinal variable based on a level of response of solid tumors with four categories: complete response, partial response, stable disease and progression. We describe a two-stage design for a single-arm phase II trial where the primary objective is to test the rate of tumor response defined as complete plus partial response, and the secondary objective is to estimate the rate of disease control defined as tumor response plus stable disease. Since the goal is to estimate the disease control rate, the trial is not stopped for futility after the first stage if the disease control rate is promising. The new design can be generated using easy-to-use software that is available at http://cancer.unc.edu/biostatistics/program/ivanova/.

Journal ArticleDOI
Yuexiao Dong1
TL;DR: In this paper, the two main groups of moment-based sufficient dimension reduction methods are the estimators for the central space and the estimator for the center mean space, and they provide unified frameworks for each group of estimators.
Abstract: The two main groups of moment-based sufficient dimension reduction methods are the estimators for the central space and the estimators for the central mean space The former group includes methods such as sliced inverse regression, sliced average variance estimation and sliced average third-moment estimation, while ordinary least squares and principal Hessian directions belong to the latter group We provide unified frameworks for each group of estimators in this short note The central space estimators can be unified as inverse conditional cumulants, while Stein’s Lemma is used to motivate the central mean space estimators

Journal ArticleDOI
TL;DR: It is shown that, with the presence of outliers, RCover almost always outperforms other methods tested, and modifies Cover by incorporating Huber’s loss function into the estimation procedure.
Abstract: This paper considers the problem of robust covariance estimation in the so-called “large p small n” setting. Its first contribution is the proposal of a novel (non-robust) highdimensional covariance estimation method that is based on eigenvalue regularization. The method is called Cover, short for COVariance Eigenvalue-Regularized estimation. It is fast to execute and enjoys excellent theoretical properties for the case when p is fixed. As a second contribution, this paper modifies Cover by incorporating Huber’s loss function into the estimation procedure. By design, the resulting method is robust to outliers and is called RCover. The empirical performances of Cover and RCover are tested and compared with existing methods via a sequence of numerical experiments. It is shown that, with the presence of outliers, RCover almost always outperforms other methods tested.

Journal ArticleDOI
TL;DR: A Gaussian copula model with discrete margins for modeling multivariate binary responses is proposed that separates marginal effects from between-trait correlations and provides further insight into genetic association studies of multivariate traits.
Abstract: The need for analysis of multiple responses arises from many applications. In behavioral science, for example, comorbidity is a common phenomenon where multiple disorders occur in the same person. The advantage of jointly analyzing multiple correlated responses has been examined and documented. Due to the difficulties of modeling multiple responses, nonparametric tests such as generalized Kendall's Tau have been developed to assess the association between multiple responses and risk factors. These procedures have been applied to genomewide association studies of multiple complex traits. Unfortunately, those nonparametric tests only provide the significance of the association but not the magnitude. We propose a Gaussian copula model with discrete margins for modeling multivariate binary responses. This model separates marginal effects from between-trait correlations. We use a bootstrapping margins approach to constructing Wald's statistic for the association test. Although our derivation is based on the fully parametric Gaussian copula framework for simplicity, the underlying assumptions to apply our method can be weakened. The bootstrapping margins approach only requires the correct specification of the model margins. Our simulation and real data analysis demonstrate that our proposed method not only increases power over some existing association tests, but also provides further insight into genetic association studies of multivariate traits.

Journal ArticleDOI
TL;DR: In this paper, the mean time between failures (MTBF) in nonhomogeneous Poisson processes (NHPP) with power law intensity function for complete/incomplete observations was investigated.
Abstract: Motivated by four unsolved issues on the mean time between failures (MTBFs) in nonhomogeneous Poisson processes (NHPP) with power law intensity function for complete/incomplete observations, in this article, we first study some important properties on three new distributions (i.e., the G, inverse G, and RG distributions). Next, we develop three methods (i.e., the Lagrange multiplier, quantile-based and sampling-based methods) to establish the shortest confidence intervals for the MTBF in a single repairable system and for the MTBF ratio in two independent repairable systems; and also develop two methods (i.e., the density-based and sampling-based methods) within the framework of the critical region and p-value approaches to test hypotheses on the MTBF and the MTBF ratio. Simulation studies are performed to compare the proposed methods. Two real data sets are used to illustrate the proposed statistical methods.

Journal ArticleDOI
TL;DR: This paper proposes using the “divide-and-conquer” strategy to construct a computationally more succinct surrogate estimating equation and shows that the method significantly reduces the computational time and meanwhile enjoys the same asymptotic behavior as the original estimation method.
Abstract: Big data problems present great challenges to statistical analyses, especially from the computational side. In this paper, we consider regression estimation of high-order moments in big data problems based on the U-statistic-based Functional Regression Model (U-FRM) model. The U-FRM model is a nonparametric method that allows direct estimation of higher-order moments without imposing parametric assumptions on the high order-moments. Despite this modeling advantage, its estimation relies on a U-statisticsbased estimating equation whose computational complexity is generally too high for big data. In this paper, we propose using the “divide-and-conquer” strategy to construct a computationally more succinct surrogate estimating equation. Through both theoretical proof and simulations, we show that our method significantly reduces the computational time and meanwhile enjoys the same asymptotic behavior as the original estimation method. We then apply our method to a genomic problem to illustrate its performance on real data.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a click fraud detection method by estimating the posterior malicious probability of each visitor. But, their method assumes each visitor carries a latent indicator, which labels him/her as a regular or malicious user.
Abstract: This paper is concerned with the problem of click fraud detection. We assume each visitor of a website carries a latent indicator, which labels him/her as a regular or malicious user. Information such as number of clicks, number of page views (PVs) and time difference between consecutive clicks are cooperated in our newly proposed statistical model. We allow those random variables to share the same distribution but with different parameters according to the visitor's type. An EM algorithm is then suggested to obtain the maximum likelihood estimator. As a result, click fraud detection can be implemented by estimating the posterior malicious probability of each visitor. Simulation studies are conducted to assess the finite sample performance. We also demonstrate the usefulness of the proposed method via an empirical analysis of a real life example on searchengine marketing.