scispace - formally typeset
Search or ask a question

Showing papers in "Statistica Sinica in 2017"


Journal ArticleDOI
TL;DR: A new penalized likelihood method for model selection of finite multivariate Gaussian mixture models is proposed and is shown to be statistically consistent in determining of the number of components.
Abstract: This paper is concerned with an important issue in finite mixture modelling, the selection of the number of mixing components. We propose a new penalized likelihood method for model selection of finite multivariate Gaussian mixture models. The proposed method is shown to be statistically consistent in determining of the number of components. A modified EM algorithm is developed to simultaneously select the number of components and to estimate the mixing weights, i.e. the mixing probabilities, and unknown parameters of Gaussian distributions. Simulations and a real data analysis are presented to illustrate the performance of the proposed method.

108 citations


Journal ArticleDOI
TL;DR: A new technique for consistent estimation of the number and locations of the change-points in the second-order structure of a time series using the Wild Binary Segmentation method, a technique which involves a certain randomised mechanism.
Abstract: We propose a new technique for consistent estimation of the number and locations of the change-points in the second-order structure of a time series. The core of the segmentation procedure is the Wild Binary Segmentation method(WBS), a technique which involves a certain randomised mechanism. The advantage of WBS over the standard Binary Segmentation lies in its localisation feature, thanks to which it works in cases where the spacings between change-points are short. In addition, we do not restrict the total number of change-points a time series can have. We also ameliorate the performance of our method by combining the CUSUM statistics obtained at different scales of the wavelet periodogram, our main change-point detection statistic, which allows a rigorous estimation of the local autocovariance of a piecewise-stationary process. We provide a simulation study to examine the performance of our method for different types of scenarios. A proof of consistency is also provided. Our methodology is implemented in the R package wbsts, available from CRAN.

57 citations


Journal ArticleDOI
TL;DR: The new results reveal an asymptotic conical structure in critical sample eigendirections under the spike models with distinguishable eigenvalues, when the sample size and/or the number of variables (or dimension) tend to infinity.
Abstract: The aim of this paper is to establish several deep theoretical properties of principal component analysis for multiple-component spike covariance models. Our new results reveal an asymptotic conical structure in critical sample eigendirections under the spike models with distinguishable (or indistinguishable) eigenvalues, when the sample size and/or the number of variables (or dimension) tend to infinity. The consistency of the sample eigenvectors relative to their population counterparts is determined by the ratio between the dimension and the product of the sample size with the spike size. When this ratio converges to a nonzero constant, the sample eigenvector converges to a cone, with a certain angle to its corresponding population eigenvector. In the High Dimension, Low Sample Size case, the angle between the sample eigenvector and its population counterpart converges to a limiting distribution. Several generalizations of the multi-spike covariance models are also explored, and additional theoretical results are presented.

48 citations


Journal ArticleDOI
TL;DR: This paper proposes a Bayesian method to estimate shape-restricted functions using Gaussian process priors and modify the basic model with a spike-and-slab prior that improves model fit when the true function is on the boundary of the constraint space.
Abstract: This paper proposes a Bayesian method to estimate shape-restricted functions using Gaussian process priors. The proposed model enforces shape-restrictions by assuming that the derivatives of the functions are squares of Gaussian processes. The resulting functions, after integration, are monotonic, monotonic convex or concave, U–Shaped, and S–shaped. The latter two allow estimation of extreme points and inflection points. The Gaussian process’s covariance function has hyper parameters to control the smoothness of the function and the tradeoff between the data and the prior distribution. The Bayesian analysis of these hyper parameters provides a data–driven method to identify the appropriate amount of smoothing. The posterior distributions of the proposed models are consistent. We modify the basic model with a spike-and-slab prior that improves model fit when the true function is on the boundary of the constraint space. We also examine Bayesian hypothesis testing for shape restrictions and discuss its potentials and limitations. We contrast our approach with existing Bayesian regression models with monotonicity and concavity and illustrate the empirical performance of the proposed models with synthetic and actual data.

33 citations


Journal ArticleDOI
TL;DR: In this article, a nonparametric imputation method based on the propensity score in a general class of semiparametric models for nonignorable missing data is proposed, which does not require any model specification for the imputation but rather a general parametric model involving an unknown parameter which can be estimated consistently.
Abstract: Handling data with the missing not at random (MNAR) mechanism is still a challenging problem in statistics. In this article, we propose a nonparametric imputation method based on the propensity score in a general class of semiparametric models for nonignorable missing data. Compared with the existing imputation methods, the proposed imputation method is more flexible as it does not require any model specification for the propensity score but rather a general parametric model involving an unknown parameter which can be estimated consistently. To obtain a consistent estimator of the parametric propensity score, two approaches are proposed. One is based on a validation sample. The other is a semi-empirical likelihood (SEL) method. By incorporating auxiliary information from some calibration conditions under the MNAR assumption, we gain significant efficiency with the SEL-based estimator. We investigate the asymptotic properties of the proposed estimators based on either known or estimated propensity scores. Our empirical studies show that that the resultant estimator is robust against the misspecified response model. Simulation studies and data analysis are provided to evaluate the finite sample performance of the proposed method.

30 citations


Journal ArticleDOI
TL;DR: In this paper, the authors presented a hypothesis testing method given independent samples from a number of connected populations, which is motivated by a forestry project for monitoring change in the strength of lumber.
Abstract: This paper presents a hypothesis testing method given independent samples from a number of connected populations. The method is motivated by a forestry project for monitoring change in the strength of lumber. Traditional practice has been built upon nonparametric methods which ignore the fact that these populations are connected. By pooling the information in multiple samples through a density ratio model, the proposed empirical likelihood method leads to a more efficient inference and therefore reduces the cost in applications. The new test has a classical chi-square null limiting distribution. Its power function is obtained under a class of local alternatives. The local power is found increased even when some underlying populations are unrelated to the hypothesis of interest. Simulation studies confirm that this test has better power properties than potential competitors, and is robust to model misspecification. An application example to lumber strength is included.

27 citations


Journal ArticleDOI
TL;DR: A unified method for two-level models, based on a weighted composite likelihood approach, that takes account of design features and provides valid inferences even for small sample sizes within level 2 units is presented.
Abstract: Multi-level models provide a convenient framework for analyzing data from survey samples with hierarchical structures. Inferential procedures that take account of survey design features are well established for single-level (or marginal) models. On the other hand, available methods that are valid for general multi-level models are somewhat limited. This paper presents a unified method for two-level models, based on a weighted composite likelihood approach, that takes account of design features and provides valid inferences even for small sample sizes within level 2 units. The proposed method has broad applicability and is straightforward to implement. Empirical studies reported have demonstrated that the method performs well in estimating the model parameters. Moreover, this research has important implication: It provides a particular scenario to showcase the unique merit of the composite likelihood method for which the likelihood method would not work.

24 citations


Journal ArticleDOI
TL;DR: The Bayesian approach to multivariate adaptive regression splines (BMARS) as an emulator for a computer model that outputs curves is discussed and modifications to traditional BMARS approaches are introduced that allow for fitting large amounts of data and allow for more efficient MCMC sampling.
Abstract: When a computer code is used to simulate a complex system, one of the fundamental tasks is to assess the sensitivity of the simulator to the different input parameters. In the case of computationally expensive simulators, this is often accomplished via a surrogate statistical model, a statistical output emulator. An effective emulator is one that provides good approximations to the computer code output for wide ranges of input values. In addition, an emulator should be able to handle large dimensional simulation output for a relevant number of inputs; it should flexibly capture heterogeneities in the variability of the response surface; it should be fast to evaluate for arbitrary combinations of input parameters, and it should provide an accurate quantification of the emulation uncertainty. In this paper we discuss the Bayesian approach to multivariate adaptive regression splines (BMARS) as an emulator for a computer model that outputs curves. We introduce modifications to traditional BMARS approaches that allow for fitting large amounts of data and allow for more efficient MCMC sampling. We emphasize the ease with which sensitivity analysis can be performed in this situation. We present a sensitivity analysis of a computer model of the deformation of a protective plate used in pressure-driven experiments. Our example serves as an illustration of the ability of BMARS emulators to fulfill all the necessities of computability, flexibility and reliable calculation on relevant measures of sensitivity.

22 citations


Journal ArticleDOI
TL;DR: An iterative approach to estimating the loading space of each regime and clustering the data points, combining eigenanalysis and the Viterbi algorithm is proposed, providing flexibility in dealing with applications in which underlying states may be changing over time.
Abstract: We consider a factor model for high-dimensional time series with regimeswitching dynamics. The switching is assumed to be driven by an unobserved Markov chain; the mean, factor loading matrix, and covariance matrix of the noise process are different among the regimes. The model is an extension of the traditional factor models for time series and provides flexibility in dealing with applications in which underlying states may be changing over time. We propose an iterative approach to estimating the loading space of each regime and clustering the data points, combining eigenanalysis and the Viterbi algorithm. The theoretical properties of the procedure are investigated. Simulation results and the analysis of a data example are presented.

21 citations


Journal ArticleDOI
TL;DR: The authors proposed a regularization method that can deal with large numbers of candidate generalized linear mixed models (GLMMs) while preserving a hierarchical structure in the effects that needs to be taken into account when performing variable selection.
Abstract: In many applications of generalized linear mixed models (GLMMs), there is a hierarchical structure in the effects that needs to be taken into account when performing variable selection. A prime example of this is when fitting mixed models to longitudinal data, where it is usual for covariates to be included as only fixed effects or as composite (fixed and random) effects. In this article, we propose the first regularization method that can deal with large numbers of candidate GLMMs while preserving this hierarchical structure: CREPE (Composite Random Effects PEnalty) for joint selection in mixed models. CREPE induces sparsity in a hierarchical manner, as the fixed effect for a covariate is shrunk to zero only if the corresponding random effect is or has already been shrunk to zero. In the setting where the number of fixed effects grow at a slower rate than the number of clusters, we show that CREPE is selection consistent for both fixed and random effects, and attains the oracle property. Simulations show that CREPE outperforms some currently available penalized methods for mixed models.

20 citations


Journal ArticleDOI
TL;DR: This work proposes a new method for estimating A0 which does not rely on the knowledge or an estimation of the standard deviation of the noise, and achieves optimal rates of convergence under the Frobenius risk.
Abstract: We propose a new pivotal method for estimating high-dimensional matrices. Assume that we observe a small set of entries or linear combinations of entries of an unknown matrix $A_0$ corrupted by noise. We propose a new method for estimating $A_0$ which does not rely on the knowledge or an estimation of the standard deviation of the noise $\sigma$. Our estimator achieves, up to a logarithmic factor, optimal rates of convergence under the Frobenius risk and, thus, has the same prediction performance as previously proposed estimators which rely on the knowledge of $\sigma$. Our method is based on the solution of a convex optimization problem which makes it computationally attractive.


Journal ArticleDOI
TL;DR: In this paper, it was shown that the support points of an optimal design lie on the edges of the design region, if this design region is a polyhedron, and under certain conditions the D-optimal designs can be constructed from the optimal designs in the marginal models with single covariates, which can be applied to a broad class of models, including the Poisson, the negative binomial as well as the proportional hazards model with both type I and random censoring.
Abstract: In this paper we consider nonlinear models with an arbitrary number of covariates for which the information additionally depends on the value of the linear predictor. We establish the general result that for many optimality criteria the support points of an optimal design lie on the edges of the design region, if this design region is a polyhedron. Based on this result we show that under certain conditions the D-optimal designs can be constructed from the D-optimal designs in the marginal models with single covariates. This can be applied to a broad class of models, which include the Poisson, the negative binomial as well as the proportional hazards model with both type I and random censoring.

Journal ArticleDOI
TL;DR: In this paper, the authors derived credible intervals for the Bayesian nonparametric estimator of Dn(l), and investigated the large n asymptotic behaviour of such an estimator.
Abstract: Given a sample of size n from a population of individual belonging to different species with unknown proportions, a popular problem of practical interest consists in making inference on the probability Dn(l) that the (n+1)-th draw coincides with a species with frequency l in the sample, for any l=0,1,…,n. This paper contributes to the methodology of Bayesian nonparametric inference for Dn(l). Specifically, under the general framework of Gibbs-type priors we show how to derive credible intervals for the Bayesian nonparametric estimator of Dn(l), and we investigate the large n asymptotic behaviour of such an estimator. Of particular interest are special cases of our results obtained under the assumption of the two parameter Poisson-Dirichlet prior and the normalized generalized Gamma prior, which are two of the most commonly used Gibbs-type priors. With respect to these two prior assumptions, the proposed results are illustrated through a simulation study and a benchmark Expressed Sequence Tags dataset. To the best our knowledge, this illustration provides the first comparative study between the two parameter Poisson-Dirichlet prior and the normalized generalized Gamma prior in the context of Bayesian nonparemetric inference for Dn(l)

Journal ArticleDOI
TL;DR: In this paper, the author's version of a work that was accepted for publication in Statistica Sinica has been published, and a definitive version was subsequently published in the same journal.
Abstract: This is the author’s version of a work that was accepted for publication in Statistica Sinica. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Statistica Sinica, 2016. 26(1): 385-411. DOI: 10.5705/ss.2013.265.

Journal ArticleDOI
TL;DR: In this article, the tail quotient correlation coefficient (TQCC) was proposed to measure tail dependence between two random variables and an approximation theory between conditional tail probabilities was established.
Abstract: This paper first studies the theoretical properties of the tail quotient correlation coefficient (TQCC) which was proposed to measure tail dependence between two random variables. By introducing random thresholds in TQCC, an approximation theory between conditional tail probabilities is established. The new random threshold-driven TQCC can be used to test the null hypothesis of tail independence under which TQCC test statistics are shown to follow a Chi-squared distribution under two general scenarios. The TQCC is shown to be consistent under the alternative hypothesis of tail dependence with a general approximation of max-stable distribution. Second, we apply TQCC to investigate tail dependencies of a large scale problem of daily precipitation in the continental US. Our results, from the perspective of tail dependence, reveal nonstationarity, spatial clusters, and tail dependence from the precipitations across the continental US.

Journal ArticleDOI
TL;DR: In this paper, the authors propose general and flexible stepup and stepdown procedures for testing multiple hypotheses about sequential data that simultaneously control both the type I and II versions of FDP and FWER.
Abstract: The $\gamma$-FDP and $k$-FWER multiple testing error metrics, which are tail probabilities of the respective error statistics, have become popular recently as less-stringent alternatives to the FDR and FWER. We propose general and flexible stepup and stepdown procedures for testing multiple hypotheses about sequential (or streaming) data that simultaneously control both the type I and II versions of $\gamma$-FDP, or $k$-FWER. The error control holds regardless of the dependence between data streams, which may be of arbitrary size and shape. All that is needed is a test statistic for each data stream that controls the conventional type I and II error probabilities, and no information or assumptions are required about the joint distribution of the statistics or data streams. The procedures can be used with sequential, group sequential, truncated, or other sampling schemes. We give recommendations for the procedures' implementation including closed-form expressions for the needed critical values in some commonly-encountered testing situations. The proposed sequential procedures are compared with each other and with comparable fixed sample size procedures in the context of strongly positively correlated Gaussian data streams. For this setting we conclude that both the stepup and stepdown sequential procedures provide substantial savings over the fixed sample procedures in terms of expected sample size, and the stepup procedure performs slightly but consistently better than the stepdown for $\gamma$-FDP control, with the relationship reversed for $k$-FWER control.

Journal ArticleDOI
TL;DR: In this article, the authors proposed two estimators, the weighted composite estimator that minimizes the weighted combined quantile objective function across quantiles, and the weighted quantile average estimator, which is the weighted average of quantile-specific slope estimators.
Abstract: Quantile regression offers a convenient tool to access the relationship between a response and covariates in a comprehensive way and it is appealing especially in applications where interests are on the tails of the response distribution. However, due to data sparsity, the finite sample estimation at tail quantiles often suffers from high variability. To improve the tail estimation efficiency, we consider modeling multiple quantiles jointly for cases where the quantile slope coefficients tend to be constant at the tails. We propose two estimators, the weighted composite estimator that minimizes the weighted combined quantile objective function across quantiles, and the weighted quantile average estimator that is the weighted average of quantile-specific slope estimators. By using extreme value theory, we establish the asymptotic distributions of the two estimators at the tails, and propose a procedure for estimating optimal weights. We show that the optimally weighted estimators improve the efficiency over equally weighted estimators, and the efficiency gain depends on the heaviness of the tail distribution. The performance of the proposed estimators is assessed through a simulation study and the analysis of precipitation downscaling data.

Journal ArticleDOI
TL;DR: In this paper, the authors consider a heteroscedastic transformation model, where the transformation belongs to a parametric family of monotone transformations, the regression and variance function are modelled nonparametrically and the error is independent of the multidimensional covariates.
Abstract: In this paper we consider a heteroscedastic transformation model, where the transformation belongs to a parametric family of monotone transformations, the regression and variance function are modelled nonparametrically and the error is independent of the multidimensional covariates. In this model, we first consider the estimation of the unknown components of the model, namely the transformation parameter, regression and variance function and the distribution of the error. We show the asymptotic normality of the proposed estimators. Second, we propose tests for the validity of the model, and establish the limiting distribution of the test statistics under the null hypothesis. A bootstrap procedure is proposed to approximate the critical values of the tests. Finally, we carry out a simulation study to verify the small sample behavior of the proposed estimators and tests.

Journal ArticleDOI
TL;DR: In this paper, the authors exploit the efficient empirical likelihood method to give a unified interval for the coefficient by taking the structure of errors into account, and propose a jackknife method to reduce the computation of the empirical likelihood when the order in the AR errors is not small.
Abstract: Jonathan Hill1, Deyuan Li2∗ and Liang Peng3 Abstract. An empirical likelihood method was proposed in Hill and Peng (2014) to construct a unified interval estimation for the coefficient in an AR(1) model, regardless of whether the sequence was stationary or near integrated. The error term, however, was assumed independent, and this method fails when the errors are dependent. Testing for a unit root in an AR(1) model has been studied in the literature for dependent errors, but existing methods cannot be used to test for a near unit root. In this paper, assuming the errors are governed by an AR(p) process, we exploit the efficient empirical likelihood method to give a unified interval for the coefficient by taking the structure of errors into account. Furthermore, a jackknife empirical likelihood method is proposed to reduce the computation of the empirical likelihood method when the order in the AR errors is not small. A simulation study is conducted to examine the finite sample behavior of the proposed methods.

Journal ArticleDOI
TL;DR: A novel sure independence screening method based on conditional distance correlation under the ultrahigh dimensional model setting that accomplishes the adjustment by conditioning on the confounding variables and is applicable to data with multivariate response.
Abstract: Detecting candidate genetic variants in genomic studies often encounters confounding problems, particularly when the data are ultrahigh dimensional. Confounding covariates, such as age and gender, not only can reduce the statistical power, but also introduce spurious genetic association. How to control for the confounders in ultrahigh dimensional data analysis is a critical and challenging issue. In this paper, we propose a novel sure independence screening method based on conditional distance correlation under the ultrahigh dimensional model setting. Our proposal accomplishes the adjustment by conditioning on the confounding variables. With the model-free feature of conditional distance correlation, our method does not need any parametric modeling assumptions and is thus quite flexible. In addition, it is applicable to data with multivariate response. We show that under some mild technical conditions, the proposed method enjoys the sure screening property even when the dimensionality is an exponential order of the sample size. The simulation studies and a data analysis demonstrate that the proposed procedure has competitive performance.

Journal ArticleDOI
TL;DR: In this paper, Wang distortion risk measures have been used to study the right tail of a real-valued random variable, including the value at risk and the value-value-at-risk measures, and adapted estimators when the random variable has a heavy-tailed distribution.
Abstract: Among the many possible ways to study the right tail of a real-valued random variable, a particularly general one is given by considering the family of its Wang distortion risk measures. This class of risk measures encompasses various interesting indicators, such as the widely used Value-at-Risk and Tail Value-at-Risk, which are especially popular in actuarial science, for instance. In this paper, we first build simple extreme analogues of Wang distortion risk measures and we show how this makes it possible to consider many standard measures of extreme risk, including the usual extreme Value-at-Risk or Tail-Value-at-Risk, as well as the recently introduced extreme Conditional Tail Moment, in a unified framework. We then introduce adapted estimators when the random variable of interest has a heavy-tailed distribution and we prove their asymptotic normality. The finite sample performance of our estimators is assessed on a simulation study and we showcase our techniques on two sets of real data.

Journal ArticleDOI
TL;DR: In this paper, a fully generative nonparametric approach, which relies on mixing parametric kernels such as the matrix Langevin distribution, is proposed to approximate a large class of distributions on the Stiefel manifold.
Abstract: The Stiefel manifold $V_{p,d}$ is the space of all $d \times p$ orthonormal matrices, with the $d-1$ hypersphere and the space of all orthogonal matrices constituting special cases. In modeling data lying on the Stiefel manifold, parametric distributions such as the matrix Langevin distribution are often used; however, model misspecification is a concern and it is desirable to have nonparametric alternatives. Current nonparametric methods are Frechet mean based. We take a fully generative nonparametric approach, which relies on mixing parametric kernels such as the matrix Langevin. The proposed kernel mixtures can approximate a large class of distributions on the Stiefel manifold, and we develop theory showing posterior consistency. While there exists work developing general posterior consistency results, extending these results to this particular manifold requires substantial new theory. Posterior inference is illustrated on a real-world dataset of near-Earth objects.

Journal ArticleDOI
TL;DR: In this paper, a conditional random-effects model is proposed to estimate subject-specific treatment effects through conditional random effects modeling, and apply the random forest algorithm to allocate effective treatments for individuals.
Abstract: We develop new modeling for personalized treatment for longitudinal studies involving high heterogeneity of treatment effects. Incorporating subjectspecific information into the treatment assignment is crucial since different individuals can react to the same treatment very differently. We estimate unobserved subject-specific treatment effects through conditional random-effects modeling, and apply the random forest algorithm to allocate effective treatments for individuals. The advantage of our approach is that random-effects estimation does not rely on the normality assumption. In theory, we show that the proposed random-effect estimator is consistent and more efficient than the random-effect estimator that ignores correlation information from longitudinal data. Simulation studies and a data example from an HIV clinical trial also confirm that the proposed method can efficiently identify the best treatments for individual patients.

Journal ArticleDOI
TL;DR: This work revisits the problem of determining the sample size for a Gaussian process emulator and provides a data analytic tool for exact sample size calculations that goes beyond the n = 10d rule of thumb and is based on an IMSPE-related criterion.
Abstract: We revisit the problem of determining the sample size for a Gaussian process emulator and provide a data analytic tool for exact sample size calculations that goes beyond the n = 10d rule of thumb and is based on an IMSPE-related criterion. This allows us to tie sample size and prediction accuracy to the anticipated roughness of the simulated data, and to propose an experimental process for computer experiments, with extension to a robust scheme.

Journal ArticleDOI
TL;DR: In this article, the authors characterized and constructed universally optimal designs among the class of circular repeated-measurements designs when the parameters do not permit balance for carry-over effects.
Abstract: The aim of this paper is to characterize and construct universally optimal designs among the class of circular repeated-measurements designs when the parameters do not permit balance for carry-over effects. It is shown that some circular weakly neighbour balanced designs defined by Filipiak and Markiewicz (2012) are universally optimal repeated-measurements designs. These results extend the work of Magda (1980), Kunert (1984b) and Filipiak and Markiewicz (2012).

Journal ArticleDOI
TL;DR: In this paper, a Bayesian nonparametric approach is used to discover a clustering of these region-specic time series, and the crime counts in each region are modeled using an integer-valued rst order autoregressive process.
Abstract: To aid in the ecient tasking of police and other protective measures, there is signicant interest in being able to predict regions in which crimes are likely to occur. Violent crimes often exhibit both temporal and spatial characteristics, though the spatial patterns do not vary smoothly across the map and instead we see spatially disjoint areas that exhibit similar crime behaviors. It is this indeterminate inter-region correlation structure along with the low-count discrete nature of the data that motivate our proposed forecasting tool. In particular, we propose to model the crime counts in each region using an integer-valued rst order autoregressive process. We take a Bayesian nonparametric approach to exibly discover a clustering of these region-specic time series. We also present methods for accounting for seasonality and covariates. We demonstrate our approach through an analysis of reported violent crime data in Washington D.C., collected between 2001-2008, and show that our forecasts outperform standard methods while additionally providing useful tools such as prediction intervals.

Journal ArticleDOI
TL;DR: In this article, Wang et al. developed nonparametric approaches to the pAUC and pODC using normal approximation, including the jackknife and the empirical likelihood, for the Pancreatic Cancer Serum Biomarker data set.
Abstract: The receiver operating characteristic (ROC) curve is a well-known measure of the performance of a classification method. Interest may only pertain to a specific region of the curve and, in this case, the partial area under the ROC curve (pAUC) provides a useful summary measure. Related measures such as the ordinal dominance curve (ODC) and the partial area under the ODC (pODC) are frequently of interest as well. Based on a novel estimator of pAUC proposed by Wang and Chang (2011), we develop nonparametric approaches to the pAUC and pODC using normal approximation, the jackknife and the jackknife empirical likelihood. A simulation study demonstrates the flaws of the existing method and shows proposed methods perform well. Simulations also substantiate the consistency of our jackknife variance estimator. The Pancreatic Cancer Serum Biomarker data set is used to illustrate the proposed methods.

Journal ArticleDOI
TL;DR: In this paper, the authors conduct integrative analysis of survival data under the accelerated failure time (AFT) model, where the sparsity structures of multiple datasets are described using the homogeneity and heterogeneity models.
Abstract: For survival data with high-dimensional covariates, results generated in the analysis of a single dataset are often unsatisfactory because of the small sample size. Integrative analysis pools raw data from multiple independent studies with comparable designs, effectively increases sample size, and has better performance than meta-analysis and single-dataset analysis. In this study, we conduct integrative analysis of survival data under the accelerated failure time (AFT) model. The sparsity structures of multiple datasets are described using the homogeneity and heterogeneity models. For variable selection under the homogeneity model, we adopt group penalization approaches. For variable selection under the heterogeneity model, we use composite penalization and sparse group penalization approaches. As a major advancement from the existing studies, the asymptotic selection and estimation properties are rigorously established. Simulation study is conducted to compare different penalization methods and against alternatives. We also analyze four lung cancer prognosis datasets with gene expression measurements.

Journal ArticleDOI
TL;DR: A testing method is provided to assess first-order independence between points and marks, where first- order independence is concluded if the test statistic is insignificant and first-orders dependence is concludedif the test statistics are significant.
Abstract: An important problem in statistical methods for marked point processes (MPPs) is to evaluate the relationship between points and marks, which can be developed under either the concept of independence or the concept of separability. Although both have been used, the connection between these two concepts is still unclear in the literature. The present article provides a way to evaluate such a connection, concluding that the concept of independence and the concept of separability are equivalent if the Kolmogorov consistency condition is satisfied, but not otherwise. We also provide a testing method to assess first-order independence between points and marks, where first-order independence is concluded if the test statistic is insignificant and first-order dependence is concluded if the test statistic is significant. The performance of the testing method is evaluated under simulation and case studies.