scispace - formally typeset
Search or ask a question

Showing papers in "Annals of Statistics in 2018"


Journal ArticleDOI
TL;DR: It is demonstrated that in several problems such as non-convex binary classification, robust regression, and Gaussian mixture model, this result implies a complete characterization of the landscape of the empirical risk, and of the convergence properties of descent algorithms.
Abstract: Most high-dimensional estimation methods propose to minimize a cost function (empirical risk) that is a sum of losses associated to each data point (each example). In this paper, we focus on the case of nonconvex losses. Classical empirical process theory implies uniform convergence of the empirical (or sample) risk to the population risk. While under additional assumptions, uniform convergence implies consistency of the resulting M-estimator, it does not ensure that the latter can be computed efficiently. In order to capture the complexity of computing M-estimators, we study the landscape of the empirical risk, namely its stationary points and their properties. We establish uniform convergence of the gradient and Hessian of the empirical risk to their population counterparts, as soon as the number of samples becomes larger than the number of unknown parameters (modulo logarithmic factors). Consequently, good properties of the population risk can be carried to the empirical risk, and we are able to establish one-to-one correspondence of their stationary points. We demonstrate that in several problems such as nonconvex binary classification, robust regression and Gaussian mixture model, this result implies a complete characterization of the landscape of the empirical risk, and of the convergence properties of descent algorithms. We extend our analysis to the very high-dimensional setting in which the number of parameters exceeds the number of samples, and provides a characterization of the empirical risk landscape under a nearly information-theoretically minimal condition. Namely, if the number of samples exceeds the sparsity of the parameters vector (modulo logarithmic factors), then a suitable uniform convergence result holds. We apply this result to nonconvex binary classification and robust regression in very high-dimension.

203 citations


Journal ArticleDOI
TL;DR: In this paper, the authors show that the shape of the optimal shrinker is determined by the choice of loss function and inconsistency of both eigenvalues and eigenvectors of the sample covariance matrix.
Abstract: We show that in a common high-dimensional covariance model, the choice of loss function has a profound effect on optimal estimation. In an asymptotic framework based on the Spiked Covariance model and use of orthogonally invariant estimators, we show that optimal estimation of the population covariance matrix boils down to design of an optimal shrinker η that acts elementwise on the sample eigenvalues. Indeed, to each loss function there corresponds a unique admissible eigenvalue shrinker η* dominating all other shrinkers. The shape of the optimal shrinker is determined by the choice of loss function and, crucially, by inconsistency of both eigenvalues and eigenvectors of the sample covariance matrix. Details of these phenomena and closed form formulas for the optimal eigenvalue shrinkers are worked out for a menagerie of 26 loss functions for covariance estimation found in the literature, including the Stein, Entropy, Divergence, Frechet, Bhattacharya/Matusita, Frobenius Norm, Operator Norm, Nuclear Norm and Condition Number losses.

169 citations


Journal ArticleDOI
TL;DR: This is the first result that gives different optimal rates for the left and right singular spaces under the same perturbation, and applications to low-rank matrix denoising and singular space estimation, high-dimensional clustering, and canonical correlation analysis are discussed.
Abstract: Perturbation bounds for singular spaces, in particular Wedin’s $\mathop{\mathrm{sin}} olimits \Theta$ theorem, are a fundamental tool in many fields including high-dimensional statistics, machine learning and applied mathematics. In this paper, we establish separate perturbation bounds, measured in both spectral and Frobenius $\mathop{\mathrm{sin}} olimits \Theta$ distances, for the left and right singular subspaces. Lower bounds, which show that the individual perturbation bounds are rate-optimal, are also given. The new perturbation bounds are applicable to a wide range of problems. In this paper, we consider in detail applications to low-rank matrix denoising and singular space estimation, high-dimensional clustering and canonical correlation analysis (CCA). In particular, separate matching upper and lower bounds are obtained for estimating the left and right singular spaces. To the best of our knowledge, this is the first result that gives different optimal rates for the left and right singular spaces under the same perturbation.

153 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider Gaussian designs with known or unknown population covariance and prove that the debiased estimator is asymptotically Gaussian under the nearly optimal condition $s_{0}=o(n/(log p)^{2}).
Abstract: Performing statistical inference in high-dimensional models is challenging because of the lack of precise information on the distribution of high-dimensional regularized estimators. Here, we consider linear regression in the high-dimensional regime $p>>n$ and the Lasso estimator: we would like to perform inference on the parameter vector $\theta^{*}\in\mathbb{R}^{p}$. Important progress has been achieved in computing confidence intervals and $p$-values for single coordinates $\theta^{*}_{i}$, $i\in\{1,\dots,p\}$. A key role in these new inferential methods is played by a certain debiased estimator $\widehat{\theta}^{\mathrm{d}}$. Earlier work establishes that, under suitable assumptions on the design matrix, the coordinates of $\widehat{\theta}^{\mathrm{d}}$ are asymptotically Gaussian provided the true parameters vector $\theta^{*}$ is $s_{0}$-sparse with $s_{0}=o(\sqrt{n}/\log p)$. The condition $s_{0}=o(\sqrt{n}/\log p)$ is considerably stronger than the one for consistent estimation, namely $s_{0}=o(n/\log p)$. In this paper, we consider Gaussian designs with known or unknown population covariance. When the covariance is known, we prove that the debiased estimator is asymptotically Gaussian under the nearly optimal condition $s_{0}=o(n/(\log p)^{2})$. The same conclusion holds if the population covariance is unknown but can be estimated sufficiently well. For intermediate regimes, we describe the trade-off between sparsity in the coefficients $\theta^{*}$, and sparsity in the inverse covariance of the design. We further discuss several applications of our results beyond high-dimensional inference. In particular, we propose a thresholded Lasso estimator that is minimax optimal up to a factor $1+o_{n}(1)$ for i.i.d. Gaussian designs.

152 citations


Journal ArticleDOI
TL;DR: This paper addresses the important question of how large k can be, as n grows large, such that the loss of efficiency due to the divide-and-conquer algorithm is negligible.
Abstract: This paper studies hypothesis testing and parameter estimation in the context of the divide-and-conquer algorithm. In a unified likelihood based framework, we propose new test statistics and point estimators obtained by aggregating various statistics from k subsamples of size n/k, where n is the sample size. In both low dimensional and sparse high dimensional settings, we address the important question of how large k can be, as n grows large, such that the loss of efficiency due to the divide-and-conquer algorithm is negligible. In other words, the resulting estimators have the same inferential efficiencies and estimation rates as an oracle with access to the full sample. Thorough numerical results are provided to back up the theory.

149 citations


Journal ArticleDOI
TL;DR: A new concept called matrix depth is defined and a robust covariance matrix estimator is proposed that is shown to achieve minimax optimal rate under Huber's $\epsilon$-contamination model for estimating covariance/scatter matrices with various structures including bandedness and sparsity.
Abstract: Covariance matrix estimation is one of the most important problems in statistics. To accommodate the complexity of modern datasets, it is desired to have estimation procedures that not only can incorporate the structural assumptions of covariance matrices, but are also robust to outliers from arbitrary sources. In this paper, we define a new concept called matrix depth and then propose a robust covariance matrix estimator by maximizing the empirical depth function. The proposed estimator is shown to achieve minimax optimal rate under Huber’s $\varepsilon$-contamination model for estimating covariance/scatter matrices with various structures including bandedness and sparsity.

146 citations


Journal ArticleDOI
TL;DR: It is shown that two polynomial time methods, a Lasso estimator with adaptively chosen tuning parameter and a Slope estimator, adaptively achieve the exact minimax prediction and estimation rate in high-dimensional linear regression on the class of $s-sparse target vectors in $\mathbb R^p$.
Abstract: We show that two polynomial time methods, a Lasso estimator with adaptively chosen tuning parameter and a Slope estimator, adaptively achieve the minimax prediction and $\ell_{2}$ estimation rate $(s/n)\log(p/s)$ in high-dimensional linear regression on the class of $s$-sparse vectors in $\mathbb{R}^{p}$. This is done under the Restricted Eigenvalue (RE) condition for the Lasso and under a slightly more constraining assumption on the design for the Slope. The main results have the form of sharp oracle inequalities accounting for the model misspecification error. The minimax optimal bounds are also obtained for the $\ell_{q}$ estimation errors with $1\le q\le2$ when the model is well specified. The results are nonasymptotic, and hold both in probability and in expectation. The assumptions that we impose on the design are satisfied with high probability for a large class of random matrices with independent and possibly anisotropically distributed rows. We give a comparative analysis of conditions, under which oracle bounds for the Lasso and Slope estimators can be obtained. In particular, we show that several known conditions, such as the RE condition and the sparse eigenvalue condition are equivalent if the $\ell_{2}$-norms of regressors are uniformly bounded.

137 citations


Journal ArticleDOI
TL;DR: In this paper, a unified analysis of the predictive risk of ridge regression and regularized discriminant analysis in a dense random effects model was provided. But the analysis was performed in a high-dimensional asymptotic regime where $p,n,n\to\infty$ and $p/n/to\gamma>0, and allow for arbitrary covariance among the features.
Abstract: We provide a unified analysis of the predictive risk of ridge regression and regularized discriminant analysis in a dense random effects model. We work in a high-dimensional asymptotic regime where $p,n\to\infty$ and $p/n\to\gamma>0$, and allow for arbitrary covariance among the features. For both methods, we provide an explicit and efficiently computable expression for the limiting predictive risk, which depends only on the spectrum of the feature-covariance matrix, the signal strength and the aspect ratio $\gamma$. Especially in the case of regularized discriminant analysis, we find that predictive accuracy has a nuanced dependence on the eigenvalue distribution of the covariance matrix, suggesting that analyses based on the operator norm of the covariance matrix may not be sharp. Our results also uncover an exact inverse relation between the limiting predictive risk and the limiting estimation risk in high-dimensional linear models. The analysis builds on recent advances in random matrix theory.

134 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider selective inference with a randomized response and prove a selective central limit theorem that transfers procedures valid under asymptotic normality without selection to their corresponding selective counterparts.
Abstract: Inspired by sample splitting and the reusable holdout introduced in the field of differential privacy, we consider selective inference with a randomized response. We discuss two major advantages of using a randomized response for model selection. First, the selectively valid tests are more powerful after randomized selection. Second, it allows consistent estimation and weak convergence of selective inference procedures. Under independent sampling, we prove a selective (or privatized) central limit theorem that transfers procedures valid under asymptotic normality without selection to their corresponding selective counterparts. This allows selective inference in nonparametric settings. Finally, we propose a framework of inference after combining multiple randomized selection procedures. We focus on the classical asymptotic setting, leaving the interesting high-dimensional asymptotic questions for future work.

124 citations


Journal ArticleDOI
TL;DR: This work shows that greedy algorithms perform within a constant factor from the best possible subset-selection solution for a broad class of general objective functions.
Abstract: We connect high-dimensional subset selection and submodular maximization. Our results extend the work of Das and Kempe [In ICML (2011) 1057–1064] from the setting of linear regression to arbitrary objective functions. For greedy feature selection, this connection allows us to obtain strong multiplicative performance bounds on several methods without statistical modeling assumptions. We also derive recovery guarantees of this form under standard assumptions. Our work shows that greedy algorithms perform within a constant factor from the best possible subset-selection solution for a broad class of general objective functions. Our methods allow a direct control over the number of obtained features as opposed to regularization parameters that only implicitly control sparsity. Our proof technique uses the concept of weak submodularity initially defined by Das and Kempe. We draw a connection between convex analysis and submodular set function theory which may be of independent interest for other statistical learning applications that have combinatorial structure.

115 citations


Journal ArticleDOI
TL;DR: In this article, the authors derived asymptotic minimax risks of community detection in degree-corrected block models (DCBMs) and proposed a polynomial time algorithm to adaptively perform consistent and even optimal community detection.
Abstract: Community detection is a central problem of network data analysis. Given a network, the goal of community detection is to partition the network nodes into a small number of clusters, which could often help reveal interesting structures. The present paper studies community detection in Degree-Corrected Block Models (DCBMs). We first derive asymptotic minimax risks of the problem for a misclassification proportion loss under appropriate conditions. The minimax risks are shown to depend on degree-correction parameters, community sizes and average within and between community connectivities in an intuitive and interpretable way. In addition, we propose a polynomial time algorithm to adaptively perform consistent and even asymptotically optimal community detection in DCBMs.

Journal ArticleDOI
TL;DR: In this paper, a new semidefinite programming (SDP) solution to the problem of fitting the stochastic block model is derived as a relaxation of the maximum likelihood (MLE) problem.
Abstract: The stochastic block model (SBM) is a popular tool for community detection in networks, but fitting it by maximum likelihood (MLE) involves a computationally infeasible optimization problem. We propose a new semidefinite programming (SDP) solution to the problem of fitting the SBM, derived as a relaxation of the MLE. We put ours and previously proposed SDPs in a unified framework, as relaxations of the MLE over various subclasses of the SBM, which also reveals a connection to the well-known problem of sparse PCA. Our main relaxation, which we call SDP-1, is tighter than other recently proposed SDP relaxations, and thus previously established theoretical guarantees carry over. However, we show that SDP-1 exactly recovers true communities over a wider class of SBMs than those covered by current results. In particular, the assumption of strong assortativity of the SBM, implicit in consistency conditions for previously proposed SDPs, can be relaxed to weak assortativity for our approach, thus significantly broadening the class of SBMs covered by the consistency results. We also show that strong assortativity is indeed a necessary condition for exact recovery for previously proposed SDP approaches and not an artifact of the proofs. Our analysis of SDPs is based on primal-dual witness constructions, which provides some insight into the nature of the solutions of various SDPs. In particular, we show how to combine features from SDP-1 and already available SDPs to achieve the most flexibility in terms of both assortativity and block-size constraints, as our relaxation has the tendency to produce communities of similar sizes. This tendency makes it the ideal tool for fitting network histograms, a method gaining popularity in the graphon estimation literature, as we illustrate on an example of a social networks of dolphins. We also provide empirical evidence that SDPs outperform spectral methods for fitting SBMs with a large number of blocks.

Journal ArticleDOI
TL;DR: In this paper, the authors studied the statistical limits of tests for the presence of a spike, including nonspectral tests, for the Gaussian Wigner ensemble and showed that PCA achieves the optimal detection threshold for certain natural priors for the spike.
Abstract: A central problem of random matrix theory is to understand the eigenvalues of spiked random matrix models, introduced by Johnstone, in which a prominent eigenvector (or “spike”) is planted into a random matrix. These distributions form natural statistical models for principal component analysis (PCA) problems throughout the sciences. Baik, Ben Arous and Peche showed that the spiked Wishart ensemble exhibits a sharp phase transition asymptotically: when the spike strength is above a critical threshold, it is possible to detect the presence of a spike based on the top eigenvalue, and below the threshold the top eigenvalue provides no information. Such results form the basis of our understanding of when PCA can detect a low-rank signal in the presence of noise. However, under structural assumptions on the spike, not all information is necessarily contained in the spectrum. We study the statistical limits of tests for the presence of a spike, including nonspectral tests. Our results leverage Le Cam’s notion of contiguity and include: (i) For the Gaussian Wigner ensemble, we show that PCA achieves the optimal detection threshold for certain natural priors for the spike. (ii) For any non-Gaussian Wigner ensemble, PCA is sub-optimal for detection. However, an efficient variant of PCA achieves the optimal threshold (for natural priors) by pre-transforming the matrix entries. (iii) For the Gaussian Wishart ensemble, the PCA threshold is optimal for positive spikes (for natural priors) but this is not always the case for negative spikes.

Journal ArticleDOI
TL;DR: The main thrust of this paper is to introduce the family of Spike-andSlab LASSO (SS-LASSO) priors, which form a continuum between the Laplace prior and the point-mass spike-and-slab prior, and establish several appealing frequentist properties of SS-LassO priors.
Abstract: We introduce a new framework for estimation of sparse normal means, bridging the gap between popular frequentist strategies (LASSO) and popular Bayesian strategies (spike-and-slab). The main thrust of this paper is to introduce the family of Spike-and-Slab LASSO (SS-LASSO) priors, which form a continuum between the Laplace prior and the point-mass spike-and-slab prior. We establish several appealing frequentist properties of SS-LASSO priors, contrasting them with these two limiting cases. First, we adopt the penalized likelihood perspective on Bayesian modal estimation and introduce the framework of Bayesian penalty mixing with spike-and-slab priors. We show that the SS-LASSO global posterior mode is (near) minimax rate-optimal under squared error loss, similarly as the LASSO. Going further, we introduce an adaptive two-step estimator which can achieve provably sharper performance than the LASSO. Second, we show that the whole posterior keeps pace with the global mode and concentrates at the (near) minimax rate, a property that is known \textsl{not to hold} for the single Laplace prior. The minimax-rate optimality is obtained with a suitable class of independent product priors (for known levels of sparsity) as well as with dependent mixing priors (adapting to the unknown levels of sparsity). Up to now, the rate-optimal posterior concentration has been established only for spike-and-slab priors with a point mass at zero. Thus, the SS-LASSO priors, despite being continuous, possess similar optimality properties as the “theoretically ideal” point-mass mixtures. These results provide valuable theoretical justification for our proposed class of priors, underpinning their intuitive appeal and practical potential.

Journal ArticleDOI
TL;DR: In this article, the performance of least squares estimators over closed convex sets is studied in shape-constrained regression models under Gaussian and sub-Gaussian noise.
Abstract: The performance of Least Squares (LS) estimators is studied in shape-constrained regression models under Gaussian and sub-Gaussian noise. General bounds on the performance of LS estimators over closed convex sets are provided. These results have the form of sharp oracle inequalities that account for the model misspecification error. In the presence of misspecification, these bounds imply that the LS estimator estimates the projection of the true parameter at the same rate as in the well-specified case. In isotonic and unimodal regression, the LS estimator achieves the nonparametric rate $n^{-2/3}$ as well as a parametric rate of order $k/n$ up to logarithmic factors, where $k$ is the number of constant pieces of the true parameter. In univariate convex regression, the LS estimator satisfies an adaptive risk bound of order $q/n$ up to logarithmic factors, where $q$ is the number of affine pieces of the true regression function. This adaptive risk bound holds for any collection of design points. While Guntuboyina and Sen [Probab. Theory Related Fields 163 (2015) 379–411] established that the nonparametric rate of convex regression is of order $n^{-4/5}$ for equispaced design points, we show that the nonparametric rate of convex regression can be as slow as $n^{-2/3}$ for some worst-case design points. This phenomenon can be explained as follows: Although convexity brings more structure than unimodality, for some worst-case design points this extra structure is uninformative and the nonparametric rates of unimodal regression and convex regression are both $n^{-2/3}$. Higher order cones, such as the cone of $\beta $-monotone sequences, are also studied.

Journal ArticleDOI
TL;DR: Though this framework is completely algorithmic, it provides solutions with optimal statistical performances and controlled algorithmic complexity for a large family of nonconvex optimization problems.
Abstract: We propose a computational framework named iterative local adaptive majorize-minimization (I-LAMM) to simultaneously control algorithmic complexity and statistical error when fitting high dimensional models. I-LAMM is a two-stage algorithmic implementation of the local linear approximation to a family of folded concave penalized quasi-likelihood. The first stage solves a convex program with a crude precision tolerance to obtain a coarse initial estimator, which is further refined in the second stage by iteratively solving a sequence of convex programs with smaller precision tolerances. Theoretically, we establish a phase transition: the first stage has a sublinear iteration complexity, while the second stage achieves an improved linear rate of convergence. Though this framework is completely algorithmic, it provides solutions with optimal statistical performances and controlled algorithmic complexity for a large family of nonconvex optimization problems. The iteration effects on statistical errors are clearly demonstrated via a contraction property. Our theory relies on a localized version of the sparse/restricted eigenvalue condition, which allows us to analyze a large family of loss and penalty functions and provide optimality guarantees under very weak assumptions (For example, I-LAMM requires much weaker minimal signal strength than other procedures). Thorough numerical results are provided to support the obtained theory.

Journal ArticleDOI
TL;DR: In this paper, the authors prove a central limit theorem for the components of the eigenvectors corresponding to the largest eigenvalues of the normalized Laplacian matrix of a finite dimensional random dot product graph.
Abstract: We prove a central limit theorem for the components of the eigenvectors corresponding to the $d$ largest eigenvalues of the normalized Laplacian matrix of a finite dimensional random dot product graph. As a corollary, we show that for stochastic blockmodel graphs, the rows of the spectral embedding of the normalized Laplacian converge to multivariate normals and, furthermore, the mean and the covariance matrix of each row are functions of the associated vertex’s block membership. Together with prior results for the eigenvectors of the adjacency matrix, we then compare, via the Chernoff information between multivariate normal distributions, how the choice of embedding method impacts subsequent inference. We demonstrate that neither embedding method dominates with respect to the inference task of recovering the latent block assignments.

Journal ArticleDOI
TL;DR: In this article, the authors show that consistency of hybrid methods based on greedy equivalence search (GES) can be achieved in the classical setting with adaptive restrictions on the search space that depend on the current state of the algorithm.
Abstract: Main approaches for learning Bayesian networks can be classified as constraint-based, score-based or hybrid methods. Although high-dimensional consistency results are available for constraint-based methods like the PC algorithm, such results have not been proved for score-based or hybrid methods, and most of the hybrid methods have not even shown to be consistent in the classical setting where the number of variables remains fixed and the sample size tends to infinity. In this paper, we show that consistency of hybrid methods based on greedy equivalence search (GES) can be achieved in the classical setting with adaptive restrictions on the search space that depend on the current state of the algorithm. Moreover, we prove consistency of GES and adaptively restricted GES (ARGES) in several sparse high-dimensional settings. ARGES scales well to sparse graphs with thousands of variables and our simulation study indicates that both GES and ARGES generally outperform the PC algorithm.

Journal ArticleDOI
TL;DR: In this article, the authors consider the problem of controlling the false discovery rate of a set of null hypotheses in an online manner, where the statistician must decide whether to reject a null hypothesis having access only to the previous decisions.
Abstract: Multiple hypothesis testing is a core problem in statistical inference and arises in almost every scientific field. Given a set of null hypotheses $\mathcal{H}(n)=(H_{1},\ldots,H_{n})$, Benjamini and Hochberg [J. R. Stat. Soc. Ser. B. Stat. Methodol. 57 (1995) 289–300] introduced the false discovery rate ($\mathrm{FDR}$), which is the expected proportion of false positives among rejected null hypotheses, and proposed a testing procedure that controls $\mathrm{FDR}$ below a pre-assigned significance level. Nowadays $\mathrm{FDR}$ is the criterion of choice for large-scale multiple hypothesis testing. In this paper we consider the problem of controlling $\mathrm{FDR}$ in an online manner. Concretely, we consider an ordered—possibly infinite—sequence of null hypotheses $\mathcal{H}=(H_{1},H_{2},H_{3},\ldots)$ where, at each step $i$, the statistician must decide whether to reject hypothesis $H_{i}$ having access only to the previous decisions. This model was introduced by Foster and Stine [J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 (2008) 429–444]. We study a class of generalized alpha investing procedures, first introduced by Aharoni and Rosset [J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 (2014) 771–794]. We prove that any rule in this class controls online $\mathrm{FDR}$, provided $p$-values corresponding to true nulls are independent from the other $p$-values. Earlier work only established $\mathrm{mFDR}$ control. Next, we obtain conditions under which generalized alpha investing controls $\mathrm{FDR}$ in the presence of general $p$-values dependencies. We also develop a modified set of procedures that allow to control the false discovery exceedance (the tail of the proportion of false discoveries). Finally, we evaluate the performance of online procedures on both synthetic and real data, comparing them with offline approaches, such as adaptive Benjamini–Hochberg.

Journal ArticleDOI
TL;DR: In this article, the authors show that the test statistic of Tibshirani et al. is asymptotically valid, as the number of samples grows and the dimension of the regression problem stays fixed.
Abstract: Recently, Tibshirani et al. [J. Amer. Statist. Assoc. 111 (2016) 600–620] proposed a method for making inferences about parameters defined by model selection, in a typical regression setting with normally distributed errors. Here, we study the large sample properties of this method, without assuming normality. We prove that the test statistic of Tibshirani et al. (2016) is asymptotically valid, as the number of samples $n$ grows and the dimension $d$ of the regression problem stays fixed. Our asymptotic result holds uniformly over a wide class of nonnormal error distributions. We also propose an efficient bootstrap version of this test that is provably (asymptotically) conservative, and in practice, often delivers shorter intervals than those from the original normality-based approach. Finally, we prove that the test statistic of Tibshirani et al. (2016) does not enjoy uniform validity in a high-dimensional setting, when the dimension $d$ is allowed grow.

Journal ArticleDOI
TL;DR: In this paper, it was shown that the optimal local minimax rate of estimating the parameters of a mixing distribution with a given number of components is n − 1/(4(m-m 0 + 2 ) + 2 ).
Abstract: We study the rates of estimation of finite mixing distributions, that is, the parameters of the mixture. We prove that under some regularity and strong identifiability conditions, around a given mixing distribution with $m_{0}$ components, the optimal local minimax rate of estimation of a mixing distribution with $m$ components is $n^{-1/(4(m-m_{0})+2)}$. This corrects a previous paper by Chen [Ann. Statist. 23 (1995) 221–233]. By contrast, it turns out that there are estimators with a (nonuniform) pointwise rate of estimation of $n^{-1/2}$ for all mixing distributions with a finite number of components.

Journal ArticleDOI
TL;DR: A convexified modularity maximization approach for estimating the hidden communities under DCSBM is proposed, based on a convex programming relaxation of the classical (generalized) modularity Maximization formulation, followed by a novel doubly-weighted procedure.
Abstract: The stochastic block model (SBM), a popular framework for studying community detection in networks, is limited by the assumption that all nodes in the same community are statistically equivalent and have equal expected degrees. The degree-corrected stochastic block model (DCSBM) is a natural extension of SBM that allows for degree heterogeneity within communities. To find the communities under DCSBM, this paper proposes a convexified modularity maximization approach, which is based on a convex programming relaxation of the classical (generalized) modularity maximization formulation, followed by a novel doubly-weighted $\ell_{1}$-norm $k$-medoids procedure. We establish nonasymptotic theoretical guarantees for approximate and perfect clustering, both of which build on a new degree-corrected density gap condition. Our approximate clustering results are insensitive to the minimum degree, and hold even in sparse regime with bounded average degrees. In the special case of SBM, our theoretical guarantees match the best-known results of computationally feasible algorithms. Numerically, we provide an efficient implementation of our algorithm, which is applied to both synthetic and real-world networks. Experiment results show that our method enjoys competitive performance compared to the state of the art in the literature.

Journal ArticleDOI
TL;DR: It is proved that the Bayes-UCB algorithm, which relies on quantiles of posterior distributions, is asymptotically optimal when the reward distributions belong to a one-dimensional exponential family, for a large class of prior distributions.
Abstract: This paper is about index policies for minimizing (frequentist) regret in a stochastic multi-armed bandit model, that are inspired by a Bayesian view on the problem. Our main contribution is to prove the asymptotic optimality of Bayes-UCB, an algorithm based on quantiles of posterior distributions, when the rewards distributions belong to a one-dimensional exponential family, for a large class of prior distributions. We also show that the Bayesian literature gives new insight on what kind of exploration rates could be used in frequentist, UCB-type algorithms. Indeed, approximations of the Bayesian optimal solution or the Finite Horizon Gittins indices suggest the introduction of two algorithms, KL-UCB + and KL-UCB-H + , whose asymptotic optimality is also established.

Journal ArticleDOI
TL;DR: In this paper, a new estimator of the (element-wise) mean of a random matrix is proposed, which admits sub-Gaussian or sub-exponential concentration around the unknown mean in the operator norm.
Abstract: Estimation of the covariance matrix has attracted a lot of attention of the statistical research community over the years, partially due to important applications such as principal component analysis. However, frequently used empirical covariance estimator, and its modifications, is very sensitive to the presence of outliers in the data. As P. Huber wrote [Ann. Math. Stat. 35 (1964) 73–101], “…This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one? As is now well known, the sample mean then may have a catastrophically bad performance….” Motivated by Tukey’s question, we develop a new estimator of the (element-wise) mean of a random matrix, which includes covariance estimation problem as a special case. Assuming that the entries of a matrix possess only finite second moment, this new estimator admits sub-Gaussian or sub-exponential concentration around the unknown mean in the operator norm. We explain the key ideas behind our construction, and discuss applications to covariance estimation and matrix completion problems.

Journal ArticleDOI
TL;DR: In this paper, the authors consider the estimation of the parameters of a Gaussian Stochastic Process (GaSP), in the context of emulation (approximation) of computer models for which the outcomes are real-valued scalars.
Abstract: We consider estimation of the parameters of a Gaussian Stochastic Process (GaSP), in the context of emulation (approximation) of computer models for which the outcomes are real-valued scalars. The main focus is on estimation of the GaSP parameters through various generalized maximum likelihood methods, mostly involving finding posterior modes; this is because full Bayesian analysis in computer model emulation is typically prohibitively expensive. The posterior modes that are studied arise from objective priors, such as the reference prior. These priors have been studied in the literature for the situation of an isotropic covariance function or under the assumption of separability in the design of inputs for model runs used in the GaSP construction. In this paper, we consider more general designs (e.g., a Latin Hypercube Design) with a class of commonly used anisotropic correlation functions, which can be written as a product of isotropic correlation functions, each having an unknown range parameter and a fixed roughness parameter. We discuss properties of the objective priors and marginal likelihoods for the parameters of the GaSP and establish the posterior propriety of the GaSP parameters, but our main focus is to demonstrate that certain parameterizations result in more robust estimation of the GaSP parameters than others, and that some parameterizations that are in common use should clearly be avoided. These results are applicable to many frequently used covariance functions, for example, power exponential, Matern, rational quadratic and spherical covariance. We also generalize the results to the GaSP model with a nugget parameter. Both theoretical and numerical evidence is presented concerning the performance of the studied procedures.

Journal ArticleDOI
TL;DR: An innovative method, namely the recommendation engine of multilayers (REM), for tensor recommender systems, which utilizes the structure of a tensor response to integrate information from multiple modes, and creates an additional layer of nested latent factors to accommodate between-subjects dependency.
Abstract: Recommender systems have been widely adopted by electronic commerce and entertainment industries for individualized prediction and recommendation, which benefit consumers and improve business intelligence. In this article, we propose an innovative method, namely the recommendation engine of multilayers (REM), for tensor recommender systems. The proposed method utilizes the structure of a tensor response to integrate information from multiple modes, and creates an additional layer of nested latent factors to accommodate between-subjects dependency. One major advantage is that the proposed method is able to address the “cold-start” issue in the absence of information from new customers, new products or new contexts. Specifically, it provides more effective recommendations through sub-group information. To achieve scalable computation, we develop a new algorithm for the proposed method, which incorporates a maximum block improvement strategy into the cyclic blockwise-coordinate-descent algorithm. In theory, we investigate algorithmic properties for convergence from an arbitrary initial point and local convergence, along with the asymptotic consistency of estimated parameters. Finally, the proposed method is applied in simulations and IRI marketing data with 116 million observations of product sales. Numerical studies demonstrate that the proposed method outperforms existing competitors in the literature.

Journal ArticleDOI
TL;DR: This paper proposes a penalized multi-stage A-learning for deriving the optimal dynamic treatment regime when the number of covariates is of the non-polynomial (NP) order of the sample size and adopts the Dantzig selector which directly penalizes the A-leaning estimating equations.
Abstract: Precision medicine is a medical paradigm that focuses on finding the most effective treatment decision based on individual patient information. For many complex diseases, such as cancer, treatment decisions need to be tailored over time according to patients' responses to previous treatments. Such an adaptive strategy is referred as a dynamic treatment regime. A major challenge in deriving an optimal dynamic treatment regime arises when an extraordinary large number of prognostic factors, such as patient's genetic information, demographic characteristics, medical history and clinical measurements over time are available, but not all of them are necessary for making treatment decision. This makes variable selection an emerging need in precision medicine. In this paper, we propose a penalized multi-stage A-learning for deriving the optimal dynamic treatment regime when the number of covariates is of the non-polynomial (NP) order of the sample size. To preserve the double robustness property of the A-learning method, we adopt the Dantzig selector which directly penalizes the A-leaning estimating equations. Oracle inequalities of the proposed estimators for the parameters in the optimal dynamic treatment regime and error bounds on the difference between the value functions of the estimated optimal dynamic treatment regime and the true optimal dynamic treatment regime are established. Empirical performance of the proposed approach is evaluated by simulations and illustrated with an application to data from the STAR*D study.

Journal ArticleDOI
TL;DR: In this paper, simultaneous confidence bands are constructed for a general moment condition model with high-dimensional parameters, where the Neyman orthogonality condition is assumed to be satisfied.
Abstract: In this paper, we develop procedures to construct simultaneous confidence bands for p ˜ potentially infinite-dimensional parameters after model selection for general moment condition models where p ˜ is potentially much larger than the sample size of available data, n. This allows us to cover settings with functional response data where each of the p ˜ parameters is a function. The procedure is based on the construction of score functions that satisfy Neyman orthogonality condition approximately. The proposed simultaneous confidence bands rely on uniform central limit theorems for high-dimensional vectors (and not on Donsker arguments as we allow for p ˜ ≫ n ). To construct the bands, we employ a multiplier bootstrap procedure which is computationally efficient as it only involves resampling the estimated score functions (and does not require resolving the high-dimensional optimization problems). We formally apply the general theory to inference on regression coefficient process in the distribution regression model with a logistic link, where two implementations are analyzed in detail. Simulations and an application to real data are provided to help illustrate the applicability of the results.

Journal ArticleDOI
TL;DR: In this paper, bounds on estimation error rates for regularization procedures of the form \begin{equation*}\hat{f}\in\mathop{\operatorname{argmin}}_{f\in F}(\frac{1}{N}\sum_{i=1}^{N}(Y_{i}-f(X_{i})))^{2}+\lambda \Psi(f))\end{Equation*} when $\Psi$ is a norm and $F$ is convex.
Abstract: We obtain bounds on estimation error rates for regularization procedures of the form \begin{equation*}\hat{f}\in\mathop{\operatorname{argmin}}_{f\in F}(\frac{1}{N}\sum_{i=1}^{N}(Y_{i}-f(X_{i}))^{2}+\lambda \Psi(f))\end{equation*} when $\Psi$ is a norm and $F$ is convex. Our approach gives a common framework that may be used in the analysis of learning problems and regularization problems alike. In particular, it sheds some light on the role various notions of sparsity have in regularization and on their connection with the size of subdifferentials of $\Psi$ in a neighborhood of the true minimizer. As “proof of concept” we extend the known estimates for the LASSO, SLOPE and trace norm regularization.

Journal ArticleDOI
TL;DR: In this paper, an intrinsic principal component analysis for smooth Riemannian manifold-valued functional data and its asymptotic properties were studied. But the analysis of longitudinal compositional data was not considered.
Abstract: Functional data analysis on nonlinear manifolds has drawn recent interest. Sphere-valued functional data, which are encountered, for example, as movement trajectories on the surface of the earth are an important special case. We consider an intrinsic principal component analysis for smooth Riemannian manifold-valued functional data and study its asymptotic properties. Riemannian functional principal component analysis (RFPCA) is carried out by first mapping the manifold-valued data through Riemannian logarithm maps to tangent spaces around the Frechet mean function, and then performing a classical functional principal component analysis (FPCA) on the linear tangent spaces. Representations of the Riemannian manifold-valued functions and the eigenfunctions on the original manifold are then obtained with exponential maps. The tangent-space approximation yields upper bounds to residual variances if the Riemannian manifold has nonnegative curvature. We derive a central limit theorem for the mean function, as well as root-$n$ uniform convergence rates for other model components. Our applications include a novel framework for the analysis of longitudinal compositional data, achieved by mapping longitudinal compositional data to trajectories on the sphere, illustrated with longitudinal fruit fly behavior patterns. RFPCA is shown to outperform an unrestricted FPCA in terms of trajectory recovery and prediction in applications and simulations.