scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Statistics Theory in 2016"


Journal ArticleDOI
TL;DR: Two generalizations of the network model are presented, establishing network modeling as parts of the more general framework of structural equation modeling (SEM), and it is shown that, within this framework, identifiable models can be obtained in which local independence is structurally violated.
Abstract: We introduce the network model as a formal psychometric model, conceptualizing the covariance between psychometric indicators as resulting from pairwise interactions between observable variables in a network structure. This contrasts with standard psychometric models, in which the covariance between test items arises from the influence of one or more common latent variables. Here, we present two generalizations of the network model that encompass latent variable structures, establishing network modeling as parts of the more general framework of Structural Equation Modeling (SEM). In the first generalization, we model the covariance structure of latent variables as a network. We term this framework Latent Network Modeling (LNM) and show that, with LNM, a unique structure of conditional independence relationships between latent variables can be obtained in an explorative manner. In the second generalization, the residual variance-covariance structure of indicators is modeled as a network. We term this generalization Residual Network Modeling (RNM) and show that, within this framework, identifiable models can be obtained in which local independence is structurally violated. These generalizations allow for a general modeling framework that can be used to fit, and compare, SEM models, network models, and the RNM and LNM generalizations. This methodology has been implemented in the free-to-use software package lvnet, which contains confirmatory model testing as well as two exploratory search algorithms: stepwise search algorithms for low-dimensional datasets and penalized maximum likelihood estimation for larger datasets. We show in simulation studies that these search algorithms performs adequately in identifying the structure of the relevant residual or latent networks. We further demonstrate the utility of these generalizations in an empirical example on a personality inventory dataset.

233 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that several machine learning estimators, including square-root LASSO (Least Absolute Shrinkage and Selection) and regularized logistic regression can be represented as solutions to distributionally robust optimization problems.
Abstract: We show that several machine learning estimators, including square-root LASSO (Least Absolute Shrinkage and Selection) and regularized logistic regression can be represented as solutions to distributionally robust optimization (DRO) problems. The associated uncertainty regions are based on suitably defined Wasserstein distances. Hence, our representations allow us to view regularization as a result of introducing an artificial adversary that perturbs the empirical distribution to account for out-of-sample effects in loss estimation. In addition, we introduce RWPI (Robust Wasserstein Profile Inference), a novel inference methodology which extends the use of methods inspired by Empirical Likelihood to the setting of optimal transport costs (of which Wasserstein distances are a particular case). We use RWPI to show how to optimally select the size of uncertainty regions, and as a consequence, we are able to choose regularization parameters for these machine learning estimators without the use of cross validation. Numerical experiments are also given to validate our theoretical findings.

230 citations


Posted Content
TL;DR: Private versions of classical information-theoretical bounds, in particular those due to Le Cam, Fano, and Assouad, are developed to allow for a precise characterization of statistical rates under local privacy constraints and the development of provably (minimax) optimal estimation procedures.
Abstract: Working under a model of privacy in which data remains private even from the statistician, we study the tradeoff between privacy guarantees and the risk of the resulting statistical estimators. We develop private versions of classical information-theoretic bounds, in particular those due to Le Cam, Fano, and Assouad. These inequalities allow for a precise characterization of statistical rates under local privacy constraints and the development of provably (minimax) optimal estimation procedures. We provide a treatment of several canonical families of problems: mean estimation and median estimation, generalized linear models, and nonparametric density estimation. For all of these families, we provide lower and upper bounds that match up to constant factors, and exhibit new (optimal) privacy-preserving mechanisms and computationally efficient estimators that achieve the bounds. Additionally, we present a variety of experimental results for estimation problems involving sensitive data, including salaries, censored blog posts and articles, and drug abuse; these experiments demonstrate the importance of deriving optimal procedures.

225 citations


ReportDOI
TL;DR: In this article, a general construction of locally robust/orthogonal moment functions for GMM, where moment conditions have zero derivative with respect to first steps, is given and debiased machine learning estimators of functionals of high dimensional conditional quantiles and of dynamic discrete choice parameters with high dimensional state variables.
Abstract: Many economic and causal parameters depend on nonparametric or high dimensional first steps. We give a general construction of locally robust/orthogonal moment functions for GMM, where moment conditions have zero derivative with respect to first steps. We show that orthogonal moment functions can be constructed by adding to identifying moments the nonparametric influence function for the effect of the first step on identifying moments. Orthogonal moments reduce model selection and regularization bias, as is very important in many applications, especially for machine learning first steps. We give debiased machine learning estimators of functionals of high dimensional conditional quantiles and of dynamic discrete choice parameters with high dimensional state variables. We show that adding to identifying moments the nonparametric influence function provides a general construction of orthogonal moments, including regularity conditions, and show that the nonparametric influence function is robust to additional unknown functions on which it depends. We give a general approach to estimating the unknown functions in the nonparametric influence function and use it to automatically debias estimators of functionals of high dimensional conditional location learners. We give a variety of new doubly robust moment equations and characterize double robustness. We give general and simple regularity conditions and apply these for asymptotic inference on functionals of high dimensional regression quantiles and dynamic discrete choice parameters with high dimensional state variables.

201 citations


Book ChapterDOI
TL;DR: In this paper, the authors review important aspects of semiparametric theory and empirical processes that arise in causal inference problems and discuss estimation and inference for causal effects under semi-parametric models, which allow parts of the data generating process to be unrestricted if they are not of particular interest.
Abstract: In this paper we review important aspects of semiparametric theory and empirical processes that arise in causal inference problems. We begin with a brief introduction to the general problem of causal inference, and go on to discuss estimation and inference for causal effects under semiparametric models, which allow parts of the data-generating process to be unrestricted if they are not of particular interest (i.e., nuisance functions). These models are very useful in causal problems because the outcome process is often complex and difficult to model, and there may only be information available about the treatment process (at best). Semiparametric theory gives a framework for benchmarking efficiency and constructing estimators in such settings. In the second part of the paper we discuss empirical process theory, which provides powerful tools for understanding the asymptotic behavior of semiparametric estimators that depend on flexible nonparametric estimators of nuisance functions. These tools are crucial for incorporating machine learning and other modern methods into causal inference analyses. We conclude by examining related extensions and future directions for work in semiparametric causal inference.

145 citations


Posted Content
TL;DR: Lower bounds on the regret in the case of multi-armed bandit problems are revisited and bounds show that in an initial phase the regret grows almost linearly, and that the well-known logarithmic growth of the regret only holds in a final phase.
Abstract: We revisit lower bounds on the regret in the case of multi-armed bandit problems. We obtain non-asymptotic, distribution-dependent bounds and provide straightforward proofs based only on well-known properties of Kullback-Leibler divergences. These bounds show in particular that in an initial phase the regret grows almost linearly, and that the well-known logarithmic growth of the regret only holds in a final phase. The proof techniques come to the essence of the information-theoretic arguments used and they are deprived of all unnecessary complications.

119 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a transformation approach to map probability densities to a Hilbert space of functions through a continuous and invertible map, and derive representations of the densities themselves by applying the inverse map from the linear functional space to the density space.
Abstract: Functional data that are nonnegative and have a constrained integral can be considered as samples of one-dimensional density functions. Such data are ubiquitous. Due to the inherent constraints, densities do not live in a vector space and, therefore, commonly used Hilbert space based methods of functional data analysis are not applicable. To address this problem, we introduce a transformation approach, mapping probability densities to a Hilbert space of functions through a continuous and invertible map. Basic methods of functional data analysis, such as the construction of functional modes of variation, functional regression or classification, are then implemented by using representations of the densities in this linear space. Representations of the densities themselves are obtained by applying the inverse map from the linear functional space to the density space. Transformations of interest include log quantile density and log hazard transformations, among others. Rates of convergence are derived for the representations that are obtained for a general class of transformations under certain structural properties. If the subject-specific densities need to be estimated from data, these rates correspond to the optimal rates of convergence for density estimation. The proposed methods are illustrated through simulations and applications in brain imaging.

115 citations


Posted Content
TL;DR: In this article, the authors consider the problem of sampling a high-dimensional probability distribution having a density with respect to the Lebesgue measure on the Euclidean space known up to a normalization constant, and obtain nonasymptotic bounds for the convergence to stationarity in Wasserstein distance of order $2$ and total variation distance of the sampling method based on the Euler discretization of the Langevin stochastic differential equation, for both constant and decreasing step sizes.
Abstract: We consider in this paper the problem of sampling a high-dimensional probability distribution $\pi$ having a density with respect to the Lebesgue measure on $\mathbb{R}^d$, known up to a normalization constant $x \mapsto \pi(x)= \mathrm{e}^{-U(x)}/\int_{\mathbb{R}^d} \mathrm{e}^{-U(y)} \mathrm{d} y$. Such problem naturally occurs for example in Bayesian inference and machine learning. Under the assumption that $U$ is continuously differentiable, $ abla U$ is globally Lipschitz and $U$ is strongly convex, we obtain non-asymptotic bounds for the convergence to stationarity in Wasserstein distance of order $2$ and total variation distance of the sampling method based on the Euler discretization of the Langevin stochastic differential equation, for both constant and decreasing step sizes. The dependence on the dimension of the state space of these bounds is explicit. The convergence of an appropriately weighted empirical measure is also investigated and bounds for the mean square error and exponential deviation inequality are reported for functions which are measurable and bounded. An illustration to Bayesian inference for binary regression is presented to support our claims.

112 citations


Journal ArticleDOI
TL;DR: In this paper, a nonparametric test of random utility models is proposed to test the null hypothesis that a sample of cross-sectional demand distributions was generated by a population of rational consumers.
Abstract: This paper develops and implements a nonparametric test of Random Utility Models. The motivating application is to test the null hypothesis that a sample of cross-sectional demand distributions was generated by a population of rational consumers. We test a necessary and sufficient condition for this that does not rely on any restriction on unobserved heterogeneity or the number of goods. We also propose and implement a control function approach to account for endogenous expenditure. An econometric result of independent interest is a test for linear inequality constraints when these are represented as the vertices of a polyhedron rather than its faces. An empirical application to the U.K. Household Expenditure Survey illustrates computational feasibility of the method in demand problems with 5 goods.

104 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider the problem of estimating the mean outcome under an individualized treatment strategy defined as the treatment rule that maximizes the population mean outcome, where the candidate treatment rules are restricted to depend on baseline covariates.
Abstract: We consider challenges that arise in the estimation of the mean outcome under an optimal individualized treatment strategy defined as the treatment rule that maximizes the population mean outcome, where the candidate treatment rules are restricted to depend on baseline covariates. We prove a necessary and sufficient condition for the pathwise differentiability of the optimal value, a key condition needed to develop a regular and asymptotically linear (RAL) estimator of the optimal value. The stated condition is slightly more general than the previous condition implied in the literature. We then describe an approach to obtain root-$n$ rate confidence intervals for the optimal value even when the parameter is not pathwise differentiable. We provide conditions under which our estimator is RAL and asymptotically efficient when the mean outcome is pathwise differentiable. We also outline an extension of our approach to a multiple time point problem. All of our results are supported by simulations.

98 citations


Posted Content
TL;DR: In this article, the authors study hypothesis testing subject to differential privacy, specifically chi-squared tests for goodness of fit for multinomial data and independence between two categorical variables.
Abstract: Hypothesis testing is a useful statistical tool in determining whether a given model should be rejected based on a sample from the population. Sample data may contain sensitive information about individuals, such as medical information. Thus it is important to design statistical tests that guarantee the privacy of subjects in the data. In this work, we study hypothesis testing subject to differential privacy, specifically chi-squared tests for goodness of fit for multinomial data and independence between two categorical variables. We propose new tests for goodness of fit and independence testing that like the classical versions can be used to determine whether a given model should be rejected or not, and that additionally can ensure differential privacy. We give both Monte Carlo based hypothesis tests as well as hypothesis tests that more closely follow the classical chi-squared goodness of fit test and the Pearson chi-squared test for independence. Crucially, our tests account for the distribution of the noise that is injected to ensure privacy in determining significance. We show that these tests can be used to achieve desired significance levels, in sharp contrast to direct applications of classical tests to differentially private contingency tables which can result in wildly varying significance levels. Moreover, we study the statistical power of these tests. We empirically show that to achieve the same level of power as the classical non-private tests our new tests need only a relatively modest increase in sample size.

Posted Content
TL;DR: This paper proposed a bootstrap-assisted procedure to conduct simultaneous inference for high dimensional sparse linear models based on the recent de-sparsifying Lasso estimator, which allows the dimension of the parameter vector of interest to be exponentially larger than sample size.
Abstract: This paper proposes a bootstrap-assisted procedure to conduct simultaneous inference for high dimensional sparse linear models based on the recent de-sparsifying Lasso estimator (van de Geer et al. 2014). Our procedure allows the dimension of the parameter vector of interest to be exponentially larger than sample size, and it automatically accounts for the dependence within the de-sparsifying Lasso estimator. Moreover, our simultaneous testing method can be naturally coupled with the margin screening (Fan and Lv 2008) to enhance its power in sparse testing with a reduced computational cost, or with the step-down method (Romano and Wolf 2005) to provide a strong control for the family-wise error rate. In theory, we prove that our simultaneous testing procedure asymptotically achieves the pre-specified significance level, and enjoys certain optimality in terms of its power even when the model errors are non-Gaussian. Our general theory is also useful in studying the support recovery problem. To broaden the applicability, we further extend our main results to generalized linear models with convex loss functions. The effectiveness of our methods is demonstrated via simulation studies.

Posted Content
TL;DR: The central limit theorems for randomization-based causal inference have been studied in this paper, where the parameters of interests are functions of a finite population and randomness comes solely from the treatment assignment.
Abstract: Frequentists' inference often delivers point estimators associated with confidence intervals or sets for parameters of interest. Constructing the confidence intervals or sets requires understanding the sampling distributions of the point estimators, which, in many but not all cases, are related to asymptotic Normal distributions ensured by central limit theorems. Although previous literature has established various forms of central limit theorems for statistical inference in super population models, we still need general and convenient forms of central limit theorems for some randomization-based causal analysis of experimental data, where the parameters of interests are functions of a finite population and randomness comes solely from the treatment assignment. We use central limit theorems for sample surveys and rank statistics to establish general forms of the finite population central limit theorems that are particularly useful for proving asymptotic distributions of randomization tests under the sharp null hypothesis of zero individual causal effects, and for obtaining the asymptotic repeated sampling distributions of the causal effect estimators. The new central limit theorems hold for general experimental designs with multiple treatment levels and multiple treatment factors, and are immediately applicable for studying the asymptotic properties of many methods in causal inference, including instrumental variable, regression adjustment, rerandomization, clustered randomized experiments, and so on. Previously, the asymptotic properties of these problems are often based on heuristic arguments, which in fact rely on general forms of finite population central limit theorems that have not been established before. Our new theorems fill in this gap by providing more solid theoretical foundation for asymptotic randomization-based causal inference.

Posted Content
TL;DR: This paper investigates the performance of Lloyd's algorithm on clustering sub-Gaussian mixtures, and extends the algorithm and its analysis to community detection and crowdsourcing, two problems that have received a lot of attention recently in statistics and machine learning.
Abstract: Clustering is a fundamental problem in statistics and machine learning. Lloyd's algorithm, proposed in 1957, is still possibly the most widely used clustering algorithm in practice due to its simplicity and empirical performance. However, there has been little theoretical investigation on the statistical and computational guarantees of Lloyd's algorithm. This paper is an attempt to bridge this gap between practice and theory. We investigate the performance of Lloyd's algorithm on clustering sub-Gaussian mixtures. Under an appropriate initialization for labels or centers, we show that Lloyd's algorithm converges to an exponentially small clustering error after an order of $\log n$ iterations, where $n$ is the sample size. The error rate is shown to be minimax optimal. For the two-mixture case, we only require the initializer to be slightly better than random guess. In addition, we extend the Lloyd's algorithm and its analysis to community detection and crowdsourcing, two problems that have received a lot of attention recently in statistics and machine learning. Two variants of Lloyd's algorithm are proposed respectively for community detection and crowdsourcing. On the theoretical side, we provide statistical and computational guarantees of the two algorithms, and the results improve upon some previous signal-to-noise ratio conditions in literature for both problems. Experimental results on simulated and real data sets demonstrate competitive performance of our algorithms to the state-of-the-art methods.

Posted Content
TL;DR: In this paper, the authors give a complete characterization of the complexity of best-arm identification in one-parameter bandit problems and prove a tight lower bound on the sample complexity.
Abstract: We give a complete characterization of the complexity of best-arm identification in one-parameter bandit problems. We prove a new, tight lower bound on the sample complexity. We propose the `Track-and-Stop' strategy, which we prove to be asymptotically optimal. It consists in a new sampling rule (which tracks the optimal proportions of arm draws highlighted by the lower bound) and in a stopping rule named after Chernoff, for which we give a new analysis.

Posted Content
TL;DR: The main goal here is to show that Shapley value removes the conceptual problems when alternatives based on the ANOVA decomposition run into conceptual and computational problems when the input variables are dependent.
Abstract: This paper makes the case for using Shapley value to quantify the importance of random input variables to a function. Alternatives based on the ANOVA decomposition can run into conceptual and computational problems when the input variables are dependent. Our main goal here is to show that Shapley value removes the conceptual problems. We do this with some simple examples where Shapley value leads to intuitively reasonable nearly closed form values.

Posted Content
TL;DR: A new procedure is introduced, the so-called median-of-means tournament, that achieves the optimal tradeoff between accuracy and confidence under minimal assumptions, and in particular outperforms classical methods based on empirical risk minimization.
Abstract: We consider the classical statistical learning/regression problem, when the value of a real random variable Y is to be predicted based on the observation of another random variable X. Given a class of functions F and a sample of independent copies of (X, Y ), one needs to choose a function f from F such that f(X) approximates Y as well as possible, in the mean-squared sense. We introduce a new procedure, the so-called median-of-means tournament, that achieves the optimal tradeoff between accuracy and confidence under minimal assumptions, and in particular outperforms classical methods based on empirical risk minimization.

Posted Content
TL;DR: It is conjecture that the classical algorithm of alternating projections (Gerchberg–Saxton) succeeds with high probability when no special initialization procedure is used, and it is conjectured that this result is still true when nospecial initialization process is used.
Abstract: We consider a phase retrieval problem, where we want to reconstruct a $n$-dimensional vector from its phaseless scalar products with $m$ sensing vectors. We assume the sensing vectors to be independently sampled from complex normal distributions. We propose to solve this problem with the classical non-convex method of alternating projections. We show that, when $m\geq Cn$ for $C$ large enough, alternating projections succeed with high probability, provided that they are carefully initialized. We also show that there is a regime in which the stagnation points of the alternating projections method disappear, and the initialization procedure becomes useless. However, in this regime, $m$ has to be of the order of $n^2$. Finally, we conjecture from our numerical experiments that, in the regime $m=O(n)$, there are stagnation points, but the size of their attraction basin is small if $m/n$ is large enough, so alternating projections can succeed with probability close to $1$ even with no special initialization.

Journal ArticleDOI
TL;DR: A key element in this approach is to demonstrate that when the classical phase variation assumptions of Functional Data Analysis are applied to the point process case, they become equivalent to conditions interpretable through the prism of the theory of optimal transportation of measure.
Abstract: We develop a canonical framework for the study of the problem of registration of multiple point processes subjected to warping, known as the problem of separation of amplitude and phase variation. The amplitude variation of a real random function $\{Y(x):x\in[0,1]\}$ corresponds to its random oscillations in the $y$-axis, typically encapsulated by its (co)variation around a mean level. In contrast, its phase variation refers to fluctuations in the $x$-axis, often caused by random time changes. We formalise similar notions for a point process, and nonparametrically separate them based on realisations of i.i.d. copies $\{\Pi_i\}$ of the phase-varying point process. A key element in our approach is to demonstrate that when the classical phase variation assumptions of Functional Data Analysis (FDA) are applied to the point process case, they become equivalent to conditions interpretable through the prism of the theory of optimal transportation of measure. We demonstrate that these induce a natural Wasserstein geometry tailored to the warping problem, including a formal notion of bias expressing over-registration. Within this framework, we construct nonparametric estimators that tend to avoid over-registration in finite samples. We show that they consistently estimate the warp maps, consistently estimate the structural mean, and consistently register the warped point processes, even in a sparse sampling regime. We also establish convergence rates, and derive $\sqrt{n}$-consistency and a central limit theorem in the Cox process case under dense sampling, showing rate optimality of our structural mean estimator in that case.

Posted Content
TL;DR: This paper proposes to apply the penalized least-squares approach to the appropriately truncated or shrunk data and gives a robust covariance estimator with concentration inequality and optimal rate of convergence in terms of the spectral norm, when the samples only bear bounded fourth moment.
Abstract: This paper introduces a simple principle for robust high-dimensional statistical inference via an appropriate shrinkage on the data. This widens the scope of high-dimensional techniques, reducing the moment conditions from sub-exponential or sub-Gaussian distributions to merely bounded second or fourth moment. As an illustration of this principle, we focus on robust estimation of the low-rank matrix $\Theta^*$ from the trace regression model $Y=Tr (\Theta^{*T}X) +\epsilon$. It encompasses four popular problems: sparse linear models, compressed sensing, matrix completion and multi-task regression. We propose to apply penalized least-squares approach to appropriately truncated or shrunk data. Under only bounded $2+\delta$ moment condition on the response, the proposed robust methodology yields an estimator that possesses the same statistical error rates as previous literature with sub-Gaussian errors. For sparse linear models and multi-tasking regression, we further allow the design to have only bounded fourth moment and obtain the same statistical rates, again, by appropriate shrinkage of the design matrix. As a byproduct, we give a robust covariance matrix estimator and establish its concentration inequality in terms of the spectral norm when the random samples have only bounded fourth moment. Extensive simulations have been carried out to support our theories.

Posted Content
TL;DR: A new way to compact a continuous probability distribution into a set of representative points called support points, obtained by minimizing the energy distance, which can be formulated as a difference-of-convex program, which is manipulated using two algorithms to efficiently generate representative point sets.
Abstract: This paper introduces a new way to compact a continuous probability distribution $F$ into a set of representative points called support points. These points are obtained by minimizing the energy distance, a statistical potential measure initially proposed by Sz\'ekely and Rizzo (2004) for testing goodness-of-fit. The energy distance has two appealing features. First, its distance-based structure allows us to exploit the duality between powers of the Euclidean distance and its Fourier transform for theoretical analysis. Using this duality, we show that support points converge in distribution to $F$, and enjoy an improved error rate to Monte Carlo for integrating a large class of functions. Second, the minimization of the energy distance can be formulated as a difference-of-convex program, which we manipulate using two algorithms to efficiently generate representative point sets. In simulation studies, support points provide improved integration performance to both Monte Carlo and a specific Quasi-Monte Carlo method. Two important applications of support points are then highlighted: (a) as a way to quantify the propagation of uncertainty in expensive simulations, and (b) as a method to optimally compact Markov chain Monte Carlo (MCMC) samples in Bayesian computation.

Posted Content
TL;DR: The fundamental limitations of statistical methods are studied, including non-spectral ones, and it is shown that inefficient procedures can work below the threshold where PCA succeeds, whereas no known efficient algorithm achieves this.
Abstract: A central problem of random matrix theory is to understand the eigenvalues of spiked random matrix models, in which a prominent eigenvector is planted into a random matrix. These distributions form natural statistical models for principal component analysis (PCA) problems throughout the sciences. Baik, Ben Arous and Peche showed that the spiked Wishart ensemble exhibits a sharp phase transition asymptotically: when the signal strength is above a critical threshold, it is possible to detect the presence of a spike based on the top eigenvalue, and below the threshold the top eigenvalue provides no information. Such results form the basis of our understanding of when PCA can detect a low-rank signal in the presence of noise. However, not all the information about the spike is necessarily contained in the spectrum. We study the fundamental limitations of statistical methods, including non-spectral ones. Our results include: I) For the Gaussian Wigner ensemble, we show that PCA achieves the optimal detection threshold for a variety of benign priors for the spike. We extend previous work on the spherically symmetric and i.i.d. Rademacher priors through an elementary, unified analysis. II) For any non-Gaussian Wigner ensemble, we show that PCA is always suboptimal for detection. However, a variant of PCA achieves the optimal threshold (for benign priors) by pre-transforming the matrix entries according to a carefully designed function. This approach has been stated before, and we give a rigorous and general analysis. III) For both the Gaussian Wishart ensemble and various synchronization problems over groups, we show that inefficient procedures can work below the threshold where PCA succeeds, whereas no known efficient algorithm achieves this. This conjectural gap between what is statistically possible and what can be done efficiently remains open.

Posted Content
TL;DR: In this paper, the authors show that the two-dimensional total variation denoiser satisfies a sharp oracle inequality that leads to near optimal rates of estimation for a large class of image models such as bi-isotonic, Holder smooth and cartoons.
Abstract: Motivated by its practical success, we show that the two-dimensional total variation denoiser satisfies a sharp oracle inequality that leads to near optimal rates of estimation for a large class of image models such as bi-isotonic, Holder smooth and cartoons. Our analysis hinges on properties of the unnormalized Laplacian of the two-dimensional grid such as eigenvector delocalization and spectral decay. We also present extensions to more than two dimensions as well as several other graphs.

Journal ArticleDOI
TL;DR: In this paper, the directions of arrival (DOA) of plane waves are estimated from multi-snapshot sensor array data using Sparse Bayesian Learning (SBL), where prior source amplitudes are assumed independent zero-mean complex Gaussian distributed with hyperparameters the unknown variances (i.e. the source powers).
Abstract: The directions of arrival (DOA) of plane waves are estimated from multi-snapshot sensor array data using Sparse Bayesian Learning (SBL). The prior source amplitudes is assumed independent zero-mean complex Gaussian distributed with hyperparameters the unknown variances (i.e. the source powers). For a complex Gaussian likelihood with hyperparameter the unknown noise variance, the corresponding Gaussian posterior distribution is derived. For a given number of DOAs, the hyperparameters are automatically selected by maximizing the evidence and promote sparse DOA estimates. The SBL scheme for DOA estimation is discussed and evaluated competitively against LASSO ($\ell_1$-regularization), conventional beamforming, and MUSIC

Posted Content
TL;DR: This work bridges non-linear and non-parametric function estimation and includes single-hidden layer nets and shows that the risk is small even when the input dimension of an infinite-dimensional parameterized dictionary is much larger than the available sample size.
Abstract: Let $ f^{\star} $ be a function on $ \mathbb{R}^d $ with an assumption of a spectral norm $ v_{f^{\star}} $. For various noise settings, we show that $ \mathbb{E}\|\hat{f} - f^{\star} \|^2 \leq \left(v^4_{f^{\star}}\frac{\log d}{n}\right)^{1/3} $, where $ n $ is the sample size and $ \hat{f} $ is either a penalized least squares estimator or a greedily obtained version of such using linear combinations of sinusoidal, sigmoidal, ramp, ramp-squared or other smooth ridge functions. The candidate fits may be chosen from a continuum of functions, thus avoiding the rigidity of discretizations of the parameter space. On the other hand, if the candidate fits are chosen from a discretization, we show that $ \mathbb{E}\|\hat{f} - f^{\star} \|^2 \leq \left(v^3_{f^{\star}}\frac{\log d}{n}\right)^{2/5} $. This work bridges non-linear and non-parametric function estimation and includes single-hidden layer nets. Unlike past theory for such settings, our bound shows that the risk is small even when the input dimension $ d $ of an infinite-dimensional parameterized dictionary is much larger than the available sample size. When the dimension is larger than the cube root of the sample size, this quantity is seen to improve the more familiar risk bound of $ v_{f^{\star}}\left(\frac{d\log (n/d)}{n}\right)^{1/2} $, also investigated here.

Posted Content
TL;DR: This paper proves that when $A$ is a low-rank and incoherent matrix, the $\ell_{\infty}$ norm perturbation bound of singular vectors (or eigenvectors in the symmetric case) is smaller by a factor of $\sqrt{d_1}$ or $d_2$ for left and right vectors, where $ d_1$ and $d-2$ are the matrix dimensions.
Abstract: In statistics and machine learning, people are often interested in the eigenvectors (or singular vectors) of certain matrices (e.g. covariance matrices, data matrices, etc). However, those matrices are usually perturbed by noises or statistical errors, either from random sampling or structural patterns. One usually employs Davis-Kahan $\sin \theta$ theorem to bound the difference between the eigenvectors of a matrix $A$ and those of a perturbed matrix $\widetilde{A} = A + E$, in terms of $\ell_2$ norm. In this paper, we prove that when $A$ is a low-rank and incoherent matrix, the $\ell_{\infty}$ norm perturbation bound of singular vectors (or eigenvectors in the symmetric case) is smaller by a factor of $\sqrt{d_1}$ or $\sqrt{d_2}$ for left and right vectors, where $d_1$ and $d_2$ are the matrix dimensions. The power of this new perturbation result is shown in robust covariance estimation, particularly when random variables have heavy tails. There, we propose new robust covariance estimators and establish their asymptotic properties using the newly developed perturbation bound. Our theoretical results are verified through extensive numerical experiments.

Posted Content
TL;DR: Two general model selection methods are developed to provide sieved adaptive estimators (SAE) that achieve nearly optimal rates of convergence for particular "regular" classes of convex functions, while maintaining nearly parametric rate-adaptivity to polyhedral functions in arbitrary dimensions.
Abstract: We study the problem of estimating a multivariate convex function defined on a convex body in a regression setting with random design. We are interested in optimal rates of convergence under a squared global continuous $l_2$ loss in the multivariate setting $(d\geq 2)$. One crucial fact is that the minimax risks depend heavily on the shape of the support of the regression function. It is shown that the global minimax risk is on the order of $n^{-2/(d+1)}$ when the support is sufficiently smooth, but that the rate $n^{-4/(d+4)}$ is when the support is a polytope. Such differences in rates are due to difficulties in estimating the regression function near the boundary of smooth regions. We then study the natural bounded least squares estimators (BLSE): we show that the BLSE nearly attains the optimal rates of convergence in low dimensions, while suffering rate-inefficiency in high dimensions. We show that the BLSE adapts nearly parametrically to polyhedral functions when the support is polyhedral in low dimensions by a local entropy method. We also show that the boundedness constraint cannot be dropped when risk is assessed via continuous $l_2$ loss. Given rate sub-optimality of the BLSE in higher dimensions, we further study rate-efficient adaptive estimation procedures. Two general model selection methods are developed to provide sieved adaptive estimators (SAE) that achieve nearly optimal rates of convergence for particular "regular" classes of convex functions, while maintaining nearly parametric rate-adaptivity to polyhedral functions in arbitrary dimensions. Interestingly, the uniform boundedness constraint is unnecessary when risks are measured in discrete $l_2$ norms.

Posted Content
TL;DR: In this paper, the problem of sampling a probability distribution π having a density w.r.t. the Lebesgue measure on π up to a normalisation factor was considered and non-asymptotic bounds for the convergence to stationarity in Wasserstein distances of the sampling method based on the Euler discretization of the Langevin stochastic differential equation for both constant and decreasing step sizes were obtained.
Abstract: We consider in this paper the problem of sampling a probability distribution π having a density w.r.t. the Lebesgue measure on $\mathbb{R}^d$, known up to a normalisation factor $x \mapsto \mathrm{e}^{−U (x)} / \int_{\mathbb{R}^d} \mathrm{e}^{−U (y)}\mathrm{d}y$. Under the assumption that $U$ is continuously differentiable, $ abla U$ is globally Lipshitz and $U$ is strongly convex, we obtain non-asymptotic bounds for the convergence to stationarity in Wasserstein distances of the sampling method based on the Euler discretization of the Langevin stochastic differential equation for both constant and decreasing step sizes. The dependence on the dimension of the state space of the obtained bounds is studied to demonstrate the applicability of this method in the high dimensional setting. The convergence of an appropriately weighted empirical measure is also investigated and bounds for the mean square error and exponential deviation inequality for Lipschitz functions are reported. Some numerical results are presented to illustrate our findings.

Posted Content
TL;DR: In this paper, a new estimator of the (element-wise) mean of a random matrix, which includes covariance estimation problem as a special case, is proposed, which admits sub-Gaussian or sub-exponential concentration around the unknown mean in the operator norm.
Abstract: Estimation of the covariance matrix has attracted a lot of attention of the statistical research community over the years, partially due to important applications such as Principal Component Analysis. However, frequently used empirical covariance estimator (and its modifications) is very sensitive to outliers in the data. As P. J. Huber wrote in 1964, "...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one? As is now well known, the sample mean then may have a catastrophically bad performance..." Motivated by this question, we develop a new estimator of the (element-wise) mean of a random matrix, which includes covariance estimation problem as a special case. Assuming that the entries of a matrix possess only finite second moment, this new estimator admits sub-Gaussian or sub-exponential concentration around the unknown mean in the operator norm. We will explain the key ideas behind our construction, as well as applications to covariance estimation and matrix completion problems.

Journal ArticleDOI
TL;DR: In this paper, the authors generalize the widely used Laplace mechanism to the family of generalized Gaussian (GG) mechanisms based on the global sensitivity of statistical queries, and investigate the connections and differences between the GG mechanism and the Exponential mechanism.
Abstract: Assessment of disclosure risk is of paramount importance in the research and applications of data privacy techniques. The concept of differential privacy (DP) formalizes privacy in probabilistic terms and provides a robust concept for privacy protection without making assumptions about the background knowledge of adversaries. Practical applications of DP involve development of DP mechanisms to release results at a pre-specified privacy budget. In this paper, we generalize the widely used Laplace mechanism to the family of generalized Gaussian (GG) mechanism based on the $l_p$ global sensitivity of statistical queries. We explore the theoretical requirement for the GG mechanism to reach DP at prespecified privacy parameters, and investigate the connections and differences between the GG mechanism and the Exponential mechanism based on the GG distribution We also present a lower bound on the scale parameter of the Gaussian mechanism of $(\epsilon,\delta)$-probabilistic DP as a special case of the GG mechanism, and compare the statistical utility of the sanitized results in the tail probability and dispersion in the Gaussian and Laplace mechanisms. Lastly, we apply the GG mechanism in 3 experiments (the mildew, Czech, adult data), and compare the accuracy of sanitized results via the $l_1$ distance and Kullback-Leibler divergence and examine how sanitization affects the prediction power of a classifier constructed with the sanitized data in the adult experiment.