scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Statistics Theory in 2019"


Posted Content
TL;DR: This paper recovers---in a precise quantitative way---several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.
Abstract: Interpolators---estimators that achieve zero training error---have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum $\ell_2$ norm ("ridgeless") interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, where the feature vectors $x_i \in \mathbb{R}^p$ are obtained by applying a linear transform to a vector of i.i.d. entries, $x_i = \Sigma^{1/2} z_i$ (with $z_i \in \mathbb{R}^p$); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, $x_i = \varphi(W z_i)$ (with $z_i \in \mathbb{R}^d$, $W \in \mathbb{R}^{p \times d}$ a matrix of i.i.d. entries, and $\varphi$ an activation function acting componentwise on $W z_i$). We recover---in a precise quantitative way---several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

563 citations


Posted Content
TL;DR: Deep learning methods operate in regimes that defy the traditional statistical mindset, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise.
Abstract: Deep learning methods operate in regimes that defy the traditional statistical mindset. Neural network architectures often contain more parameters than training samples, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise. Despite their huge complexity, the same architectures achieve small generalization error on real data. This phenomenon has been rationalized in terms of a so-called `double descent' curve. As the model complexity increases, the test error follows the usual U-shaped curve at the beginning, first decreasing and then peaking around the interpolation threshold (when the model achieves vanishing training error). However, it descends again as model complexity exceeds this threshold. The global minimum of the test error is found above the interpolation threshold, often in the extreme overparametrization regime in which the number of parameters is much larger than the number of samples. Far from being a peculiar property of deep neural networks, elements of this behavior have been demonstrated in much simpler settings, including linear regression with random covariates. In this paper we consider the problem of learning an unknown function over the $d$-dimensional sphere $\mathbb S^{d-1}$, from $n$ i.i.d. samples $(\boldsymbol x_i, y_i)\in \mathbb S^{d-1} \times \mathbb R$, $i\le n$. We perform ridge regression on $N$ random features of the form $\sigma(\boldsymbol w_a^{\mathsf T} \boldsymbol x)$, $a\le N$. This can be equivalently described as a two-layers neural network with random first-layer weights. We compute the precise asymptotics of the test error, in the limit $N,n,d\to \infty$ with $N/d$ and $n/d$ fixed. This provides the first analytically tractable model that captures all the features of the double descent phenomenon without assuming ad hoc misspecification structures.

386 citations


Posted Content
TL;DR: A specific structure is considered that captures the behavior of nonlinear random feature models or, equivalently, two-layers neural networks with random first layer weights that agrees with the recently developed `double descent' phenomenology for overparametrized models.
Abstract: Modern machine learning models are often so complex that they achieve vanishing classification error on the training set. Max-margin linear classifiers are among the simplest classification methods that have zero training error (with linearly separable data). Despite their simplicity, their high-dimensional behavior is not yet completely understood. We assume to be given i.i.d. data $(y_i,{\boldsymbol x}_i)$, $i\le n$ with ${\boldsymbol x}_i\sim {\sf N}(0,{\boldsymbol \Sigma})$ a $p$-dimensional feature vector, and $y_i \in\{+1,-1\}$ a label whose distribution depends on a linear combination of the covariates $\langle{\boldsymbol\theta}_*,{\boldsymbol x}_i\rangle$. We consider the proportional asymptotics $n,p\to\infty$ with $p/n\to \psi$, and derive exact expressions for the limiting prediction error. Our asymptotic results match simulations already when $n,p$ are of the order of a few hundreds. We explore several choices for $({\boldsymbol \theta}_*,{\boldsymbol \Sigma})$, and show that the resulting generalization curve (test error error as a function of the overparametrization $\psi=p/n$) is qualitatively different, depending on this choice. In particular we consider a specific structure of $({\boldsymbol \theta}_*,{\boldsymbol\Sigma})$ that captures the behavior of nonlinear random feature models or, equivalently, two-layers neural networks with random first layer weights. In this case, we aim at classifying data $(y_i,{\boldsymbol x}_i)$ with ${\boldsymbol x}_i\in{\mathbb R}^d$ but we do so by first embedding them a $p$ dimensional feature space via ${\boldsymbol x}_i\mapsto\sigma({\boldsymbol W}{\boldsymbol x}_i)$ and then finding a max-margin classifier in this space. We derive exact formulas in the proportional asymptotics $p,n,d\to\infty$ with $p/d\to\psi_1$, $n/d\to\psi_2$ and observe that the test error is minimized in the highly overparametrized regime $\psi_1\gg 0$.

157 citations


Posted Content
TL;DR: By focusing on excess risk rather than parameter estimation, this work can give guarantees under weaker assumptions than in previous works and accommodate the case where the target parameter belongs to a complex nonparametric class.
Abstract: We provide non-asymptotic excess risk guarantees for statistical learning in a setting where the population risk with respect to which we evaluate the target parameter depends on an unknown nuisance parameter that must be estimated from data. We analyze a two-stage sample splitting meta-algorithm that takes as input two arbitrary estimation algorithms: one for the target parameter and one for the nuisance parameter. We show that if the population risk satisfies a condition called Neyman orthogonality, the impact of the nuisance estimation error on the excess risk bound achieved by the meta-algorithm is of second order. Our theorem is agnostic to the particular algorithms used for the target and nuisance and only makes an assumption on their individual performance. This enables the use of a plethora of existing results from statistical learning and machine learning to give new guarantees for learning with a nuisance component. Moreover, by focusing on excess risk rather than parameter estimation, we can give guarantees under weaker assumptions than in previous works and accommodate settings in which the target parameter belongs to a complex nonparametric class. We provide conditions on the metric entropy of the nuisance and target classes such that oracle rates---rates of the same order as if we knew the nuisance parameter---are achieved. We also derive new rates for specific estimation algorithms such as variance-penalized empirical risk minimization, neural network estimation and sparse high-dimensional linear model estimation. We highlight the applicability of our results in four settings of central importance: 1) heterogeneous treatment effect estimation, 2) offline policy optimization, 3) domain adaptation, and 4) learning with missing data.

122 citations


Posted Content
TL;DR: This work aims to explore the space in between exact conditional inference guarantees and what types of relaxations of the conditional coverage property would alleviate some of the practical concerns with marginal coverage guarantees while still being possible to achieve in a distribution-free setting.
Abstract: We consider the problem of distribution-free predictive inference, with the goal of producing predictive coverage guarantees that hold conditionally rather than marginally. Existing methods such as conformal prediction offer marginal coverage guarantees, where predictive coverage holds on average over all possible test points, but this is not sufficient for many practical applications where we would like to know that our predictions are valid for a given individual, not merely on average over a population. On the other hand, exact conditional inference guarantees are known to be impossible without imposing assumptions on the underlying distribution. In this work we aim to explore the space in between these two, and examine what types of relaxations of the conditional coverage property would alleviate some of the practical concerns with marginal coverage guarantees while still being possible to achieve in a distribution-free setting.

107 citations


Posted Content
TL;DR: These notes survey and explore an emerging method, which is called the low-degree method, for predicting and understanding statistical-versus-computational tradeoffs in high-dimensional inference problems, which posits that a certain quantity gives insight into how much computational time is required to solve a given hypothesis testing problem.
Abstract: These notes survey and explore an emerging method, which we call the low-degree method, for predicting and understanding statistical-versus-computational tradeoffs in high-dimensional inference problems. In short, the method posits that a certain quantity -- the second moment of the low-degree likelihood ratio -- gives insight into how much computational time is required to solve a given hypothesis testing problem, which can in turn be used to predict the computational hardness of a variety of statistical inference tasks. While this method originated in the study of the sum-of-squares (SoS) hierarchy of convex programs, we present a self-contained introduction that does not require knowledge of SoS. In addition to showing how to carry out predictions using the method, we include a discussion investigating both rigorous and conjectural consequences of these predictions. These notes include some new results, simplified proofs, and refined conjectures. For instance, we point out a formal connection between spectral methods and the low-degree likelihood ratio, and we give a sharp low-degree lower bound against subexponential-time algorithms for tensor PCA.

90 citations


Posted Content
TL;DR: In this article, the authors define a simple and interpretable measure of the degree of dependence between the variables, which is 0 if and only if the variables are independent and 1 if one is a measurable function of the other, and has a simple asymptotic theory under the hypothesis of independence.
Abstract: Is it possible to define a coefficient of correlation which is (a) as simple as the classical coefficients like Pearson's correlation or Spearman's correlation, and yet (b) consistently estimates some simple and interpretable measure of the degree of dependence between the variables, which is 0 if and only if the variables are independent and 1 if and only if one is a measurable function of the other, and (c) has a simple asymptotic theory under the hypothesis of independence, like the classical coefficients? This article answers this question in the affirmative, by producing such a coefficient. No assumptions are needed on the distributions of the variables. There are several coefficients in the literature that converge to 0 if and only if the variables are independent, but none that satisfy any of the other properties mentioned above.

90 citations


Posted Content
TL;DR: A new statistical model is proposed, the spiked transport model, which formalizes the assumption that two probability distributions differ only on a low-dimensional subspace and establishes a lower bound showing that, in the absence of such structure, the plug-in estimator is nearly rate-optimal for estimating the Wasserstein distance in high dimension.
Abstract: We propose a new statistical model, the spiked transport model, which formalizes the assumption that two probability distributions differ only on a low-dimensional subspace. We study the minimax rate of estimation for the Wasserstein distance under this model and show that this low-dimensional structure can be exploited to avoid the curse of dimensionality. As a byproduct of our minimax analysis, we establish a lower bound showing that, in the absence of such structure, the plug-in estimator is nearly rate-optimal for estimating the Wasserstein distance in high dimension. We also give evidence for a statistical-computational gap and conjecture that any computationally efficient estimator is bound to suffer from the curse of dimensionality.

83 citations


Posted Content
TL;DR: The algorithm is fully data-dependent and does not use in its construction the proportion of outliers nor the rate above, which combines recently developed tools for Median-of-Means estimators and covering-Semi-definite Programming.
Abstract: We construct an algorithm, running in time $\tilde{\mathcal O}(N d + uK d)$, which is robust to outliers and heavy-tailed data and which achieves the subgaussian rate from [Lugosi, Mendelson] \begin{equation}\label{eq:intro_subgaus_rate} \sqrt{\frac{{\rm Tr}(\Sigma)}{N}}+\sqrt{\frac{||\Sigma||_{op}K}{N}} \end{equation}with probability at least $1-\exp(-c_0K)-\exp(-c_1 u)$ where $\Sigma$ is the covariance matrix of the informative data, $K\in\{1, \ldots, K\}$ is some parameter (number of block means) and $u>0$ is another parameter of the algorithm. This rate is achieved when $K\geq c_1 |\mathcal O|$ where $|\mathcal O|$ is the number of outliers in the database and under the only assumption that the informative data have a second moment. The algorithm is fully data-dependent and does not use in its construction the proportion of outliers nor the rate above. Its construction combines recently developed tools for Median-of-Means estimators and covering-Semi-definite Programming [Chen, Diakonikolas, Ge] and [Peng, Tangwongsan, Zhang].

82 citations


Posted Content
TL;DR: In this paper, the authors consider the problem of learning an unknown function on the Euclidean sphere with respect to the square loss, given i.i.d. samples.
Abstract: We consider the problem of learning an unknown function $f_{\star}$ on the $d$-dimensional sphere with respect to the square loss, given i.i.d. samples $\{(y_i,{\boldsymbol x}_i)\}_{i\le n}$ where ${\boldsymbol x}_i$ is a feature vector uniformly distributed on the sphere and $y_i=f_{\star}({\boldsymbol x}_i)+\varepsilon_i$. We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$. We consider two specific regimes: the approximation-limited regime, in which $n=\infty$ while $d$ and $N$ are large but finite; and the sample size-limited regime in which $N=\infty$ while $d$ and $n$ are large but finite. In the first regime we prove that if $d^{\ell + \delta} \le N\le d^{\ell+1-\delta}$ for small $\delta > 0$, then \RF\, effectively fits a degree-$\ell$ polynomial in the raw features, and \NT\, fits a degree-$(\ell+1)$ polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is $d^{\ell + \delta} \le n \le d^{\ell +1-\delta}$, then kernel methods can fit at most a a degree-$\ell$ polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.

82 citations


Posted Content
TL;DR: In this article, a multivariate extension of the trimmed-mean estimator is proposed and shown to have optimal performance under minimal conditions. But it does not address the problem of estimating the mean of a random vector.
Abstract: We consider the problem of estimating the mean of a random vector based on i.i.d. observations and adversarial contamination. We introduce a multivariate extension of the trimmed-mean estimator and show its optimal performance under minimal conditions.

Posted Content
TL;DR: A scaling limit for a DNN being trained by stochastic gradient descent is presented and it is shown that network weights are approximated by certain "ideal particles" whose distribution and dependencies are described by the mean-field model of McKean-Vlasov type.
Abstract: Understanding deep neural networks (DNNs) is a key challenge in the theory of machine learning, with potential applications to the many fields where DNNs have been successfully used. This article presents a scaling limit for a DNN being trained by stochastic gradient descent. Our networks have a fixed (but arbitrary) number $L\geq 2$ of inner layers; $N\gg 1$ neurons per layer; full connections between layers; and fixed weights (or "random features" that are not trained) near the input and output. Our results describe the evolution of the DNN during training in the limit when $N\to +\infty$, which we relate to a mean field model of McKean-Vlasov type. Specifically, we show that network weights are approximated by certain "ideal particles" whose distribution and dependencies are described by the mean-field model. A key part of the proof is to show existence and uniqueness for our McKean-Vlasov problem, which does not seem to be amenable to existing theory. Our paper extends previous work on the $L=1$ case by Mei, Montanari and Nguyen; Rotskoff and Vanden-Eijnden; and Sirignano and Spiliopoulos. We also complement recent independent work on $L>1$ by Sirignano and Spiliopoulos (who consider a less natural scaling limit) and Nguyen (who nonrigorously derives similar results).

Posted Content
TL;DR: It is demonstrated that e-values are often mathematically more tractable; in particular, in multiple testing of a single hypothesis, e- Values can be merged simply by averaging them, which allows to develop efficient procedures using e- values for testing multiple hypotheses.
Abstract: Multiple testing of a single hypothesis and testing multiple hypotheses are usually done in terms of p-values. In this paper we replace p-values with their natural competitor, e-values, which are closely related to betting, Bayes factors, and likelihood ratios. We demonstrate that e-values are often mathematically more tractable; in particular, in multiple testing of a single hypothesis, e-values can be merged simply by averaging them. This allows us to develop efficient procedures using e-values for testing multiple hypotheses.

Posted Content
TL;DR: This paper proposes a general framework for distribution-free nonparametric testing in multi-dimensions, based on a notion of multivariate ranks defined using the theory of measure transportation, and proposes (multivariate) rank versions of distance covariance and energy statistic for testing scenarios (i) and (ii) respectively.
Abstract: In this paper, we propose a general framework for distribution-free nonparametric testing in multi-dimensions, based on a notion of multivariate ranks defined using the theory of measure transportation. Unlike other existing proposals in the literature, these multivariate ranks share a number of useful properties with the usual one-dimensional ranks; most importantly, these ranks are distribution-free. This crucial observation allows us to design nonparametric tests that are exactly distribution-free under the null hypothesis. We demonstrate the applicability of this approach by constructing exact distribution-free tests for two classical nonparametric problems: (i) testing for mutual independence between random vectors, and (ii) testing for the equality of multivariate distributions. In particular, we propose (multivariate) rank versions of distance covariance (Szekely et al., 2007) and energy statistic (Szekely and Rizzo, 2013) for testing scenarios (i) and (ii) respectively. In both these problems, we derive the asymptotic null distribution of the proposed test statistics. We further show that our tests are consistent against all fixed alternatives. Moreover, the proposed tests are tuning-free, computationally feasible and are well-defined under minimal assumptions on the underlying distributions (e.g., they do not need any moment assumptions). We also demonstrate the efficacy of these procedures via extensive simulations. In the process of analyzing the theoretical properties of our procedures, we end up proving some new results in the theory of measure transportation and in the limit theory of permutation statistics using Stein's method for exchangeable pairs, which may be of independent interest.

Posted Content
TL;DR: In this paper, the authors introduce a graphical criterion to compare the asymptotic variance provided by different covariate sets in a causal linear model and present a simple variance reducing pruning procedure for any given adjustment set, which can be applied to DAGs, CPDAGs and maximally oriented PDAGs.
Abstract: Covariate adjustment is commonly used for total causal effect estimation. In recent years, graphical criteria have been developed to identify all covariate sets that can be used for this purpose. Different valid adjustment sets typically provide causal effect estimates of varying accuracies. We introduce a graphical criterion to compare the asymptotic variance provided by certain valid adjustment sets in a causal linear model. We employ this result to develop two further graphical tools. First, we introduce a simple variance reducing pruning procedure for any given valid adjustment set. Second, we give a graphical characterization of a valid adjustment set that provides the optimal asymptotic variance among all valid adjustment sets. Our results depend only on the graphical structure and not on the specific error variances or the edge coefficients of the underlying causal linear model. They can be applied to DAGs, CPDAGs and maximally oriented PDAGs. We present simulations and a real data example to support our results and show their practical applicability.

Posted Content
TL;DR: In this article, the bias of the one-step estimator of the parameter of interest is the expectation of the product of the estimation errors of the two nuisance functions, and the target parameter is assumed to be root-n consistent and asymptotically normal.
Abstract: We consider inference about a scalar parameter under a non-parametric model based on a one-step estimator computed as a plug in estimator plus the empirical mean of an estimator of the parameter's influence function. We focus on a class of parameters that have influence function which depends on two infinite dimensional nuisance functions and such that the bias of the one-step estimator of the parameter of interest is the expectation of the product of the estimation errors of the two nuisance functions. Our class includes many important treatment effect contrasts of interest in causal inference and econometrics, such as ATE, ATT, an integrated causal contrast with a continuous treatment, and the mean of an outcome missing not at random. We propose estimators of the target parameter that entertain approximately sparse regression models for the nuisance functions allowing for the number of potential confounders to be even larger than the sample size. By employing sample splitting, cross-fitting and $\ell_1$-regularized regression estimators of the nuisance functions based on objective functions whose directional derivatives agree with those of the parameter's influence function, we obtain estimators of the target parameter with two desirable robustness properties: (1) they are rate doubly-robust in that they are root-n consistent and asymptotically normal when both nuisance functions follow approximately sparse models, even if one function has a very non-sparse regression coefficient, so long as the other has a sufficiently sparse regression coefficient, and (2) they are model doubly-robust in that they are root-n consistent and asymptotically normal even if one of the nuisance functions does not follow an approximately sparse model so long as the other nuisance function follows an approximately sparse model with a sufficiently sparse regression coefficient.

Posted Content
TL;DR: Lower bounds for estimation under local privacy constraints are developed by showing an equivalence between private estimation and communication-restricted estimation problems, and it is shown that the minimax mean-squared error for estimating the mean of a bounded or Gaussian random vector in $d$ dimensions scales as $\frac{d}{n} \cdot \frac{ d}{ \min\{\varepsilon, \varePSilon^2\}}$.
Abstract: We develop lower bounds for estimation under local privacy constraints---including differential privacy and its relaxations to approximate or Renyi differential privacy---by showing an equivalence between private estimation and communication-restricted estimation problems. Our results apply to arbitrarily interactive privacy mechanisms, and they also give sharp lower bounds for all levels of differential privacy protections, that is, privacy mechanisms with privacy levels $\varepsilon \in [0, \infty)$. As a particular consequence of our results, we show that the minimax mean-squared error for estimating the mean of a bounded or Gaussian random vector in $d$ dimensions scales as $\frac{d}{n} \cdot \frac{d}{ \min\{\varepsilon, \varepsilon^2\}}$.

Posted Content
TL;DR: A new variable selection algorithm, called Feature Ordering by Conditional Independence (FOCI), which is model-free, has no tuning parameters, and is provably consistent under sparsity assumptions is devised.
Abstract: We propose a coefficient of conditional dependence between two random variables $Y$ and $Z$ given a set of other variables $X_1,\ldots,X_p$, based on an i.i.d. sample. The coefficient has a long list of desirable properties, the most important of which is that under absolutely no distributional assumptions, it converges to a limit in $[0,1]$, where the limit is $0$ if and only if $Y$ and $Z$ are conditionally independent given $X_1,\ldots,X_p$, and is $1$ if and only if $Y$ is equal to a measurable function of $Z$ given $X_1,\ldots,X_p$. Moreover, it has a natural interpretation as a nonlinear generalization of the familiar partial $R^2$ statistic for measuring conditional dependence by regression. Using this statistic, we devise a new variable selection algorithm, called Feature Ordering by Conditional Independence (FOCI), which is model-free, has no tuning parameters, and is provably consistent under sparsity assumptions. A number of applications to synthetic and real datasets are worked out.

Posted Content
TL;DR: It is shown that spectral clustering is minimax optimal in the Gaussian Mixture Model with isotropic covariance matrix, when the number of clusters is fixed and the signal-to-noise ratio is large enough.
Abstract: Spectral clustering is one of the most popular algorithms to group high dimensional data. It is easy to implement and computationally efficient. Despite its popularity and successful applications, its theoretical properties have not been fully understood. In this paper, we show that spectral clustering is minimax optimal in the Gaussian Mixture Model with isotropic covariance matrix, when the number of clusters is fixed and the signal-to-noise ratio is large enough. Spectral gap conditions are widely assumed in the literature to analyze spectral clustering. On the contrary, these conditions are not needed to establish optimality of spectral clustering in this paper.

Posted Content
TL;DR: This work considers the problem of sampling from a target distribution, which is not necessarily logconcave, in the context of empirical risk minimization and stochastic optimization as presented in Raginsky et al. (2017).
Abstract: We consider the problem of sampling from a target distribution, which is \emph {not necessarily logconcave}, in the context of empirical risk minimization and stochastic optimization as presented in Raginsky et al. (2017). Non-asymptotic analysis results are established in the $L^1$-Wasserstein distance for the behaviour of Stochastic Gradient Langevin Dynamics (SGLD) algorithms. We allow the estimation of gradients to be performed even in the presence of \emph{dependent} data streams. Our convergence estimates are sharper and \emph{uniform} in the number of iterations, in contrast to those in previous studies.

Posted Content
TL;DR: In this article, the authors propose confidence sequences, sequences of confidence intervals which are valid uniformly over time for quantiles of any distribution over a complete, fully-ordered set, based on a stream of i.i.d. observations.
Abstract: We propose confidence sequences -- sequences of confidence intervals which are valid uniformly over time -- for quantiles of any distribution over a complete, fully-ordered set, based on a stream of i.i.d. observations. We give methods both for tracking a fixed quantile and for tracking all quantiles simultaneously. Specifically, we provide explicit expressions with small constants for intervals whose widths shrink at the fastest possible $\sqrt{t^{-1} \log\log t}$ rate, along with a non-asymptotic concentration inequality for the empirical distribution function which holds uniformly over time with the same rate. The latter strengthens Smirnov's empirical process law of the iterated logarithm and extends the Dvoretzky-Kiefer-Wolfowitz inequality to hold uniformly over time. We give a new algorithm and sample complexity bound for selecting an arm with an approximately best quantile in a multi-armed bandit framework. In simulations, our method requires fewer samples than existing methods by a factor of five to fifty.

Posted Content
TL;DR: A lower bound is proved on the estimation error achieved by any convex regularizer which is invariant under permutations of the coordinates of its argument which is expected to be generally tight, and indeed it is proved tightness under certain conditions.
Abstract: In high-dimensional regression, we attempt to estimate a parameter vector ${\boldsymbol \beta}_0\in{\mathbb R}^p$ from $n\lesssim p$ observations $\{(y_i,{\boldsymbol x}_i)\}_{i\le n}$ where ${\boldsymbol x}_i\in{\mathbb R}^p$ is a vector of predictors and $y_i$ is a response variable. A well-estabilished approach uses convex regularizers to promote specific structures (e.g. sparsity) of the estimate $\widehat{\boldsymbol \beta}$, while allowing for practical algorithms. Theoretical analysis implies that convex penalization schemes have nearly optimal estimation properties in certain settings. However, in general the gaps between statistically optimal estimation (with unbounded computational resources) and convex methods are poorly understood. We show that, in general, a large gap exists between the best performance achieved by \emph{any convex regularizer} and the optimal statistical error. Remarkably, we demonstrate that this gap is generic as soon as we try to incorporate very simple structural information about the empirical distribution of the entries of ${\boldsymbol \beta}_0$. Our results follow from a detailed study of standard Gaussian designs, a setting that is normally considered particularly friendly to convex regularization schemes such as the Lasso. We prove a lower bound on the estimation error achieved by any convex regularizer which is invariant under permutations of the coordinates of its argument. This bound is expected to be generally tight, and indeed we prove tightness under certain conditions. Further, it implies a gap with respect to Bayes-optimal estimation that can be precisely quantified and persists if the prior distribution of the signal ${\boldsymbol \beta}_0$ is known to the statistician. Our results provide rigorous evidence towards a broad conjecture regarding computational-statistical gaps in high-dimensional estimation.

Posted Content
TL;DR: A survey of the recent advances in mean estimation and regression function estimation for possibly heavy-tailed data can be found in this paper, where the authors focus on estimators based on median-of-means techniques but other methods such as trimmed mean and Catoni's estimator are also reviewed.
Abstract: We survey some of the recent advances in mean estimation and regression function estimation. In particular, we describe sub-Gaussian mean estimators for possibly heavy-tailed data both in the univariate and multivariate settings. We focus on estimators based on median-of-means techniques but other methods such as the trimmed mean and Catoni's estimator are also reviewed. We give detailed proofs for the cornerstone results. We dedicate a section on statistical learning problems--in particular, regression function estimation--in the presence of possibly heavy-tailed data.

Posted Content
TL;DR: A simple betting interpretation of likelihood ratios is called on, which leads to methods that lend themselves to meta-analysis and accounting for multiple testing, and does not encourage the fallacy that probabilistic models imply the existence of unseen alternative worlds.
Abstract: The established language for statistical testing --- significance levels, power, and p-values --- is overly complicated and deceptively conclusive. Even teachers of statistics and scientists who use statistics misinterpret the results of statistical tests, tending to misstate their meaning and exaggerate their certainty. We can communicate the meaning and limitations of statistical evidence more clearly using the language of betting. This paper calls attention to a simple betting interpretation of likelihood ratios. This interpretation leads to methods that lend themselves to meta-analysis and accounting for multiple testing. It is closely related to the interpretation of probability as frequency, but it does not encourage the fallacy that probabilistic models imply the existence of unseen alternative worlds. For more on the betting interpretation of probability, see \cite{Shafer/Vovk:2019} and the other working papers at this http URL.

Posted Content
TL;DR: Using the first moment method, the densest subgraph problems for subgraphs with fixed, but arbitrary, overlap size with the planted clique are studied, and evidence of a phase transition for the presence of Overlap Gap Property (OGP) is provided.
Abstract: In this paper we study the computational-statistical gap of the planted clique problem, where a clique of size $k$ is planted in an Erdos Renyi graph $G(n,\frac{1}{2})$ resulting in a graph $G\left(n,\frac{1}{2},k\right)$. The goal is to recover the planted clique vertices by observing $G\left(n,\frac{1}{2},k\right)$ . It is known that the clique can be recovered as long as $k \geq \left(2+\epsilon\right)\log n $ for any $\epsilon>0$, but no polynomial-time algorithm is known for this task unless $k=\Omega\left(\sqrt{n} \right)$. Following a statistical-physics inspired point of view as an attempt to understand this computational-statistical gap, we study the landscape of the "sufficiently dense" subgraphs of $G$ and their overlap with the planted clique. Using the first moment method, we study the densest subgraph problems for subgraphs with fixed, but arbitrary, overlap size with the planted clique, and provide evidence of a phase transition for the presence of Overlap Gap Property (OGP) at $k=\Theta\left(\sqrt{n}\right)$. OGP is a concept introduced originally in spin glass theory and known to suggest algorithmic hardness when it appears. We establish the presence of OGP when $k$ is a small positive power of $n$ by using a conditional second moment method. As our main technical tool, we establish the first, to the best of our knowledge, concentration results for the $K$-densest subgraph problem for the Erdos-Renyi model $G\left(n,\frac{1}{2}\right)$ when $K=n^{0.5-\epsilon}$ for arbitrary $\epsilon>0$. Finally, to study the OGP we employ a certain form of overparametrization, which is conceptually aligned with a large body of recent work in learning theory and optimization.

Posted Content
TL;DR: Empirical evidence supports the finding that minimum-norm interpolants in RKHS can exhibit this unusual non-monotonicity in sample size, and the analysis yields novel estimation and generalization guarantees for these over-parametrized models.
Abstract: We study the risk of minimum-norm interpolants of data in Reproducing Kernel Hilbert Spaces. Our upper bounds on the risk are of a multiple-descent shape for the various scalings of $d = n^{\alpha}$, $\alpha\in(0,1)$, for the input dimension $d$ and sample size $n$. Empirical evidence supports our finding that minimum-norm interpolants in RKHS can exhibit this unusual non-monotonicity in sample size; furthermore, locations of the peaks in our experiments match our theoretical predictions. Since gradient flow on appropriately initialized wide neural networks converges to a minimum-norm interpolant with respect to a certain kernel, our analysis also yields novel estimation and generalization guarantees for these over-parametrized models. At the heart of our analysis is a study of spectral properties of the random kernel matrix restricted to a filtration of eigen-spaces of the population covariance operator, and may be of independent interest.

Posted Content
TL;DR: A family of algorithms that interpolates smoothly between two existing algorithms: the polynomial-time diagonal thresholding algorithm and the $\exp(\rho n)$-time exhaustive search algorithm, demonstrating a smooth tradeoff between sparsity and runtime.
Abstract: We study the computational cost of recovering a unit-norm sparse principal component $x \in \mathbb{R}^n$ planted in a random matrix, in either the Wigner or Wishart spiked model (observing either $W + \lambda xx^\top$ with $W$ drawn from the Gaussian orthogonal ensemble, or $N$ independent samples from $\mathcal{N}(0, I_n + \beta xx^\top)$, respectively). Prior work has shown that when the signal-to-noise ratio ($\lambda$ or $\beta\sqrt{N/n}$, respectively) is a small constant and the fraction of nonzero entries in the planted vector is $\|x\|_0 / n = \rho$, it is possible to recover $x$ in polynomial time if $\rho \lesssim 1/\sqrt{n}$. While it is possible to recover $x$ in exponential time under the weaker condition $\rho \ll 1$, it is believed that polynomial-time recovery is impossible unless $\rho \lesssim 1/\sqrt{n}$. We investigate the precise amount of time required for recovery in the "possible but hard" regime $1/\sqrt{n} \ll \rho \ll 1$ by exploring the power of subexponential-time algorithms, i.e., algorithms running in time $\exp(n^\delta)$ for some constant $\delta \in (0,1)$. For any $1/\sqrt{n} \ll \rho \ll 1$, we give a recovery algorithm with runtime roughly $\exp(\rho^2 n)$, demonstrating a smooth tradeoff between sparsity and runtime. Our family of algorithms interpolates smoothly between two existing algorithms: the polynomial-time diagonal thresholding algorithm and the $\exp(\rho n)$-time exhaustive search algorithm. Furthermore, by analyzing the low-degree likelihood ratio, we give rigorous evidence suggesting that the tradeoff achieved by our algorithms is optimal.

Posted Content
TL;DR: Sharing Fisherian, Neymanian and Jeffreys-Bayesian interpretations, S-values and safe tests may provide a methodology acceptable to adherents of all three schools.
Abstract: We develop the theory of hypothesis testing based on the e-value, a notion of evidence that, unlike the p-value, allows for effortlessly combining results from several tests. Even in the common scenario of optional continuation, where the decision to perform a new test depends on previous test outcomes, 'safe' tests based on e-values generally preserve Type-I error guarantees. Our main result shows that e-values exist for completely general testing problems with composite null and alternatives. Their prime interpretation is in terms of gambling or investing, each e-value corresponding to a particular investment. Surprisingly, optimal 'GROW' e-values, which lead to fastest capital growth, are fully characterized by the joint information projection (JIPr) between the set of all Bayes marginal distributions on H0 and H1. Thus, optimal e-values also have an interpretation as Bayes factors, with priors given by the JIPr. We illustrate the theory using several 'classic' examples including a one-sample safe t-test and the 2 x 2 contingency table. Sharing Fisherian, Neymanian and Jeffreys-Bayesian interpretations, e-values and safe tests may provide a methodology acceptable to adherents of all three schools.

Journal ArticleDOI
TL;DR: In this paper, the authors investigate the extent to which quantile-based measures of relative dispersion can provide appropriate summary information as an alternative to the coefficient of variation (CV), since it is based on the sample mean and standard deviation, outliers can adversely affect the CV.
Abstract: The coefficient of variation (CV) is commonly used to measure relative dispersion. However, since it is based on the sample mean and standard deviation, outliers can adversely affect the CV. Additionally, for skewed distributions the mean and standard deviation do not have natural interpretations and, consequently, neither does the CV. Here we investigate the extent to which quantile-based measures of relative dispersion can provide appropriate summary information as an alternative to the CV. In particular, we investigate two measures, the first being the interquartile range (in lieu of the standard deviation), divided by the median (in lieu of the mean), and the second being the median absolute deviation (MAD), divided by the median, as robust estimators of relative dispersion. In addition to comparing the influence functions of the competing estimators and their asymptotic biases and variances, we compare interval estimators using simulation studies to assess coverage.

Posted Content
TL;DR: Like the polynomial time estimator introduced by Hopkins, 2018, which is based on the sum-of-squares hierarchy, this estimator achieves optimal statistical efficiency in this challenging setting, but it has a significantly faster runtime and a simpler analysis.
Abstract: We propose an estimator for the mean of a random vector in $\mathbb{R}^d$ that can be computed in time $O(n^4+n^2d)$ for $n$ i.i.d.~samples and that has error bounds matching the sub-Gaussian case. The only assumptions we make about the data distribution are that it has finite mean and covariance; in particular, we make no assumptions about higher-order moments. Like the polynomial time estimator introduced by Hopkins, 2018, which is based on the sum-of-squares hierarchy, our estimator achieves optimal statistical efficiency in this challenging setting, but it has a significantly faster runtime and a simpler analysis.