scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Statistics Theory in 2021"


Posted Content
TL;DR: This work develops a new HSIC-based aggregated procedure which avoids such a kernel choice, and provides theoretical guarantees for this procedure, and performs numerical studies to assess the efficiency of this procedure and compare it to existing independence tests in the literature.
Abstract: Dependence measures based on reproducing kernel Hilbert spaces, also known as Hilbert-Schmidt Independence Criterion and denoted HSIC, are widely used to statistically decide whether or not two random vectors are dependent. Recently, non-parametric HSIC-based statistical tests of independence have been performed. However, these tests lead to the question of the choice of the kernels associated to the HSIC. In particular, there is as yet no method to objectively select specific kernels with theoretical guarantees in terms of first and second kind errors. One of the main contributions of this work is to develop a new HSIC-based aggregated procedure which avoids such a kernel choice, and to provide theoretical guarantees for this procedure. To achieve this, we first introduce non-asymptotic single tests based on Gaussian kernels with a given bandwidth, which are of prescribed level $\alpha \in (0,1)$. From a theoretical point of view, we upper-bound their uniform separation rate of testing over Sobolev and Nikol'skii balls. Then, we aggregate several single tests, and obtain similar upper-bounds for the uniform separation rate of the aggregated procedure over the same regularity spaces. Another main contribution is that we provide a lower-bound for the non-asymptotic minimax separation rate of testing over Sobolev balls, and deduce that the aggregated procedure is adaptive in the minimax sense over such regularity spaces. Finally, from a practical point of view, we perform numerical studies in order to assess the efficiency of our aggregated procedure and compare it to existing independence tests in the literature.

28 citations


Posted Content
TL;DR: Recently, the authors showed that simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy.
Abstract: The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.

21 citations


Posted Content
TL;DR: In this paper, the Wasserstein distance between the actual distribution and a reference distribution with independent components is used to quantify the dependence between two random vectors of possibly different dimensions, and two additional coefficients rooted in the same ideas as the first two are proposed.
Abstract: To quantify the dependence between two random vectors of possibly different dimensions, we propose to rely on the properties of the 2-Wasserstein distance. We first propose two coefficients that are based on the Wasserstein distance between the actual distribution and a reference distribution with independent components. The coefficients are normalized to take values between 0 and 1, where 1 represents the maximal amount of dependence possible given the two multivariate margins. We then make a quasi-Gaussian assumption that yields two additional coefficients rooted in the same ideas as the first two. These different coefficients are more amenable for distributional results and admit attractive formulas in terms of the joint covariance or correlation matrix. Furthermore, maximal dependence is proved to occur at the covariance matrix with minimal von Neumann entropy given the covariance matrices of the two multivariate margins. This result also helps us revisit the RV coefficient by proposing a sharper normalisation. The two coefficients based on the quasi-Gaussian approach can be estimated easily via the empirical covariance matrix. The estimators are asymptotically normal and their asymptotic variances are explicit functions of the covariance matrix, which can thus be estimated consistently too. The results extend to the Gaussian copula case, in which case the estimators are rank-based. The results are illustrated through theoretical examples, Monte Carlo simulations, and a case study involving electroencephalography data.

16 citations


Posted Content
TL;DR: In this article, the authors show that the test error of random features ridge regression is dominated by its approximation error and is larger than the error of kernel ridge regression as long as $N\ge n^{1+\delta}$ for some ε > 0.
Abstract: Consider the classical supervised learning problem: we are given data $(y_i,{\boldsymbol x}_i)$, $i\le n$, with $y_i$ a response and ${\boldsymbol x}_i\in {\mathcal X}$ a covariates vector, and try to learn a model $f:{\mathcal X}\to{\mathbb R}$ to predict future responses. Random features methods map the covariates vector ${\boldsymbol x}_i$ to a point ${\boldsymbol \phi}({\boldsymbol x}_i)$ in a higher dimensional space ${\mathbb R}^N$, via a random featurization map ${\boldsymbol \phi}$. We study the use of random features methods in conjunction with ridge regression in the feature space ${\mathbb R}^N$. This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime. We define a class of problems satisfying certain spectral conditions on the underlying kernels, and a hypercontractivity assumption on the associated eigenfunctions. These conditions are verified by classical high-dimensional examples. Under these conditions, we prove a sharp characterization of the error of random features ridge regression. In particular, we address two fundamental questions: $(1)$~What is the generalization error of KRR? $(2)$~How big $N$ should be for the random features approximation to achieve the same error as KRR? In this setting, we prove that KRR is well approximated by a projection onto the top $\ell$ eigenfunctions of the kernel, where $\ell$ depends on the sample size $n$. We show that the test error of random features ridge regression is dominated by its approximation error and is larger than the error of KRR as long as $N\le n^{1-\delta}$ for some $\delta>0$. We characterize this gap. For $N\ge n^{1+\delta}$, random features achieve the same error as the corresponding KRR, and further increasing $N$ does not lead to a significant change in test error.

16 citations


Posted Content
TL;DR: This paper derives non-asymptotic upper bounds for the prediction error of the empirical risk minimizer for feedforward deep neural regression and proposes a notion of network relative efficiency between two types of neural networks, which provides a quantitative measure for evaluating the relative merits of different network structures.
Abstract: In this paper, we study the properties of nonparametric least squares regression using deep neural networks. We derive non-asymptotic upper bounds for the prediction error of the empirical risk minimizer for feedforward deep neural regression. Our error bounds achieve the minimax optimal rate and significantly improve over the existing ones in the sense that they depend linearly or quadratically on the dimension d of the predictor, instead of exponentially on d. We show that the neural regression estimator can circumvent the curse of dimensionality under the assumption that the predictor is supported on an approximate low-dimensional manifold. This assumption differs from the structural condition imposed on the target regression function and is weaker and more realistic than the exact low-dimensional manifold support assumption in the existing literature. We investigate how the prediction error of the neural regression estimator depends on the structure of neural networks and propose a notion of network relative efficiency between two types of neural networks, which provides a quantitative measure for evaluating the relative merits of different network structures. Our results are derived under weaker assumptions on the data distribution, the target regression function and the neural network structure than those in the existing literature.

15 citations


Journal ArticleDOI
TL;DR: Aistleitner et al. as mentioned in this paper introduced a class of ''negative dependent random samples'' and provided probabilistic upper bounds for their star discrepancy with explicitly stated dependence on $N, $d, and ''gamma''.
Abstract: We introduce a class of $\gamma$-negatively dependent random samples. We prove that this class includes, apart from Monte Carlo samples, in particular Latin hypercube samples and Latin hypercube samples padded by Monte Carlo. For a $\gamma$-negatively dependent $N$-point sample in dimension $d$ we provide probabilistic upper bounds for its star discrepancy with explicitly stated dependence on $N$, $d$, and $\gamma$. These bounds generalize the probabilistic bounds for Monte Carlo samples from [Heinrich et al., Acta Arith. 96 (2001), 279--302] and [C.~Aistleitner, J.~Complexity 27 (2011), 531--540], and they are optimal for Monte Carlo and Latin hypercube samples. In the special case of Monte Carlo samples the constants that appear in our bounds improve substantially on the constants presented in the latter paper and in [C.~Aistleitner, M.~T.~Hofer, Math. Comp.~83 (2014), 1373--1381].

14 citations


Book ChapterDOI
TL;DR: It is shown how entropy and relative entropy can describe amalgamations in a simple way, while Aitchesison distance requires the use of geometric means to obtain more succinct relationships, and the information monotonicity property for Aitchison distance is proved.
Abstract: Information geometry uses the formal tools of differential geometry to describe the space of probability distributions as a Riemannian manifold with an additional dual structure. The formal equivalence of compositional data with discrete probability distributions makes it possible to apply the same description to the sample space of Compositional Data Analysis (CoDA). The latter has been formally described as a Euclidean space with an orthonormal basis featuring components that are suitable combinations of the original parts. In contrast to the Euclidean metric, the information-geometric description singles out the Fisher information metric as the only one keeping the manifold’s geometric structure invariant under equivalent representations of the underlying random variables. Well-known concepts that are valid in Euclidean coordinates, e.g., the Pythagorean theorem, are generalized by information geometry to corresponding notions that hold for more general coordinates. In briefly reviewing Euclidean CoDA and, in more detail, the information-geometric approach, we show how the latter justifies the use of distance measures and divergences that so far have received little attention in CoDA as they do not fit the Euclidean geometry favored by current thinking. We also show how Shannon entropy and relative entropy can describe amalgamations in a simple way, while Aitchison distance requires the use of geometric means to obtain more succinct relationships. We proceed to prove the information monotonicity property for Aitchison distance. We close with some thoughts about new directions in CoDA where the rich structure that is provided by information geometry could be exploited.

14 citations


Posted Content
TL;DR: This work considers the problem of estimating a d dimensional sub-manifold of RD from a finite set of noisy samples and presents an algorithm that takes a point r from the tubular neighborhood and outputs p̂n ∈ RD, and T̂p⩽nM an element in the Grassmanian Gr(d,D).
Abstract: A common observation in data-driven applications is that high dimensional data has a low intrinsic dimension, at least locally. In this work, we consider the problem of estimating a $d$ dimensional sub-manifold of $\mathbb{R}^D$ from a finite set of noisy samples. Assuming that the data was sampled uniformly from a tubular neighborhood of $\mathcal{M}\in \mathcal{C}^k$, a compact manifold without boundary, we present an algorithm that takes a point $r$ from the tubular neighborhood and outputs $\hat p_n\in \mathbb{R}^D$, and $\widehat{T_{\hat p_n}\mathcal{M}}$ an element in the Grassmanian $Gr(d, D)$. We prove that as the number of samples $n\to\infty$ the point $\hat p_n$ converges to $p\in \mathcal{M}$ and $\widehat{T_{\hat p_n}\mathcal{M}}$ converges to $T_p\mathcal{M}$ (the tangent space at that point) with high probability. Furthermore, we show that the estimation yields asymptotic rates of convergence of $n^{-\frac{k}{2k + d}}$ for the point estimation and $n^{-\frac{k-1}{2k + d}}$ for the estimation of the tangent space. These rates are known to be optimal for the case of function estimation.

12 citations


Posted Content
TL;DR: Approximate Message Passing (AMP) algorithms have become extremely popular in various structured high-dimensional statistical problems over the last decade or so, and the origins of these techniques can be traced back to notions of belief propagation in the statistical physics literature as discussed by the authors.
Abstract: Over the last decade or so, Approximate Message Passing (AMP) algorithms have become extremely popular in various structured high-dimensional statistical problems. The fact that the origins of these techniques can be traced back to notions of belief propagation in the statistical physics literature lends a certain mystique to the area for many statisticians. Our goal in this work is to present the main ideas of AMP from a statistical perspective, to illustrate the power and flexibility of the AMP framework. Along the way, we strengthen and unify many of the results in the existing literature.

11 citations


Posted Content
TL;DR: In this paper, the authors describe conditions under which one can draw inferences about exposure effects when the exposures are misspecified. But the main result is a proof of consistency under mild conditions on the errors introduced by the misspecification, and consistency is achieved even if the errors are large as long as they are sufficiently weakly dependent.
Abstract: Exposure mappings facilitate investigations of complex causal effects when units interact in experiments. Current methods assume that the exposures are correctly specified, but such an assumption cannot be verified, and its validity is often questionable. This paper describes conditions under which one can draw inferences about exposure effects when the exposures are misspecified. The main result is a proof of consistency under mild conditions on the errors introduced by the misspecification. The rate of convergence is determined by the dependence between units' specification errors, and consistency is achieved even if the errors are large as long as they are sufficiently weakly dependent. In other words, exposure effects can be precisely estimated also under misspecification as long as the units' exposures are not misspecified in the same way. The limiting distribution of the estimator is discussed. Asymptotic normality is achieved under stronger conditions than those needed for consistency. Similar conditions also facilitate conservative variance estimation.

11 citations


Posted Content
TL;DR: In this article, a data-driven method was proposed to extract stochastic dynamical systems with non-Gaussian asymmetric (rather than the symmetric) Levy process, as well as Gaussian Brownian motion.
Abstract: Advances in data science are leading to new progresses in the analysis and understanding of complex dynamics for systems with experimental and observational data. With numerous physical phenomena exhibiting bursting, flights, hopping, and intermittent features, stochastic differential equations with non-Gaussian Levy noise are suitable to model these systems. Thus it is desirable and essential to infer such equations from available data to reasonably predict dynamical behaviors. In this work, we consider a data-driven method to extract stochastic dynamical systems with non-Gaussian asymmetric (rather than the symmetric) Levy process, as well as Gaussian Brownian motion. We establish a theoretical framework and design a numerical algorithm to compute the asymmetric Levy jump measure, drift and diffusion (i.e., nonlocal Kramers-Moyal formulas), hence obtaining the stochastic governing law, from noisy data. Numerical experiments on several prototypical examples confirm the efficacy and accuracy of this method. This method will become an effective tool in discovering the governing laws from available data sets and in understanding the mechanisms underlying complex random phenomena.

Posted Content
TL;DR: In this paper, the authors focus on derivation of the efficient influence function and explain how it may be used to construct statistical/machine-learning-based estimators, and discuss the requisite conditions for these estimators to perform well.
Abstract: Evaluation of treatment effects and more general estimands is typically achieved via parametric modelling, which is unsatisfactory since model misspecification is likely. Data-adaptive model building (e.g. statistical/machine learning) is commonly employed to reduce the risk of misspecification. Naive use of such methods, however, delivers estimators whose bias may shrink too slowly with sample size for inferential methods to perform well, including those based on the bootstrap. Bias arises because standard data-adaptive methods are tuned towards minimal prediction error as opposed to e.g. minimal MSE in the estimator. This may cause excess variability that is difficult to acknowledge, due to the complexity of such strategies. Building on results from non-parametric statistics, targeted learning and debiased machine learning overcome these problems by constructing estimators using the estimand's efficient influence function under the non-parametric model. These increasingly popular methodologies typically assume that the efficient influence function is given, or that the reader is familiar with its derivation. In this paper, we focus on derivation of the efficient influence function and explain how it may be used to construct statistical/machine-learning-based estimators. We discuss the requisite conditions for these estimators to perform well and use diverse examples to convey the broad applicability of the theory.

Posted Content
TL;DR: The proof of the failure of stable algorithms at values of $2^{-\omega(n \log^{-1/5} n)}$ employs methods from Ramsey Theory from the extremal combinatorics, and is of independent interest.
Abstract: We consider the algorithmic problem of finding a near-optimal solution for the number partitioning problem (NPP). The NPP appears in many applications, including the design of randomized controlled trials, multiprocessor scheduling, and cryptography; and is also of theoretical significance. It possesses a so-called statistical-to-computational gap: when its input $X$ has distribution $\mathcal{N}(0,I_n)$, its optimal value is $\Theta(\sqrt{n}2^{-n})$ w.h.p.; whereas the best polynomial-time algorithm achieves an objective value of only $2^{-\Theta(\log^2 n)}$, w.h.p. In this paper, we initiate the study of the nature of this gap. Inspired by insights from statistical physics, we study the landscape of NPP and establish the presence of the Overlap Gap Property (OGP), an intricate geometric property which is known to be a rigorous evidence of an algorithmic hardness for large classes of algorithms. By leveraging the OGP, we establish that (a) any sufficiently stable algorithm, appropriately defined, fails to find a near-optimal solution with energy below $2^{-\omega(n \log^{-1/5} n)}$; and (b) a very natural MCMC dynamics fails to find near-optimal solutions. Our simulations suggest that the state of the art algorithm achieving $2^{-\Theta(\log^2 n)}$ is indeed stable, but formally verifying this is left as an open problem. OGP regards the overlap structure of $m-$tuples of solutions achieving a certain objective value. When $m$ is constant we prove the presence of OGP in the regime $2^{-\Theta(n)}$, and the absence of it in the regime $2^{-o(n)}$. Interestingly, though, by considering overlaps with growing values of $m$ we prove the presence of the OGP up to the level $2^{-\omega(\sqrt{n\log n})}$. Our proof of the failure of stable algorithms at values $2^{-\omega(n \log^{-1/5} n)}$ employs methods from Ramsey Theory from the extremal combinatorics, and is of independent interest.

ReportDOI
TL;DR: In this paper, lower bounds on the minimax risk for estimating the regression function at (i) a point and (ii) under the infinity norm were derived for the dyadic analog of the familiar Nadaraya-Watson (NW) regression estimator.
Abstract: Let $i=1,\ldots,N$ index a simple random sample of units drawn from some large population. For each unit we observe the vector of regressors $X_{i}$ and, for each of the $N\left(N-1\right)$ ordered pairs of units, an outcome $Y_{ij}$. The outcomes $Y_{ij}$ and $Y_{kl}$ are independent if their indices are disjoint, but dependent otherwise (i.e., "dyadically dependent"). Let $W_{ij}=\left(X_{i}',X_{j}'\right)'$; using the sampled data we seek to construct a nonparametric estimate of the mean regression function $g\left(W_{ij}\right)\overset{def}{\equiv}\mathbb{E}\left[\left.Y_{ij}\right|X_{i},X_{j}\right].$ We present two sets of results. First, we calculate lower bounds on the minimax risk for estimating the regression function at (i) a point and (ii) under the infinity norm. Second, we calculate (i) pointwise and (ii) uniform convergence rates for the dyadic analog of the familiar Nadaraya-Watson (NW) kernel regression estimator. We show that the NW kernel regression estimator achieves the optimal rates suggested by our risk bounds when an appropriate bandwidth sequence is chosen. This optimal rate differs from the one available under iid data: the effective sample size is smaller and $d_W=\mathrm{dim}(W_{ij})$ influences the rate differently.

Posted Content
TL;DR: In this paper, the authors review the properties of Bayesian procedures in high-dimensional models such as many normal means problems, linear regression, generalized linear models, Gaussian and non-Gaussian graphical models.
Abstract: Models with dimension more than the available sample size are now commonly used in various applications. A sensible inference is possible using a lower-dimensional structure. In regression problems with a large number of predictors, the model is often assumed to be sparse, with only a few predictors active. Interdependence between a large number of variables is succinctly described by a graphical model, where variables are represented by nodes on a graph and an edge between two nodes is used to indicate their conditional dependence given other variables. Many procedures for making inferences in the high-dimensional setting, typically using penalty functions to induce sparsity in the solution obtained by minimizing a loss function, were developed. Bayesian methods have been proposed for such problems more recently, where the prior takes care of the sparsity structure. These methods have the natural ability to also automatically quantify the uncertainty of the inference through the posterior distribution. Theoretical studies of Bayesian procedures in high-dimension have been carried out recently. Questions that arise are, whether the posterior distribution contracts near the true value of the parameter at the minimax optimal rate, whether the correct lower-dimensional structure is discovered with high posterior probability, and whether a credible region has adequate frequentist coverage. In this paper, we review these properties of Bayesian and related methods for several high-dimensional models such as many normal means problem, linear regression, generalized linear models, Gaussian and non-Gaussian graphical models. Effective computational approaches are also discussed.

Posted Content
TL;DR: In this article, the problem of recovering the hidden vertex correspondence between two edge-correlated random graphs was studied in the Gaussian model and the Erdős-Renyi model.
Abstract: This paper studies the problem of recovering the hidden vertex correspondence between two edge-correlated random graphs. We focus on the Gaussian model where the two graphs are complete graphs with correlated Gaussian weights and the Erdős-Renyi model where the two graphs are subsampled from a common parent Erdős-Renyi graph $\mathcal{G}(n,p)$. For dense graphs with $p=n^{-o(1)}$, we prove that there exists a sharp threshold, above which one can correctly match all but a vanishing fraction of vertices and below which correctly matching any positive fraction is impossible, a phenomenon known as the "all-or-nothing" phase transition. Even more strikingly, in the Gaussian setting, above the threshold all vertices can be exactly matched with high probability. In contrast, for sparse Erdős-Renyi graphs with $p=n^{-\Theta(1)}$, we show that the all-or-nothing phenomenon no longer holds and we determine the thresholds up to a constant factor. Along the way, we also derive the sharp threshold for exact recovery, sharpening the existing results in Erdős-Renyi graphs. The proof of the negative results builds upon a tight characterization of the mutual information based on the truncated second-moment computation and an "area theorem" that relates the mutual information to the integral of the reconstruction error. The positive results follows from a tight analysis of the maximum likelihood estimator that takes into account the cycle structure of the induced permutation on the edges.

Posted Content
TL;DR: The unified framework considered in this paper covers the case of linear, logistic or softmax regressions to name a few, and establishes almost sure convergences and rates of convergence of the algorithms, as well as central limit theorems for the constructed parameter estimates.
Abstract: The majority of machine learning methods can be regarded as the minimization of an unavailable risk function. To optimize the latter, given samples provided in a streaming fashion, we define a general stochastic Newton algorithm and its weighted average version. In several use cases, both implementations will be shown not to require the inversion of a Hessian estimate at each iteration, but a direct update of the estimate of the inverse Hessian instead will be favored. This generalizes a trick introduced in [2] for the specific case of logistic regression, by directly updating the estimate of the inverse Hessian. Under mild assumptions such as local strong convexity at the optimum, we establish almost sure convergences and rates of convergence of the algorithms, as well as central limit theorems for the constructed parameter estimates. The unified framework considered in this paper covers the case of linear, logistic or softmax regressions to name a few. Numerical experiments on simulated data give the empirical evidence of the pertinence of the proposed methods, which outperform popular competitors particularly in case of bad initializa-tions.

Posted Content
TL;DR: In this paper, the authors consider statistical methods which invoke a min-max distributionally robust formulation to extract good out-of-sample performance in data-driven optimization and learning problems.
Abstract: We consider statistical methods which invoke a min-max distributionally robust formulation to extract good out-of-sample performance in data-driven optimization and learning problems. Acknowledging the distributional uncertainty in learning from limited samples, the min-max formulations introduce an adversarial inner player to explore unseen covariate data. The resulting Distributionally Robust Optimization (DRO) formulations, which include Wasserstein DRO formulations (our main focus), are specified using optimal transportation phenomena. Upon describing how these infinite-dimensional min-max problems can be approached via a finite-dimensional dual reformulation, the tutorial moves into its main component, namely, explaining a generic recipe for optimally selecting the size of the adversary's budget. This is achieved by studying the limit behavior of an optimal transport projection formulation arising from an inquiry on the smallest confidence region that includes the unknown population risk minimizer. Incidentally, this systematic prescription coincides with those in specific examples in high-dimensional statistics and results in error bounds that are free from the curse of dimensions. Equipped with this prescription, we present a central limit theorem for the DRO estimator and provide a recipe for constructing compatible confidence regions that are useful for uncertainty quantification. The rest of the tutorial is devoted to insights into the nature of the optimizers selected by the min-max formulations and additional applications of optimal transport projections.

Book ChapterDOI
TL;DR: In this paper, a profile least squares estimator was proposed to estimate a fixed regression parameter in a monotone single index regression model, which is shown to be convergence and asymptotic normal.
Abstract: We consider least squares estimators of the finite regression parameter \(\boldsymbol{\alpha }\) in the single index regression model \(Y=\psi (\boldsymbol{\alpha }^T\boldsymbol{X})+\varepsilon \), where \(\boldsymbol{X}\) is a d-dimensional random vector, \({\mathbb E}(Y|\boldsymbol{X})=\psi (\boldsymbol{\alpha }^T\boldsymbol{X})\), and \(\psi \) is a monotone. It has been suggested to estimate \(\boldsymbol{\alpha }\) by a profile least squares estimator, minimizing \(\sum _{i=1}^n(Y_i-\psi (\boldsymbol{\alpha }^T\boldsymbol{X}_i))^2\) over monotone \(\psi \) and \(\boldsymbol{\alpha }\) on the boundary \(\mathcal {S}_{d-1}\) of the unit ball. Although this suggestion has been around for a long time, it is still unknown whether the estimate is \(\sqrt{n}\)-convergent. We show that a profile least squares estimator, using the same pointwise least squares estimator for fixed \(\boldsymbol{\alpha }\), but using a different global sum of squares, is \(\sqrt{n}\)-convergent and asymptotically normal. The difference between the corresponding loss functions is studied and also a comparison with other methods is given.

Book ChapterDOI
TL;DR: The theoretical approximations and practical recommendations are extended to the problem of construction of efficient quantization designs in a cube and new construction schemes which provide even better coverings than the ones numerically found in \cite{us}.
Abstract: The main problem considered in this paper is construction and theoretical study of efficient n-point coverings of a d-dimensional cube [−1, 1]d. Targeted values of d are between 5 and 50; n can be in hundreds or thousands and the designs (collections of points) are nested. This paper is a continuation of our paper (Noonan and Zhigljavsky, SN Oper Res Forum, 2020), where we have theoretically investigated several simple schemes and numerically studied many more. In this paper, we extend the theoretical constructions of (Noonan and Zhigljavsky, SN Oper Res Forum, 2020) for studying the designs that were found to be superior to the ones theoretically investigated in (Noonan and Zhigljavsky, SN Oper Res Forum, 2020). We also extend our constructions for new construction schemes that provide even better coverings (in the class of nested designs) than the ones numerically found in (Noonan and Zhigljavsky, SN Oper Res Forum, 2020). In view of a close connection of the problem of quantization to the problem of covering, we extend our theoretical approximations and practical recommendations to the problem of construction of efficient quantization designs in a cube [−1, 1]d. In the last section, we discuss the problems of covering and quantization in a d-dimensional simplex; practical significance of this problem has been communicated to the authors by Professor Michael Vrahatis, a co-editor of the present volume.

Posted Content
TL;DR: The upper and lower error bounds for Gaussian process regression with possibly misspecified correlation functions are derived and the optimal convergence rate can be attained even if the smoothness of the imposed correlation function exceeds that of the true correlation function and the sampling scheme is quasi-uniform.
Abstract: In this work, we investigate Gaussian process regression used to recover a function based on noisy observations. We derive upper and lower error bounds for Gaussian process regression with possibly misspecified correlation functions. The optimal convergence rate can be attained even if the smoothness of the imposed correlation function exceeds that of the true correlation function and the sampling scheme is quasi-uniform. As byproducts, we also obtain convergence rates of kernel ridge regression with misspecified kernel function, where the underlying truth is a deterministic function. The convergence rates of Gaussian process regression and kernel ridge regression are closely connected, which is aligned with the relationship between sample paths of Gaussian process and the corresponding reproducing kernel Hilbert space.

Posted Content
TL;DR: It is shown that the context-specific information encoded by a CStree can be equivalently expressed via a collection of DAGs, and a global Markov property for CStrees is obtained which leads to a graphical criterion of model equivalence for C Strees generalizing that of Verma and Pearl for DAG models.
Abstract: We consider the problem of representing causal models that encode context-specific information for discrete data. To represent such models we use a proper subclass of staged tree models which we call CStrees. We show that the context-specific information encoded by a CStree can be equivalently expressed via a collection of DAGs. As not all staged tree models admit this property, CStrees are a subclass that provides a transparent, intuitive and compact representation of context-specific causal information. Model equivalence for CStrees also takes a simpler form than for general staged trees: We provide a characterization of the complete set of asymmetric conditional independence relations encoded by a CStree. As a consequence, we obtain a global Markov property for CStrees which leads to a graphical criterion of model equivalence for CStrees generalizing that of Verma and Pearl for DAG models. In addition, we provide a closed-form formula for the maximum likelihood estimator of a CStree and use it to show that the Bayesian information criterion is a locally consistent score function for this model class. We also give an analogous global Markov property and characterization of model equivalence for general interventions in CStrees. As examples, we apply these results to two real data sets, and examine how BIC-optimal CStrees for each provide a clear and concise representation of the learned context-specific causal structure.

Book ChapterDOI
TL;DR: This paper shows the asymptotic equivalence of several bootstrapped processes related to the empirical copula and empirical beta copula, and investigates the finite-sample properties of resampling schemes based on the empirical (beta) copula by Monte Carlo simulation.
Abstract: The empirical beta copula is a simple but effective smoother of the empirical copula. Because it is a genuine copula, from which it is particularly easy to sample, it is reasonable to expect that resampling procedures based on the empirical beta copula are expedient and accurate. In this paper, after reviewing the literature on some bootstrap approximations for the empirical copula process, we first show the asymptotic equivalence of several bootstrapped processes related to the empirical and empirical beta copulas. Then we investigate the finite-sample properties of resampling schemes based on the empirical (beta) copula by the Monte Carlo simulation. More specifically, we consider interval estimation for functionals such as the rank correlation coefficients and dependence parameters of several well-known families of copulas. Here, we construct confidence intervals using several methods and compare their accuracy and efficiency. We also compute the actual size and power of symmetry tests based on several resampling schemes for the empirical and empirical beta copulas.

Posted Content
TL;DR: In this article, a nonparametric hidden Markov model with two states is considered and the question of constructing efficient multiple testing procedures is considered, treating one of the states as an unknown null hypothesis, and a procedure is introduced, based on non-parametric empirical Bayes ideas, that controls the False Discovery Rate (FDR) at a user-specified level.
Abstract: Given a nonparametric Hidden Markov Model (HMM) with two states, the question of constructing efficient multiple testing procedures is considered, treating one of the states as an unknown null hypothesis. A procedure is introduced, based on nonparametric empirical Bayes ideas, that controls the False Discovery Rate (FDR) at a user--specified level. Guarantees on power are also provided, in the form of a control of the true positive rate. One of the key steps in the construction requires supremum--norm convergence of preliminary estimators of the emission densities of the HMM. We provide the existence of such estimators, with convergence at the optimal minimax rate, for the case of a HMM with $J\ge 2$ states, which is of independent interest.

Posted Content
TL;DR: In this article, a general theory on rates of convergence of penalized spline estimators for function estimation when the likelihood functional is concave in candidate functions, where the likelihood is interpreted in a broad sense that includes conditional likelihood, quasi-likelihood, and pseudo likelihood.
Abstract: This paper develops a general theory on rates of convergence of penalized spline estimators for function estimation when the likelihood functional is concave in candidate functions, where the likelihood is interpreted in a broad sense that includes conditional likelihood, quasi-likelihood, and pseudo-likelihood. The theory allows all feasible combinations of the spline degree, the penalty order, and the smoothness of the unknown functions. According to this theory, the asymptotic behaviors of the penalized spline estimators depends on interplay between the spline knot number and the penalty parameter. The general theory is applied to obtain results in a variety of contexts, including regression, generalized regression such as logistic regression and Poisson regression, density estimation, conditional hazard function estimation for censored data, quantile regression, diffusion function estimation for a diffusion type process, and estimation of spectral density function of a stationary time series. For multi-dimensional function estimation, the theory (presented in the Supplementary Material) covers both penalized tensor product splines and penalized bivariate splines on triangulations.

Posted Content
TL;DR: In this article, the authors study the problem of testing the existence of a dense subhypergraph and establish sharp detection boundaries in both scenarios: (1) the edge probabilities are known; (2) the edges are unknown.
Abstract: We study the problem of testing the existence of a dense subhypergraph. The null hypothesis is an Erdos-Renyi uniform random hypergraph and the alternative hypothesis is a uniform random hypergraph that contains a dense subhypergraph. We establish sharp detection boundaries in both scenarios: (1) the edge probabilities are known; (2) the edge probabilities are unknown. In both scenarios, sharp detectable boundaries are characterized by the appropriate model parameters. Asymptotically powerful tests are provided when the model parameters fall in the detectable regions. Our results indicate that the detectable regions for general hypergraph models are dramatically different from their graph counterparts.

Posted Content
TL;DR: In this article, the Frechet mean on the infinite-dimensional Hilbert sphere has been derived and a root-n$ central limit theorem (CLT) for the sample version has been obtained for the estimated tangent vectors and covariance operator.
Abstract: The infinite-dimensional Hilbert sphere $S^\infty$ has been widely employed to model density functions and shapes, extending the finite-dimensional counterpart. We consider the Frechet mean as an intrinsic summary of the central tendency of data lying on $S^\infty$. To break a path for sound statistical inference, we derive properties of the Frechet mean on $S^\infty$ by establishing its existence and uniqueness as well as a root-$n$ central limit theorem (CLT) for the sample version, overcoming obstructions from infinite-dimensionality and lack of compactness on $S^\infty$. Intrinsic CLTs for the estimated tangent vectors and covariance operator are also obtained. Asymptotic and bootstrap hypothesis tests for the Frechet mean based on projection and norm are then proposed and are shown to be consistent. The proposed two-sample tests are applied to make inference for daily taxi demand patterns over Manhattan modeled as densities, of which the square roots are analyzed on the Hilbert sphere. Numerical properties of the proposed hypothesis tests which utilize the spherical geometry are studied in the real data application and simulations, where we demonstrate that the tests based on the intrinsic geometry compare favorably to those based on an extrinsic or flat geometry.

Posted Content
TL;DR: In this paper, the authors studied the structural and statistical properties of the Gaussian-smoothed Wasserstein distance in high dimensions and provided asymptotic guarantees for two-sample testing and minimum distance estimation.
Abstract: Statistical distances, i.e., discrepancy measures between probability distributions, are ubiquitous in probability theory, statistics and machine learning. To combat the curse of dimensionality when estimating these distances from data, recent work has proposed smoothing out local irregularities in the measured distributions via convolution with a Gaussian kernel. Motivated by the scalability of the smooth framework to high dimensions, we conduct an in-depth study of the structural and statistical behavior of the Gaussian-smoothed $p$-Wasserstein distance $\mathsf{W}_p^{(\sigma)}$, for arbitrary $p\geq 1$. We start by showing that $\mathsf{W}_p^{(\sigma)}$ admits a metric structure that is topologically equivalent to classic $\mathsf{W}_p$ and is stable with respect to perturbations in $\sigma$. Moving to statistical questions, we explore the asymptotic properties of $\mathsf{W}_p^{(\sigma)}(\hat{\mu}_n,\mu)$, where $\hat{\mu}_n$ is the empirical distribution of $n$ i.i.d. samples from $\mu$. To that end, we prove that $\mathsf{W}_p^{(\sigma)}$ is controlled by a $p$th order smooth dual Sobolev norm $\mathsf{d}_p^{(\sigma)}$. Since $\mathsf{d}_p^{(\sigma)}(\hat{\mu}_n,\mu)$ coincides with the supremum of an empirical process indexed by Gaussian-smoothed Sobolev functions, it lends itself well to analysis via empirical process theory. We derive the limit distribution of $\sqrt{n}\mathsf{d}_p^{(\sigma)}(\hat{\mu}_n,\mu)$ in all dimensions $d$, when $\mu$ is sub-Gaussian. Through the aforementioned bound, this implies a parametric empirical convergence rate of $n^{-1/2}$ for $\mathsf{W}_p^{(\sigma)}$, contrasting the $n^{-1/d}$ rate for unsmoothed $\mathsf{W}_p$ when $d \geq 3$. As applications, we provide asymptotic guarantees for two-sample testing and minimum distance estimation. When $p=2$, we further show that $\mathsf{d}_2^{(\sigma)}$ can be expressed as a maximum mean discrepancy.

Posted Content
TL;DR: A more flexible model is provided which relaxes the linearity assumption by replacing it by an arbitrary additive form and establishes statistical guarantees for the resulting estimators, which can be used to prove consistency if the dimension and the number of functional principal components diverge to infinity with the sample size.
Abstract: We consider the problem of constructing nonparametric undirected graphical models for high-dimensional functional data. Most existing statistical methods in this context assume either a Gaussian distribution on the vertices or linear conditional means. In this article we provide a more flexible model which relaxes the linearity assumption by replacing it by an arbitrary additive form. The use of functional principal components offers an estimation strategy that uses a group lasso penalty to estimate the relevant edges of the graph. We establish statistical guarantees for the resulting estimators, which can be used to prove consistency if the dimension and the number of functional principal components diverge to infinity with the sample size. We also investigate the empirical performance of our method through simulation studies and a real data application.

Posted Content
TL;DR: In this article, the spectral convergence of graph Laplacian to the Laplace-Beltrami operator was studied for graph affinity matrices, where the graph affinity matrix is constructed from random samples on a manifold embedded in a possibly high dimensional space.
Abstract: This work studies the spectral convergence of graph Laplacian to the Laplace-Beltrami operator when the graph affinity matrix is constructed from $N$ random samples on a $d$-dimensional manifold embedded in a possibly high dimensional space. By analyzing Dirichlet form convergence and constructing candidate approximate eigenfunctions via convolution with manifold heat kernel, we prove that, with Gaussian kernel, one can set the kernel bandwidth parameter $\epsilon \sim (\log N/ N)^{1/(d/2+2)}$ such that the eigenvalue convergence rate is $N^{-1/(d/2+2)}$ and the eigenvector convergence in 2-norm has rate $N^{-1/(d+4)}$; When $\epsilon \sim N^{-1/(d/2+3)}$, both eigenvalue and eigenvector rates are $N^{-1/(d/2+3)}$. These rates are up to a $\log N$ factor and proved for finitely many low-lying eigenvalues. The result holds for un-normalized and random-walk graph Laplacians when data are uniformly sampled on the manifold, as well as the density-corrected graph Laplacian (where the affinity matrix is normalized by the degree matrix from both sides) with non-uniformly sampled data. As an intermediate result, we prove new point-wise and Dirichlet form convergence rates for the density-corrected graph Laplacian. Numerical results are provided to verify the theory.