scispace - formally typeset
Search or ask a question
Posted Content

On the rate of convergence in Wasserstein distance of the empirical measure

TL;DR: In this paper, the authors consider the convergence of a probability distribution to a given probability distribution in the Wasserstein distance of order $p>0, and provide some satisfying non-asymptotic $L^p$-bounds and concentration inequalities, for any values of $p > 0 and $d\geq 1.
Abstract: Let $\mu_N$ be the empirical measure associated to a $N$-sample of a given probability distribution $\mu$ on $\mathbb{R}^d$. We are interested in the rate of convergence of $\mu_N$ to $\mu$, when measured in the Wasserstein distance of order $p>0$. We provide some satisfying non-asymptotic $L^p$-bounds and concentration inequalities, for any values of $p>0$ and $d\geq 1$. We extend also the non asymptotic $L^p$-bounds to stationary $\rho$-mixing sequences, Markov chains, and to some interacting particle systems.
Citations
More filters
Posted Content
TL;DR: This short book reviews OT with a bias toward numerical methods and their applications in data sciences, and sheds lights on the theoretical properties of OT that make it particularly useful for some of these applications.
Abstract: Optimal transport (OT) theory can be informally described using the words of the French mathematician Gaspard Monge (1746-1818): A worker with a shovel in hand has to move a large pile of sand lying on a construction site. The goal of the worker is to erect with all that sand a target pile with a prescribed shape (for example, that of a giant sand castle). Naturally, the worker wishes to minimize her total effort, quantified for instance as the total distance or time spent carrying shovelfuls of sand. Mathematicians interested in OT cast that problem as that of comparing two probability distributions, two different piles of sand of the same volume. They consider all of the many possible ways to morph, transport or reshape the first pile into the second, and associate a "global" cost to every such transport, using the "local" consideration of how much it costs to move a grain of sand from one place to another. Recent years have witnessed the spread of OT in several fields, thanks to the emergence of approximate solvers that can scale to sizes and dimensions that are relevant to data sciences. Thanks to this newfound scalability, OT is being increasingly used to unlock various problems in imaging sciences (such as color or texture processing), computer vision and graphics (for shape manipulation) or machine learning (for regression, classification and density fitting). This short book reviews OT with a bias toward numerical methods and their applications in data sciences, and sheds lights on the theoretical properties of OT that make it particularly useful for some of these applications.

1,355 citations

Posted Content
TL;DR: It is demonstrated that the distributionally robust optimization problems over Wasserstein balls can in fact be reformulated as finite convex programs—in many interesting cases even as tractable linear programs.
Abstract: We consider stochastic programs where the distribution of the uncertain parameters is only observable through a finite training dataset. Using the Wasserstein metric, we construct a ball in the space of (multivariate and non-discrete) probability distributions centered at the uniform distribution on the training samples, and we seek decisions that perform best in view of the worst-case distribution within this Wasserstein ball. The state-of-the-art methods for solving the resulting distributionally robust optimization problems rely on global optimization techniques, which quickly become computationally excruciating. In this paper we demonstrate that, under mild assumptions, the distributionally robust optimization problems over Wasserstein balls can in fact be reformulated as finite convex programs---in many interesting cases even as tractable linear programs. Leveraging recent measure concentration results, we also show that their solutions enjoy powerful finite-sample performance guarantees. Our theoretical results are exemplified in mean-risk portfolio optimization as well as uncertainty quantification.

808 citations

Posted Content
TL;DR: The paper argues that the set of distributions chosen should be chosen to be appropriate for the application at hand, and that some of the choices that have been popular until recently are, for many applications, not good choices.
Abstract: Distributionally robust stochastic optimization (DRSO) is an approach to optimization under uncertainty in which, instead of assuming that there is an underlying probability distribution that is known exactly, one hedges against a chosen set of distributions. In this paper, we consider sets of distributions that are within a chosen Wasserstein distance from a nominal distribution. We argue that such a choice of sets has two advantages: (1) The resulting distributions hedged against are more reasonable than those resulting from other popular choices of sets, such as {\Phi}-divergence ambiguity set. (2) The problem of determining the worst-case expectation has desirable tractability properties. We derive a dual reformulation of the corresponding DRSO problem and construct approximate worst-case distributions (or an exact worst-case distribution if it exists) explicitly via the first-order optimality conditions of the dual problem. Our contributions are five-fold. (i) We identify necessary and sufficient conditions for the existence of a worst-case distribution, which is naturally related to the growth rate of the objective function. (ii) We show that the worst-case distributions resulting from an appropriate Wasserstein distance have a concise structure and a clear interpretation. (iii) Using this structure, we show that data-driven DRSO problems can be approximated to any accuracy by robust optimization problems, and thereby many DRSO problems become tractable by using tools from robust optimization. (iv) To the best of our knowledge, our proof of strong duality is the first constructive proof for DRSO problems, and we show that the constructive proof technique is also useful in other contexts. (v) Our strong duality result holds in a very general setting, and we show that it can be applied to infinite dimensional process control problems and worst-case value-at-risk analysis.

505 citations

Posted Content
TL;DR: Main concepts and contributions to DRO are surveyed, and its relationships with robust optimization, risk-aversion, chance-constrained optimization, and function regularization are surveyed.
Abstract: The concepts of risk-aversion, chance-constrained optimization, and robust optimization have developed significantly over the last decade. Statistical learning community has also witnessed a rapid theoretical and applied growth by relying on these concepts. A modeling framework, called distributionally robust optimization (DRO), has recently received significant attention in both the operations research and statistical learning communities. This paper surveys main concepts and contributions to DRO, and its relationships with robust optimization, risk-aversion, chance-constrained optimization, and function regularization.

348 citations

Journal ArticleDOI
TL;DR: This work considers the fundamental question of how quickly the empirical measure obtained from independent samples from $\mu$ approaches $n$ in the Wasserstein distance of any order and proves sharp asymptotic and finite-sample results for this rate of convergence for general measures on general compact metric spaces.
Abstract: The Wasserstein distance between two probability measures on a metric space is a measure of closeness with applications in statistics, probability, and machine learning. In this work, we consider the fundamental question of how quickly the empirical measure obtained from $n$ independent samples from $\mu$ approaches $\mu$ in the Wasserstein distance of any order. We prove sharp asymptotic and finite-sample results for this rate of convergence for general measures on general compact metric spaces. Our finite-sample results show the existence of multi-scale behavior, where measures can exhibit radically different rates of convergence as $n$ grows.

317 citations

References
More filters
Book
14 Mar 1996
TL;DR: In this article, the authors define the Ball Sigma-Field and Measurability of Suprema and show that it is possible to achieve convergence almost surely and in probability.
Abstract: 1.1. Introduction.- 1.2. Outer Integrals and Measurable Majorants.- 1.3. Weak Convergence.- 1.4. Product Spaces.- 1.5. Spaces of Bounded Functions.- 1.6. Spaces of Locally Bounded Functions.- 1.7. The Ball Sigma-Field and Measurability of Suprema.- 1.8. Hilbert Spaces.- 1.9. Convergence: Almost Surely and in Probability.- 1.10. Convergence: Weak, Almost Uniform, and in Probability.- 1.11. Refinements.- 1.12. Uniformity and Metrization.- 2.1. Introduction.- 2.2. Maximal Inequalities and Covering Numbers.- 2.3. Symmetrization and Measurability.- 2.4. Glivenko-Cantelli Theorems.- 2.5. Donsker Theorems.- 2.6. Uniform Entropy Numbers.- 2.7. Bracketing Numbers.- 2.8. Uniformity in the Underlying Distribution.- 2.9. Multiplier Central Limit Theorems.- 2.10. Permanence of the Donsker Property.- 2.11. The Central Limit Theorem for Processes.- 2.12. Partial-Sum Processes.- 2.13. Other Donsker Classes.- 2.14. Tail Bounds.- 3.1. Introduction.- 3.2. M-Estimators.- 3.3. Z-Estimators.- 3.4. Rates of Convergence.- 3.5. Random Sample Size, Poissonization and Kac Processes.- 3.6. The Bootstrap.- 3.7. The Two-Sample Problem.- 3.8. Independence Empirical Processes.- 3.9. The Delta-Method.- 3.10. Contiguity.- 3.11. Convolution and Minimax Theorems.- A. Appendix.- A.1. Inequalities.- A.2. Gaussian Processes.- A.2.1. Inequalities and Gaussian Comparison.- A.2.2. Exponential Bounds.- A.2.3. Majorizing Measures.- A.2.4. Further Results.- A.3. Rademacher Processes.- A.4. Isoperimetric Inequalities for Product Measures.- A.5. Some Limit Theorems.- A.6. More Inequalities.- A.6.1. Binomial Random Variables.- A.6.2. Multinomial Random Vectors.- A.6.3. Rademacher Sums.- Notes.- References.- Author Index.- List of Symbols.

5,231 citations

Book
01 Mar 2003
TL;DR: In this paper, the metric side of optimal transportation is considered from a differential point of view on optimal transportation, and the Kantorovich duality of the optimal transportation problem is investigated.
Abstract: Introduction The Kantorovich duality Geometry of optimal transportation Brenier's polar factorization theorem The Monge-Ampere equation Displacement interpolation and displacement convexity Geometric and Gaussian inequalities The metric side of optimal transportation A differential point of view on optimal transportation Entropy production and transportation inequalities Problems Bibliography Table of short statements Index.

4,808 citations

Book
01 Jan 2001
TL;DR: Concentration functions and inequalities isoperimetric and functional examples Concentration and geometry Concentration in product spaces Entropy and concentration Transportation cost inequalities Sharp bounds of Gaussian and empirical processes Selected applications References Index
Abstract: Concentration functions and inequalities Isoperimetric and functional examples Concentration and geometry Concentration in product spaces Entropy and concentration Transportation cost inequalities Sharp bounds of Gaussian and empirical processes Selected applications References Index

2,324 citations

Book
01 Jan 1994
TL;DR: In this paper, the authors provide a study of applications of dependence in probability and statistics, focusing on mixing, which is concerned with the analysis of dependence between sigma-fields defined on the same underlying probability space.
Abstract: Mixing is concerned with the analysis of dependence between sigma-fields defined on the same underlying probability space. It provides an important tool of analysis for random fields, Markov processes and central limit theorems as well as being a topic of current research interest in its own right. The aim of this monograph is to provide a study of applications of dependence in probability and statistics. It is divided into two parts, the first covering the definitions and probabilistic properties of mixing theory, the second describing mixing properties of classical processes and random fields as well as providing a detailed study of linear and Gaussian fields.

1,487 citations

Book
25 Aug 2008
TL;DR: In this article, Gaussian Processes and Gaussian Model Selection are used to estimate density estimation via model selection via statistical learning.Exponential and Information Inequalities, Gaussian processes and model selection.
Abstract: Exponential and Information Inequalities- Gaussian Processes- Gaussian Model Selection- Concentration Inequalities- Maximal Inequalities- Density Estimation via Model Selection- Statistical Learning

1,115 citations