scispace - formally typeset
Search or ask a question
Author

Wenlong Mou

Bio: Wenlong Mou is an academic researcher from University of California, Berkeley. The author has contributed to research in topics: Mathematics & Convex function. The author has an hindex of 13, co-authored 25 publications receiving 413 citations.

Papers
More filters
Proceedings ArticleDOI
01 Aug 2017
TL;DR: In this paper, RRPSGD (Random Round Private Stochastic Gradient Descent) algorithm is proposed for non-convex but smooth objectives, which provably converges to a stationary point with privacy guarantee.
Abstract: In this paper, we consider efficient differentially private empirical risk minimization from the viewpoint of optimization algorithms. For strongly convex and smooth objectives, we prove that gradient descent with output perturbation not only achieves nearly optimal utility, but also significantly improves the running time of previous state-of-the-art private optimization algorithms, for both $\epsilon$-DP and $(\epsilon, \delta)$-DP. For non-convex but smooth objectives, we propose an RRPSGD (Random Round Private Stochastic Gradient Descent) algorithm, which provably converges to a stationary point with privacy guarantee. Besides the expected utility bounds, we also provide guarantees in high probability form. Experiments demonstrate that our algorithm consistently outperforms existing method in both utility and running time.

93 citations

Posted Content
TL;DR: In this article, the authors studied the generalization error of stochastic gradient Langevin dynamics with non-convex objectives and proposed two theories with nonasymptotic discrete-time analysis, using Stability and PAC-Bayesian results respectively.
Abstract: Algorithm-dependent generalization error bounds are central to statistical learning theory. A learning algorithm may use a large hypothesis space, but the limited number of iterations controls its model capacity and generalization error. The impacts of stochastic gradient methods on generalization error for non-convex learning problems not only have important theoretical consequences, but are also critical to generalization errors of deep learning. In this paper, we study the generalization errors of Stochastic Gradient Langevin Dynamics (SGLD) with non-convex objectives. Two theories are proposed with non-asymptotic discrete-time analysis, using Stability and PAC-Bayesian results respectively. The stability-based theory obtains a bound of $O\left(\frac{1}{n}L\sqrt{\beta T_k}\right)$, where $L$ is uniform Lipschitz parameter, $\beta$ is inverse temperature, and $T_k$ is aggregated step sizes. For PAC-Bayesian theory, though the bound has a slower $O(1/\sqrt{n})$ rate, the contribution of each step is shown with an exponentially decaying factor by imposing $\ell^2$ regularization, and the uniform Lipschitz constant is also replaced by actual norms of gradients along trajectory. Our bounds have no implicit dependence on dimensions, norms or other capacity measures of parameter, which elegantly characterizes the phenomenon of "Fast Training Guarantees Generalization" in non-convex settings. This is the first algorithm-dependent result with reasonable dependence on aggregated step sizes for non-convex learning, and has important implications to statistical learning aspects of stochastic gradient methods in complicated models such as deep learning.

60 citations

Proceedings Article
17 Jul 2017
TL;DR: This work gives differentially private and efficient algorithms achieving strong guarantees for k-means and k-median clustering when d = Ω(polylog(n), advancing the state-of-the-art result of √ dOPT+ poly(log n, d, k).
Abstract: We study the problem of clustering sensitive data while preserving the privacy of individuals represented in the dataset, which has broad applications in practical machine learning and data analysis tasks. Although the problem has been widely studied in the context of lowdimensional, discrete spaces, much remains unknown concerning private clustering in highdimensional Euclidean spaces R. In this work, we give differentially private and efficient algorithms achieving strong guarantees for k-means and k-median clustering when d = Ω(polylog(n)). Our algorithm achieves clustering loss at most log(n)OPT+poly(log n, d, k), advancing the state-of-the-art result of √ dOPT+ poly(log n, d, k). We also study the case where the data points are s-sparse and show that the clustering loss can scale logarithmically with d, i.e., log(n)OPT + poly(log n, log d, k, s). Experiments on both synthetic and real datasets verify the effectiveness of the proposed method.

55 citations

Posted Content
TL;DR: An RRPSGD (Random Round Private Stochastic Gradient Descent) algorithm, which provably converges to a stationary point with privacy guarantee is proposed, which consistently outperforms existing method in both utility and running time.
Abstract: In this paper, we consider efficient differentially private empirical risk minimization from the viewpoint of optimization algorithms. For strongly convex and smooth objectives, we prove that gradient descent with output perturbation not only achieves nearly optimal utility, but also significantly improves the running time of previous state-of-the-art private optimization algorithms, for both $\epsilon$-DP and $(\epsilon, \delta)$-DP. For non-convex but smooth objectives, we propose an RRPSGD (Random Round Private Stochastic Gradient Descent) algorithm, which provably converges to a stationary point with privacy guarantee. Besides the expected utility bounds, we also provide guarantees in high probability form. Experiments demonstrate that our algorithm consistently outperforms existing method in both utility and running time.

48 citations

Posted Content
TL;DR: An improved analysis of the Euler-Maruyama discretization of the Langevin diffusion does not require global contractivity, and yields polynomial dependence on the time horizon, and simultaneously improves all those methods based on Dalayan's approach.
Abstract: We present an improved analysis of the Euler-Maruyama discretization of the Langevin diffusion. Our analysis does not require global contractivity, and yields polynomial dependence on the time horizon. Compared to existing approaches, we make an additional smoothness assumption, and improve the existing rate from $O(\eta)$ to $O(\eta^2)$ in terms of the KL divergence. This result matches the correct order for numerical SDEs, without suffering from exponential time dependence. When applied to algorithms for sampling and learning, this result simultaneously improves all those methods based on Dalayan's approach.

40 citations


Cited by
More filters
Book
01 Jan 1991
TL;DR: In this paper, the Third Edition of the Third edition of Linear Systems: Local Theory and Nonlinear Systems: Global Theory (LTLT) is presented, along with an extended version of the second edition.
Abstract: Series Preface * Preface to the Third Edition * 1 Linear Systems * 2 Nonlinear Systems: Local Theory * 3 Nonlinear Systems: Global Theory * 4 Nonlinear Systems: Bifurcation Theory * References * Index

1,977 citations

Proceedings Article
24 May 2019
TL;DR: This paper showed that gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet) and further extended their analysis to deep residual convolutional neural networks and obtained a similar convergence result.
Abstract: Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.

626 citations

Posted Content
TL;DR: This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.
Abstract: Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: (i) Using a tighter characterization of training speed than recent papers, an explanation for why training a neural net with random labels leads to slower training, as originally observed in [Zhang et al. ICLR'17]. (ii) Generalization bound independent of network size, using a data-dependent complexity measure. Our measure distinguishes clearly between random labels and true labels on MNIST and CIFAR, as shown by experiments. Moreover, recent papers require sample complexity to increase (slowly) with the size, while our sample complexity is completely independent of the network size. (iii) Learnability of a broad class of smooth functions by 2-layer ReLU nets trained via gradient descent. The key idea is to track dynamics of training and generalization via properties of a related kernel.

476 citations