scispace - formally typeset
Search or ask a question

Showing papers by "Huishuai Zhang published in 2020"


Posted Content
TL;DR: In this paper, the authors show that layer normalization is crucial to the performance of pre-LN Transformers and remove the warm-up stage for the training of Pre-LNs.
Abstract: The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

373 citations


Proceedings Article
12 Jul 2020
TL;DR: It is proved with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large and using a large learning rate makes the training unstable.
Abstract: The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyperparameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

64 citations


Journal ArticleDOI
TL;DR: In this article, a truncated gradient descent algorithm is proposed to improve the robustness against outliers, where the truncation is performed to rule out the contributions of samples that deviate significantly from the measurement residuals adaptively in each iteration.
Abstract: Recent work has demonstrated the effectiveness of gradient descent for directly recovering the factors of low-rank matrices from random linear measurements in a globally convergent manner when initialized properly. However, the performance of existing algorithms is highly sensitive in the presence of outliers that may take arbitrary values. In this paper, we propose a truncated gradient descent algorithm to improve the robustness against outliers, where the truncation is performed to rule out the contributions of samples that deviate significantly from the {\em sample median} of measurement residuals adaptively in each iteration. We demonstrate that, when initialized in a basin of attraction close to the ground truth, the proposed algorithm converges to the ground truth at a linear rate for the Gaussian measurement model with a near-optimal number of measurements, even when a constant fraction of the measurements are arbitrarily corrupted. In addition, we propose a new truncated spectral method that ensures an initialization in the basin of attraction at slightly higher requirements. We finally provide numerical experiments to validate the superior performance of the proposed approach.

33 citations


Journal ArticleDOI
TL;DR: A comprehensive understanding of algorithmic convergence with respect to data homogeneity is obtained by measuring the smoothness of the discrepancy between the local and global loss functions and it is shown that when the data are less balanced, regularization can be used to ensure convergence at a slower rate.
Abstract: Stochastic variance reduced methods have gained a lot of interest recently for empirical risk minimization due to its appealing run time complexity. When the data size is large and disjointly stored on different machines, it becomes imperative to distribute the implementation of such variance reduced methods. In this paper, we consider a general framework that directly distributes popular stochastic variance reduced methods in the master/slave model, by assigning outer loops to the parameter server, and inner loops to worker machines. This framework is natural and friendly to implement, but its theoretical convergence is not well understood. We obtain a comprehensive understanding of algorithmic convergence with respect to data homogeneity by measuring the smoothness of the discrepancy between the local and global loss functions. We establish the linear convergence of distributed versions of a family of stochastic variance reduced algorithms, including those using accelerated and recursive gradient updates, for minimizing strongly convex losses. Our theory captures how the convergence of distributed algorithms behaves as the number of machines and the size of local data vary. Furthermore, we show that when the data are less balanced, regularization can be used to ensure convergence at a slower rate. We also demonstrate that our analysis can be further extended to handle nonconvex loss functions.

29 citations


Posted Content
Da Yu1, Huishuai Zhang2, Wei Chen2, Jian Yin1, Tie-Yan Liu2 
TL;DR: This work establishes the optimal membership inference when the model is trained with augmented data, which inspires them to formulate the MI attack as a set classification problem, i.e., classifying a set of augmented instances instead of a single data point, and design input permutation invariant features.
Abstract: It is observed in the literature that data augmentation can significantly mitigate membership inference (MI) attack. However, in this work, we challenge this observation by proposing new MI attacks to utilize the information of augmented data. MI attack is widely used to measure the model's information leakage of the training set. We establish the optimal membership inference when the model is trained with augmented data, which inspires us to formulate the MI attack as a set classification problem, i.e., classifying a set of augmented instances instead of a single data point, and design input permutation invariant features. Empirically, we demonstrate that the proposed approach universally outperforms original methods when the model is trained with data augmentation. Even further, we show that the proposed approach can achieve higher MI attack success rates on models trained with some data augmentation than the existing methods on models trained without data augmentation. Notably, we achieve a 70.1% MI attack success rate on CIFAR10 against a wide residual network while the previous best approach only attains 61.9%. This suggests the privacy risk of models trained with data augmentation could be largely underestimated.

28 citations


Posted Content
TL;DR: A novel adaptive optimizer named Adaptive Inertia Estimation (Adai), which uses parameter-wise adaptive inertia to accelerate training and provably favors flat minima as much as SGD.
Abstract: Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Rate and Momentum, is the most popular stochastic optimizer for accelerating the training of deep neural networks However, empirically Adam often generalizes worse than Stochastic Gradient Descent (SGD) We unveil the mystery of this behavior based on the diffusion theoretical framework Specifically, we disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and minima selection We prove that Adaptive Learning Rate can escape saddle points efficiently, but cannot select flat minima as SGD does In contrast, Momentum provides a drift effect to help the training process pass through saddle points, and almost does not affect flat minima selection This theoretically explains why SGD (with Momentum) generalizes better, while Adam generalizes worse but converges faster Furthermore, motivated by the analysis, we design a novel adaptive optimization framework named Adaptive Inertia, which uses parameter-wise adaptive inertia to accelerate the training and provably favors flat minima as well as SGD Our extensive experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods

11 citations


Proceedings ArticleDOI
Da Yu1, Huishuai Zhang2, Wei Chen2, Jian Yin1, Tie-Yan Liu2 
09 Jul 2020
TL;DR: It is shown that for differentiallyPrivate convex optimization, the utility guarantee of differentially private (stochastic) gradient descent is determined by an expected curvature rather than the minimum curvature, which represents the average curvature over the optimization path.
Abstract: Gradient perturbation, widely used for differentially private optimization, injects noise at every iterative update to guarantee differential privacy. Previous work first determines the noise level that can satisfy the privacy requirement and then analyzes the utility of noisy gradient updates as in the non-private case. In contrast, we explore how the privacy noise affects the optimization property. We show that for differentially private convex optimization, the utility guarantee of differentially private (stochastic) gradient descent is determined by an expected curvature rather than the minimum curvature. The expected curvature, which represents the average curvature over the optimization path, is usually much larger than the minimum curvature. By using the expected curvature, we show that gradient perturbation can achieve a significantly improved utility guarantee that can theoretically justify the advantage of gradient perturbation over other perturbation methods. Finally, our extensive experiments suggest that gradient perturbation with the advanced composition method indeed outperforms other perturbation approaches by a large margin, matching our theoretical findings.

9 citations


Posted Content
04 Aug 2020
TL;DR: A framework which takes state-of-the-art solvers and "robustifies" them to achieve comparable guarantees against a semi-random adversary is developed, and given a matrix which contains an (unknown) well-conditioned submatrix, the methods obtain computational and statistical guarantees as if the entire matrix was well- Conditioned.
Abstract: Classical iterative algorithms for linear system solving and regression are brittle to the condition number of the data matrix. Even a semi-random adversary, constrained to only give additional consistent information, can arbitrarily hinder the resulting computational guarantees of existing solvers. We show how to overcome this barrier by developing a framework which takes state-of-the-art solvers and "robustifies" them to achieve comparable guarantees against a semi-random adversary. Given a matrix which contains an (unknown) well-conditioned submatrix, our methods obtain computational and statistical guarantees as if the entire matrix was well-conditioned. We complement our theoretical results with preliminary experimental evidence, showing that our methods are effective in practice.

5 citations


Posted Content
21 Jul 2020
TL;DR: This paper proposes using private augmented data to sharpen its good side while passivate its bad side, and exploits the data augmentation used in training to boost the accuracy of membership inference.
Abstract: Membership inference (MI) in machine learning decides whether a given example is in target model's training set. It can be used in two ways: adversaries use it to steal private membership information while legitimate users can use it to verify whether their data has been forgotten by a trained model. Therefore, MI is a double-edged sword to privacy preserving machine learning. In this paper, we propose using private augmented data to sharpen its good side while passivate its bad side. To sharpen the good side, we exploit the data augmentation used in training to boost the accuracy of membership inference. Specifically, we compose a set of augmented instances for each sample and then the membership inference is formulated as a set classification problem, i.e., classifying a set of augmented data points instead of one point. We design permutation invariant features based on the losses of augmented instances. Our approach significantly improves the MI accuracy over existing algorithms. To passivate the bad side, we apply different data augmentation methods to each legitimate user and keep the augmented data as secret. We show that the malicious adversaries cannot benefit from our algorithms if being ignorant of the augmented data used in training. Extensive experiments demonstrate the superior efficacy of our algorithms. Our source code is available at anonymous GitHub page \url{this https URL}.

1 citations