Showing papers by "Huishuai Zhang published in 2020"

PDF

Open Access

Posted Content•

On Layer Normalization in the Transformer Architecture

[...]

Ruibin Xiong¹, Yunchang Yang², Di He², Kai Zheng², Shuxin Zheng³, Chen Xing⁴, Huishuai Zhang³, Yanyan Lan, Liwei Wang², Tie-Yan Liu³ - Show less +6 more•Institutions (4)

Chinese Academy of Sciences¹, Peking University², Microsoft³, Nankai University⁴

12 Feb 2020-arXiv: Learning

TL;DR: In this paper, the authors show that layer normalization is crucial to the performance of pre-LN Transformers and remove the warm-up stage for the training of Pre-LNs.

...read moreread less

Abstract: The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

...read moreread less

373 citations

Proceedings Article•

On Layer Normalization in the Transformer Architecture

[...]

Ruibin Xiong¹, Yunchang Yang², Di He², Kai Zheng², Shuxin Zheng³, Chen Xing⁴, Huishuai Zhang³, Yanyan Lan, Liwei Wang², Tie-Yan Liu³ - Show less +6 more•Institutions (4)

Chinese Academy of Sciences¹, Peking University², Microsoft³, Nankai University⁴

12 Jul 2020

TL;DR: It is proved with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large and using a large learning rate makes the training unstable.

...read moreread less

Abstract: The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyperparameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

...read moreread less

64 citations

Journal Article•DOI•

Nonconvex Low-Rank Matrix Recovery with Arbitrary Outliers via Median-Truncated Gradient Descent

[...]

Yuanxin Li¹, Yuejie Chi¹, Huishuai Zhang², Yingbin Liang³•Institutions (3)

Carnegie Mellon University¹, Microsoft², Ohio State University³

18 Jun 2020-Information and Inference: A Journal of the IMA

TL;DR: In this article, a truncated gradient descent algorithm is proposed to improve the robustness against outliers, where the truncation is performed to rule out the contributions of samples that deviate significantly from the measurement residuals adaptively in each iteration.

...read moreread less

Abstract: Recent work has demonstrated the effectiveness of gradient descent for directly recovering the factors of low-rank matrices from random linear measurements in a globally convergent manner when initialized properly. However, the performance of existing algorithms is highly sensitive in the presence of outliers that may take arbitrary values. In this paper, we propose a truncated gradient descent algorithm to improve the robustness against outliers, where the truncation is performed to rule out the contributions of samples that deviate significantly from the {\em sample median} of measurement residuals adaptively in each iteration. We demonstrate that, when initialized in a basin of attraction close to the ground truth, the proposed algorithm converges to the ground truth at a linear rate for the Gaussian measurement model with a near-optimal number of measurements, even when a constant fraction of the measurements are arbitrarily corrupted. In addition, we propose a new truncated spectral method that ensures an initialization in the basin of attraction at slightly higher requirements. We finally provide numerical experiments to validate the superior performance of the proposed approach.

...read moreread less

33 citations

Journal Article•DOI•

Convergence of Distributed Stochastic Variance Reduced Methods Without Sampling Extra Data

[...]

Shicong Cen¹, Huishuai Zhang², Yuejie Chi¹, Wei Chen², Tie-Yan Liu² - Show less +1 more•Institutions (2)

Carnegie Mellon University¹, Microsoft²

26 Jun 2020-IEEE Transactions on Signal Processing

TL;DR: A comprehensive understanding of algorithmic convergence with respect to data homogeneity is obtained by measuring the smoothness of the discrepancy between the local and global loss functions and it is shown that when the data are less balanced, regularization can be used to ensure convergence at a slower rate.

...read moreread less

Abstract: Stochastic variance reduced methods have gained a lot of interest recently for empirical risk minimization due to its appealing run time complexity. When the data size is large and disjointly stored on different machines, it becomes imperative to distribute the implementation of such variance reduced methods. In this paper, we consider a general framework that directly distributes popular stochastic variance reduced methods in the master/slave model, by assigning outer loops to the parameter server, and inner loops to worker machines. This framework is natural and friendly to implement, but its theoretical convergence is not well understood. We obtain a comprehensive understanding of algorithmic convergence with respect to data homogeneity by measuring the smoothness of the discrepancy between the local and global loss functions. We establish the linear convergence of distributed versions of a family of stochastic variance reduced algorithms, including those using accelerated and recursive gradient updates, for minimizing strongly convex losses. Our theory captures how the convergence of distributed algorithms behaves as the number of machines and the size of local data vary. Furthermore, we show that when the data are less balanced, regularization can be used to ensure convergence at a slower rate. We also demonstrate that our analysis can be further extended to handle nonconvex loss functions.

...read moreread less

29 citations

Posted Content•

How Does Data Augmentation Affect Privacy in Machine Learning

[...]

Da Yu¹, Huishuai Zhang², Wei Chen², Jian Yin¹, Tie-Yan Liu² - Show less +1 more•Institutions (2)

Sun Yat-sen University¹, Microsoft²

21 Jul 2020-arXiv: Learning

TL;DR: This work establishes the optimal membership inference when the model is trained with augmented data, which inspires them to formulate the MI attack as a set classification problem, i.e., classifying a set of augmented instances instead of a single data point, and design input permutation invariant features.

...read moreread less

Abstract: It is observed in the literature that data augmentation can significantly mitigate membership inference (MI) attack. However, in this work, we challenge this observation by proposing new MI attacks to utilize the information of augmented data. MI attack is widely used to measure the model's information leakage of the training set. We establish the optimal membership inference when the model is trained with augmented data, which inspires us to formulate the MI attack as a set classification problem, i.e., classifying a set of augmented instances instead of a single data point, and design input permutation invariant features. Empirically, we demonstrate that the proposed approach universally outperforms original methods when the model is trained with data augmentation. Even further, we show that the proposed approach can achieve higher MI attack success rates on models trained with some data augmentation than the existing methods on models trained without data augmentation. Notably, we achieve a 70.1% MI attack success rate on CIFAR10 against a wide residual network while the previous best approach only attains 61.9%. This suggests the privacy risk of models trained with data augmentation could be largely underestimated.

...read moreread less

28 citations

Posted Content•

Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia.

[...]

Zeke Xie, Xinrui Wang, Huishuai Zhang, Issei Sato, Masashi Sugiyama - Show less +1 more

29 Jun 2020-arXiv: Learning

TL;DR: A novel adaptive optimizer named Adaptive Inertia Estimation (Adai), which uses parameter-wise adaptive inertia to accelerate training and provably favors flat minima as much as SGD.

...read moreread less

Abstract: Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Rate and Momentum, is the most popular stochastic optimizer for accelerating the training of deep neural networks However, empirically Adam often generalizes worse than Stochastic Gradient Descent (SGD) We unveil the mystery of this behavior based on the diffusion theoretical framework Specifically, we disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and minima selection We prove that Adaptive Learning Rate can escape saddle points efficiently, but cannot select flat minima as SGD does In contrast, Momentum provides a drift effect to help the training process pass through saddle points, and almost does not affect flat minima selection This theoretically explains why SGD (with Momentum) generalizes better, while Adam generalizes worse but converges faster Furthermore, motivated by the analysis, we design a novel adaptive optimization framework named Adaptive Inertia, which uses parameter-wise adaptive inertia to accelerate the training and provably favors flat minima as well as SGD Our extensive experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods

...read moreread less

11 citations

Proceedings Article•DOI•

Gradient Perturbation is Underrated for Differentially Private Convex Optimization.

[...]

Da Yu¹, Huishuai Zhang², Wei Chen², Jian Yin¹, Tie-Yan Liu² - Show less +1 more•Institutions (2)

Sun Yat-sen University¹, Microsoft²

09 Jul 2020

TL;DR: It is shown that for differentiallyPrivate convex optimization, the utility guarantee of differentially private (stochastic) gradient descent is determined by an expected curvature rather than the minimum curvature, which represents the average curvature over the optimization path.

...read moreread less

Abstract: Gradient perturbation, widely used for differentially private optimization, injects noise at every iterative update to guarantee differential privacy. Previous work first determines the noise level that can satisfy the privacy requirement and then analyzes the utility of noisy gradient updates as in the non-private case. In contrast, we explore how the privacy noise affects the optimization property. We show that for differentially private convex optimization, the utility guarantee of differentially private (stochastic) gradient descent is determined by an expected curvature rather than the minimum curvature. The expected curvature, which represents the average curvature over the optimization path, is usually much larger than the minimum curvature. By using the expected curvature, we show that gradient perturbation can achieve a significantly improved utility guarantee that can theoretically justify the advantage of gradient perturbation over other perturbation methods. Finally, our extensive experiments suggest that gradient perturbation with the advanced composition method indeed outperforms other perturbation approaches by a large margin, matching our theoretical findings.

...read moreread less

9 citations

Posted Content•

Well-Conditioned Methods for Ill-Conditioned Systems: Linear Regression with Semi-Random Noise

[...]

Jerry Li, Aaron Sidford, Kevin Tian, Huishuai Zhang

04 Aug 2020

TL;DR: A framework which takes state-of-the-art solvers and "robustifies" them to achieve comparable guarantees against a semi-random adversary is developed, and given a matrix which contains an (unknown) well-conditioned submatrix, the methods obtain computational and statistical guarantees as if the entire matrix was well- Conditioned.

...read moreread less

Abstract: Classical iterative algorithms for linear system solving and regression are brittle to the condition number of the data matrix. Even a semi-random adversary, constrained to only give additional consistent information, can arbitrarily hinder the resulting computational guarantees of existing solvers. We show how to overcome this barrier by developing a framework which takes state-of-the-art solvers and "robustifies" them to achieve comparable guarantees against a semi-random adversary. Given a matrix which contains an (unknown) well-conditioned submatrix, our methods obtain computational and statistical guarantees as if the entire matrix was well-conditioned. We complement our theoretical results with preliminary experimental evidence, showing that our methods are effective in practice.

...read moreread less

5 citations

Posted Content•

Membership Inference with Privately Augmented Data Endorses the Benign while Suppresses the Adversary.

[...]

Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, Tie-yan Liu - Show less +1 more

21 Jul 2020

TL;DR: This paper proposes using private augmented data to sharpen its good side while passivate its bad side, and exploits the data augmentation used in training to boost the accuracy of membership inference.

...read moreread less

Abstract: Membership inference (MI) in machine learning decides whether a given example is in target model's training set. It can be used in two ways: adversaries use it to steal private membership information while legitimate users can use it to verify whether their data has been forgotten by a trained model. Therefore, MI is a double-edged sword to privacy preserving machine learning. In this paper, we propose using private augmented data to sharpen its good side while passivate its bad side. To sharpen the good side, we exploit the data augmentation used in training to boost the accuracy of membership inference. Specifically, we compose a set of augmented instances for each sample and then the membership inference is formulated as a set classification problem, i.e., classifying a set of augmented data points instead of one point. We design permutation invariant features based on the losses of augmented instances. Our approach significantly improves the MI accuracy over existing algorithms. To passivate the bad side, we apply different data augmentation methods to each legitimate user and keep the augmented data as secret. We show that the malicious adversaries cannot benefit from our algorithms if being ignorant of the augmented data used in training. Extensive experiments demonstrate the superior efficacy of our algorithms. Our source code is available at anonymous GitHub page \url{this https URL}.

...read moreread less

1 citations