计量经济分析 = Econometric analysis

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.

Longformer: The Long-Document Transformer

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Substantial progress has been made recently on developing provably accurate and efficient algorithms for low-rank matrix factorization via nonconvex optimization. While conventional wisdom often takes a dim view of nonconvex optimization algorithms due to their susceptibility to spurious local minima, simple iterative methods such as gradient descent have been remarkably successful in practice. The theoretical footings, however, had been largely lacking until recently. In this tutorial-style overview, we highlight the important role of statistical models in enabling efficient nonconvex optimization with performance guarantees. We review two contrasting approaches: (1) two-stage algorithms, which consist of a tailored initialization step followed by successive refinement; and (2) global landscape analysis and initialization-free algorithms. Several canonical matrix factorization problems are discussed, including but not limited to matrix sensing, phase retrieval, matrix completion, blind deconvolution, and robust principal component analysis. Special care is taken to illustrate the key technical insights underlying their analyses. This article serves as a testament that the integrated consideration of optimization and statistics leads to fruitful research findings.

Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

Can we recover a complex signal from its Fourier magnitudes? More generally, given a set of $m$ measurements, $y_k = |\mathbf a_k^* \mathbf x|$ for $k = 1, \dots, m$, is it possible to recover $\mathbf x \in \mathbb{C}^n$ (i.e., length-$n$ complex vector)? This **generalized phase retrieval** (GPR) problem is a fundamental task in various disciplines, and has been the subject of much recent investigation. Natural nonconvex heuristics often work remarkably well for GPR in practice, but lack clear theoretical explanations. In this paper, we take a step towards bridging this gap. We prove that when the measurement vectors $\mathbf a_k$'s are generic (i.i.d. complex Gaussian) and the number of measurements is large enough ($m \ge C n \log^3 n$), with high probability, a natural least-squares formulation for GPR has the following benign geometric structure: (1) there are no spurious local minimizers, and all global minimizers are equal to the target signal $\mathbf x$, up to a global phase; and (2) the objective function has a negative curvature around each saddle point. This structure allows a number of iterative optimization methods to efficiently find a global minimizer, without special initialization. To corroborate the claim, we describe and analyze a second-order trust-region algorithm.

A Geometric Analysis of Phase Retrieval

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

On Layer Normalization in the Transformer Architecture

We study the problem of recovering a vector $\bx\in \bbR^n$ from its magnitude measurements $y_i=|\langle \ba_i, \bx\rangle|, i=1,..., m$. Our work is along the line of the Wirtinger flow (WF) approach \citet{candes2015phase}, which solves the problem by minimizing a nonconvex loss function via a gradient algorithm and can be shown to converge to a global optimal point under good initialization. In contrast to the smooth loss function used in WF, we adopt a nonsmooth but lower-order loss function, and design a gradient-like algorithm (referred to as reshaped-WF). We show that for random Gaussian measurements, reshaped-WF enjoys geometric convergence to a global optimal point as long as the number $m$ of measurements is at the order of $\cO(n)$, where $n$ is the dimension of the unknown $\bx$. This improves the sample complexity of WF, and achieves the same sample complexity as truncated-WF \citet{chen2015solving} but without truncation at gradient step. Furthermore, reshaped-WF costs less computationally than WF, and runs faster numerically than both WF and truncated-WF. Bypassing higher-order variables in the loss function and truncations in the gradient loop, analysis of reshaped-WF is simplified.

/pdf/reshaped-wirtinger-flow-for-solving-quadratic-system-of-zgz2wy50th.pdf

Reshaped Wirtinger Flow for Solving Quadratic System of Equations

We study the problem of solving a quadratic system of equations, i.e., recovering a vector signal x ∈ R from its magnitude measurements yi = |〈ai,x〉|, i = 1, ...,m. We develop a gradient descent algorithm (referred to as RWF for reshaped Wirtinger flow) by minimizing the quadratic loss of the magnitude measurements. Comparing with Wirtinger flow (WF) (Candès et al., 2015), the loss function of RWF is nonconvex and nonsmooth, but better resembles the least-squares loss when the phase information is also available. We show that for random Gaussian measurements, RWF enjoys linear convergence to the true signal as long as the number of measurements is O(n). This improves the sample complexity of WF (O(n logn)), and achieves the same sample complexity as truncated Wirtinger flow (TWF) (Chen and Candès, 2015), but without any sophisticated truncation in the gradient loop. Furthermore, RWF costs less computationally than WF, and runs faster numerically than both WF and TWF. We further develop an incremental (stochastic) version of RWF (IRWF) and connect it with the randomized Kaczmarz method for phase retrieval. We demonstrate that IRWF outperforms existing incremental as well as batch algorithms with experiments.

/pdf/a-nonconvex-approach-for-phase-retrieval-reshaped-wirtinger-3iymotwsyn.pdf

A Nonconvex Approach for Phase Retrieval: Reshaped Wirtinger Flow and Incremental Algorithms

Solving systems of quadratic equations is a central problem in machine learning and signal processing. One important example is phase retrieval, which aims to recover a signal from only magnitudes of its linear measurements. This paper focuses on the situation when the measurements are corrupted by arbitrary outliers, for which the recently developed non-convex gradient descent Wirtinger flow (WF) and truncated Wirtinger flow (TWF) algorithms likely fail. We develop a novel median-TWF algorithm that exploits robustness of sample median to resist arbitrary outliers in the initialization and the gradient update in each iteration. We show that such a non-convex algorithm provably recovers the signal from a near-optimal number of measurements composed of i.i.d. Gaussian entries, up to a logarithmic factor, even when a constant portion of the measurements are corrupted by arbitrary outliers. We further show that median-TWF is also robust when measurements are corrupted by both arbitrary outliers and bounded noise. Our analysis of performance guarantee is accomplished by development of non-trivial concentration measures of median-related quantities, which may be of independent interest. We further provide numerical experiments to demonstrate the effectiveness of the approach.

Provable Non-convex Phase Retrieval with Outliers: Median Truncated Wirtinger Flow

/pdf/provable-non-convex-phase-retrieval-with-outliers-median-1rx9xg7hic.pdf

Huishuai Zhang

Papers

On Layer Normalization in the Transformer Architecture

Reshaped Wirtinger Flow for Solving Quadratic System of Equations

A Nonconvex Approach for Phase Retrieval: Reshaped Wirtinger Flow and Incremental Algorithms

Provable Non-convex Phase Retrieval with Outliers: Median Truncated Wirtinger Flow

Provable non-convex phase retrieval with outliers: median truncated wirtinger flow