We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. 
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

In this paper, we propose a novel neural network model called RNN Encoder‐ Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixedlength vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder‐Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

/pdf/learning-phrase-representations-using-rnn-encoder-decoder-74734oka8k.pdf

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251 percent, surpassing the winning entry of 2016 by a relative improvement of ${\sim }$ ∼ 25 percent. Models and code are available at https://github.com/hujie-frank/SENet .

/pdf/squeeze-and-excitation-networks-50icgwqgeu.pdf

Squeeze-and-Excitation Networks

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

/pdf/roberta-a-robustly-optimized-bert-pretraining-approach-2rj0bdyhim.pdf

RoBERTa: A Robustly Optimized BERT Pretraining Approach

A system implemented as computer programs on one or more computers in one or more locations that implements a computer vision model is described. The computer vision model includes a positional local self-attention layer that is configured to receive an input feature map and to generate an output feature map. For each input element in the input feature map, the positional local self-attention layer generates a respective output element for the output feature map by generating a memory block including neighboring input elements around the input element, generates a query vector using the input element and a query weight matrix, for each neighboring element in the memory block, performs positional local self-attention operations to generate a temporary output element, and generates the respective output element by summing temporary output elements of the neighboring elements in the memory block.

Fully attentional computer vision

There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplan et al. (2020) presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm. The key findings of this paper are as follows: (1) we show that aside from only the model size, model shape matters for downstream fine-tuning, (2) scaling protocols operate differently at different compute regions, (3) widely adopted T5-base and T5-large sizes are Pareto-inefficient. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50% fewer parameters and training 40% faster compared to the widely adopted T5-base model. We publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.

S cale e fficiently : i nsights from p re - training and f ine - tuning t ransformers

In mid-2012 we organized a two-week workshop in Papua New Guinea (PNG) to provide training in basic techniques and technologies for language documentation, and to gain understanding of how these technologies might be improved in the future. It was a diverse program, combining the expertise of scholars from ten institutions. It was also a diverse audience, including academics, teachers, students, archivists, translators, pastors, and farmers from across the country. Approximately twenty local languages were represented. The central idea of Brooks’ assessment of the workshop is that its computational goal was incompatible with its documentary goal. However, we would say that there was a single goal, namely, to document languages of PNG. We would particularly guard against the possible misperception that data is collected from PNG languages to fuel machine translation in general. While that would admittedly be interesting, machine translation, in this context, is not an end in itself but a means to an end, which is documentation. Much of Brooks’ commentary addresses our reliance on textual sources. We agree with his reasons, and most were already raised in our article. We had planned to include spoken language recordings among the workshop activities, using 34 voice recorders donated by Olympus, but the recorders turned out to be tied up in student projects. This was one of several logistical challenges of organizing a workshop in PNG, challenges which made the execution of the workshop turn out differently from its conception. As always, there are things we would do differently the second time, including a stronger emphasis on oral language recording. In fact, it was for this purpose that the Aikuma mobile phone app was developed (Hanke & Bird 2013, Bird et al 2014a, b). Nevertheless, texts are a form of

/pdf/documentary-linguistics-and-computational-linguistics-a-4kofkg7pcu.pdf

Documentary Linguistics and Computational Linguistics: A response to Brooks

Model efficiency is a critical aspect of developing and deploying machine learning models. Inference time and latency directly affect the user experience, and some applications have hard requirements. In addition to inference costs, model training also have direct financial and environmental impacts. Although there are numerous well-established metrics (cost indicators) for measuring model efficiency, researchers and practitioners often assume that these metrics are correlated with each other and report only few of them. In this paper, we thoroughly discuss common cost indicators, their advantages and disadvantages, and how they can contradict each other. We demonstrate how incomplete reporting of cost indicators can lead to partial conclusions and a blurred or incomplete picture of the practical considerations of different models. We further present suggestions to improve reporting of efficiency metrics.

The Efficiency Misnomer

Dense retrieval has been shown to be effective for Open Domain Question Answering, surpassing sparse retrieval methods like BM25. One such model, REALM, (Guu et al., 2020) is an end-to-end dense retrieval system that uses MLM based pretraining for improved downstream QA performance. However, the current REALM setup uses limited resources and is not comparable in scale to more recent systems, contributing to its lower performance. Additionally, it relies on noisy supervision for retrieval during fine-tuning. We propose REALM++, where we improve upon the training and inference setups and introduce better supervision signal for improving performance, without any architectural changes. REALM++ achieves ~5.5% absolute accuracy gains over the baseline while being faster to train. It also matches the performance of large models which have 3x more parameters demonstrating the efficiency of our setup.

Ashish Vaswani

Papers

Fully attentional computer vision

S cale e fficiently : i nsights from p re - training and f ine - tuning t ransformers

Documentary Linguistics and Computational Linguistics: A response to Brooks

The Efficiency Misnomer

Simple and Efficient ways to Improve REALM