Showing papers in "arXiv: Computation and Language in 2019"

PDF

Open Access

Posted Content•

RoBERTa: A Robustly Optimized BERT Pretraining Approach

[...]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Michael Lewis, Luke Zettlemoyer, Veselin Stoyanov - Show less +6 more

26 Jul 2019-arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

Abstract: Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

...read moreread less

13,994 citations

Posted Content•

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

[...]

Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf

02 Oct 2019-arXiv: Computation and Language

TL;DR: This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.

...read moreread less

Abstract: As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

...read moreread less

3,877 citations

Posted Content•

HuggingFace's Transformers: State-of-the-art Natural Language Processing.

[...]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Jamie Brew - Show less +7 more

09 Oct 2019-arXiv: Computation and Language

TL;DR: The \textit{Transformers} library is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.

...read moreread less

Abstract: Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \textit{Transformers} is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrained models made by and available for the community. \textit{Transformers} is designed to be extensible by researchers, simple for practitioners, and fast and robust in industrial deployments. The library is available at \url{this https URL}.

...read moreread less

3,463 citations

Posted Content•

XLNet: Generalized Autoregressive Pretraining for Language Understanding

[...]

Zhilin Yang¹, Zihang Dai¹, Yiming Yang¹, Jaime G. Carbonell¹, Ruslan Salakhutdinov¹, Quoc V. Le² - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Google²

19 Jun 2019-arXiv: Computation and Language

TL;DR: XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.

...read moreread less

Abstract: With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

...read moreread less

3,009 citations

Posted Content•

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

[...]

Zhenzhong Lan¹, Mingda Chen², Sebastian Goodman¹, Kevin Gimpel³, Piyush Sharma¹, Radu Soricut¹ - Show less +2 more•Institutions (3)

Google¹, Toyota Technological Institute at Chicago², New York University³

26 Sep 2019-arXiv: Computation and Language

TL;DR: The authors proposed a self-supervised loss that focuses on modeling inter-sentence coherence, and showed it consistently helps downstream tasks with multientence inputs, achieving state-of-the-art results on the GLUE, RACE, and \squad benchmarks.

...read moreread less

Abstract: Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at this https URL.

...read moreread less

2,247 citations

Posted Content•

fairseq: A Fast, Extensible Toolkit for Sequence Modeling.

[...]

Myle Ott¹, Sergey Edunov¹, Alexei Baevski¹, Angela Fan¹, Sam Gross¹, Nathan Ng, David Grangier², Michael Auli¹ - Show less +4 more•Institutions (2)

Facebook¹, Google²

01 Apr 2019-arXiv: Computation and Language

TL;DR: fairseq as discussed by the authors is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks, and supports distributed training across multiple GPUs and machines.

...read moreread less

Abstract: fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs. A demo video can be found at this https URL

...read moreread less

1,650 citations

Posted Content•

BERTScore: Evaluating Text Generation with BERT

[...]

Tianyi Zhang¹, Varsha Kishore, Felix Wu¹, Kilian Q. Weinberger¹, Yoav Artzi¹ - Show less +1 more•Institutions (1)

Cornell University¹

21 Apr 2019-arXiv: Computation and Language

TL;DR: This work proposes BERTScore, an automatic evaluation metric for text generation that correlates better with human judgments and provides stronger model selection performance than existing metrics.

...read moreread less

Abstract: We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.

...read moreread less

1,456 citations

Posted Content•

Energy and Policy Considerations for Deep Learning in NLP

[...]

Emma Strubell¹, Ananya Ganesh¹, Andrew McCallum¹•Institutions (1)

University of Massachusetts Amherst¹

05 Jun 2019-arXiv: Computation and Language

TL;DR: This paper quantifies the approximate financial and environmental costs of training a variety of recently successful neural network models for NLP and proposes actionable recommendations to reduce costs and improve equity in NLP research and practice.

...read moreread less

Abstract: Recent progress in hardware and methodology for training neural networks has ushered in a new generation of large networks trained on abundant data. These models have obtained notable gains in accuracy across many NLP tasks. However, these accuracy improvements depend on the availability of exceptionally large computational resources that necessitate similarly substantial energy consumption. As a result these models are costly to train and develop, both financially, due to the cost of hardware and electricity or cloud compute time, and environmentally, due to the carbon footprint required to fuel modern tensor processing hardware. In this paper we bring this issue to the attention of NLP researchers by quantifying the approximate financial and environmental costs of training a variety of recently successful neural network models for NLP. Based on these findings, we propose actionable recommendations to reduce costs and improve equity in NLP research and practice.

...read moreread less

1,318 citations

Posted Content•

The Curious Case of Neural Text Degeneration

[...]

Ari Holtzman¹, Jan Buys¹, Leo Du¹, Maxwell Forbes¹, Yejin Choi² - Show less +1 more•Institutions (2)

University of Washington¹, University of Cape Town²

22 Apr 2019-arXiv: Computation and Language

TL;DR: This paper showed that decoding strategies alone alone can dramatically affect the quality of machine text, even when generated from exactly the same neural language model, and they proposed Nucleus Sampling, a simple but effective method to draw the best out of neural generation.

...read moreread less

Abstract: Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive. In this paper, we reveal surprising distributional differences between human text and machine text. In addition, we find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. Our findings motivate Nucleus Sampling, a simple but effective method to draw the best out of neural generation. By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

...read moreread less

1,256 citations

Posted Content•

Cross-lingual Language Model Pretraining.

[...]

Guillaume Lample¹, Alexis Conneau¹•Institutions (1)

Facebook¹

22 Jan 2019-arXiv: Computation and Language

TL;DR: This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.

...read moreread less

Abstract: Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding. In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.

...read moreread less

1,186 citations

Posted Content•

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.

[...]

Michael Lewis¹, Yinhan Liu¹, Naman Goyal¹, Marjan Ghazvininejad¹, Abdelrahman Mohamed¹, Omer Levy², Veselin Stoyanov¹, Luke Zettlemoyer¹ - Show less +4 more•Institutions (2)

Facebook¹, University of Washington²

29 Oct 2019-arXiv: Computation and Language

TL;DR: BART as mentioned in this paper is a denoising autoencoder for pretraining sequence-to-sequence models, which is trained by corrupting text with an arbitrary noising function, and then learning a model to reconstruct the original text.

...read moreread less

Abstract: We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.

...read moreread less

Posted Content•

CTRL: A Conditional Transformer Language Model for Controllable Generation

[...]

Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher - Show less +1 more

11 Sep 2019-arXiv: Computation and Language

TL;DR: CTRL is released, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior, providing more explicit control over text generation.

...read moreread less

Abstract: Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at this https URL.

...read moreread less

Posted Content•

Language Models as Knowledge Bases

[...]

Fabio Petroni¹, Tim Rocktäschel¹, Patrick S. H. Lewis¹, Anton Bakhtin¹, Yuxiang Wu², Alexander H. Miller¹, Sebastian Riedel¹ - Show less +3 more•Institutions (2)

Facebook¹, University College London²

03 Sep 2019-arXiv: Computation and Language

TL;DR: An in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-the-art pretrained language models finds that BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge.

...read moreread less

Abstract: Recent progress in pretraining language models on large textual corpora led to a surge of improvements for downstream NLP tasks. Whilst learning linguistic knowledge, these models may also be storing relational knowledge present in the training data, and may be able to answer queries structured as "fill-in-the-blank" cloze statements. Language models have many advantages over structured knowledge bases: they require no schema engineering, allow practitioners to query about an open class of relations, are easy to extend to more data, and require no human supervision to train. We present an in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-the-art pretrained language models. We find that (i) without fine-tuning, BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge, (ii) BERT also does remarkably well on open-domain question answering against a supervised baseline, and (iii) certain types of factual knowledge are learned much more readily than others by standard language model pretraining approaches. The surprisingly strong ability of these models to recall factual knowledge without any fine-tuning demonstrates their potential as unsupervised open-domain QA systems. The code to reproduce our analysis is available at this https URL.

...read moreread less

Posted Content•

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

[...]

Mohammad Shoeybi¹, Md. Mostofa Ali Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro - Show less +2 more•Institutions (1)

Nvidia¹

17 Sep 2019-arXiv: Computation and Language

TL;DR: A simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters and shows that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows.

...read moreread less

Abstract: Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).

...read moreread less

Posted Content•

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

[...]

Jingqing Zhang¹, Yao Zhao², Mohammad Saleh², Peter J. Liu²•Institutions (2)

Imperial College London¹, Google²

18 Dec 2019-arXiv: Computation and Language

TL;DR: This work proposes pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective, PEGASUS, and demonstrates it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores.

...read moreread less

Abstract: Recent work pre-training Transformers with self-supervised objectives on large text corpora has shown great success when fine-tuned on downstream NLP tasks including text summarization. However, pre-training objectives tailored for abstractive text summarization have not been explored. Furthermore there is a lack of systematic evaluation across diverse domains. In this work, we propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective. In PEGASUS, important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. We evaluated our best PEGASUS model on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores. Our model also shows surprising performance on low-resource summarization, surpassing previous state-of-the-art results on 6 datasets with only 1000 examples. Finally we validated our results using human evaluation and show that our model summaries achieve human performance on multiple datasets.

...read moreread less

Posted Content•

SpanBERT: Improving Pre-training by Representing and Predicting Spans

[...]

Mandar Joshi¹, Danqi Chen², Yinhan Liu³, Daniel S. Weld¹, Luke Zettlemoyer¹, Omer Levy³ - Show less +2 more•Institutions (3)

University of Washington¹, Princeton University², Facebook³

24 Jul 2019-arXiv: Computation and Language

TL;DR: SpanBERT as discussed by the authors extends BERT by masking contiguous random spans, rather than random tokens, and training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it.

...read moreread less

Abstract: We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. SpanBERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT-large, our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0, respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6\% F1), strong performance on the TACRED relation extraction benchmark, and even show gains on GLUE.

...read moreread less

Posted Content•

What Does BERT Look At? An Analysis of BERT's Attention

[...]

Kevin Clark¹, Urvashi Khandelwal¹, Omer Levy¹, Christopher D. Manning²•Institutions (2)

Stanford University¹, Facebook²

11 Jun 2019-arXiv: Computation and Language

TL;DR: It is shown that certain attention heads correspond well to linguistic notions of syntax and coreference, and an attention-based probing classifier is proposed and used to demonstrate that substantial syntactic information is captured in BERT’s attention.

...read moreread less

Abstract: Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused on model outputs (e.g., language model surprisal) or internal vector representations (e.g., probing classifiers). Complementary to these works, we propose methods for analyzing the attention mechanisms of pre-trained models and apply them to BERT. BERT's attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly attending over the whole sentence, with heads in the same layer often exhibiting similar behaviors. We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT's attention.

...read moreread less

Posted Content•

Unsupervised Cross-lingual Representation Learning at Scale.

[...]

Alexis Conneau¹, Kartikay Khandelwal², Naman Goyal¹, Vishrav Chaudhary¹, Guillaume Wenzek¹, Francisco Guzmán³, Edouard Grave¹, Myle Ott¹, Luke Zettlemoyer¹, Veselin Stoyanov¹ - Show less +6 more•Institutions (3)

Facebook¹, Microsoft², Johns Hopkins University³

05 Nov 2019-arXiv: Computation and Language

TL;DR: This paper showed that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks and proposed a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data.

...read moreread less

Abstract: This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.

...read moreread less

Posted Content•

ERNIE: Enhanced Representation through Knowledge Integration

[...]

Yu Sun, Wang Shuohuan, Li Yukun, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Zhu Danxiang, Hao Tian, Hua Wu¹ - Show less +6 more•Institutions (1)

Baidu¹

19 Apr 2019-arXiv: Computation and Language

TL;DR: Experimental results show that ERNIE outperforms other baseline methods, achieving new state-of-the-art results on five Chinese natural language processing tasks including natural language inference, semantic similarity, named entity recognition, sentiment analysis and question answering.

...read moreread less

Abstract: We present a novel language representation model enhanced by knowledge called ERNIE (Enhanced Representation through kNowledge IntEgration). Inspired by the masking strategy of BERT, ERNIE is designed to learn language representation enhanced by knowledge masking strategies, which includes entity-level masking and phrase-level masking. Entity-level strategy masks entities which are usually composed of multiple words.Phrase-level strategy masks the whole phrase which is composed of several words standing together as a conceptual unit.Experimental results show that ERNIE outperforms other baseline methods, achieving new state-of-the-art results on five Chinese natural language processing tasks including natural language inference, semantic similarity, named entity recognition, sentiment analysis and question answering. We also demonstrate that ERNIE has more powerful knowledge inference capacity on a cloze test.

...read moreread less

Posted Content•

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

[...]

Jason Wei¹, Kai Zou²•Institutions (2)

Dartmouth College¹, Georgetown University²

31 Jan 2019-arXiv: Computation and Language

TL;DR: EDA consists of four simple but powerful operations: synonym replacement, random insertion, random swap, and random deletion, which shows that EDA improves performance for both convolutional and recurrent neural networks.

...read moreread less

Abstract: We present EDA: easy data augmentation techniques for boosting performance on text classification tasks. EDA consists of four simple but powerful operations: synonym replacement, random insertion, random swap, and random deletion. On five text classification tasks, we show that EDA improves performance for both convolutional and recurrent neural networks. EDA demonstrates particularly strong results for smaller datasets; on average, across five datasets, training with EDA while using only 50% of the available training set achieved the same accuracy as normal training with all available data. We also performed extensive ablation studies and suggest parameters for practical use.

...read moreread less

Posted Content•

DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

[...]

Yizhe Zhang¹, Siqi Sun¹, Michel Galley¹, Yen-Chun Chen¹, Chris Brockett¹, Xiang Gao¹, Jianfeng Gao¹, Jingjing Liu¹, Bill Dolan¹ - Show less +5 more•Institutions (1)

Microsoft¹

01 Nov 2019-arXiv: Computation and Language

TL;DR: The authors presented a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained transformer) trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017.

...read moreread less

Abstract: We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained transformer) Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings We show that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems The pre-trained model and training pipeline are publicly released to facilitate research into neural response generation and the development of more intelligent open-domain dialogue systems

...read moreread less

Posted Content•

Publicly Available Clinical BERT Embeddings

[...]

Emily Alsentzer¹, John Murphy¹, Willie Boag¹, Wei-Hung Weng¹, Di Jin¹, Tristan Naumann², Matthew B. A. McDermott¹ - Show less +3 more•Institutions (2)

Massachusetts Institute of Technology¹, Microsoft²

06 Apr 2019-arXiv: Computation and Language

TL;DR: This work explores and releases two BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically, and demonstrates that using a domain-specific model yields performance improvements on 3/5 clinical NLP tasks, establishing a new state-of-the-art on the MedNLI dataset.

...read moreread less

Abstract: Contextual word embedding models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on three common clinical NLP tasks as compared to nonspecific embeddings. These domain-specific models are not as performant on two clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthetically non de-identified task text.

...read moreread less

Posted Content•

TinyBERT: Distilling BERT for Natural Language Understanding

[...]

Xiaoqi Jiao¹, Yichun Yin¹, Lifeng Shang², Xin Jiang², Xiao Chen², Linlin Li², Fang Wang², Qun Liu² - Show less +4 more•Institutions (2)

Huazhong University of Science and Technology¹, Huawei²

23 Sep 2019-arXiv: Computation and Language

TL;DR: A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.

...read moreread less

Abstract: Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT. TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.

...read moreread less

Posted Content•

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

[...]

R. Thomas McCoy¹, Ellie Pavlick², Tal Linzen¹•Institutions (2)

Johns Hopkins University¹, Brown University²

04 Feb 2019-arXiv: Computation and Language

TL;DR: There is substantial room for improvement in NLI systems, and the HANS dataset can motivate and measure progress in this area, which contains many examples where the heuristics fail.

...read moreread less

Abstract: A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area

...read moreread less

Posted Content•DOI•

Pre-Training with Whole Word Masking for Chinese BERT

[...]

Yiming Cui¹, Wanxiang Che¹, Ting Liu¹, Bing Qin¹, Ziqing Yang, Shijin Wang, Guoping Hu - Show less +3 more•Institutions (1)

Harbin Institute of Technology¹

19 Jun 2019-arXiv: Computation and Language

TL;DR: The whole word masking (wwm) strategy for Chinese BERT is introduced, along with a series of Chinese pre-trained language models, and a simple but effective model called MacBERT is proposed, which improves upon RoBERTa in several ways.

...read moreread less

Abstract: Bidirectional Encoder Representations from Transformers (BERT) has shown marvelous improvements across various NLP tasks. Recently, an upgraded version of BERT has been released with Whole Word Masking (WWM), which mitigate the drawbacks of masking partial WordPiece tokens in pre-training BERT. In this technical report, we adapt whole word masking in Chinese text, that masking the whole word instead of masking Chinese characters, which could bring another challenge in Masked Language Model (MLM) pre-training task. The proposed models are verified on various NLP tasks, across sentence-level to document-level, including machine reading comprehension (CMRC 2018, DRCD, CJRC), natural language inference (XNLI), sentiment classification (ChnSentiCorp), sentence pair matching (LCQMC, BQ Corpus), and document classification (THUCNews). Experimental results on these datasets show that the whole word masking could bring another significant gain. Moreover, we also examine the effectiveness of the Chinese pre-trained models: BERT, ERNIE, BERT-wwm, BERT-wwm-ext, RoBERTa-wwm-ext, and RoBERTa-wwm-ext-large. We release all the pre-trained models: \url{this https URL

...read moreread less

Posted Content•

wav2vec: Unsupervised Pre-training for Speech Recognition

[...]

Steffen Schneider¹, Alexei Baevski², Ronan Collobert², Michael Auli²•Institutions (2)

Technische Universität München¹, Facebook²

11 Apr 2019-arXiv: Computation and Language

TL;DR: Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.

...read moreread less

Abstract: We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. Our approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.

...read moreread less

Posted Content•

Assessing BERT's Syntactic Abilities.

[...]

Yoav Goldberg¹•Institutions (1)

Allen Institute for Artificial Intelligence¹

16 Jan 2019-arXiv: Computation and Language

TL;DR: The extent to which the recently introduced BERT model captures English syntactic phenomena is assessed, using naturally-occurring subject-verb agreement stimuli; "coloreless green ideas" subject- Verb Agreement stimuli; and manually crafted stimuli for subject- verb agreement and reflexive anaphora phenomena.

...read moreread less

Abstract: I assess the extent to which the recently introduced BERT model captures English syntactic phenomena, using (1) naturally-occurring subject-verb agreement stimuli; (2) "coloreless green ideas" subject-verb agreement stimuli, in which content words in natural sentences are randomly replaced with words sharing the same part-of-speech and inflection; and (3) manually crafted stimuli for subject-verb agreement and reflexive anaphora phenomena. The BERT model performs remarkably well on all cases.

...read moreread less

Proceedings Article•DOI•

CamemBERT: a Tasty French Language Model

[...]

Louis Martin¹, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, Benoît Sagot - Show less +4 more•Institutions (1)

Facebook¹

10 Nov 2019-arXiv: Computation and Language

TL;DR: This paper investigates the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating their language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks.

...read moreread less

Abstract: Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

...read moreread less

Posted Content•

How multilingual is Multilingual BERT

[...]

Telmo Pires¹, Eva Schlinger¹, Dan Garrette¹•Institutions (1)

Google¹

04 Jun 2019-arXiv: Computation and Language

TL;DR: It is concluded that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs, and that the model can find translation pairs.

...read moreread less

Abstract: In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2018) as a single language model pre-trained from monolingual corpora in 104 languages, is surprisingly good at zero-shot cross-lingual model transfer, in which task-specific annotations in one language are used to fine-tune the model for evaluation in another language. To understand why, we present a large number of probing experiments, showing that transfer is possible even to languages in different scripts, that transfer works best between typologically similar languages, that monolingual corpora can train models for code-switching, and that the model can find translation pairs. From these results, we can conclude that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs.

...read moreread less

Posted Content•

ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

[...]

Kexin Huang, Jaan Altosaar, Rajesh Ranganath

10 Apr 2019-arXiv: Computation and Language

TL;DR: ClinicalBERT uncovers high-quality relationships between medical concepts as judged by humans and outperforms baselines on 30-day hospital readmission prediction using both discharge summaries and the first few days of notes in the intensive care unit.

...read moreread less

Abstract: Clinical notes contain information about patients that goes beyond structured data like lab values and medications. However, clinical notes have been underused relative to structured data, because notes are high-dimensional and sparse. This work develops and evaluates representations of clinical notes using bidirectional transformers (ClinicalBERT). ClinicalBERT uncovers high-quality relationships between medical concepts as judged by humans. ClinicalBert outperforms baselines on 30-day hospital readmission prediction using both discharge summaries and the first few days of notes in the intensive care unit. Code and model parameters are available.

...read moreread less

Collapse