Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Automatic Student Network Search for Knowledge Distillation

[...]

Zhexi Zhang¹, Wei Zhu, Junchi Yan¹, Peng Gao, Guotong Xie - Show less +1 more•Institutions (1)

Shanghai Jiao Tong University¹

10 Jan 2021

TL;DR: NAS-KD as discussed by the authors automatically generates an optimal student network using neural architecture search (NAS) to enhance the knowledge distillation for BERT, which can substantially reduce the size of BERT without much performance sacrifice.

...read moreread less

Abstract: Pre-trained language models (PLMs), such as BERT, have achieved outstanding performance on multiple natural language processing (NLP) tasks. However, such pre-trained models usually contain a huge number of parameters and are computationally expensive. The high resource demand hinders their application on resource-restricted devices like mobile phones. Knowledge distillation (KD) is an effective compression approach, aiming at encouraging a light-weight student network to imitate the teacher network, and accordingly latent knowledge is transferred from the teacher to student. However, the great majority of student networks in previous KD methods are manually designed, normally a subnetwork of the teacher network. Transformer is generally utilized as the student for compressing BERT but still contains masses of parameters. Motivated by this, we propose a novel approach named NAS-KD, which automatically generates an optimal student network using neural architecture search (NAS) to enhance the distillation for BERT. Experiment on 7 classification tasks in NLP domain demonstrates that NAS-KD can substantially reduce the size of BERT without much performance sacrifice.

...read moreread less

2 citations

Posted Content•

Language Models not just for Pre-training: Fast Online Neural Noisy Channel Modeling

[...]

Shruti Bhosale¹, Kyra Yee², Sergey Edunov¹, Michael Auli¹•Institutions (2)

Facebook¹, Twitter²

13 Nov 2020-arXiv: Computation and Language

TL;DR: This work introduces efficient approximations to make inference with the noisy channel approach as fast as strong ensembles while increasing accuracy and shows that the noisyChannel approach can outperform strong pre-training results by achieving a new state of the art on WMT Romanian-English translation.

...read moreread less

Abstract: Pre-training models on vast quantities of unlabeled data has emerged as an effective approach to improving accuracy on many NLP tasks. On the other hand, traditional machine translation has a long history of leveraging unlabeled data through noisy channel modeling. The same idea has recently been shown to achieve strong improvements for neural machine translation. Unfortunately, naive noisy channel modeling with modern sequence to sequence models is up to an order of magnitude slower than alternatives. We address this issue by introducing efficient approximations to make inference with the noisy channel approach as fast as strong ensembles while increasing accuracy. We also show that the noisy channel approach can outperform strong pre-training results by achieving a new state of the art on WMT Romanian-English translation.

...read moreread less

2 citations

Proceedings Article•

Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering

[...]

Arij Riabi, Thomas Scialom¹, Rachel Keraron, Benoît Sagot, Djamé Seddah, Jacopo Staiano - Show less +2 more•Institutions (1)

University of Paris¹

13 Jan 2021

TL;DR: This article proposed a method to improve cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a crosslingual fashion.

...read moreread less

Abstract: Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).

...read moreread less

2 citations

Posted Content•

Transformer-Based Models for Question Answering on COVID19

[...]

Hillary Ngai, Yoona Park, John Chen, Mahboobeh Parsapoor

16 Jan 2021-arXiv: Computation and Language

TL;DR: In this paper, the authors proposed three transformer-based question-answering systems using BERT, ALBERT, and T5 models for the Kaggle's COVID-19 Open Research Dataset (CORD-19) challenge.

...read moreread less

Abstract: In response to the Kaggle's COVID-19 Open Research Dataset (CORD-19) challenge, we have proposed three transformer-based question-answering systems using BERT, ALBERT, and T5 models. Since the CORD-19 dataset is unlabeled, we have evaluated the question-answering models' performance on two labeled questions answers datasets \textemdash CovidQA and CovidGQA. The BERT-based QA system achieved the highest F1 score (26.32), while the ALBERT-based QA system achieved the highest Exact Match (13.04). However, numerous challenges are associated with developing high-performance question-answering systems for the ongoing COVID-19 pandemic and future pandemics. At the end of this paper, we discuss these challenges and suggest potential solutions to address them.

...read moreread less

2 citations

Posted Content•

RefSum: Refactoring Neural Summarization.

[...]

Yixin Liu¹, Zi-Yi Dou¹, Pengfei Liu¹•Institutions (1)

Carnegie Mellon University¹

15 Apr 2021-arXiv: Computation and Language

TL;DR: This article presented a new framework Refactor that provides a unified view of text summarization and summaries combination and achieved state-of-the-art results on CNN/DailyMail dataset.

...read moreread less

Abstract: Although some recent works show potential complementarity among different state-of-the-art systems, few works try to investigate this problem in text summarization. Researchers in other areas commonly refer to the techniques of reranking or stacking to approach this problem. In this work, we highlight several limitations of previous methods, which motivates us to present a new framework Refactor that provides a unified view of text summarization and summaries combination. Experimentally, we perform a comprehensive evaluation that involves twenty-two base systems, four datasets, and three different application scenarios. Besides new state-of-the-art results on CNN/DailyMail dataset (46.18 ROUGE-1), we also elaborate on how our proposed method addresses the limitations of the traditional methods and the effectiveness of the Refactor model sheds light on insight for performance improvement. Our system can be directly used by other researchers as an off-the-shelf tool to achieve further performance improvements. We open-source all the code and provide a convenient interface to use it: this https URL. We have also made the demo of this work available at: this http URL.

...read moreread less

2 citations