Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic

[...]

Muhammad Abdul-Mageed¹, AbdelRahim A. Elmadany¹, El Moatez Billah Nagoudi¹•Institutions (1)

University of British Columbia¹

01 Aug 2021

TL;DR: The authors introduced two powerful deep bidirectional transformer-based models, ARBERT and MARBERT, for multi-dialectal Arabic language understanding evaluation, which achieved state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets).

...read moreread less

Abstract: Pre-trained language models (LMs) are currently integral to many natural language processing systems. Although multilingual LMs were also introduced to serve many languages, these have limitations such as being costly at inference time and the size and diversity of non-English data involved in their pre-training. We remedy these issues for a collection of diverse Arabic varieties by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT. To evaluate our models, we also introduce ARLUE, a new benchmark for multi-dialectal Arabic language understanding evaluation. ARLUE is built using 42 datasets targeting six different task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ARLUE, our models collectively achieve new state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets). Our best model acquires the highest ARLUE score (77.40) across all six task clusters, outperforming all other models including XLM-R Large ( 3.4x larger size). Our models are publicly available at https://github.com/UBC-NLP/marbert and ARLUE will be released through the same repository.

...read moreread less

52 citations

Posted Content•DOI•

Semantic Models for the First-stage Retrieval: A Comprehensive Review.

[...]

Yinqiong Cai, Yixing Fan, Jiafeng Guo, Fei Sun, Ruqing Zhang, Xueqi Cheng - Show less +2 more

08 Mar 2021-arXiv: Information Retrieval

TL;DR: The current landscape of semantic retrieval models from three major paradigms, paying special attention to recent neural-based methods, is described in this article, where the authors review the benchmark datasets, optimization methods and evaluation metrics, and summarize the state-of-theart models.

...read moreread less

Abstract: Multi-stage ranking pipelines have been a practical solution in modern search systems, where the first-stage retrieval is to return a subset of candidate documents, and the latter stages attempt to re-rank those candidates. Unlike the re-ranking stages going through quick technique shifts during the past decades, the first-stage retrieval has long been dominated by classical term-based models. Unfortunately, these models suffer from the vocabulary mismatch problem, which may block the re-ranking stages from relevant documents at the very beginning. Therefore, it has been a long-term desire to build semantic models for the first-stage retrieval that can achieve high recall efficiently. Recently, we have witnessed an explosive growth of research interests on the first-stage semantic retrieval models. We believe it is the right time to survey the current status, learn from existing methods, and gain some insights for future development. In this paper, we describe the current landscape of semantic retrieval models from three major paradigms, paying special attention to recent neural-based methods. We review the benchmark datasets, optimization methods and evaluation metrics, and summarize the state-of-the-art models. We also discuss the unresolved challenges and suggest potentially promising directions for future work.

...read moreread less

52 citations

Posted Content•

A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models

[...]

Usman Naseem¹, Imran Razzak², Shah Khalid Khan³, Mukesh Prasad⁴•Institutions (4)

University of Sydney¹, Deakin University², RMIT University³, Association for Computing Machinery⁴

28 Oct 2020-arXiv: Computation and Language

TL;DR: A variety of text representation methods, and model designs have blossomed in the context of NLP, including SOTA LMs are described, which can transform large volumes of text into effective vector representations capturing the same semantic information.

...read moreread less

Abstract: Word representation has always been an important research area in the history of natural language processing (NLP). Understanding such complex text data is imperative, given that it is rich in information and can be used widely across various applications. In this survey, we explore different word representation models and its power of expression, from the classical to modern-day state-of-the-art word representation language models (LMS). We describe a variety of text representation methods, and model designs have blossomed in the context of NLP, including SOTA LMs. These models can transform large volumes of text into effective vector representations capturing the same semantic information. Further, such representations can be utilized by various machine learning (ML) algorithms for a variety of NLP related tasks. In the end, this survey briefly discusses the commonly used ML and DL based classifiers, evaluation metrics and the applications of these word embeddings in different NLP tasks.

...read moreread less

52 citations

Journal Article•DOI•

Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

[...]

Peter Bell¹, Joachim Fainberg¹, Ondrej Klejch¹, Jinyu Li², Steve Renals¹, Pawel Swietojanski³ - Show less +2 more•Institutions (3)

University of Edinburgh¹, Microsoft², University of New South Wales³

01 Jan 2021

TL;DR: A meta-analysis of the performance of speech recognition adaptation algorithms is presented, based on relative error rate reductions as reported in the literature, to characterize adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation.

...read moreread less

Abstract: We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation. We present a meta-analysis of the performance of speech recognition adaptation algorithms, based on relative error rate reductions as reported in the literature.

...read moreread less

51 citations

Proceedings Article•DOI•

SentiX: A Sentiment-Aware Pre-Trained Model for Cross-Domain Sentiment Analysis

[...]

Jie Zhou¹, Junfeng Tian², Rui Wang², Yuanbin Wu³, Wenming Xiao, Liang He³ - Show less +2 more•Institutions (3)

Tsinghua University¹, Alibaba Group², East China Normal University³

01 Dec 2020

TL;DR: This paper pre-train a sentiment-aware language model (SentiX) via domain-invariant sentiment knowledge from large-scale review datasets, and utilize it for cross-domain sentiment analysis task without fine-tuning.

...read moreread less

Abstract: Pre-trained language models have been widely applied to cross-domain NLP tasks like sentiment analysis, achieving state-of-the-art performance. However, due to the variety of users’ emotional expressions across domains, fine-tuning the pre-trained models on the source domain tends to overfit, leading to inferior results on the target domain. In this paper, we pre-train a sentiment-aware language model (SentiX) via domain-invariant sentiment knowledge from large-scale review datasets, and utilize it for cross-domain sentiment analysis task without fine-tuning. We propose several pre-training tasks based on existing lexicons and annotations at both token and sentence levels, such as emoticons, sentiment words, and ratings, without human interference. A series of experiments are conducted and the results indicate the great advantages of our model. We obtain new state-of-the-art results in all the cross-domain sentiment analysis tasks, and our proposed SentiX can be trained with only 1% samples (18 samples) and it achieves better performance than BERT with 90% samples.

...read moreread less

51 citations