Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Telling BERT's full story: from Local Attention to Global Aggregation

[...]

Damian Pascual¹, Gino Brunner¹, Roger Wattenhofer¹•Institutions (1)

ETH Zurich¹

01 Apr 2021

TL;DR: In this article, the authors take a deep look into the behavior of self-attention heads in the transformer architecture and find that there is a significant mismatch between attention and attribution distributions, caused by the mixing of context inside the model.

...read moreread less

Abstract: We take a deep look into the behaviour of self-attention heads in the transformer architecture. In light of recent work discouraging the use of attention distributions for explaining a model’s behaviour, we show that attention distributions can nevertheless provide insights into the local behaviour of attention heads. This way, we propose a distinction between local patterns revealed by attention and global patterns that refer back to the input, and analyze BERT from both angles. We use gradient attribution to analyze how the output of an attention head depends on the input tokens, effectively extending the local attention-based analysis to account for the mixing of information throughout the transformer layers. We find that there is a significant mismatch between attention and attribution distributions, caused by the mixing of context inside the model. We quantify this discrepancy and observe that interestingly, there are some patterns that persist across all layers despite the mixing.

...read moreread less

9 citations

Posted Content•

Distilling Large Language Models into Tiny and Effective Students using pQRNN.

[...]

Prabhu Kaliamoorthi, Aditya Siddhant, Edward Li, Melvin Johnson¹•Institutions (1)

Google¹

21 Jan 2021-arXiv: Computation and Language

TL;DR: This paper proposed pQRNN, a projection-based embedding-free neural encoder that is tiny and effective for natural language processing tasks, which outperforms LSTM models with pre-trained embeddings despite being 140x smaller.

...read moreread less

Abstract: Large pre-trained multilingual models like mBERT, XLM-R achieve state of the art results on language understanding tasks. However, they are not well suited for latency critical applications on both servers and edge devices. It's important to reduce the memory and compute resources required by these models. To this end, we propose pQRNN, a projection-based embedding-free neural encoder that is tiny and effective for natural language processing tasks. Without pre-training, pQRNNs significantly outperform LSTM models with pre-trained embeddings despite being 140x smaller. With the same number of parameters, they outperform transformer baselines thereby showcasing their parameter efficiency. Additionally, we show that pQRNNs are effective student architectures for distilling large pre-trained language models. We perform careful ablations which study the effect of pQRNN parameters, data augmentation, and distillation settings. On MTOP, a challenging multilingual semantic parsing dataset, pQRNN students achieve 95.9\% of the performance of an mBERT teacher while being 350x smaller. On mATIS, a popular parsing task, pQRNN students on average are able to get to 97.1\% of the teacher while again being 350x smaller. Our strong results suggest that our approach is great for latency-sensitive applications while being able to leverage large mBERT-like models.

...read moreread less

9 citations

Posted Content•

A Two-Phase Approach for Abstractive Podcast Summarization

[...]

Chujie Zheng¹, Kunpeng Zhang², Harry Jiannan Wang¹, Ling Fan³•Institutions (3)

University of Delaware¹, University of Maryland, College Park², Tongji University³

16 Nov 2020-arXiv: Computation and Language

TL;DR: This work selects important sentences from the noisy long podcast transcripts based on sentence similarity to the reference to reduce the redundancy and the associated latent topics to preserve semantics and proposes a two-phase approach: sentence selection and seq2seq learning.

...read moreread less

Abstract: Podcast summarization is different from summarization of other data formats, such as news, patents, and scientific papers in that podcasts are often longer, conversational, colloquial, and full of sponsorship and advertising information, which imposes great challenges for existing models. In this paper, we focus on abstractive podcast summarization and propose a two-phase approach: sentence selection and seq2seq learning. Specifically, we first select important sentences from the noisy long podcast transcripts. The selection is based on sentence similarity to the reference to reduce the redundancy and the associated latent topics to preserve semantics. Then the selected sentences are fed into a pre-trained encoder-decoder framework for the summary generation. Our approach achieves promising results regarding both ROUGE-based measures and human evaluations.

...read moreread less

9 citations

Posted Content•

BiToD: A Bilingual Multi-Domain Dataset For Task-Oriented Dialogue Modeling

[...]

Zhaojiang Lin¹, Andrea Madotto¹, Genta Indra Winata¹, Peng Xu¹, Feijun Jiang², Yuxiang Hu², Chen Shi², Pascale Fung³ - Show less +4 more•Institutions (3)

Hong Kong University of Science and Technology¹, Alibaba Group², Facebook³

05 Jun 2021-arXiv: Computation and Language

TL;DR: The BiToD task-oriented dialogue dataset as discussed by the authors contains 7k multi-domain dialogues (144k utterances) with a large and realistic bilingual knowledge base, which serve as an effective benchmark for evaluating bilingual ToD systems and cross-lingual transfer learning.

...read moreread less

Abstract: Task-oriented dialogue (ToD) benchmarks provide an important avenue to measure progress and develop better conversational agents. However, existing datasets for end-to-end ToD modeling are limited to a single language, hindering the development of robust end-to-end ToD systems for multilingual countries and regions. Here we introduce BiToD, the first bilingual multi-domain dataset for end-to-end task-oriented dialogue modeling. BiToD contains over 7k multi-domain dialogues (144k utterances) with a large and realistic bilingual knowledge base. It serves as an effective benchmark for evaluating bilingual ToD systems and cross-lingual transfer learning approaches. We provide state-of-the-art baselines under three evaluation settings (monolingual, bilingual, and cross-lingual). The analysis of our baselines in different settings highlights 1) the effectiveness of training a bilingual ToD system compared to two independent monolingual ToD systems, and 2) the potential of leveraging a bilingual knowledge base and cross-lingual transfer learning to improve the system performance under low resource condition.

...read moreread less

9 citations

Posted Content•

Domain-matched Pre-training Tasks for Dense Retrieval

[...]

Barlas Oguz, Kushal Lakhotia, Anchit Gupta, Patrick S. H. Lewis, Vladimir Karpukhin, Aleksandra Piktus, Xilun Chen, Sebastian Riedel, Wen-tau Yih, Sonal Gupta, Yashar Mehdad - Show less +7 more

28 Jul 2021-arXiv: Computation and Language

Abstract: Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional pre-training has so far failed to produce convincing results. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder models on 1) a recently released set of 65 million synthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations made available by this http URL. We evaluate on a set of information retrieval and dialogue retrieval benchmarks, showing substantial improvements over supervised baselines.

...read moreread less

9 citations