Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

DistIR: An Intermediate Representation for Optimizing Distributed Neural Networks

[...]

Keshav Santhanam¹, Siddharth Krishna², Ryota Tomioka², Andrew Fitzgibbon², Tim Harris² - Show less +1 more•Institutions (2)

Stanford University¹, Microsoft²

26 Apr 2021

TL;DR: DistIR as mentioned in this paper is an IR for explicitly representing distributed DNN computation that can capture many popular distribution strategies, such as data, horizontal, and pipeline parallelism, and can be used to automatically search for an optimal distribution strategy.

...read moreread less

Abstract: The rapidly growing size of deep neural network (DNN) models and datasets has given rise to a variety of distribution strategies such as data, horizontal, and pipeline parallelism. However, selecting the best set of strategies for a given model and hardware configuration is challenging because debugging and testing on clusters is expensive. In this work we propose DistIR, an IR for explicitly representing distributed DNN computation that can capture many popular distribution strategies. We build an analysis framework for DistIR programs, including a simulator and reference executor that can be used to automatically search for an optimal distribution strategy. Our unified global representation also eases development of new distribution strategies, as one can reuse the lowering to per-rank backend programs. Preliminary results using a grid search over a hybrid data/horizontal/pipeline-parallel space suggest DistIR and its simulator can aid automatic DNN distribution.

...read moreread less

5 citations

Proceedings Article•DOI•

PhotoChat: A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text Modeling

[...]

Xiaoxue Zang¹, Lijuan Liu¹, Maria Wang, Yang Song, Hao Zhang, Jindong Chen¹ - Show less +2 more•Institutions (1)

Google¹

01 Aug 2021

TL;DR: The PhotoChat dataset as mentioned in this paper contains 12k dialogues, each of which is paired with a user photo that is shared during the conversation, and the best image retrieval model achieves 10.4% recall@1 and best photo intent prediction model achieves 58.1% F1 score.

...read moreread less

Abstract: We present a new human-human dialogue dataset - PhotoChat, the first dataset that casts light on the photo sharing behavior in online messaging. PhotoChat contains 12k dialogues, each of which is paired with a user photo that is shared during the conversation. Based on this dataset, we propose two tasks to facilitate research on image-text modeling: a photo-sharing intent prediction task that predicts whether one intends to share a photo in the next conversation turn, and a photo retrieval task that retrieves the most relevant photo according to the dialogue context. In addition, for both tasks, we provide baseline models using the state-of-the-art models and report their benchmark performances. The best image retrieval model achieves 10.4% recall@1 (out of 1000 candidates) and the best photo intent prediction model achieves 58.1% F1 score, indicating that the dataset presents interesting yet challenging real-world problems. We are releasing PhotoChat to facilitate future research work among the community.

...read moreread less

5 citations

Posted Content•

UnitedQA: A Hybrid Approach for Open Domain Question Answering

[...]

Hao Cheng¹, Yelong Shen¹, Xiaodong Liu¹, Pengcheng He¹, Weizhu Chen¹, Jianfeng Gao¹ - Show less +2 more•Institutions (1)

Microsoft¹

01 Jan 2021-arXiv: Computation and Language

TL;DR: This paper proposed a hybrid approach for leveraging the strengths of both extractive and generative readers, which achieved state-of-the-art results on NaturalQuestions and TriviaQA.

...read moreread less

Abstract: To date, most of recent work under the retrieval-reader framework for open-domain QA focuses on either extractive or generative reader exclusively. In this paper, we study a hybrid approach for leveraging the strengths of both models. We apply novel techniques to enhance both extractive and generative readers built upon recent pretrained neural language models, and find that proper training methods can provide large improvement over previous state-of-the-art models. We demonstrate that a simple hybrid approach by combining answers from both readers can efficiently take advantages of extractive and generative answer inference strategies and outperforms single models as well as homogeneous ensembles. Our approach outperforms previous state-of-the-art models by 3.3 and 2.7 points in exact match on NaturalQuestions and TriviaQA respectively.

...read moreread less

5 citations

Proceedings Article•DOI•

RESIN: A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System

[...]

Haoyang Wen¹, Ying Lin², Tuan Lai, Xiaoman Pan³, Sha Li⁴, Xudong Lin⁵, Ben Zhou⁶, Manling Li², Haoyu Wang⁷, Hongming Zhang⁸, Xiaodong Yu⁴, Alexander Dong, Zhenhailong Wang⁹, Yi Fung, Piyush Mishra, Qing Lyu⁷, Dídac Surís⁵, Brian Chen⁵, Susan Brown¹⁰, Martha Palmer¹⁰, Chris Callison-Burch⁷, Carl Vondrick⁵, Jiawei Han⁴, Dan Roth⁷, Shih-Fu Chang⁵, Heng Ji⁴ - Show less +22 more•Institutions (10)

Harbin Institute of Technology¹, Rensselaer Polytechnic Institute², Tencent³, University of Illinois at Urbana–Champaign⁴, Columbia University⁵, Allen Institute for Artificial Intelligence⁶, University of Pennsylvania⁷, Hong Kong University of Science and Technology⁸, Zhejiang University⁹, University of Colorado Boulder¹⁰

01 Jun 2021

TL;DR: This article presented a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for their experiment), and multiple data modalities (speech, text, image and video).

...read moreread less

Abstract: We present a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for our experiment), and multiple data modalities (speech, text, image and video). The system advances state-of-the-art from two aspects: (1) extending from sentence-level event extraction to cross-document cross-lingual cross-media event extraction, coreference resolution and temporal event tracking; (2) using human curated event schema library to match and enhance the extraction output. We have made the dockerlized system publicly available for research purpose at GitHub, with a demo video.

...read moreread less

5 citations

Posted Content•

LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning.

[...]

Yuhuai Wu¹, Markus N. Rabe², Wenda Li³, Jimmy Ba¹, Roger Grosse¹, Christian Szegedy² - Show less +2 more•Institutions (3)

University of Toronto¹, Google², University of Cambridge³

15 Jan 2021-arXiv: Learning

TL;DR: In this article, a pre-training methodology called Learning Inductive bias for Mathematical Reasoning (LIME) is proposed to learn inductive bias for mathematical reasoning tasks, which requires only a small fraction of the computation cost of the typical downstream task.

...read moreread less

Abstract: While designing inductive bias in neural architectures has been widely studied, we hypothesize that transformer networks are flexible enough to learn inductive bias from suitable generic tasks. Here, we replace architecture engineering by encoding inductive bias in the form of datasets. Inspired by Peirce's view that deduction, induction, and abduction form an irreducible set of reasoning primitives, we design three synthetic tasks that are intended to require the model to have these three abilities. We specifically design these synthetic tasks in a way that they are devoid of mathematical knowledge to ensure that only the fundamental reasoning biases can be learned from these tasks. This defines a new pre-training methodology called "LIME" (Learning Inductive bias for Mathematical rEasoning). Models trained with LIME significantly outperform vanilla transformers on three very different large mathematical reasoning benchmarks. Unlike dominating the computation cost as traditional pre-training approaches, LIME requires only a small fraction of the computation cost of the typical downstream task.

...read moreread less

5 citations