Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Measuring and Improving Consistency in Pretrained Language Models

[...]

Yanai Elazar¹, Yanai Elazar², Nora Kassner³, Shauli Ravfogel¹, Shauli Ravfogel², Abhilasha Ravichander⁴, Eduard Hovy⁴, Hinrich Schütze³, Yoav Goldberg¹, Yoav Goldberg² - Show less +6 more•Institutions (4)

Bar-Ilan University¹, Allen Institute for Artificial Intelligence², Ludwig Maximilian University of Munich³, Carnegie Mellon University⁴

01 Feb 2021-arXiv: Computation and Language

TL;DR: ParaRel as mentioned in this paper is a high-quality resource of cloze-style query English paraphrases, containing a total of 328 paraphrase for 38 relations using ParaRel.

...read moreread less

Abstract: Consistency of a model -- that is, the invariance of its behavior under meaning-preserving alternations in its input -- is a highly desirable property in natural language processing In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel, a high-quality resource of cloze-style query English paraphrases It contains a total of 328 paraphrases for 38 relations Using ParaRel, we show that the consistency of all PLMs we experiment with is poor -- though with high variance between relations Our analysis of the representational spaces of PLMs suggests that they have a poor structure and are currently not suitable for representing knowledge robustly Finally, we propose a method for improving model consistency and experimentally demonstrate its effectiveness

...read moreread less

26 citations

Book Chapter•DOI•

Full Page Handwriting Recognition via Image to Sequence Extraction

[...]

Sumeet Singh, Sergey Karayev

05 Sep 2021

TL;DR: In this article, a Neural Network based Handwritten Text Recognition (HTR) model is proposed to recognize full pages of handwritten or printed text without image segmentation, which can extract text present in an image and then sequence it correctly without imposing any constraints regarding orientation, layout and size of text and nontext.

...read moreread less

Abstract: We present a Neural Network based Handwritten Text Recognition (HTR) model architecture that can be trained to recognize full pages of handwritten or printed text without image segmentation. Being based on Image to Sequence architecture, it can extract text present in an image and then sequence it correctly without imposing any constraints regarding orientation, layout and size of text and non-text. Further, it can also be trained to generate auxiliary markup related to formatting, layout and content. We use character level vocabulary, thereby enabling language and terminology of any subject. The model achieves a new state-of-art in paragraph level recognition on the IAM dataset. When evaluated on scans of real world handwritten free form test answers - beset with curved and slanted lines, drawings, tables, math, chemistry and other symbols - it performs better than all commercially available HTR cloud APIs. It is deployed in production as part of a commercial web application.

...read moreread less

26 citations

Posted Content•

Contextual Word Representations: A Contextual Introduction

[...]

Noah A. Smith¹•Institutions (1)

University of Washington¹

15 Feb 2019-arXiv: Computation and Language

TL;DR: This introduction aims to tell the story of how the authors put words into computers, why they exist, what problems they solve, where they come from, how they have changed over time, and what some of the open questions about them are.

...read moreread less

Abstract: This introduction aims to tell the story of how we put words into computers. It is part of the story of the field of natural language processing (NLP), a branch of artificial intelligence. It targets a wide audience with a basic understanding of computer programming, but avoids a detailed mathematical treatment, and it does not present any algorithms. It also does not focus on any particular application of NLP such as translation, question answering, or information extraction. The ideas presented here were developed by many researchers over many decades, so the citations are not exhaustive but rather direct the reader to a handful of papers that are, in the author's view, seminal. After reading this document, you should have a general understanding of word vectors (also known as word embeddings): why they exist, what problems they solve, where they come from, how they have changed over time, and what some of the open questions about them are. Readers already familiar with word vectors are advised to skip to Section 5 for the discussion of the most recent advance, contextual word vectors.

...read moreread less

26 citations

Proceedings Article•

On Position Embeddings in BERT

[...]

Benyou Wang¹, Lifeng Shang², Christina Lioma³, Xin Jiang², Hao Yang⁴, Qun Liu², Jakob Grue Simonsen³ - Show less +3 more•Institutions (4)

University of Padua¹, Huawei², University of Copenhagen³, Tsinghua University⁴

03 May 2021

TL;DR: In this article, the authors present three properties of position embeddings that capture word distance in vector space: translation invariance, monotonicity, and symmetry, and propose a new probing test (called ''identical word probing') and mathematical indicators to quantitatively detect the general attention patterns with respect to the above properties.

...read moreread less

Abstract: Various Position Embeddings (PEs) have been proposed in Transformer based architectures~(e.g. BERT) to model word order. These are empirically-driven and perform well, but no formal framework exists to systematically study them. To address this, we present three properties of PEs that capture word distance in vector space: translation invariance, monotonicity, and symmetry. These properties formally capture the behaviour of PEs and allow us to reinterpret sinusoidal PEs in a principled way. Moreover, we propose a new probing test (called `identical word probing') and mathematical indicators to quantitatively detect the general attention patterns with respect to the above properties. An empirical evaluation of seven PEs (and their combinations) for classification (GLUE) and span prediction (SQuAD) shows that: (1) both classification and span prediction benefit from translation invariance and local monotonicity, while symmetry slightly decreases performance; (2) The fully-learnable absolute PE performs better in classification, while relative PEs perform better in span prediction. We contribute the first formal and quantitative analysis of desiderata for PEs, and a principled discussion about their correlation to the performance of typical downstream tasks.

...read moreread less

26 citations

Proceedings Article•DOI•

Sentiment analysis in Bengali via transfer learning using multi-lingual BERT

[...]

Khondoker Ittehadul Islam¹, Saiful Islam¹, Ruhul Amin²•Institutions (2)

Shahjalal University of Science and Technology¹, Fordham University²

19 Dec 2020

TL;DR: In this article, the authors presented manually tagged 2-class and 3-class SA datasets in Bengali and demonstrated that the multi-lingual BERT model with relevant extensions can be trained via the approach of transfer learning over those novel datasets to improve the state-of-theart performance in sentiment classification tasks.

...read moreread less

Abstract: Sentiment analysis (SA) in Bengali is challenging due to this Indo-Aryan language’s highly inflected properties with more than 160 different inflected forms for verbs and 36 different forms for noun and 24 different forms for pronouns. The lack of standard labeled datasets in the Bengali domain makes the task of SA even harder. In this paper, we present manually tagged 2-class and 3-class SA datasets in Bengali. We also demonstrate that the multi-lingual BERT model with relevant extensions can be trained via the approach of transfer learning over those novel datasets to improve the state-of-the-art performance in sentiment classification tasks. This deep learning model achieves an accuracy of 71% for 2-class sentiment classification compared to the current state-of-the-art accuracy of 68%. We also present the very first Bengali SA classifier for the 3-class manually tagged dataset, and our proposed model achieves an accuracy of 60%. We further use this model to analyze the sentiment of public comments in the online daily newspaper. Our analysis shows that people post negative comments for political or sports news more often, while the religious article comments represent positive sentiment. The dataset and code is publicly available 1.1https://github.com/KhondokerIslam/BengaliSentiment

...read moreread less

26 citations