Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

A Survey of Recent Abstract Summarization Techniques

[...]

Diyah Puspitaningrum¹•Institutions (1)

University of Bengkulu¹

01 Jan 2022-arXiv: Computation and Language

TL;DR: This article investigated the impact of pre-training models (one T5, three Pegasuses, three ProphetNets) on several Wikipedia datasets in English and Indonesian language and compared the results to the Wikipedia systems' summaries.

...read moreread less

Abstract: This paper surveys several recent abstract summarization methods: T5, Pegasus, and ProphetNet We implement the systems in two languages: English and Indonesian languages We investigate the impact of pre-training models (one T5, three Pegasuses, three ProphetNets) on several Wikipedia datasets in English and Indonesian language and compare the results to the Wikipedia systems’ summaries The T5-Large, the Pegasus-XSum, and the ProphetNet-CNNDM provide the best summarization The most significant factors that influence ROUGE performance are coverage, density, and compression The higher the scores, the better the summary Other factors that influence the ROUGE scores are the pre-training goal, the dataset's characteristics, the dataset used for testing the pre-trained model, and the cross-lingual function Several suggestions to improve this paper's limitation are: (1) assure that the dataset used for the pre-training model must be sufficiently large, contains adequate instances for handling cross-lingual purpose; (2) advanced process (fine-tuning) shall be reasonable We recommend using the large dataset consisting of comprehensive coverage of topics from many languages before implementing advanced processes such as the train-infer-train procedure to the zero-shot translation in the training stage of the pre-training model

...read moreread less

2 citations

DOI•

Seeds of SEED: NMT-Stroke: Diverting Neural Machine Translation through Hardware-based Faults

[...]

Kunbei Cai¹, Hafizul Islam Chowdhuryy¹, Zhenkai Zhang², Fan Yao²•Institutions (2)

University of Central Florida¹, Clemson University²

01 Sep 2021

TL;DR: In this article, the authors proposed NMT-Stroke, an attack framework that can maliciously divert the translation of a victim NMT model by modeling memory fault injections with the rowhammer attack vector.

...read moreread less

Abstract: The rapid development of deep learning has significantly bolstered the performance of natural language processing (NLP) in the form of language modeling. Recent advances in hardware security studies have demonstrated that hardware-based threats can severely jeopardize the integrity of computing systems (e.g., fault attacks for data at rest). Internal adversaries exploiting such hardware vulnerabilities are becoming a major security concern. Yet the impact of hardware faults on systems running NLP models has not been fully understood.In this paper, we perform the first investigation of hardware-based fault injections in modern neural machine translation (NMT) models. We find that compared to neural network classifiers (e.g., CNNs), fault attacks on NMT models present unique challenges. We propose a novel attack framework–NMT-Stroke–that can maliciously divert the translation of a victim NMT model by modeling memory fault injections with the rowhammer attack vector. We design a fault injection strategy to minimize bit flips needed, which would mislead the translation to an arbitrary natural output sentence. Our evaluation on state-of-the-art Transformer-based NMT models shows that NMT-Stroke can effectively induce the attacker-desired and linguistically sound translation by faulting minimal parameter bits. Our work highlights the significance of understanding the robustness of emerging NLP models with the presence of hardware vulnerabilities, which could lead to future new research directions.

...read moreread less

2 citations

Posted Content•

On Learning the Transformer Kernel

[...]

Sankalan Pal Chowdhury¹, Adamos Solomou, Avinava Dubey², Mrinmaya Sachan¹•Institutions (2)

ETH Zurich¹, Google²

29 Sep 2021-arXiv: Learning

TL;DR: Kerneleded Transformers as discussed by the authors approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution, which not only helps in learning a generic kernel end-to-end, but also reduces the time and space complexity of Transformers from quadratic to linear.

...read moreread less

Abstract: In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, data driven framework for learning the kernel function in Transformers. Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution. This not only helps in learning a generic kernel end-to-end, but also reduces the time and space complexity of Transformers from quadratic to linear. We show that KERNELIZED TRANSFORMERS achieve performance comparable to existing efficient Transformer architectures, both in terms of accuracy as well as computational efficiency. Our study also demonstrates that the choice of the kernel has a substantial impact on performance, and kernel learning variants are competitive alternatives to fixed kernel Transformers, both in long as well as short sequence tasks.

...read moreread less

2 citations

Posted Content•

Solving Machine Learning Problems.

[...]

Sunny Tran¹, Pranav Krishna¹, Ishan Pakuwal¹, Prabhakar Kafle¹, Nikhil Singh¹, Jayson Lynch¹, Iddo Drori¹ - Show less +3 more•Institutions (1)

Massachusetts Institute of Technology¹

02 Jul 2021-arXiv: Learning

TL;DR: This article used Transformer models within an encoder-decoder architecture with graph and tree representations to generate questions from MIT's 6.036 Introduction to Machine Learning course and trained a machine learning model to answer these questions.

...read moreread less

Abstract: Can a machine learn Machine Learning? This work trains a machine learning model to solve machine learning problems from a University undergraduate level course. We generate a new training set of questions and answers consisting of course exercises, homework, and quiz questions from MIT's 6.036 Introduction to Machine Learning course and train a machine learning model to answer these questions. Our system demonstrates an overall accuracy of 96% for open-response questions and 97% for multiple-choice questions, compared with MIT students' average of 93%, achieving grade A performance in the course, all in real-time. Questions cover all 12 topics taught in the course, excluding coding questions or questions with images. Topics include: (i) basic machine learning principles; (ii) perceptrons; (iii) feature extraction and selection; (iv) logistic regression; (v) regression; (vi) neural networks; (vii) advanced neural networks; (viii) convolutional neural networks; (ix) recurrent neural networks; (x) state machines and MDPs; (xi) reinforcement learning; and (xii) decision trees. Our system uses Transformer models within an encoder-decoder architecture with graph and tree representations. An important aspect of our approach is a data-augmentation scheme for generating new example problems. We also train a machine learning model to generate problem hints. Thus, our system automatically generates new questions across topics, answers both open-response questions and multiple-choice questions, classifies problems, and generates problem hints, pushing the envelope of AI for STEM education.

...read moreread less

2 citations

Posted Content•

Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data.

[...]

Moshe Hazoom, Vibhor Malik, Ben Bogin¹•Institutions (1)

Columbia University¹

09 Jun 2021-arXiv: Computation and Language

TL;DR: SEDE as discussed by the authors is a dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website, which contains a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset.

...read moreread less

Abstract: Most available semantic parsing datasets, comprising of pairs of natural utterances and logical forms, were collected solely for the purpose of training and evaluation of natural language understanding systems. As a result, they do not contain any of the richness and variety of natural-occurring utterances, where humans ask about data they need or are curious about. In this work, we release SEDE, a dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website. We show that these pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset, propose an evaluation metric based on comparison of partial query clauses that is more suitable for real-world queries, and conduct experiments with strong baselines, showing a large gap between the performance on SEDE compared to other common datasets.

...read moreread less

2 citations