scispace - formally typeset
Search or ask a question
Journal Article

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.
Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI
TL;DR: This article investigated the impact of pre-training models (one T5, three Pegasuses, three ProphetNets) on several Wikipedia datasets in English and Indonesian language and compared the results to the Wikipedia systems' summaries.
Abstract: This paper surveys several recent abstract summarization methods: T5, Pegasus, and ProphetNet We implement the systems in two languages: English and Indonesian languages We investigate the impact of pre-training models (one T5, three Pegasuses, three ProphetNets) on several Wikipedia datasets in English and Indonesian language and compare the results to the Wikipedia systems’ summaries The T5-Large, the Pegasus-XSum, and the ProphetNet-CNNDM provide the best summarization The most significant factors that influence ROUGE performance are coverage, density, and compression The higher the scores, the better the summary Other factors that influence the ROUGE scores are the pre-training goal, the dataset's characteristics, the dataset used for testing the pre-trained model, and the cross-lingual function Several suggestions to improve this paper's limitation are: (1) assure that the dataset used for the pre-training model must be sufficiently large, contains adequate instances for handling cross-lingual purpose; (2) advanced process (fine-tuning) shall be reasonable We recommend using the large dataset consisting of comprehensive coverage of topics from many languages before implementing advanced processes such as the train-infer-train procedure to the zero-shot translation in the training stage of the pre-training model

2 citations

DOI
01 Sep 2021
TL;DR: In this article, the authors proposed NMT-Stroke, an attack framework that can maliciously divert the translation of a victim NMT model by modeling memory fault injections with the rowhammer attack vector.
Abstract: The rapid development of deep learning has significantly bolstered the performance of natural language processing (NLP) in the form of language modeling. Recent advances in hardware security studies have demonstrated that hardware-based threats can severely jeopardize the integrity of computing systems (e.g., fault attacks for data at rest). Internal adversaries exploiting such hardware vulnerabilities are becoming a major security concern. Yet the impact of hardware faults on systems running NLP models has not been fully understood.In this paper, we perform the first investigation of hardware-based fault injections in modern neural machine translation (NMT) models. We find that compared to neural network classifiers (e.g., CNNs), fault attacks on NMT models present unique challenges. We propose a novel attack framework–NMT-Stroke–that can maliciously divert the translation of a victim NMT model by modeling memory fault injections with the rowhammer attack vector. We design a fault injection strategy to minimize bit flips needed, which would mislead the translation to an arbitrary natural output sentence. Our evaluation on state-of-the-art Transformer-based NMT models shows that NMT-Stroke can effectively induce the attacker-desired and linguistically sound translation by faulting minimal parameter bits. Our work highlights the significance of understanding the robustness of emerging NLP models with the presence of hardware vulnerabilities, which could lead to future new research directions.

2 citations

Posted Content
TL;DR: Kerneleded Transformers as discussed by the authors approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution, which not only helps in learning a generic kernel end-to-end, but also reduces the time and space complexity of Transformers from quadratic to linear.
Abstract: In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, data driven framework for learning the kernel function in Transformers. Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution. This not only helps in learning a generic kernel end-to-end, but also reduces the time and space complexity of Transformers from quadratic to linear. We show that KERNELIZED TRANSFORMERS achieve performance comparable to existing efficient Transformer architectures, both in terms of accuracy as well as computational efficiency. Our study also demonstrates that the choice of the kernel has a substantial impact on performance, and kernel learning variants are competitive alternatives to fixed kernel Transformers, both in long as well as short sequence tasks.

2 citations

Posted Content
TL;DR: This article used Transformer models within an encoder-decoder architecture with graph and tree representations to generate questions from MIT's 6.036 Introduction to Machine Learning course and trained a machine learning model to answer these questions.
Abstract: Can a machine learn Machine Learning? This work trains a machine learning model to solve machine learning problems from a University undergraduate level course. We generate a new training set of questions and answers consisting of course exercises, homework, and quiz questions from MIT's 6.036 Introduction to Machine Learning course and train a machine learning model to answer these questions. Our system demonstrates an overall accuracy of 96% for open-response questions and 97% for multiple-choice questions, compared with MIT students' average of 93%, achieving grade A performance in the course, all in real-time. Questions cover all 12 topics taught in the course, excluding coding questions or questions with images. Topics include: (i) basic machine learning principles; (ii) perceptrons; (iii) feature extraction and selection; (iv) logistic regression; (v) regression; (vi) neural networks; (vii) advanced neural networks; (viii) convolutional neural networks; (ix) recurrent neural networks; (x) state machines and MDPs; (xi) reinforcement learning; and (xii) decision trees. Our system uses Transformer models within an encoder-decoder architecture with graph and tree representations. An important aspect of our approach is a data-augmentation scheme for generating new example problems. We also train a machine learning model to generate problem hints. Thus, our system automatically generates new questions across topics, answers both open-response questions and multiple-choice questions, classifies problems, and generates problem hints, pushing the envelope of AI for STEM education.

2 citations

Posted Content
TL;DR: SEDE as discussed by the authors is a dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website, which contains a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset.
Abstract: Most available semantic parsing datasets, comprising of pairs of natural utterances and logical forms, were collected solely for the purpose of training and evaluation of natural language understanding systems. As a result, they do not contain any of the richness and variety of natural-occurring utterances, where humans ask about data they need or are curious about. In this work, we release SEDE, a dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website. We show that these pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset, propose an evaluation metric based on comparison of partial query clauses that is more suitable for real-world queries, and conduct experiments with strong baselines, showing a large gap between the performance on SEDE compared to other common datasets.

2 citations

Trending Questions (1)
What are the limitations of transfer learning with a unified text-to-text transformer?

The paper does not mention the limitations of transfer learning with a unified text-to-text transformer.