Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Chatty Goose: A Python Framework for Conversational Search

[...]

Edwin Zhang¹, Sheng-Chieh Lin¹, Jheng-Hong Yang¹, Ronak Pradeep¹, Rodrigo Nogueira¹, Jimmy Lin¹ - Show less +2 more•Institutions (1)

University of Waterloo¹

11 Jul 2021

TL;DR: Chatty Goose as discussed by the authors is an open-source Python conversational search framework that provides strong, reproducible re-ranking pipelines built on recent advances in neural models, which can lower the barrier of entry for research in Conversational Search by providing reproducible baselines that researchers can build on top of.

...read moreread less

Abstract: Chatty Goose is an open-source Python conversational search framework that provides strong, reproducible reranking pipelines built on recent advances in neural models. The framework comprises extensible modular components that integrate with popular libraries such as Transformers by HuggingFace and ParlAI by Facebook. Our aim is to lower the barrier of entry for research in conversational search by providing reproducible baselines that researchers can build on top of. We provide an overview of the framework and demonstrate how to instantiate a new system from scratch. Chatty Goose incorporates improvements to components that we introduced in the TREC 2019 Conversational Assistance Track (CAsT), where our submission represented the top-performing system. Using our framework, a comparable run can be reproduced with just a few lines of code.

...read moreread less

2 citations

Proceedings Article•

Predicting Inductive Biases of Pre-Trained Models

[...]

Charles Lovering¹, Rohan Jha¹, Tal Linzen², Ellie Pavlick¹•Institutions (2)

Brown University¹, New York University²

03 May 2021

TL;DR: This paper test the hypothesis that the extent to which a feature influences a model's decisions can be predicted using a combination of two factors: the feature's "extractability" after pre-training (measured using information-theoretic probing techniques), and the evidence available during fine-tuning.

...read moreread less

Abstract: Most current NLP systems are based on a pre-train-then-fine-tune paradigm, in which a large neural network is first trained in a self-supervised way designed to encourage the network to extract broadly-useful linguistic features, and then fine-tuned for a specific task of interest. Recent work attempts to understand why this recipe works and explain when it fails. Currently, such analyses have produced two sets of apparently-contradictory results. Work that analyzes the representations that result from pre-training (via "probing classifiers") finds evidence that rich features of linguistic structure can be decoded with high accuracy, but work that analyzes model behavior after fine-tuning (via "challenge sets") indicates that decisions are often not based on such structure but rather on spurious heuristics specific to the training set. In this work, we test the hypothesis that the extent to which a feature influences a model's decisions can be predicted using a combination of two factors: The feature's "extractability" after pre-training (measured using information-theoretic probing techniques), and the "evidence" available during fine-tuning (defined as the feature's co-occurrence rate with the label). In experiments with both synthetic and natural language data, we find strong evidence (statistically significant correlations) supporting this hypothesis.

...read moreread less

2 citations

Posted Content•

Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

[...]

Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen Bi, Chuanqi Tan, Fei Huang, Huajun Chen - Show less +4 more

30 Aug 2021-arXiv: Computation and Language

TL;DR: DifferentiAble pRompT (DART) as discussed by the authors is a pluggable, extensible, and efficient approach which can convert small language models into better few-shot learners without any prompt engineering.

...read moreread less

Abstract: Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners. However, their effectiveness depends mainly on scaling the model parameters and prompt design, hindering their implementation in most real-world applications. This study proposes a novel pluggable, extensible, and efficient approach named DifferentiAble pRompT (DART), which can convert small language models into better few-shot learners without any prompt engineering. The main principle behind this approach involves reformulating potential natural language processing tasks into the task of a pre-trained language model and differentially optimizing the prompt template as well as the target label with backpropagation. Furthermore, the proposed approach can be: (i) Plugged to any pre-trained language models; (ii) Extended to widespread classification tasks. A comprehensive evaluation of standard NLP tasks demonstrates that the proposed approach achieves a better few-shot performance.

...read moreread less

2 citations

Proceedings Article•DOI•

Attribute Value Generation from Product Title using Language Models

[...]

Kalyani Roy, Pawan Goyal, Manish Pandey

01 Aug 2021

TL;DR: In this paper, a generative approach to attribute value extraction using language models is presented, which leverages the large-scale pretraining of the GPT-2 and the T5 text-to-text transformer.

...read moreread less

Abstract: Identifying the value of product attribute is essential for many e-commerce functions such as product search and product recommendations. Therefore, identifying attribute values from unstructured product descriptions is a critical undertaking for any e-commerce retailer. What makes this problem challenging is the diversity of product types and their attributes and values. Existing methods have typically employed multiple types of machine learning models, each of which handles specific product types or attribute classes. This has limited their scalability and generalization for large scale real world e-commerce applications. Previous approaches for this task have formulated the attribute value extraction as a Named Entity Recognition (NER) task or a Question Answering (QA) task. In this paper we have presented a generative approach to the attribute value extraction problem using language models. We leverage the large-scale pretraining of the GPT-2 and the T5 text-to-text transformer to create fine-tuned models that can effectively perform this task. We show that a single general model is very effective for this task over a broad set of product attribute values with the open world assumption. Our approach achieves state-of-the-art performance for different attribute classes, which has previously required a diverse set of models.

...read moreread less

2 citations

Posted Content•

TransVG: End-to-End Visual Grounding with Transformers

[...]

Jiajun Deng¹, Zhengyuan Yang², Tianlang Chen², Wengang Zhou¹, Houqiang Li¹ - Show less +1 more•Institutions (2)

University of Science and Technology of China¹, University of Rochester²

17 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: TransVG as mentioned in this paper proposes a transformer-based framework for visual grounding to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.

...read moreread less

Abstract: In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods, including two-stage or one-stage ones, rely on a complex module with manually-designed mechanisms to perform the query reasoning and multi-modal fusion. However, the involvement of certain mechanisms in fusion module design, such as query decomposition and image scene graph, makes the models easily overfit to datasets with specific scenarios, and limits the plenitudinous interaction between the visual-linguistic context. To avoid this caveat, we propose to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules (e.g., modular attention network, dynamic graph, and multi-modal tree) can be replaced by a simple stack of transformer encoder layers with higher performance. Moreover, we re-formulate the visual grounding as a direct coordinates regression problem and avoid making predictions out of a set of candidates (i.e., region proposals or anchor boxes). Extensive experiments are conducted on five widely used datasets, and a series of state-of-the-art records are set by our TransVG. We build the benchmark of transformer-based visual grounding framework and will make our code available to the public.

...read moreread less

2 citations