Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization

[...]

Yichen Jiang, Asli Celikyilmaz¹, Paul Smolensky¹, Paul Soulos, Sudha Rao¹, Hamid Palangi¹, Roland Fernandez¹, Caitlin Smith², Mohit Bansal³, Jianfeng Gao¹ - Show less +6 more•Institutions (3)

Microsoft¹, University of Southern California², University of North Carolina at Chapel Hill³

02 Jun 2021-arXiv: Computation and Language

TL;DR: In this paper, a structural bias is introduced by encoding two separate representations for each token to represent the syntactic structure (with role vectors) and semantic content (with filler vectors) separately, and the model then binds the role and filler vectors into the TPR as the layer output.

...read moreread less

Abstract: ive summarization, the task of generating a concise summary of input documents, requires: (1) reasoning over the source document to determine the salient pieces of information scattered across the long document, and (2) composing a cohesive text by reconstructing these salient facts into a shorter summary that faithfully reflects the complex relations connecting these facts. In this paper, we adapt TP-TRANSFORMER (Schlag et al., 2019), an architecture that enriches the original Transformer (Vaswani et al., 2017) with the explicitly compositional Tensor Product Representation (TPR), for the task of abstractive summarization. The key feature of our model is a structural bias that we introduce by encoding two separate representations for each token to represent the syntactic structure (with role vectors) and semantic content (with filler vectors) separately. The model then binds the role and filler vectors into the TPR as the layer output. We argue that the structured intermediate representations enable the model to take better control of the contents (salient facts) and structures (the syntax that connects the facts) when generating the summary. Empirically, we show that our TP-TRANSFORMER outperforms the Transformer and the original TP-TRANSFORMER significantly on several abstractive summarization datasets based on both automatic and human evaluations. On several syntactic and semantic probing tasks, we demonstrate the emergent structural information in the role vectors and improved syntactic interpretability in the TPR layer outputs. Code and models are available at this https URL.

...read moreread less

Proceedings Article•

Think about it! Improving defeasible reasoning by first modeling the question scenario

[...]

Aman Madaan¹, Niket Tandon¹, Dheeraj Rajagopal¹, Peter Clark², Yiming Yang¹, Eduard Hovy¹ - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Allen Institute for Artificial Intelligence²

24 Oct 2021

TL;DR: This paper proposed a model to first create a graph of relevant influences, and then leverage that graph as an additional input when answering the question, achieving state-of-the-art performance on three different defeasible reasoning datasets.

...read moreread less

Abstract: Defeasible reasoning is the mode of reasoning where conclusions can be overturned by taking into account new evidence. Existing cognitive science literature on defeasible reasoning suggests that a person forms a “mental model” of the problem scenario before answering questions. Our research goal asks whether neural models can similarly benefit from envisioning the question scenario before answering a defeasible query. Our approach is, given a question, to have a model first create a graph of relevant influences, and then leverage that graph as an additional input when answering the question. Our system, CURIOUS, achieves a new state-of-the-art on three different defeasible reasoning datasets. This result is significant as it illustrates that performance can be improved by guiding a system to “think about” a question and explicitly model the scenario, rather than answering reflexively.

...read moreread less

Posted Content•

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

[...]

Haoming Jiang¹, Danqing Zhang¹, Tianyu Cao¹, Bing Yin¹, Tuo Zhao² - Show less +1 more•Institutions (2)

Amazon.com¹, Georgia Institute of Technology²

16 Jun 2021-arXiv: Computation and Language

TL;DR: WNDLE as mentioned in this paper proposes a multi-stage computational framework with three essential ingredients: weak label completion, noise-aware loss function, and final fine-tuning over the strongly labeled data.

...read moreread less

Abstract: Weak supervision has shown promising results in many natural language processing tasks, such as Named Entity Recognition (NER). Existing work mainly focuses on learning deep NER models only with weak supervision, i.e., without any human annotation, and shows that by merely using weakly labeled data, one can achieve good performance, though still underperforms fully supervised NER with manually/strongly labeled data. In this paper, we consider a more practical scenario, where we have both a small amount of strongly labeled data and a large amount of weakly labeled data. Unfortunately, we observe that weakly labeled data does not necessarily improve, or even deteriorate the model performance (due to the extensive noise in the weak labels) when we train deep NER models over a simple or weighted combination of the strongly labeled and weakly labeled data. To address this issue, we propose a new multi-stage computational framework -- NEEDLE with three essential ingredients: (1) weak label completion, (2) noise-aware loss function, and (3) final fine-tuning over the strongly labeled data. Through experiments on E-commerce query NER and Biomedical NER, we demonstrate that NEEDLE can effectively suppress the noise of the weak labels and outperforms existing methods. In particular, we achieve new SOTA F1-scores on 3 Biomedical NER datasets: BC5CDR-chem 93.74, BC5CDR-disease 90.69, NCBI-disease 92.28.

...read moreread less

Proceedings Article•DOI•

A Flexible Multi-Task Model for BERT Serving

[...]

01 Jan 2022

TL;DR: In this article , a BERT-based multi-task (MT) framework is proposed for iterative and incremental development of the tasks, which is based on the idea of partial fine-tuning, i.e. only fine-tune some top layers of BERT while keeping the other layers frozen.

...read moreread less

Abstract: We present an efficient BERT-based multi-task (MT) framework that is particularly suitable for iterative and incremental development of the tasks. The proposed framework is based on the idea of partial fine-tuning, i.e. only fine-tune some top layers of BERT while keep the other layers frozen. For each task, we train independently a single-task (ST) model using partial fine-tuning. Then we compress the task-specific layers in each ST model using knowledge distillation. Those compressed ST models are finally merged into one MT model so that the frozen layers of the former are shared across the tasks. We exemplify our approach on eight GLUE tasks, demonstrating that it is able to achieve 99.6% of the performance of the full fine-tuning method, while reducing up to two thirds of its overhead.

...read moreread less

Posted Content•

CO2Sum:Contrastive Learning for Factual-Consistent Abstractive Summarization

[...]

Wei Liu¹, Huanqin Wu¹, Wenjing Mu¹, Zhen Li¹, Tao Chen¹, Dan Nie - Show less +2 more•Institutions (1)

Tencent¹

02 Dec 2021-arXiv: Computation and Language

TL;DR: The authors proposed a contrastive learning scheme to generate factual-consistent abstractive summarization, which can help the model be aware of the factual information contained in the input article and make the model to produce factual-correct output summary.

...read moreread less

Abstract: Generating factual-consistent summaries is a challenging task for abstractive summarization. Previous works mainly encode factual information or perform post-correct/rank after decoding. In this paper, we provide a factual-consistent solution from the perspective of contrastive learning, which is a natural extension of previous works. We propose CO2Sum (Contrastive for Consistency), a contrastive learning scheme that can be easily applied on sequence-to-sequence models for factual-consistent abstractive summarization, proving that the model can be fact-aware without modifying the architecture. CO2Sum applies contrastive learning on the encoder, which can help the model be aware of the factual information contained in the input article, or performs contrastive learning on the decoder, which makes the model to generate factual-correct output summary. What's more, these two schemes are orthogonal and can be combined to further improve faithfulness. Comprehensive experiments on public benchmarks demonstrate that CO2Sum improves the faithfulness on large pre-trained language models and reaches competitive results compared to other strong factual-consistent summarization baselines.

...read moreread less