Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

CATE: Computation-aware Neural Architecture Encoding with Transformers

[...]

Shen Yan¹, Kaiqiang Song², Fei Liu³, Mi Zhang¹•Institutions (3)

Michigan State University¹, Tencent², University of Central Florida³

14 Feb 2021-arXiv: Learning

TL;DR: CATE as mentioned in this paper employs a pairwise pre-training scheme to learn computation-aware encodings using Transformers with cross-attention, which contain dense and contextualized computation information of neural architectures.

...read moreread less

Abstract: Recent works (White et al., 2020a; Yan et al., 2020) demonstrate the importance of architecture encodings in Neural Architecture Search (NAS). These encodings encode either structure or computation information of the neural architectures. Compared to structure-aware encodings, computation-aware encodings map architectures with similar accuracies to the same region, which improves the downstream architecture search performance (Zhang et al., 2019; White et al., 2020a). In this work, we introduce a Computation-Aware Transformer-based Encoding method called CATE. Different from existing computation-aware encodings based on fixed transformation (e.g. path encoding), CATE employs a pairwise pre-training scheme to learn computation-aware encodings using Transformers with cross-attention. Such learned encodings contain dense and contextualized computation information of neural architectures. We compare CATE with eleven encodings under three major encoding-dependent NAS subroutines in both small and large search spaces. Our experiments show that CATE is beneficial to the downstream search, especially in the large search space. Moreover, the outside search space experiment demonstrates its superior generalization ability beyond the search space on which it was trained. Our code is available at: this https URL.

...read moreread less

1 citations

Posted Content•

On the ability of monolingual models to learn language-agnostic representations.

[...]

Leandro Rodrigues de Souza, Rodrigo Nogueira, Roberto de Alencar Lotufo

04 Sep 2021-arXiv: Computation and Language

TL;DR: This paper showed that monolingual models pretrained and finetuned on different languages achieve competitive performance compared to the ones that use the same target language, regardless of the pretraining language.

...read moreread less

Abstract: Pretrained multilingual models have become a de facto default approach for zero-shot cross-lingual transfer. Previous work has shown that these models are able to achieve cross-lingual representations when pretrained on two or more languages with shared parameters. In this work, we provide evidence that a model can achieve language-agnostic representations even when pretrained on a single language. That is, we find that monolingual models pretrained and finetuned on different languages achieve competitive performance compared to the ones that use the same target language. Surprisingly, the models show a similar performance on a same task regardless of the pretraining language. For example, models pretrained on distant languages such as German and Portuguese perform similarly on English tasks.

...read moreread less

1 citations

Posted Content•

Claim Matching Beyond English to Scale Global Fact-Checking.

[...]

Ashkan Kazemi¹, Kiran Garimella², Devin Gaffney³, Scott A. Hale⁴•Institutions (4)

University of Michigan¹, Massachusetts Institute of Technology², Northeastern University³, University of Oxford⁴

01 Jun 2021-arXiv: Computation and Language

TL;DR: In this paper, a claim matching task is defined to identify pairs of textual messages containing claims that can be served with one fact-check, and a high-quality "teacher" model is trained to address the imbalance in embedding quality between the low and high-resource languages in the dataset.

...read moreread less

Abstract: Manual fact-checking does not scale well to serve the needs of the internet. This issue is further compounded in non-English contexts. In this paper, we discuss claim matching as a possible solution to scale fact-checking. We define claim matching as the task of identifying pairs of textual messages containing claims that can be served with one fact-check. We construct a novel dataset of WhatsApp tipline and public group messages alongside fact-checked claims that are first annotated for containing "claim-like statements" and then matched with potentially similar items and annotated for claim matching. Our dataset contains content in high-resource (English, Hindi) and lower-resource (Bengali, Malayalam, Tamil) languages. We train our own embedding model using knowledge distillation and a high-quality "teacher" model in order to address the imbalance in embedding quality between the low- and high-resource languages in our dataset. We provide evaluations on the performance of our solution and compare with baselines and existing state-of-the-art multilingual embedding models, namely LASER and LaBSE. We demonstrate that our performance exceeds LASER and LaBSE in all settings. We release our annotated datasets, codebooks, and trained embedding model to allow for further research.

...read moreread less

1 citations

Proceedings Article•DOI•

Multilingual Speech Translation from Efficient Finetuning of Pretrained Models

[...]

Xian Li¹, Changhan Wang¹, Yun Tang¹, Chau Tran¹, Yuqing Tang, Juan Pino¹, Alexei Baevski, Alexis Conneau¹, Michael Auli¹ - Show less +5 more•Institutions (1)

Facebook¹

01 Aug 2021

TL;DR: This article used a minimalistic LNA (LayerNorm and Attention) finetuning to achieve zero-shot cross-lingual and cross-modality transfer ability by only finetuned 10 50% of the pretrained parameters.

...read moreread less

Abstract: We present a simple yet effective approach to build multilingual speech-to-text (ST) translation through efficient transfer learning from a pretrained speech encoder and text decoder. Our key finding is that a minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability by only finetuning 10 50% of the pretrained parameters. This effectively leverages large pretrained models at low training cost such as wav2vec 2.0 for acoustic modeling, and mBART for multilingual text generation. This sets a new state-of-the-art for 36 translation directions (and surpassing cascaded ST for 26 of them) on the large-scale multilingual ST benchmark CoVoST 2 (+6.4 BLEU on average for En-X directions and +6.7 BLEU for X-En directions). Our approach demonstrates strong zero-shot performance in a many-to-many multilingual model (+5.6 BLEU on average across 28 non-English directions), making it an appealing approach for attaining high-quality speech translation with improved parameter and data efficiency.

...read moreread less

1 citations

Journal Article•DOI•

Multistage BiCross encoder for multilingual access to COVID-19 health information.

[...]

Iknoor Singh¹, Carolina Scarton¹, Kalina Bontcheva¹•Institutions (1)

University of Sheffield¹

08 Jan 2021-PLOS ONE

TL;DR: This paper proposed a novel high precision and high recall neural multi-stage BiCross encoder approach, which is a sequential three-stage ranking pipeline which uses the Okapi BM25 retrieval algorithm and transformer-based bi-encoder and crossencoder to effectively rank the documents with respect to the given query.

...read moreread less

Abstract: The Coronavirus (COVID-19) pandemic has led to a rapidly growing 'infodemic' of health information online. This has motivated the need for accurate semantic search and retrieval of reliable COVID-19 information across millions of documents, in multiple languages. To address this challenge, this paper proposes a novel high precision and high recall neural Multistage BiCross encoder approach. It is a sequential three-stage ranking pipeline which uses the Okapi BM25 retrieval algorithm and transformer-based bi-encoder and cross-encoder to effectively rank the documents with respect to the given query. We present experimental results from our participation in the Multilingual Information Access (MLIA) shared task on COVID-19 multilingual semantic search. The independently evaluated MLIA results validate our approach and demonstrate that it outperforms other state-of-the-art approaches according to nearly all evaluation metrics in cases of both monolingual and bilingual runs.

...read moreread less

1 citations