Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Monitoring the role of site chemistry on the formation energy of perovskites via deep learning analysis of Hirshfeld surfaces

[...]

Logan Williams¹, Arpan Mukherjee¹, Aparajita Dasgupta¹, Krishna Rajan¹•Institutions (1)

University at Buffalo¹

22 Jul 2021-Journal of Materials Chemistry C

TL;DR: In this paper, a new approach for predicting thermodynamic properties of perovskites that harnesses deep learning and crystal structure fingerprinting based on Hirshfeld surface analysis is presented.

...read moreread less

Abstract: This paper presents a new approach for predicting thermodynamic properties of perovskites that harnesses deep learning and crystal structure fingerprinting based on Hirshfeld surface analysis. It is demonstrated that convolutional neural network methods capture critical features embedded in two-dimensional Hirshfeld surface fingerprints that enable a quantitative assessment of the formation energy of perovskites. Building on our recent work on lattice parameter prediction from Hirshfeld surface calculations, we show how transfer learning can be used to speed up the training of the neural network, allowing multiple properties to be trained using the same feature extraction layers. We also predict formation energies for various perovskite polymorphs, and our predictions are found to give generally improved performance over a well-established graph network method, but with the methods better suited to different types of datasets. Analysis of the structure types within the dataset reveals the Hirshfeld surface-based method to excel for the less symmetric and similar structures, while the graph network performs better for very symmetric and similar structures.

...read moreread less

4 citations

Posted Content•

Zero-Shot Information Extraction as a Unified Text-to-Triple Translation

[...]

Chenguang Wang¹, Xiao Liu¹, Zui Chen², Haoyun Hong², Jie Tang², Dawn Song² - Show less +2 more•Institutions (2)

University of California, Berkeley¹, Tsinghua University²

23 Sep 2021-arXiv: Computation and Language

TL;DR: This paper cast a suite of information extraction tasks into a text-to-triple translation framework, which enables a task-agnostic translation by leveraging the latent knowledge that a pre-trained language model has about the task.

...read moreread less

Abstract: We cast a suite of information extraction tasks into a text-to-triple translation framework. Instead of solving each task relying on task-specific datasets and models, we formalize the task as a translation between task-specific input text and output triples. By taking the task-specific input, we enable a task-agnostic translation by leveraging the latent knowledge that a pre-trained language model has about the task. We further demonstrate that a simple pre-training task of predicting which relational information corresponds to which input text is an effective way to produce task-specific outputs. This enables the zero-shot transfer of our framework to downstream tasks. We study the zero-shot performance of this framework on open information extraction (OIE2016, NYT, WEB, PENN), relation classification (FewRel and TACRED), and factual probe (Google-RE and T-REx). The model transfers non-trivially to most tasks and is often competitive with a fully supervised method without the need for any task-specific training. For instance, we significantly outperform the F1 score of the supervised open information extraction without needing to use its training set.

...read moreread less

4 citations

Posted Content•

Seq2Edits: Sequence Transduction Using Span-level Edit Operations

[...]

Felix Stahlberg¹, Shankar Kumar¹•Institutions (1)

Google¹

23 Sep 2020-arXiv: Computation and Language

TL;DR: The authors proposed Seq2Edits, an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts.

...read moreread less

Abstract: We propose Seq2Edits, an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts. In this approach, each sequence-to-sequence transduction is represented as a sequence of edit operations, where each operation either replaces an entire source span with target tokens or keeps it unchanged. We evaluate our method on five NLP tasks (text normalization, sentence fusion, sentence splitting & rephrasing, text simplification, and grammatical error correction) and report competitive results across the board. For grammatical error correction, our method speeds up inference by up to 5.2x compared to full sequence models because inference time depends on the number of edits rather than the number of target tokens. For text normalization, sentence fusion, and grammatical error correction, our approach improves explainability by associating each edit operation with a human-readable tag.

...read moreread less

4 citations

Proceedings Article•

Text Generation by Learning from Demonstrations

[...]

Richard Yuanzhe Pang¹, He He¹•Institutions (1)

New York University¹

03 May 2021

TL;DR: GOLD as discussed by the authors proposes an easy-to-optimize algorithm that learns from the off-policy demonstrations by importance weighting, which upweights confident tokens and downweights unconfident ones during training.

...read moreread less

Abstract: Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation. This paradigm leads to (i) diverse but low-quality samples due to mismatched learning objective and evaluation metric (likelihood vs. quality) and (ii) exposure bias due to mismatched history distributions (gold vs. model-generated). To alleviate these problems, we frame text generation as a reinforcement learning (RL) problem with expert demonstrations (i.e., the training data), where the goal is to maximize quality given model-generated histories. Prior RL approaches to generation often face optimization issues due to the large action space and sparse reward. We propose GOLD (generation by off-policy learning from demonstrations): an easy-to-optimize algorithm that learns from the off-policy demonstrations by importance weighting. Intuitively, GOLD upweights confident tokens and downweights unconfident ones during training. According to both automatic and human evaluation, models trained by GOLD outperform those trained by MLE and policy gradient on summarization, question generation, and machine translation. Further, they are less sensitive to decoding algorithms and alleviate exposure bias.

...read moreread less

4 citations

Posted Content•

Whale: A Unified Distributed Training Framework.

[...]

Ang Wang, Xianyan Jia, Le Jiang, Jie Zhang, Yong Li, Wei Lin - Show less +2 more

18 Nov 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: Whale is the first work that can support various hybrid distributed strategies within one framework and is compatible with TensorFlow and can easily distribute training tasks by adding a few code lines without changing user model code.

...read moreread less

Abstract: Data parallelism (DP) has been a common practice to speed up the training workloads for a long time. However, with the increase of data size and model size, DP has become less optimal for most distributed training workloads. Moreover, it does not work on models whose parameter size cannot fit into a single GPU's device memory. To enable and further improve the industrial-level giant model training, we present Whale, a unified distributed training framework. It provides comprehensive parallel strategies including data parallelism, model parallelism, operator sharding, pipeline, hybrid strategy, and automatic parallel strategy. To express complex training strategies effectively and efficiently in one framework, Whale IR is designed as the basic unit to explore and implement different distributed strategies. Moreover, Whale enables automatic parallelism upon using a meta-driven cost model. Whale is compatible with TensorFlow and can easily distribute training tasks by adding a few code lines without changing user model code. To the best of our knowledge, Whale is the first work that can support various hybrid distributed strategies within one framework. In our experiment of Bert Large model, Whale pipeline strategy is 2.32 times faster than Horovod data parallelism (HDP) on 64 GPUs. In a large-scale image classification task (100,000 classes), Whale hybrid strategy, which consists of operator sharding and DP, is 14.8 times faster than HDP on 64 GPUs.

...read moreread less

4 citations