scispace - formally typeset
Search or ask a question
Journal Article

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.
Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, a new approach for predicting thermodynamic properties of perovskites that harnesses deep learning and crystal structure fingerprinting based on Hirshfeld surface analysis is presented.
Abstract: This paper presents a new approach for predicting thermodynamic properties of perovskites that harnesses deep learning and crystal structure fingerprinting based on Hirshfeld surface analysis. It is demonstrated that convolutional neural network methods capture critical features embedded in two-dimensional Hirshfeld surface fingerprints that enable a quantitative assessment of the formation energy of perovskites. Building on our recent work on lattice parameter prediction from Hirshfeld surface calculations, we show how transfer learning can be used to speed up the training of the neural network, allowing multiple properties to be trained using the same feature extraction layers. We also predict formation energies for various perovskite polymorphs, and our predictions are found to give generally improved performance over a well-established graph network method, but with the methods better suited to different types of datasets. Analysis of the structure types within the dataset reveals the Hirshfeld surface-based method to excel for the less symmetric and similar structures, while the graph network performs better for very symmetric and similar structures.

4 citations

Posted Content
TL;DR: This paper cast a suite of information extraction tasks into a text-to-triple translation framework, which enables a task-agnostic translation by leveraging the latent knowledge that a pre-trained language model has about the task.
Abstract: We cast a suite of information extraction tasks into a text-to-triple translation framework. Instead of solving each task relying on task-specific datasets and models, we formalize the task as a translation between task-specific input text and output triples. By taking the task-specific input, we enable a task-agnostic translation by leveraging the latent knowledge that a pre-trained language model has about the task. We further demonstrate that a simple pre-training task of predicting which relational information corresponds to which input text is an effective way to produce task-specific outputs. This enables the zero-shot transfer of our framework to downstream tasks. We study the zero-shot performance of this framework on open information extraction (OIE2016, NYT, WEB, PENN), relation classification (FewRel and TACRED), and factual probe (Google-RE and T-REx). The model transfers non-trivially to most tasks and is often competitive with a fully supervised method without the need for any task-specific training. For instance, we significantly outperform the F1 score of the supervised open information extraction without needing to use its training set.

4 citations

Posted Content
Felix Stahlberg1, Shankar Kumar1
TL;DR: The authors proposed Seq2Edits, an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts.
Abstract: We propose Seq2Edits, an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts. In this approach, each sequence-to-sequence transduction is represented as a sequence of edit operations, where each operation either replaces an entire source span with target tokens or keeps it unchanged. We evaluate our method on five NLP tasks (text normalization, sentence fusion, sentence splitting & rephrasing, text simplification, and grammatical error correction) and report competitive results across the board. For grammatical error correction, our method speeds up inference by up to 5.2x compared to full sequence models because inference time depends on the number of edits rather than the number of target tokens. For text normalization, sentence fusion, and grammatical error correction, our approach improves explainability by associating each edit operation with a human-readable tag.

4 citations

Proceedings Article
03 May 2021
TL;DR: GOLD as discussed by the authors proposes an easy-to-optimize algorithm that learns from the off-policy demonstrations by importance weighting, which upweights confident tokens and downweights unconfident ones during training.
Abstract: Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation. This paradigm leads to (i) diverse but low-quality samples due to mismatched learning objective and evaluation metric (likelihood vs. quality) and (ii) exposure bias due to mismatched history distributions (gold vs. model-generated). To alleviate these problems, we frame text generation as a reinforcement learning (RL) problem with expert demonstrations (i.e., the training data), where the goal is to maximize quality given model-generated histories. Prior RL approaches to generation often face optimization issues due to the large action space and sparse reward. We propose GOLD (generation by off-policy learning from demonstrations): an easy-to-optimize algorithm that learns from the off-policy demonstrations by importance weighting. Intuitively, GOLD upweights confident tokens and downweights unconfident ones during training. According to both automatic and human evaluation, models trained by GOLD outperform those trained by MLE and policy gradient on summarization, question generation, and machine translation. Further, they are less sensitive to decoding algorithms and alleviate exposure bias.

4 citations

Posted Content
TL;DR: Whale is the first work that can support various hybrid distributed strategies within one framework and is compatible with TensorFlow and can easily distribute training tasks by adding a few code lines without changing user model code.
Abstract: Data parallelism (DP) has been a common practice to speed up the training workloads for a long time. However, with the increase of data size and model size, DP has become less optimal for most distributed training workloads. Moreover, it does not work on models whose parameter size cannot fit into a single GPU's device memory. To enable and further improve the industrial-level giant model training, we present Whale, a unified distributed training framework. It provides comprehensive parallel strategies including data parallelism, model parallelism, operator sharding, pipeline, hybrid strategy, and automatic parallel strategy. To express complex training strategies effectively and efficiently in one framework, Whale IR is designed as the basic unit to explore and implement different distributed strategies. Moreover, Whale enables automatic parallelism upon using a meta-driven cost model. Whale is compatible with TensorFlow and can easily distribute training tasks by adding a few code lines without changing user model code. To the best of our knowledge, Whale is the first work that can support various hybrid distributed strategies within one framework. In our experiment of Bert Large model, Whale pipeline strategy is 2.32 times faster than Horovod data parallelism (HDP) on 64 GPUs. In a large-scale image classification task (100,000 classes), Whale hybrid strategy, which consists of operator sharding and DP, is 14.8 times faster than HDP on 64 GPUs.

4 citations

Trending Questions (1)
What are the limitations of transfer learning with a unified text-to-text transformer?

The paper does not mention the limitations of transfer learning with a unified text-to-text transformer.