scispace - formally typeset
Open AccessProceedings ArticleDOI

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Reads0
Chats0
TLDR
BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.
Abstract
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and other recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 3.5 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also replicate other pretraining schemes within the BART framework, to understand their effect on end-task performance.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

Investigating Pretrained Language Models for Graph-to-Text Generation

TL;DR: It is suggested that the PLMs benefit from similar facts seen during pretraining or fine-tuning, such that they perform well even when the input graph is reduced to a simple bag of node and edge labels.
Posted Content

fairseq S2T: Fast Speech-to-Text Modeling with fairseq

TL;DR: State-of-the-art RNN-based as well as Transformer-based models and open-source detailed training recipes are implemented and seamlessly integrated into S2T workflows for multi-task learning or transfer learning.
Proceedings ArticleDOI

Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization

TL;DR: This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations and then utilizing amulti-view decoder to incorporate differentViews to generate dialogue summaries.
Posted Content

A Survey of Knowledge-Enhanced Text Generation.

TL;DR: A comprehensive review of the research on knowledge-enhanced text generation over the past five years is presented, which includes two parts: (i) general methods and architectures for integrating knowledge into text generation; (ii) specific techniques and applications according to different forms of knowledge data.
Posted Content

Distilling Knowledge from Reader to Retriever for Question Answering

TL;DR: This paper proposes a technique to learn retriever models for downstream tasks, inspired by knowledge distillation, and which does not require annotated pairs of query and documents.
References
More filters
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Posted Content

Efficient Estimation of Word Representations in Vector Space

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.
Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Proceedings ArticleDOI

Deep contextualized word representations

TL;DR: This paper introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).
Related Papers (5)
Trending Questions (1)
What is the difference between BART and other denoising sequence-to-sequence pre-training methods?

BART uses a bidirectional encoder and left-to-right decoder, while other methods may use different encoder-decoder architectures.