Challenges in Data-to-Document Generation

doi:10.18653/V1/D17-1239

Home
/
Papers
/
Challenges in Data-to-Document Generation

Proceedings Article•DOI•

Challenges in Data-to-Document Generation

Sam Wiseman¹, Stuart M. Shieber¹, Alexander M. Rush²•Institutions (2)

Harvard University¹, Facebook²

01 Jul 2017-pp 2253-2263

TL;DR: A new, large-scale corpus of data records paired with descriptive documents is introduced, a series of extractive evaluation methods for analyzing performance are proposed, and baseline results are obtained using current neural generation methods.

read less

Abstract: Recent neural models have shown significant progress on the problem of generating short descriptive texts conditioned on a small number of database records. In this work, we suggest a slightly more difficult data-to-text generation task, and investigate how effective current approaches are on this task. In particular, we introduce a new, large-scale corpus of data records paired with descriptive documents, propose a series of extractive evaluation methods for analyzing performance, and obtain baseline results using current neural generation methods. Experiments show that these models produce fluent text, but fail to convincingly approximate human-generated documents. Moreover, even templated baselines exceed the performance of these neural models on some metrics, though copy- and reconstruction-based extensions lead to noticeable improvements.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

The Curious Case of Neural Text Degeneration

[...]

Ari Holtzman¹, Jan Buys¹, Leo Du¹, Maxwell Forbes¹, Yejin Choi² - Show less +1 more•Institutions (2)

University of Washington¹, University of Cape Town²

22 Apr 2019-arXiv: Computation and Language

TL;DR: This paper showed that decoding strategies alone alone can dramatically affect the quality of machine text, even when generated from exactly the same neural language model, and they proposed Nucleus Sampling, a simple but effective method to draw the best out of neural generation.

...read moreread less

Abstract: Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive. In this paper, we reveal surprising distributional differences between human text and machine text. In addition, we find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. Our findings motivate Nucleus Sampling, a simple but effective method to draw the best out of neural generation. By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

...read moreread less

1,256 citations

Posted Content•

Big Bird: Transformers for Longer Sequences

[...]

Manzil Zaheer, Guru Guruganesh, Avinava Dubey¹, Joshua Ainslie, Chris Alberti, Santiago Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed - Show less +7 more•Institutions (1)

Google¹

28 Jul 2020-arXiv: Learning

TL;DR: It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.

...read moreread less

Abstract: Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having $O(1)$ global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

...read moreread less

939 citations

Proceedings Article•

The Curious Case of Neural Text Degeneration

[...]

Ari Holtzman¹, Jan Buys², Leo Du², Maxwell Forbes², Yejin Choi¹ - Show less +1 more•Institutions (2)

Allen Institute for Artificial Intelligence¹, University of Washington²

30 Apr 2020

TL;DR: By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

...read moreread less

Abstract: Despite considerable advances in neural language modeling, it remains an open question what the best decoding strategy is for text generation from a language model (e.g. to generate a story). The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, maximization-based decoding methods such as beam search lead to degeneration — output text that is bland, incoherent, or gets stuck in repetitive loops. To address this we propose Nucleus Sampling, a simple but effective method to draw considerably higher quality text out of neural language models. Our approach avoids text degeneration by truncating the unreliable tail of the probability distribution, sampling from the dynamic nucleus of tokens containing the vast majority of the probability mass. To properly examine current maximization-based and stochastic decoding methods, we compare generations from each of these methods to the distribution of human text along several axes such as likelihood, diversity, and repetition. Our results show that (1) maximization is an inappropriate decoding objective for open-ended text generation, (2) the probability distributions of the best current language models have an unreliable tail which needs to be truncated during generation and (3) Nucleus Sampling is the best decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation — and as diverse as human-written text.

...read moreread less

682 citations

Cites background or methods from "Challenges in Data-to-Document Gene..."

...(2) Computing the optimum argmax sequence from recurrent neural language models is not tractable, so consider two prominent decoding methods for approximating the argmax: Beam search is the most commonly used approximation in practice (Li et al., 2016c; Shen et al., 2017; Wiseman et al., 2017)....
[...]
...Computing the optimum argmax sequence from recurrent neural language models is not tractable, so consider two prominent decoding methods for approximating the argmax: Beam search is the most commonly used approximation in practice (Li et al., 2016c; Shen et al., 2017; Wiseman et al., 2017)....
[...]

Proceedings Article•DOI•

Hierarchical Neural Story Generation

[...]

Angela Fan¹, Michael Lewis², Yann N. Dauphin³•Institutions (3)

Harvard University¹, University of Pittsburgh², Facebook³

13 May 2018

TL;DR: This paper proposed a hierarchical story generation model, where the model first generates a premise, and then transforms it into a passage of text, and further improves the relevance of the story to the prompt by adding a gated multi-scale self-attention mechanism.

...read moreread less

Abstract: We explore story generation: creative systems that can build coherent and fluent passages of text about a topic. We collect a large dataset of 300K human-written stories paired with writing prompts from an online forum. Our dataset enables hierarchical story generation, where the model first generates a premise, and then transforms it into a passage of text. We gain further improvements with a novel form of model fusion that improves the relevance of the story to the prompt, and adding a new gated multi-scale self-attention mechanism to model long-range context. Experiments show large improvements over strong baselines on both automated and human evaluations. Human judges prefer stories generated by our approach to those from a strong non-hierarchical model by a factor of two to one.

...read moreread less

571 citations

Proceedings Article•DOI•

On Faithfulness and Factuality in Abstractive Summarization

[...]

Joshua Maynez¹, Shashi Narayan¹, Bernd Bohnet¹, Ryan McDonald¹•Institutions (1)

Google¹

02 May 2020

TL;DR: It is found that neural abstractive summarization models are highly prone to hallucinate content that is unfaithful to the input document and textual entailment measures better correlate with faithfulness than standard metrics, potentially leading the way to automatic evaluation metrics as well as training and decoding criteria.

...read moreread less

Abstract: It is well known that the standard likelihood training and approximate decoding objectives in neural text generation models lead to less human-like responses for open-ended tasks such as language modeling and story generation. In this paper we have analyzed limitations of these models for abstractive document summarization and found that these models are highly prone to hallucinate content that is unfaithful to the input document. We conducted a large scale human evaluation of several neural abstractive summarization systems to better understand the types of hallucinations they produce. Our human annotators found substantial amounts of hallucinated content in all model generated summaries. However, our analysis does show that pretrained models are better summarizers not only in terms of raw metrics, i.e., ROUGE, but also in generating faithful and factual summaries as evaluated by humans. Furthermore, we show that textual entailment measures better correlate with faithfulness than standard metrics, potentially leading the way to automatic evaluation metrics as well as training and decoding criteria.

...read moreread less

513 citations

Cites background from "Challenges in Data-to-Document Gene..."

...In contrast, tasks such as document summarization (Nenkova and McKeown, 2011; See et al., 2017; Paulus et al., 2018) and data-to-text generation (Lebret et al., 2016; Wiseman et al., 2017) which are not open-ended, require models to be factual and/or faithful to the source text....
[...]
...Such divergence issue between source and target is not uncommon in conditional text generation (Kryscinski et al., 2019a; Wiseman et al., 2017; Dhingra et al., 2019)....
[...]
..., 2018) and data-to-text generation (Lebret et al., 2016; Wiseman et al., 2017) which are not open-ended, require models to be factual and/or faithful to the source text....
[...]
...Dhingra et al. (2019) proposed a new automatic metric, PARENT, for data-to-text generation (Lebret et al., 2016; Wiseman et al., 2017) which aligns n-grams from the reference and generated texts to the source table to measure the accuracy of n-grams that are entailed from the source table....
[...]
...(2019) proposed a new automatic metric, PARENT, for data-to-text generation (Lebret et al., 2016; Wiseman et al., 2017) which aligns n-grams from the reference and generated texts to the source table to measure the accuracy of n-grams that are entailed from the source table....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Long short-term memory

[...]

Sepp Hochreiter¹, Jürgen Schmidhuber²•Institutions (2)

Technische Universität München¹, Dalle Molle Institute for Artificial Intelligence Research²

01 Nov 1997-Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

...read moreread less

72,897 citations

Proceedings Article•DOI•

Bleu: a Method for Automatic Evaluation of Machine Translation

[...]

Kishore Papineni¹, Salim Roukos¹, Todd Ward¹, Wei-Jing Zhu¹•Institutions (1)

IBM¹

06 Jul 2002

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

Abstract: Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.

...read moreread less

21,126 citations

Proceedings Article•

Neural Machine Translation by Jointly Learning to Align and Translate

[...]

Dzmitry Bahdanau¹, Kyunghyun Cho², Yoshua Bengio²•Institutions (2)

Jacobs University Bremen¹, Université de Montréal²

01 Jan 2015

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Abstract: Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

...read moreread less

20,027 citations

"Challenges in Data-to-Document Gene..." refers methods in this paper

...As a base model we utilize the now standard attentionbased encoder-decoder model (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015)....
[...]

Posted Content•

Neural Machine Translation by Jointly Learning to Align and Translate

[...]

Dzmitry Bahdanau¹, Kyunghyun Cho², Yoshua Bengio²•Institutions (2)

Jacobs University Bremen¹, Université de Montréal²

01 Sep 2014-arXiv: Computation and Language

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

14,077 citations

Proceedings Article•

Sequence to Sequence Learning with Neural Networks

[...]

Ilya Sutskever¹, Oriol Vinyals¹, Quoc V. Le¹•Institutions (1)

Google¹

08 Dec 2014

TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.

...read moreread less

Abstract: Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

...read moreread less

12,299 citations