scispace - formally typeset
Open AccessProceedings ArticleDOI

Creating Training Corpora for NLG Micro-Planners

TLDR
This paper proposes the corpus generation framework as a novel method for creating challenging data sets from which NLG models can be learned which are capable of handling the complex interactions occurring during in micro-planning between lexicalisation, aggregation, surface realisation, referring expression generation and sentence segmentation.
Abstract
In this paper, we present a novel framework for semi-automatically creating linguistically challenging micro-planning data-to-text corpora from existing Knowledge Bases. Because our method pairs data of varying size and shape with texts ranging from simple clauses to short texts, a dataset created using this framework provides a challenging benchmark for microplanning. Another feature of this framework is that it can be applied to any large scale knowledge base and can therefore be used to train and learn KB verbalisers. We apply our framework to DBpedia data and compare the resulting dataset with Wen et al. 2016’s. We show that while Wen et al.’s dataset is more than twice larger than ours, it is less diverse both in terms of input and in terms of text. We thus propose our corpus generation framework as a novel method for creating challenging data sets from which NLG models can be learned which are capable of handling the complex interactions occurring during in micro-planning between lexicalisation, aggregation, surface realisation, referring expression generation and sentence segmentation. To encourage researchers to take up this challenge, we made available a dataset of 21,855 data/text pairs created using this framework in the context of the WebNLG shared task.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

The WebNLG Challenge: Generating Text from RDF Data

TL;DR: The microplanning task is introduced, data preparation, evaluation methodology, participant results and a brief description of the participating systems are provided.
Journal ArticleDOI

Survey of Hallucination in Natural Language Generation

TL;DR: This survey serves tofacilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG by providing a broad overview of the research progress and challenges in the hallucination problem inNLG.
Proceedings ArticleDOI

Extracting Relational Facts by an End-to-End Neural Model with Copy Mechanism

TL;DR: This paper proposes an end-to-end model based on sequence- to-sequence learning with copy mechanism, which can jointly extract relational facts from sentences of any of these classes, including Normal, EntityPairOverlap and SingleEntiyOverlap.
Proceedings ArticleDOI

GraphRel: Modeling Text as Relational Graphs for Joint Entity and Relation Extraction

TL;DR: GraphRel, an end-to-end relation extraction model which uses graph convolutional networks (GCNs) to jointly learn named entities and relations, outperforms previous work by 3.2% and 5.8% and achieves a new state-of-the-art for relation extraction.
References
More filters
Proceedings ArticleDOI

Bleu: a Method for Automatic Evaluation of Machine Translation

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Proceedings ArticleDOI

The Stanford CoreNLP Natural Language Processing Toolkit

TL;DR: The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
Proceedings Article

SRILM – An Extensible Language Modeling Toolkit

TL;DR: The functionality of the SRILM toolkit is summarized and its design and implementation is discussed, highlighting ease of rapid prototyping, reusability, and combinability of tools.
Proceedings Article

Grammar as a foreign language

TL;DR: The domain agnostic attention-enhanced sequence-to-sequence model achieves state-of-the-art results on the most widely used syntactic constituency parsing dataset, when trained on a large synthetic corpus that was annotated using existing parsers.
Proceedings ArticleDOI

Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems

TL;DR: A statistical language generator based on a semantically controlled Long Short-term Memory (LSTM) structure that can learn from unaligned data by jointly optimising sentence planning and surface realisation using a simple cross entropy training criterion, and language variation can be easily achieved by sampling from output candidates.