scispace - formally typeset
Search or ask a question
Author

Emiel van Miltenburg

Other affiliations: VU University Amsterdam
Bio: Emiel van Miltenburg is an academic researcher from Tilburg University. The author has contributed to research in topics: Natural language generation & Benchmark (computing). The author has an hindex of 11, co-authored 30 publications receiving 445 citations. Previous affiliations of Emiel van Miltenburg include VU University Amsterdam.

Papers
More filters
Proceedings ArticleDOI
01 Oct 2019
TL;DR: This paper provides an overview of how human evaluation is currently conducted, and presents a set of best practices, grounded in the literature, for Natural Language Generation systems.
Abstract: Currently, there is little agreement as to how Natural Language Generation (NLG) systems should be evaluated. While there is some agreement regarding automatic metrics, there is a high degree of variation in the way that human evaluation is carried out. This paper provides an overview of how human evaluation is currently conducted, and presents a set of best practices, grounded in the literature. With this paper, we hope to contribute to the quality and consistency of human evaluations in NLG.

152 citations

Proceedings Article
01 Dec 2020
TL;DR: Due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.
Abstract: Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility. In this paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reports, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.

95 citations

Journal ArticleDOI
TL;DR: An overview of how (mostly intrinsic) human evaluation is currently conducted is provided and a set of best practices are presented, grounded in the literature, linked to the stages that researchers go through when conducting an evaluation research.

70 citations

Posted Content
TL;DR: Automatic and human evaluations together with a qualitative analysis suggest that having explicit intermediate steps in the generation process results in better texts than the ones generated by end-to-end approaches.
Abstract: Traditionally, most data-to-text applications have been designed using a modular pipeline architecture, in which non-linguistic input data is converted into natural language through several intermediate transformations. In contrast, recent neural models for data-to-text generation have been proposed as end-to-end approaches, where the non-linguistic input is rendered in natural language with much less explicit intermediate representations in-between. This study introduces a systematic comparison between neural pipeline and end-to-end data-to-text approaches for the generation of text from RDF triples. Both architectures were implemented making use of state-of-the art deep learning methods as the encoder-decoder Gated-Recurrent Units (GRU) and Transformer. Automatic and human evaluations together with a qualitative analysis suggest that having explicit intermediate steps in the generation process results in better texts than the ones generated by end-to-end approaches. Moreover, the pipeline models generalize better to unseen inputs. Data and code are publicly available.

67 citations

19 May 2016
TL;DR: The authors presented some evidence against this assumption, and provided a list of biases and unwarranted inferences that can be found in the Flickr30K dataset, and discussed how to deal with stereotype-driven descriptions in future applications.
Abstract: An untested assumption behind the crowdsourced descriptions of the images in the Flickr30K dataset (Young et al., 2014) is that they "focus only on the information that can be obtained from the image alone" (Hodosh et al., 2013, p. 859). This paper presents some evidence against this assumption, and provides a list of biases and unwarranted inferences that can be found in the Flickr30K dataset. Finally, it considers methods to find examples of these, and discusses how we should deal with stereotype-driven descriptions in future applications.

50 citations


Cited by
More filters
01 Jan 2014
TL;DR: Using Language部分的�’学模式既不落俗套,又能真正体现新课程标准所倡导的�'学理念,正是年努力探索的问题.
Abstract: 人教版高中英语新课程教材中,语言运用(Using Language)是每个单元必不可少的部分,提供了围绕单元中心话题的听、说、读、写的综合性练习,是单元中心话题的延续和升华.如何设计Using Language部分的教学,使自己的教学模式既不落俗套,又能真正体现新课程标准所倡导的教学理念,正是广大一线英语教师一直努力探索的问题.

2,071 citations

Journal ArticleDOI
TL;DR: A framework to demonstrate how the temporal dynamics of the embedding helps to quantify changes in stereotypes and attitudes toward women and ethnic minorities in the 20th and 21st centuries in the United States is developed.
Abstract: Word embeddings are a powerful machine-learning framework that represents each English word by a vector. The geometric relationship between these vectors captures meaningful semantic relationships between the corresponding words. In this paper, we develop a framework to demonstrate how the temporal dynamics of the embedding helps to quantify changes in stereotypes and attitudes toward women and ethnic minorities in the 20th and 21st centuries in the United States. We integrate word embeddings trained on 100 y of text data with the US Census to show that changes in the embedding track closely with demographic and occupation shifts over time. The embedding captures societal shifts-e.g., the women's movement in the 1960s and Asian immigration into the United States-and also illuminates how specific adjectives and occupations became more closely associated with certain populations over time. Our framework for temporal analysis of word embedding opens up a fruitful intersection between machine learning and quantitative social science.

728 citations

Proceedings ArticleDOI
29 Jul 2017
TL;DR: The authors proposed to inject corpus-level constraints for calibrating existing structured prediction models and design an algorithm based on Lagrangian relaxation for collective inference, which results in almost no performance loss for the underlying recognition task but decreases the magnitude of bias amplification.
Abstract: Language is increasingly being used to de-fine rich visual recognition problems with supporting image collections sourced from the web. Structured prediction models are used in these tasks to take advantage of correlations between co-occurring labels and visual input but risk inadvertently encoding social biases found in web corpora. In this work, we study data and models associated with multilabel object classification and visual semantic role labeling. We find that (a) datasets for these tasks contain significant gender bias and (b) models trained on these datasets further amplify existing bias. For example, the activity cooking is over 33% more likely to involve females than males in a training set, and a trained model further amplifies the disparity to 68% at test time. We propose to inject corpus-level constraints for calibrating existing structured prediction models and design an algorithm based on Lagrangian relaxation for collective inference. Our method results in almost no performance loss for the underlying recognition task but decreases the magnitude of bias amplification by 47.5% and 40.5% for multilabel classification and visual semantic role labeling, respectively。

560 citations

Book ChapterDOI
08 Sep 2018
TL;DR: The authors proposed a new Equalizer model that encourages equal gender probability when gender evidence is occluded in a scene and confident predictions when gender evidences is present, which can be added to any description model in order to mitigate impacts of unwanted bias in a description dataset.
Abstract: Most machine learning methods are known to capture and exploit biases of the training data. While some biases are beneficial for learning, others are harmful. Specifically, image captioning models tend to exaggerate biases present in training data (e.g., if a word is present in 60% of training sentences, it might be predicted in 70% of sentences at test time). This can lead to incorrect captions in domains where unbiased captions are desired, or required, due to over-reliance on the learned prior and image context. In this work we investigate generation of gender-specific caption words (e.g. man, woman) based on the person’s appearance or the image context. We introduce a new Equalizer model that encourages equal gender probability when gender evidence is occluded in a scene and confident predictions when gender evidence is present. The resulting model is forced to look at a person rather than use contextual cues to make a gender-specific prediction. The losses that comprise our model, the Appearance Confusion Loss and the Confident Loss, are general, and can be added to any description model in order to mitigate impacts of unwanted bias in a description dataset. Our proposed model has lower error than prior work when describing images with people and mentioning their gender and more closely matches the ground truth ratio of sentences including women to sentences including men. Finally, we show that our model more often looks at people when predicting their gender (https://people.eecs.berkeley.edu/~lisa anne/snowboard.html).

411 citations

Posted Content
TL;DR: This paper presented an empirical study of gender bias in coreference resolution systems and correlated this bias with real-world and textual gender statistics using Winograd schema-style set of minimal pair sentences that differ only by pronoun gender.
Abstract: We present an empirical study of gender bias in coreference resolution systems. We first introduce a novel, Winograd schema-style set of minimal pair sentences that differ only by pronoun gender. With these "Winogender schemas," we evaluate and confirm systematic gender bias in three publicly-available coreference resolution systems, and correlate this bias with real-world and textual gender statistics.

306 citations