Open AccessPosted Content
Human or Machine: Automating Human Likeliness Evaluation of NLG Texts
Erion Çano,Ondrej Bojar +1 more
TLDR
An attempt to automate the human likeliness evaluation of the output text samples coming from natural language generation methods used to solve several tasks by using a discrimination procedure based on large pretrained language models and their probability distributions.Abstract:
Automatic evaluation of various text quality criteria produced by data-driven intelligent methods is very common and useful because it is cheap, fast, and usually yields repeatable results. In this paper, we present an attempt to automate the human likeliness evaluation of the output text samples coming from natural language generation methods used to solve several tasks. We propose to use a human likeliness score that shows the percentage of the output samples from a method that look as if they were written by a human. Instead of having human participants label or rate those samples, we completely automate the process by using a discrimination procedure based on large pretrained language models and their probability distributions. As follow up, we plan to perform an empirical analysis of human-written and machine-generated texts to find the optimal setup of this evaluation approach. A validation procedure involving human participants will also check how the automatic evaluation correlates with human judgments.read more
Citations
More filters
Proceedings Article
TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation
TL;DR: The Turing Test (TT) benchmark environment TURINGBENCH as discussed by the authors is a dataset with 200k human- or machine-generated samples across 20 labels Human, GPT-1, GPGT-2_small,GPT-2-large,GPGT2_xl, GpgT2-xl, GPGPT-3, GROVER_base,GROVER_large, GRover_mega, CTRL, XLM, XLNET-base, XLNets, XLNet_large and XLNET_large.
Proceedings ArticleDOI
The errors analysis of natural language generation — A case study of Topic-to-Essay generation
TL;DR: The authors used manual evaluation methods to annotate and analyze the text generated by natural language generation (NLG) using a state-of-the-art Topic-to-Essay generation model to generate texts conditional on some topic words.
Proceedings ArticleDOI
ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation
TL;DR: AlIGNMEET as discussed by the authors is a comprehensive tool for meeting annotation, alignment, and evaluation, which aims to provide an efficient and clear interface for fast annotation while mitigating the risk of introducing errors.
Posted Content
TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation
TL;DR: The TuringBench benchmark as mentioned in this paper is a dataset with 200k human- or machine-generated samples across 20 labels, including human, GPT-1, gPT-2, gpt-2-medium, gtp-2xl, gp-2_xl-large, Gp-3, Gpt-3-xlxl xl, pytorch, pyTorch, Gtp-3xlXl, GROVER_base, GRover_large, grover-mega, CTRL, XLM, XL
Posted Content
Automating Text Naturalness Evaluation of NLG Systems.
Erion Çano,Ondrej Bojar +1 more
TL;DR: An attempt to automate the evaluation of text naturalness which is a very important characteristic of natural language generation methods by using a human likeliness metric the authors define and a discrimination procedure based on large pretrained language models with their probability distributions.
References
More filters
Proceedings ArticleDOI
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings ArticleDOI
Bleu: a Method for Automatic Evaluation of Machine Translation
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Proceedings Article
ROUGE: A Package for Automatic Evaluation of Summaries
TL;DR: Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations.
Proceedings Article
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
Satanjeev Banerjee,Alon Lavie +1 more
TL;DR: METEOR is described, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machineproduced translation and human-produced reference translations and can be easily extended to include more advanced matching strategies.
Journal ArticleDOI
Recent Trends in Deep Learning Based Natural Language Processing [Review Article]
TL;DR: This paper reviews significant deep learning related models and methods that have been employed for numerous NLP tasks and provides a walk-through of their evolution.
Related Papers (5)
Knowledge and Data Processing in a Process of Website Quality Evaluation
Janusz Sobecki,Dmitrij Żatuchin +1 more