scispace - formally typeset
Open AccessPosted Content

Human or Machine: Automating Human Likeliness Evaluation of NLG Texts

TLDR
An attempt to automate the human likeliness evaluation of the output text samples coming from natural language generation methods used to solve several tasks by using a discrimination procedure based on large pretrained language models and their probability distributions.
Abstract
Automatic evaluation of various text quality criteria produced by data-driven intelligent methods is very common and useful because it is cheap, fast, and usually yields repeatable results. In this paper, we present an attempt to automate the human likeliness evaluation of the output text samples coming from natural language generation methods used to solve several tasks. We propose to use a human likeliness score that shows the percentage of the output samples from a method that look as if they were written by a human. Instead of having human participants label or rate those samples, we completely automate the process by using a discrimination procedure based on large pretrained language models and their probability distributions. As follow up, we plan to perform an empirical analysis of human-written and machine-generated texts to find the optimal setup of this evaluation approach. A validation procedure involving human participants will also check how the automatic evaluation correlates with human judgments.

read more

Citations
More filters
Proceedings Article

TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation

TL;DR: The Turing Test (TT) benchmark environment TURINGBENCH as discussed by the authors is a dataset with 200k human- or machine-generated samples across 20 labels Human, GPT-1, GPGT-2_small,GPT-2-large,GPGT2_xl, GpgT2-xl, GPGPT-3, GROVER_base,GROVER_large, GRover_mega, CTRL, XLM, XLNET-base, XLNets, XLNet_large and XLNET_large.
Proceedings ArticleDOI

The errors analysis of natural language generation — A case study of Topic-to-Essay generation

TL;DR: The authors used manual evaluation methods to annotate and analyze the text generated by natural language generation (NLG) using a state-of-the-art Topic-to-Essay generation model to generate texts conditional on some topic words.
Proceedings ArticleDOI

ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation

TL;DR: AlIGNMEET as discussed by the authors is a comprehensive tool for meeting annotation, alignment, and evaluation, which aims to provide an efficient and clear interface for fast annotation while mitigating the risk of introducing errors.
Posted Content

TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation

TL;DR: The TuringBench benchmark as mentioned in this paper is a dataset with 200k human- or machine-generated samples across 20 labels, including human, GPT-1, gPT-2, gpt-2-medium, gtp-2xl, gp-2_xl-large, Gp-3, Gpt-3-xlxl xl, pytorch, pyTorch, Gtp-3xlXl, GROVER_base, GRover_large, grover-mega, CTRL, XLM, XL
Posted Content

Automating Text Naturalness Evaluation of NLG Systems.

TL;DR: An attempt to automate the evaluation of text naturalness which is a very important characteristic of natural language generation methods by using a human likeliness metric the authors define and a discrimination procedure based on large pretrained language models with their probability distributions.
References
More filters
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings ArticleDOI

Bleu: a Method for Automatic Evaluation of Machine Translation

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Proceedings Article

ROUGE: A Package for Automatic Evaluation of Summaries

TL;DR: Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations.
Proceedings Article

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

TL;DR: METEOR is described, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machineproduced translation and human-produced reference translations and can be easily extended to include more advanced matching strategies.
Journal ArticleDOI

Recent Trends in Deep Learning Based Natural Language Processing [Review Article]

TL;DR: This paper reviews significant deep learning related models and methods that have been employed for numerous NLP tasks and provides a walk-through of their evolution.
Related Papers (5)