Human or Machine: Automating Human Likeliness Evaluation of NLG Texts

Open AccessPosted Content

Human or Machine: Automating Human Likeliness Evaluation of NLG Texts

- 05 Jun 2020 -

TLDR

An attempt to automate the human likeliness evaluation of the output text samples coming from natural language generation methods used to solve several tasks by using a discrimination procedure based on large pretrained language models and their probability distributions.

Abstract:

Automatic evaluation of various text quality criteria produced by data-driven intelligent methods is very common and useful because it is cheap, fast, and usually yields repeatable results. In this paper, we present an attempt to automate the human likeliness evaluation of the output text samples coming from natural language generation methods used to solve several tasks. We propose to use a human likeliness score that shows the percentage of the output samples from a method that look as if they were written by a human. Instead of having human participants label or rate those samples, we completely automate the process by using a discrimination procedure based on large pretrained language models and their probability distributions. As follow up, we plan to perform an empirical analysis of human-written and machine-generated texts to find the optimal setup of this evaluation approach. A validation procedure involving human participants will also check how the automatic evaluation correlates with human judgments.

Citations

PDF

Open Access

More filters

Proceedings Article

TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation

Adaku Uchendu, +4 more

TL;DR: The Turing Test (TT) benchmark environment TURINGBENCH as discussed by the authors is a dataset with 200k human- or machine-generated samples across 20 labels Human, GPT-1, GPGT-2_small,GPT-2-large,GPGT2_xl, GpgT2-xl, GPGPT-3, GROVER_base,GROVER_large, GRover_mega, CTRL, XLM, XLNET-base, XLNets, XLNet_large and XLNET_large.

...read moreread less

Proceedings ArticleDOI

The errors analysis of natural language generation — A case study of Topic-to-Essay generation

Ping Cai, +3 more

TL;DR: The authors used manual evaluation methods to annotate and analyze the text generated by natural language generation (NLG) using a state-of-the-art Topic-to-Essay generation model to generate texts conditional on some topic words.

...read moreread less

Proceedings ArticleDOI

ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation

Peter Pol'ak, +3 more

TL;DR: AlIGNMEET as discussed by the authors is a comprehensive tool for meeting annotation, alignment, and evaluation, which aims to provide an efficient and clear interface for fast annotation while mitigating the risk of introducing errors.

...read moreread less

Posted Content

TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation

Adaku Uchendu, +4 more

- 27 Sep 2021 -

arXiv: Computation and Language

TL;DR: The TuringBench benchmark as mentioned in this paper is a dataset with 200k human- or machine-generated samples across 20 labels, including human, GPT-1, gPT-2, gpt-2-medium, gtp-2xl, gp-2_xl-large, Gp-3, Gpt-3-xlxl xl, pytorch, pyTorch, Gtp-3xlXl, GROVER_base, GRover_large, grover-mega, CTRL, XLM, XL

...read moreread less

Posted Content

Automating Text Naturalness Evaluation of NLG Systems.

Erion Çano, +1 more

- 23 Jun 2020 -

arXiv: Computation and Language

TL;DR: An attempt to automate the evaluation of text naturalness which is a very important characteristic of natural language generation methods by using a human likeliness metric the authors define and a discrimination procedure based on large pretrained language models with their probability distributions.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Proceedings ArticleDOI

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

Proceedings Article

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin

TL;DR: Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations.

...read moreread less

Proceedings Article

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Satanjeev Banerjee, +1 more

TL;DR: METEOR is described, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machineproduced translation and human-produced reference translations and can be easily extended to include more advanced matching strategies.

...read moreread less

Journal ArticleDOI

Recent Trends in Deep Learning Based Natural Language Processing [Review Article]

Tom Young, +3 more

- 20 Jul 2018 -

IEEE Computational Intelligence Magazine

TL;DR: This paper reviews significant deep learning related models and methods that have been employed for numerous NLP tasks and provides a walk-through of their evolution.

...read moreread less

Collapse

Human or Machine: Automating Human Likeliness Evaluation of NLG Texts

Citations

TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation

The errors analysis of natural language generation — A case study of Topic-to-Essay generation

ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation

TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation

Automating Text Naturalness Evaluation of NLG Systems.

References

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bleu: a Method for Automatic Evaluation of Machine Translation

ROUGE: A Package for Automatic Evaluation of Summaries

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Recent Trends in Deep Learning Based Natural Language Processing [Review Article]

Related Papers (5)

Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

Can we evaluate the quality of generated text

Knowledge and Data Processing in a Process of Website Quality Evaluation

A Study of Automated Evaluation of Student’s Examination Paper using Machine Learning Techniques

Machine Learning Approach for Automatic Short Answer Grading: A Systematic Review