scispace - formally typeset
Open AccessProceedings Article

MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

Reads0
Chats0
TLDR
MCTest is presented, a freely available set of stories and associated questions intended for research on the machine comprehension of text that requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension.
Abstract
We present MCTest, a freely available set of stories and associated questions intended for research on the machine comprehension of text. Previous work on machine comprehension (e.g., semantic modeling) has made great strides, but primarily focuses either on limited-domain datasets, or on solving a more restricted goal (e.g., open-domain relation extraction). In contrast, MCTest requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension. Reading comprehension can test advanced abilities such as causal reasoning and understanding the world, yet, by being multiple-choice, still provide a clear metric. By being fictional, the answer typically can be found only in the story itself. The stories and questions are also carefully limited to those a young child would understand, reducing the world knowledge that is required for the task. We present the scalable crowd-sourcing methods that allow us to cheaply construct a dataset of 500 stories and 2000 questions. By screening workers (with grammar tests) and stories (with grading), we have ensured that the data is the same quality as another set that we manually edited, but at one tenth the editing cost. By being open-domain, yet carefully restricted, we hope MCTest will serve to encourage research and provide a clear metric for advancement on the machine comprehension of text. 1 Reading Comprehension A major goal for NLP is for machines to be able to understand text as well as people. Several research disciplines are focused on this problem: for example, information extraction, relation extraction, semantic role labeling, and recognizing textual entailment. Yet these techniques are necessarily evaluated individually, rather than by how much they advance us towards the end goal. On the other hand, the goal of semantic parsing is the machine comprehension of text (MCT), yet its evaluation requires adherence to a specific knowledge representation, and it is currently unclear what the best representation is, for open-domain text. We believe that it is useful to directly tackle the top-level task of MCT. For this, we need a way to measure progress. One common method for evaluating someone’s understanding of text is by giving them a multiple-choice reading comprehension test. This has the advantage that it is objectively gradable (vs. essays) yet may test a range of abilities such as causal or counterfactual reasoning, inference among relations, or just basic understanding of the world in which the passage is set. Therefore, we propose a multiple-choice reading comprehension task as a way to evaluate progress on MCT. We have built a reading comprehension dataset containing 500 fictional stories, with 4 multiple choice questions per story. It was built using methods which can easily scale to at least 5000 stories, since the stories were created, and the curation was done, using crowd sourcing almost entirely, at a total of $4.00 per story. We plan to periodically update the dataset to ensure that methods are not overfitting to the existing data. The dataset is open-domain, yet restricted to concepts and words that a 7 year old is expected to understand. This task is still beyond the capability of today’s computers and algorithms.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

SQuAD: 100,000+ Questions for Machine Comprehension of Text

TL;DR: The Stanford Question Answering Dataset (SQuAD) as mentioned in this paper is a reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.
Proceedings ArticleDOI

SQuAD: 100,000+ Questions for Machine Comprehension of Text

TL;DR: The Stanford Question Answering Dataset (SQuAD) as mentioned in this paper is a reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.
Proceedings ArticleDOI

VQA: Visual Question Answering

TL;DR: The task of free-form and open-ended Visual Question Answering (VQA) is proposed, given an image and a natural language question about the image, the task is to provide an accurate natural language answer.
Proceedings Article

Teaching machines to read and comprehend

TL;DR: A new methodology is defined that resolves this bottleneck and provides large scale supervised reading comprehension data that allows a class of attention based deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure to be developed.
Posted Content

VQA: Visual Question Answering

TL;DR: The task of free-form and open-ended Visual Question Answering (VQA) is proposed, given an image and a natural language question about the image, the task is to provide an accurate natural language answer.
References
More filters

Running experiments on Amazon Mechanical Turk

TL;DR: The authors presented new demographic data about the Mechanical Turk subject population, reviewed the strengths of Mechanical Turk relative to other online and offline methods of recruiting subjects, and compared the magnitude of effects obtained using Mechanical Turk and traditional subject pools.
Posted Content

Running experiments on Amazon Mechanical Turk

TL;DR: The authors presented new demographic data about the Mechanical Turk subject population, reviewed the strengths of Mechanical Turk relative to other online and offline methods of recruiting subjects, and compared the magnitude of effects obtained using Mechanical Turk and traditional subject pools.
Journal Article

The PASCAL Recognising Textual Entailment Challenge

TL;DR: The PASCAL Network of Excellence first Recognising Textual Entailment (RTE-1) Challenge as mentioned in this paper was defined as recognizing, given two text fragments, whether the meaning of one text can be inferred from the other.
Journal IssueDOI

A survey of modern authorship attribution methods

TL;DR: A survey of recent advances of the automated approaches to attributing authorship is presented, examining their characteristics for both text representation and text classification.
Journal ArticleDOI

Age-of-acquisition ratings for 30,000 English words

TL;DR: This megastudy presents age-of-acquisition ratings for 30,121 English content words (nouns, verbs, and adjectives) using the Web-based crowdsourcing technology offered by the Amazon Mechanical Turk to indicate that the ratings collected are as valid and reliable as those collected in laboratory conditions.
Related Papers (5)