scispace - formally typeset
Search or ask a question

Showing papers by "Dan Jurafsky published in 2010"


Proceedings Article
09 Oct 2010
TL;DR: This work proposes a simple coreference architecture based on a sieve that applies tiers of deterministic coreference models one at a time from highest to lowest precision, and outperforms many state-of-the-art supervised and unsupervised models on several standard corpora.
Abstract: Most coreference resolution models determine if two mentions are coreferent using a single function over a set of constraints or features. This approach can lead to incorrect decisions as lower precision features often overwhelm the smaller number of high precision ones. To overcome this problem, we propose a simple coreference architecture based on a sieve that applies tiers of deterministic coreference models one at a time from highest to lowest precision. Each tier builds on the previous tier's entity cluster output. Further, our model propagates global information by sharing attributes (e.g., gender and number) across mentions in the same cluster. This cautious sieve guarantees that stronger features are given precedence over weaker ones and that each decision is made using all of the information available at the time. The framework is highly modular: new coreference modules can be plugged in without any change to the other modules. In spite of its simplicity, our approach outperforms many state-of-the-art supervised and unsupervised models on several standard corpora. This suggests that sieve-based approaches could be applied to other NLP tasks.

389 citations


Proceedings ArticleDOI
16 May 2010
TL;DR: In this paper, a large scale evaluation of captchas from the human perspective is presented, with the goal of assessing how much friction CAPTCHAs present to the average user.
Abstract: Captchas are designed to be easy for humans but hard for machines. However, most recent research has focused only on making them hard for machines. In this paper, we present what is to the best of our knowledge the first large scale evaluation of captchas from the human perspective, with the goal of assessing how much friction captchas present to the average user. For the purpose of this study we have asked workers from Amazon’s Mechanical Turk and an underground captchabreaking service to solve more than 318 000 captchas issued from the 21 most popular captcha schemes (13 images schemes and 8 audio scheme). Analysis of the resulting data reveals that captchas are often difficult for humans, with audio captchas being particularly problematic. We also find some demographic trends indicating, for example, that non-native speakers of English are slower in general and less accurate on English-centric captcha schemes. Evidence from a week’s worth of eBay captchas (14,000,000 samples) suggests that the solving accuracies found in our study are close to real-world values, and that improving audio captchas should become a priority, as nearly 1% of all captchas are delivered as audio rather than images. Finally our study also reveals that it is more effective for an attacker to use Mechanical Turk to solve captchas than an underground service.

226 citations


30 Apr 2010
TL;DR: Evidence from a week’s worth of eBay captchas suggests that the solving accuracies found in the study are close to real-world values, and that improving audioCaptchas should become a priority, as nearly 1% of all captchAs are delivered as audio rather than images.

224 citations


Proceedings Article
11 Jul 2010
TL;DR: A system that learns to follow navigational natural language directions by learning by apprenticeship from routes through a map paired with English descriptions using a reinforcement learning algorithm, which grounds the meaning of spatial terms like above and south into geometric properties of paths.
Abstract: We present a system that learns to follow navigational natural language directions. Where traditional models learn from linguistic annotation or word distributions, our approach is grounded in the world, learning by apprenticeship from routes through a map paired with English descriptions. Lacking an explicit alignment between the text and the reference path makes it difficult to determine what portions of the language describe which aspects of the route. We learn this correspondence with a reinforcement learning algorithm, using the deviation of the route we follow from the intended path as a reward signal. We demonstrate that our system successfully grounds the meaning of spatial terms like above and south into geometric properties of paths.

212 citations


Journal ArticleDOI
01 Dec 2010
TL;DR: The approach to overcoming issues involved in such a data integration project is discussed, relevant to both users of the corpus and others in the language resource community undertaking similar projects.
Abstract: This paper describes a recently completed common resource for the study of spoken discourse, the NXT-format Switchboard Corpus. Switchboard is a long-standing corpus of telephone conversations (Godfrey et al. in SWITCHBOARD: Telephone speech corpus for research and development. In Proceedings of ICASSP-92, pp. 517---520, 1992). We have brought together transcriptions with existing annotations for syntax, disfluency, speech acts, animacy, information status, coreference, and prosody; along with substantial new annotations of focus/contrast, more prosody, syllables and phones. The combined corpus uses the format of the NITE XML Toolkit, which allows these annotations to be browsed and searched as a coherent set (Carletta et al. in Lang Resour Eval J 39(4):313---334, 2005). The resulting corpus is a rich resource for the investigation of the linguistic features of dialogue and how they interact. As well as describing the corpus itself, we discuss our approach to overcoming issues involved in such a data integration project, relevant to both users of the corpus and others in the language resource community undertaking similar projects.

177 citations


Proceedings Article
02 Jun 2010
TL;DR: Three approaches for unsupervised grammar induction that are sensitive to data complexity are presented and apply them to Klein and Manning's Dependency Model with Valence, beating state-of-the-art and generalizing to the Brown corpus.
Abstract: We present three approaches for unsupervised grammar induction that are sensitive to data complexity and apply them to Klein and Manning's Dependency Model with Valence. The first, Baby Steps, bootstraps itself via iterated learning of increasingly longer sentences and requires no initialization. This method substantially exceeds Klein and Manning's published scores and achieves 39.4% accuracy on Section 23 (all sentences) of the Wall Street Journal corpus. The second, Less is More, uses a low-complexity subset of the available data: sentences up to length 15. Focusing on fewer but simpler examples trades off quantity against ambiguity; it attains 44.1% accuracy, using the standard linguistically-informed prior and batch training, beating state-of-the-art. Leapfrog, our third heuristic, combines Less is More with Baby Steps by mixing their models of shorter sentences, then rapidly ramping up exposure to the full training set, driving up accuracy to 45.0%. These trends generalize to the Brown corpus; awareness of data complexity may improve other parsing models and unsupervised algorithms.

177 citations


Proceedings Article
01 May 2010
TL;DR: It is found that constituent parsers systematically outperform algorithms designed specifically for dependency parsing as well as dependencies extracted from constituent parse trees created by phrase structure parsers.
Abstract: We investigate a number of approaches to generating Stanford Dependencies, a widely used semantically-oriented dependency representation. We examine algorithms specifically designed for dependency parsing (Nivre, Nivre Eager, Covington, Eisner, and RelEx) as well as dependencies extracted from constituent parse trees created by phrase structure parsers (Charniak, Charniak-Johnson, Bikel, Berkeley and Stanford). We found that constituent parsers systematically outperform algorithms designed specifically for dependency parsing. The most accurate method for generating dependencies is the Charniak-Johnson reranking parser, with 89% (labeled) attachment F1 score. The fastest methods are Nivre, Nivre Eager, and Covington, used with a linear classifier to make local parsing decisions, which can parse the entire Penn Treebank development set (section 22) in less than 10 seconds on an Intel Xeon E5520. However, this speed comes with a substantial drop in F1 score (about 76% for labeled attachment) compared to competing methods. By tuning how much of the search space is explored by the Charniak-Johnson parser, we are able to arrive at a balanced configuration that is both fast and nearly as good as the most accurate approaches.

168 citations


Journal ArticleDOI
TL;DR: It is proposed that doubly confusable pairs, rather than high neighborhood densit y, may better explain phonetic neighborhood errors in human speech processing.

164 citations


Proceedings ArticleDOI
26 Oct 2010
TL;DR: This work proposes a new task for evaluating the resulting retrieval models, where the retrieval system takes only an abstract as its input and must produce as output the list of references at the end of the abstract's article.
Abstract: Scientists depend on literature search to find prior work that is relevant to their research ideas. We introduce a retrieval model for literature search that incorporates a wide variety of factors important to researchers, and learns the weights of each of these factors by observing citation patterns. We introduce features like topical similarity and author behavioral patterns, and combine these with features from related work like citation count and recency of publication. We present an iterative process for learning weights for these features that alternates between retrieving articles with the current retrieval model, and updating model weights by training a supervised classifier on these articles. We propose a new task for evaluating the resulting retrieval models, where the retrieval system takes only an abstract as its input and must produce as output the list of references at the end of the abstract's article. We evaluate our model on a collection of journal, conference and workshop articles from the ACL Anthology Reference Corpus. Our model achieves a mean average precision of 28.7, a 12.8 point improvement over a term similarity baseline, and a significant improvement both over models using only features from related work and over models without our iterative learning.

140 citations


Proceedings Article
15 Jul 2010
TL;DR: It is shown that Viterbi (or "hard") EM is well-suited to unsupervised grammar induction, and is more accurate than standard inside-outside re-estimation (classic EM), significantly faster, and simpler.
Abstract: We show that Viterbi (or "hard") EM is well-suited to unsupervised grammar induction. It is more accurate than standard inside-outside re-estimation (classic EM), significantly faster, and simpler. Our experiments with Klein and Manning's Dependency Model with Valence (DMV) attain state-of-the-art performance --- 44.8% accuracy on Section 23 (all sentences) of the Wall Street Journal corpus --- without clever initialization; with a good initializer, Viterbi training improves to 47.9%. This generalizes to the Brown corpus, our held-out set, where accuracy reaches 50.8% --- a 7.5% gain over previous best results. We find that classic EM learns better from short sentences but cannot cope with longer ones, where Viterbi thrives. However, we explain that both algorithms optimize the wrong objectives and prove that there are fundamental disconnects between the likelihoods of sentences, best parses, and true parses, beyond the well-established discrepancies between likelihood, accuracy and extrinsic performance.

93 citations


Proceedings Article
02 Jun 2010
TL;DR: It is shown that people tend to prefer BLEU and NIST trained models to those trained on edit distance based metrics like TER or WER, and that using BLEu or NIST produces models that are more robust to evaluation by other metrics and perform well in human judgments.
Abstract: Translation systems are generally trained to optimize BLEU, but many alternative metrics are available. We explore how optimizing toward various automatic evaluation metrics (BLEU, METEOR, NIST, TER) affects the resulting model. We train a state-of-the-art MT system using MERT on many parameterizations of each metric and evaluate the resulting models on the other metrics and also using human judges. In accordance with popular wisdom, we find that it's important to train on the same metric used in testing. However, we also find that training to a newer metric is only useful to the extent that the MT model's structure and features allow it to take advantage of the metric. Contrasting with TER's good correlation with human judgments, we show that people tend to prefer BLEU and NIST trained models to those trained on edit distance based metrics like TER or WER. Human preferences for METEOR trained models varies depending on the source language. Since using BLEU or NIST produces models that are more robust to evaluation by other metrics and perform well in human judgments, we conclude they are still the best choice for training.

Proceedings Article
11 Jul 2010
TL;DR: It is demonstrated that derived constraints aid grammar induction by training Klein and Manning's Dependency Model with Valence (DMV) on this data set: parsing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50.4%, beating previous state-of-the-art by more than 5%.
Abstract: We show how web mark-up can be used to improve unsupervised dependency parsing. Starting from raw bracketings of four common HTML tags (anchors, bold, italics and underlines), we refine approximate partial phrase boundaries to yield accurate parsing constraints. Conversion procedures fall out of our linguistic analysis of a newly available million-word hyper-text corpus. We demonstrate that derived constraints aid grammar induction by training Klein and Manning's Dependency Model with Valence (DMV) on this data set: parsing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50.4%, beating previous state-of-the-art by more than 5%. Web-scale experiments show that the DMV, perhaps because it is unlexicalized, does not benefit from orders of magnitude more annotated but noisier data. Our model, trained on a single blog, generalizes to 53.3% accuracy out-of-domain, against the Brown corpus --- nearly 10% higher than the previous published best. The fact that web mark-up strongly correlates with syntactic structure may have broad applicability in NLP.

Proceedings Article
01 May 2010
TL;DR: The narrative schema resource described in this paper contains approximately 5000 unique events combined into schemas of varying sizes and is described, how it is learned, and a new evaluation of the coverage of these schemas over unseen documents.
Abstract: This paper describes a new language resource of events and semantic roles that characterize real-world situations Narrative schemas contain sets of related events (edit and publish), a temporal ordering of the events (edit before publish), and the semantic roles of the participants (authors publish books) This type of world knowledge was central to early research in natural language understanding, scripts being one of the main formalisms, they represented common sequences of events that occur in the world Unfortunately, most of this knowledge was hand-coded and time consuming to create Current machine learning techniques, as well as a new approach to learning through coreference chains, has allowed us to automatically extract rich event structure from open domain text in the form of narrative schemas The narrative schema resource described in this paper contains approximately 5000 unique events combined into schemas of varying sizes We describe the resource, how it is learned, and a new evaluation of the coverage of these schemas over unseen documents

Proceedings Article
01 Jun 2010
TL;DR: A new Java-based open source toolkit for phrase-based machine translation to use APIs for integrating new features (/knowledge sources) into the decoding model and for extracting feature statistics from aligned bitexts is presented.
Abstract: We present a new Java-based open source toolkit for phrase-based machine translation. The key innovation provided by the toolkit is to use APIs for integrating new features (/knowledge sources) into the decoding model and for extracting feature statistics from aligned bitexts. The package includes a number of useful features written to these APIs including features for hierarchical reordering, discriminatively trained linear distortion, and syntax based language models. Other useful utilities packaged with the toolkit include: a conditional phrase extraction system that builds a phrase table just for a specific dataset; and an implementation of MERT that allows for pluggable evaluation metrics for both training and evaluation with built in support for a variety of metrics (e.g., TERp, BLEU, METEOR).

Proceedings Article
02 Jun 2010
TL;DR: A new Java-based open source toolkit for phrase-based machine translation to use APIs for integrating new features (/knowledge sources) into the decoding model and for extracting feature statistics from aligned bitexts is presented.
Abstract: We present a new Java-based open source toolkit for phrase-based machine translation. The key innovation provided by the toolkit is to use APIs for integrating new features (/knowledge sources) into the decoding model and for extracting feature statistics from aligned bitexts. The package includes a number of useful features written to these APIs including features for hierarchical reordering, discriminatively trained linear distortion, and syntax based language models. Other useful utilities packaged with the toolkit include: a conditional phrase extraction system that builds a phrase table just for a specific dataset; and an implementation of MERT that allows for pluggable evaluation metrics for both training and evaluation with built in support for a variety of metrics (e.g., TERp, BLEU, METEOR).

Proceedings Article
11 Jul 2010
TL;DR: This paper improves the use of pseudo-words as an evaluation framework for selectional preferences and shows that selectional Preferences should instead be evaluated on the data in its entirety.
Abstract: This paper improves the use of pseudo-words as an evaluation framework for selectional preferences. While pseudo-words originally evaluated word sense disambiguation, they are now commonly used to evaluate selectional preferences. A selectional preference model ranks a set of possible arguments for a verb by their semantic fit to the verb. Pseudo-words serve as a proxy evaluation for these decisions. The evaluation takes an argument of a verb like drive (e.g. car), pairs it with an alternative word (e.g. car/rock), and asks a model to identify the original. This paper studies two main aspects of pseudoword creation that affect performance results. (1) Pseudo-word evaluations often evaluate only a subset of the words. We show that selectional preferences should instead be evaluated on the data in its entirety. (2) Different approaches to selecting partner words can produce overly optimistic evaluations. We offer suggestions to address these factors and present a simple baseline that outperforms the state-of-the-art by 13% absolute on a newspaper domain.

Proceedings Article
03 Nov 2010
TL;DR: A robotic dialog system to learn names and attributes of objects through spoken interaction with a human teacher, and a variant of the children’s games “I Spy” and “20 Questions” is reported.
Abstract: Despite efforts to build robust vision systems, robots in new environments inevitably encounter new objects. Traditional supervised learning requires gathering and annotating sample images in the environment, usually in the form of bounding boxes or segmentations. This training interface takes some experience to do correctly and is quite tedious. We report work in progress on a robotic dialog system to learn names and attributes of objects through spoken interaction with a human teacher. The robot and human play a variant of the children’s games “I Spy” and “20 Questions”. In our game, the human places objects of interest in front of the robot, then picks an object in her head. The robot asks a series of natural language questions about the target object, with the goal of pointing at the correct object while asking a minimum number of questions. The questions range from attributes such as color (“Is it red?”) to category questions (“Is it a cup?”). The robot selects questions to ask based on an information gain criteria, seeking to minimize the entropy of the visual model given the answer to the question.

Proceedings Article
23 Aug 2010
TL;DR: You will find in this volume papers from the 23rd International Conference on Computational Linguistics (COLING 2010) held in Beijing, China on August 23-27, 2010 under the auspices of the International Committee on Computations linguistics (ICCL), and organized by the Chinese Information Processing Society (CIPS) of China.
Abstract: You will find in this volume papers from the 23rd International Conference on Computational Linguistics (COLING 2010) held in Beijing, China on August 23-27, 2010 under the auspices of the International Committee on Computational Linguistics (ICCL), and organized by the Chinese Information Processing Society (CIPS) of China. For this prestigious natural language processing conference to be held in China is a significant event for computational linguistics and for colleagues in China, demonstrating both the maturity of our field and the development of academic areas in China.

Proceedings Article
01 Aug 2010