scispace - formally typeset
Search or ask a question

Showing papers by "Dan Jurafsky published in 2012"


Proceedings Article
12 Jul 2012
TL;DR: A novel coreference resolution system that models entities and events jointly that handles nominal and verbal events as well as entities, and the joint formulation allows information from event coreference to help entity coreference, and vice versa.
Abstract: We introduce a novel coreference resolution system that models entities and events jointly. Our iterative method cautiously constructs clusters of entity and event mentions using linear regression to model cluster merge operations. As clusters are built, information flows between entity and event clusters through features that model semantic role dependencies. Our system handles nominal and verbal events as well as entities, and our joint formulation allows information from event coreference to help entity coreference, and vice versa. In a cross-document domain with comparable documents, joint coreference resolution performs significantly better (over 3 CoNLL F1 points) than two strong baselines that resolve entities and events separately.

223 citations


Posted Content
TL;DR: This paper builds models for rating systems in which users leave separate ratings for each aspect of a product, and introduces new corpora consisting of five million reviews, rated with between three and six aspects.
Abstract: The majority of online reviews consist of plain-text feedback together with a single numeric score. However, there are multiple dimensions to products and opinions, and understanding the `aspects' that contribute to users' ratings may help us to better understand their individual preferences. For example, a user's impression of an audiobook presumably depends on aspects such as the story and the narrator, and knowing their opinions on these aspects may help us to recommend better products. In this paper, we build models for rating systems in which such dimensions are explicit, in the sense that users leave separate ratings for each aspect of a product. By introducing new corpora consisting of five million reviews, rated with between three and six aspects, we evaluate our models on three prediction tasks: First, we use our model to uncover which parts of a review discuss which of the rated aspects. Second, we use our model to summarize reviews, which for us means finding the sentences that best explain a user's rating. Finally, since aspect ratings are optional in many of the datasets we consider, we use our model to recover those ratings that are missing from a user's evaluation. Our model matches state-of-the-art approaches on existing small-scale datasets, while scaling to the real-world datasets we introduce. Moreover, our model is able to `disentangle' content and sentiment words: we automatically learn content words that are indicative of a particular aspect as well as the aspect-specific sentiment words that are indicative of a particular rating.

186 citations


Proceedings ArticleDOI
10 Dec 2012
TL;DR: In this paper, the authors build models for rating systems in which such dimensions are explicit, in the sense that users leave separate ratings for each aspect of a product, and they evaluate their models on three prediction tasks: uncover which parts of a review discuss which of the rated aspects, summarize reviews by finding the sentences that best explain a user's rating.
Abstract: Most online reviews consist of plain-text feedback together with a single numeric score. However, understanding the multiple `aspects' that contribute to users' ratings may help us to better understand their individual preferences. For example, a user's impression of an audio book presumably depends on aspects such as the story and the narrator, and knowing their opinions on these aspects may help us to recommend better products. In this paper, we build models for rating systems in which such dimensions are explicit, in the sense that users leave separate ratings for each aspect of a product. By introducing new corpora consisting of five million reviews, rated with between three and six aspects, we evaluate our models on three prediction tasks: First, we uncover which parts of a review discuss which of the rated aspects. Second, we summarize reviews by finding the sentences that best explain a user's rating. Finally, since aspect ratings are optional in many of the datasets we consider, we recover ratings that are missing from a user's evaluation. Our model matches state-of-the-art approaches on existing small-scale datasets, while scaling to the real-world datasets we introduce. Moreover, our model is able to `disentangle' content and sentiment words: we automatically learn content words that are indicative of a particular aspect as well as the aspect-specific sentiment words that are indicative of a particular rating.

170 citations


Journal ArticleDOI
TL;DR: A new, two-stage, self-supervised algorithm for author disambiguation in large bibliographic databases that shares the advantages of unsupervised approaches (no need for expensive hand labels) as well as supervised approaches (a rich set of features that can be discriminatively trained).
Abstract: We present a new, two-stage, self-supervised algorithm for author disambiguation in large bibliographic databases. In the first “bootstrap” stage, a collection of high-precision features is used to bootstrap a training set with positive and negative examples of coreferring authors. A supervised feature-based classifier is then trained on the bootstrap clusters and used to cluster the authors in a larger unlabeled dataset. Our self-supervised approach shares the advantages of unsupervised approaches (no need for expensive hand labels) as well as supervised approaches (a rich set of features that can be discriminatively trained). The algorithm disambiguates 54,000,000 author instances in Thomson Reuters' Web of Knowledge with B3 F1 of.807. We analyze parameters and features, particularly those from citation networks, which have not been deeply investigated in author disambiguation. The most important citation feature is self-citation, which can be approximated without expensive extraction of the full network. For the supervised stage, the minor improvement due to other citation features (increasing F1 from.748 to.767) suggests they may not be worth the trouble of extracting from databases that don't already have them. A lean feature set without expensive abstract and title features performs 130 times faster with about equal F1. © 2012 Wiley Periodicals, Inc.

109 citations


Proceedings Article
01 Jun 2012
TL;DR: Results showed that the most important indicator of high-quality poetry the authors could detect was the frequency of references to concrete objects, which suggests that concreteness may be one of the most appealing features of poetry to the modern aesthetic.
Abstract: What makes a poem beautiful? We use computational methods to compare the stylistic and content features employed by awardwinning poets and amateur poets. Building upon existing techniques designed to quantitatively analyze style and affect in texts, we examined elements of poetic craft such as diction, sound devices, emotive language, and imagery. Results showed that the most important indicator of high-quality poetry we could detect was the frequency of references to concrete objects. This result highlights the influence of Imagism in contemporary professional poetry, and suggests that concreteness may be one of the most appealing features of poetry to the modern aesthetic. We also report on other features that characterize high-quality poetry and argue that methods from computational linguistics may provide important insights into the analysis of beauty in verbal art.

92 citations


Proceedings Article
10 Jul 2012
TL;DR: A people-centered computational history of science that tracks authors over topics and applies it to the history of computational linguistics finds that the government-sponsored bakeoffs brought new researchers to the field, and bridged early topics to modern probabilistic approaches.
Abstract: We develop a people-centered computational history of science that tracks authors over topics and apply it to the history of computational linguistics. We present four findings in this paper. First, we identify the topical subfields authors work on by assigning automatically generated topics to each paper in the ACL Anthology from 1980 to 2008. Next, we identify four distinct research epochs where the pattern of topical overlaps are stable and different from other eras: an early NLP period from 1980 to 1988, the period of US government-sponsored MUC and ATIS evaluations from 1989 to 1994, a transitory period until 2001, and a modern integration period from 2002 onwards. Third, we analyze the flow of authors across topics to discern how some subfields flow into the next, forming different stages of ACL research. We find that the government-sponsored bakeoffs brought new researchers to the field, and bridged early topics to modern probabilistic approaches. Last, we identify steep increases in author retention during the bakeoff era and the modern era, suggesting two points at which the field became more integrated.

60 citations


Proceedings Article
03 Jun 2012
TL;DR: A probabilistic approach for learning to interpret temporal phrases given only a corpus of utterances and the times they reference is presented, achieving an accuracy of 72% on an adapted TempEval-2 task -- comparable to state of the art systems.
Abstract: We present a probabilistic approach for learning to interpret temporal phrases given only a corpus of utterances and the times they reference. While most approaches to the task have used regular expressions and similar linear pattern interpretation rules, the possibility of phrasal embedding and modification in time expressions motivates our use of a compositional grammar of time expressions. This grammar is used to construct a latent parse which evaluates to the time the phrase would represent, as a logical parse might evaluate to a concrete entity. In this way, we can employ a loosely supervised EM-style bootstrapping approach to learn these latent parses while capturing both syntactic uncertainty and pragmatic ambiguity in a probabilistic framework. We achieve an accuracy of 72% on an adapted TempEval-2 task -- comparable to state of the art systems.

57 citations


Proceedings Article
10 Jul 2012
TL;DR: A fine-grained study of gender in the field of Natural Language Processing finds that women publish more on dialog, discourse, and sentiment, while men publish more than women in parsing, formal semantics, and finite state models.
Abstract: Studies of gender balance in academic computer science are typically based on statistics on enrollment and graduation. Going beyond these coarse measures of gender participation, we conduct a fine-grained study of gender in the field of Natural Language Processing. We use topic models (Latent Dirichlet Allocation) to explore the research topics of men and women in the ACL Anthology Network. We find that women publish more on dialog, discourse, and sentiment, while men publish more than women in parsing, formal semantics, and finite state models. To conduct our study we labeled the gender of authors in the ACL Anthology mostly manually, creating a useful resource for other gender studies. Finally, our study of historical patterns in female participation shows that the proportion of women authors in computational linguistics has been continuously increasing, with approximately a 50% increase in the three decades since 1980.

55 citations


Proceedings Article
01 Jun 2012
TL;DR: This paper examines how referential cohesion is expressed in literary and non-literary texts and how this cohesion affects translation and suggests that incorporating discourse features above the sentence level is an important direction for MT research if it is to be applied to literature.
Abstract: What is the role of textual features above the sentence level in advancing the machine translation of literature? This paper examines how referential cohesion is expressed in literary and non-literary texts and how this cohesion affects translation. We first show in a corpus study on English that literary texts use more dense reference chains to express greater referential cohesion than news. We then compare the referential cohesion of machine versus human translations of Chinese literature and news. While human translators capture the greater referential cohesion of literature, Google translations perform less well at capturing literary cohesion. Our results suggest that incorporating discourse features above the sentence level is an important direction for MT research if it is to be applied to literature.

39 citations



Proceedings Article
12 Jul 2012
TL;DR: A new family of models for unsupervised parsing, Dependency and Boundary models, that use cues at constituent boundaries to inform head-outward dependency tree generation and induce state-of-the-art dependency grammars for many languages without special knowledge of optimal input sentence lengths or biased, manually-tuned initializers.
Abstract: We present a new family of models for unsupervised parsing, Dependency and Boundary models, that use cues at constituent boundaries to inform head-outward dependency tree generation. We build on three intuitions that are explicit in phrase-structure grammars but only implicit in standard dependency formulations: (i) Distributions of words that occur at sentence boundaries --- such as English determiners --- resemble constituent edges. (ii) Punctuation at sentence boundaries further helps distinguish full sentences from fragments like headlines and titles, allowing us to model grammatical differences between complete and incomplete sentences. (iii) Sentence-internal punctuation boundaries help with longer-distance dependencies, since punctuation correlates with constituent edges. Our models induce state-of-the-art dependency grammars for many languages without special knowledge of optimal input sentence lengths or biased, manually-tuned initializers.

Proceedings Article
16 Aug 2012
TL;DR: This work views interpunctuation fragments as atoms, initially, thus making some simple phrases and clauses of complex sentences available to training sooner, and observes that resulting partial data can be analyzed using reduced parsing models which, it is shown, can be easier to bootstrap than more nuanced grammars.
Abstract: Modern grammar induction systems often employ curriculum learning strategies that begin by training on a subset of all available input that is considered simpler than the full data. Traditionally, filtering has been at granularities of whole input units, e.g., discarding entire sentences with too many words or punctuation marks. We propose instead viewing interpunctuation fragments as atoms, initially, thus making some simple phrases and clauses of complex sentences available to training sooner. Splitting input text at punctuation in this way improved our state-of-the-art grammar induction pipeline. We observe that resulting partial data, i.e., mostly incomplete sentence fragments, can be analyzed using reduced parsing models which, we show, can be easier to bootstrap than more nuanced grammars. Starting with a new, bare dependency-and-boundary model (DBM-0), our grammar inducer attained 61.2% directed dependency accuracy on Section 23 (all sentences) of the Wall Street Journal corpus: more than 2% higher than previous published results for this task.

Proceedings Article
07 Jun 2012
TL;DR: It is shown that orthographic cues can be helpful for unsupervised parsing, and combining capitalization with punctuation-induced constraints in inference further improved parsing performance, attaining state-of-the-art levels for many languages.
Abstract: We show that orthographic cues can be helpful for unsupervised parsing. In the Penn Treebank, transitions between upper- and lower-case tokens tend to align with the boundaries of base (English) noun phrases. Such signals can be used as partial bracketing constraints to train a grammar inducer: in our experiments, directed dependency accuracy increased by 2.2% (average over 14 languages having case information). Combining capitalization with punctuation-induced constraints in inference further improved parsing performance, attaining state-of-the-art levels for many languages.


Proceedings Article
26 Jun 2012
TL;DR: Unsupervised approaches to learning rich knowledge like the way detonating a bomb is related to destroying a building, or that the perpetrator who was convicted must have been arrested are described.
Abstract: The majority of information on the Internet is expressed in written text. Understanding and extracting this information is crucial to building intelligent systems that can organize this knowledge. Today, most algorithms focus on learning atomic facts and relations. For instance, we can reliably extract facts like "Annapolis is a City" by observing redundant word patterns across a corpus. However, these facts do not capture richer knowledge like the way detonating a bomb is related to destroying a building, or that the perpetrator who was convicted must have been arrested. A structured model of these events and entities is needed for a deeper understanding of language. This talk describes unsupervised approaches to learning such rich knowledge.