Showing papers by "Jacob Eisenstein published in 2011"

PDF

Open Access

Proceedings Article•DOI•

Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

[...]

Kevin Gimpel¹, Nathan Schneider¹, Brendan O'Connor¹, Dipanjan Das¹, Daniel Mills¹, Jacob Eisenstein¹, Michael Heilman¹, Dani Yogatama¹, Jeffrey Flanigan¹, Noah A. Smith¹ - Show less +6 more•Institutions (1)

Carnegie Mellon University¹

19 Jun 2011

TL;DR: A tagset is developed, data is annotated, features are developed, and results nearing 90% accuracy are reported on the problem of part-of-speech tagging for English data from the popular micro-blogging service Twitter.

...read moreread less

Abstract: We address the problem of part-of-speech tagging for English data from the popular micro-blogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90% accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related social media data sets.

...read moreread less

1,053 citations

Proceedings Article•

Sparse Additive Generative Models of Text

[...]

Jacob Eisenstein¹, Amr Ahmed¹, Eric P. Xing¹•Institutions (1)

Carnegie Mellon University¹

28 Jun 2011

TL;DR: This approach has two key advantages: it can enforce sparsity to prevent overfitting, and it can combine generative facets through simple addition in log space, avoiding the need for latent switching variables.

...read moreread less

Abstract: Generative models of text typically associate a multinomial with every class label or topic. Even in simple models this requires the estimation of thousands of parameters; in multi-faceted latent variable models, standard approaches require additional latent "switching" variables for every token, complicating inference. In this paper, we propose an alternative generative model for text. The central idea is that each class label or latent topic is endowed with a model of the deviation in log-frequency from a constant background distribution. This approach has two key advantages: we can enforce sparsity to prevent overfitting, and we can combine generative facets through simple addition in log space, avoiding the need for latent switching variables. We demonstrate the applicability of this idea to a range of scenarios: classification, topic modeling, and more complex multifaceted generative models.

...read moreread less

335 citations

Journal Article•

Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics

[...]

Kevin Gimpel, Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, Noah A. Smith - Show less +6 more

01 Jan 2011-The Association for Computational Linguistics

152 citations

Proceedings Article•

Discovering Sociolinguistic Associations with Structured Sparsity

[...]

Jacob Eisenstein¹, Noah A. Smith¹, Eric P. Xing¹•Institutions (1)

Carnegie Mellon University¹

19 Jun 2011

TL;DR: A method to discover robust and interpretable sociolinguistic associations from raw geotagged text data is presented, using aggregate demographic statistics about the authors' geographic communities to solve a multi-output regression problem between demographics and lexical frequencies.

...read moreread less

Abstract: We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors' geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite e1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.

...read moreread less

142 citations

Proceedings Article•DOI•

Unified analysis of streaming news

[...]

Amr Ahmed¹, Qirong Ho¹, Jacob Eisenstein¹, Eric P. Xing¹, Alexander J. Smola², Choon Hui Teo² - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Yahoo!²

28 Mar 2011

TL;DR: This paper presents a unified framework to group incoming news articles into temporary but tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve.

...read moreread less

Abstract: News clustering, categorization and analysis are key components of any news portal. They require algorithms capable of dealing with dynamic data to cluster, interpret and to temporally aggregate news articles. These three tasks are often solved separately. In this paper we present a unified framework to group incoming news articles into temporary but tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve. We achieve this by building a hybrid clustering and topic model. To deal with the available wealth of data we build an efficient parallel inference algorithm by sequential Monte Carlo estimation. Time and memory costs are nearly constant in the length of the history, and the approach scales to hundreds of thousands of documents. We demonstrate the efficiency and accuracy on the publicly available TDT dataset and data of a major internet news site.

...read moreread less

90 citations

Proceedings Article•

Online Inference for the Infinite Topic-Cluster Model: Storylines from Streaming Text

[...]

Amr Ahmed¹, Qirong Ho¹, Choon Hui Teo², Jacob Eisenstein¹, Alexander J. Smola¹, Eric P. Xing¹ - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Yahoo!²

14 Jun 2011

TL;DR: The time-dependent topic-cluster model is presented, a hierarchical approach for combining Latent Dirichlet Allocation and clustering via the Recurrent Chinese Restaurant Process which inherits the advantages of both of its constituents, namely interpretability and concise representation.

...read moreread less

Abstract: We present the time-dependent topic-cluster model, a hierarchical approach for combining Latent Dirichlet Allocation and clustering via the Recurrent Chinese Restaurant Process. It inherits the advantages of both of its constituents, namely interpretability and concise representation. We show how it can be applied to streaming collections of objects such as real world feeds in a news portal. We provide details of a parallel Sequential Monte Carlo algorithm to perform inference in the resulting graphical model which scales to hundred of thousands of documents.

...read moreread less

77 citations

A Mixture Model of Demographic Lexical Variation

[...]

Brendan O'Connor, Jacob Eisenstein, Eric P. Xing, Noah A. Smith

01 Jan 2011

TL;DR: A Bayesian generative model of how demographic social factors influence lexical choice is proposed for a corpus of geo-tagged Twitter messages originating from mobile phones, cross-referenced against U.S. Census demographic data.

...read moreread less

Abstract: We propose a Bayesian generative model of how demographic social factors influence lexical choice. We apply the method to a corpus of geo-tagged Twitter messages originating from mobile phones, cross-referenced against U.S. Census demographic data. Our method discovers communities jointly defined by linguistic and demographic properties.

...read moreread less

27 citations

Posted Content•

TopicScape: Semantic Navigation of Document Collections

[...]

Jacob Eisenstein, Duen Horng "Polo" Chau, Aniket Kittur, Eric P. Xing

27 Oct 2011-arXiv: Human-Computer Interaction

TL;DR: This paper shows how topic modeling -- a technique for identifying latent themes across large collections of documents -- can support semantic exploration, and presents TopicViz, an interactive environment for information exploration.

...read moreread less

Abstract: When people explore and manage information, they think in terms of topics and themes. However, the software that supports information exploration sees text at only the surface level. In this paper we show how topic modeling -- a technique for identifying latent themes across large collections of documents -- can support semantic exploration. We present TopicViz, an interactive environment for information exploration. TopicViz combines traditional search and citation-graph functionality with a range of novel interactive visualizations, centered around a force-directed layout that links documents to the latent themes discovered by the topic model. We describe several use scenarios in which TopicViz supports rapid sensemaking on large document collections.

...read moreread less

12 citations

Proceedings Article•

Structured Databases of Named Entities from Bayesian Nonparametrics

[...]

Jacob Eisenstein¹, Tae Yano¹, William W. Cohen¹, Noah A. Smith¹, Eric P. Xing¹ - Show less +1 more•Institutions (1)

Carnegie Mellon University¹

30 Jul 2011

TL;DR: Empirical evaluation shows that the nonparametric Bayesian approach to extract a structured database of entities from text learns an accurate database of entity and a sensible model of name structure.

...read moreread less

Abstract: We present a nonparametric Bayesian approach to extract a structured database of entities from text. Neither the number of entities nor the fields that characterize each entity are provided in advance; the only supervision is a set of five prototype examples. Our method jointly accomplishes three tasks: (i) identifying a set of canonical entities, (ii) inferring a schema for the fields that describe each entity, and (iii) matching entities to their references in raw text. Empirical evaluation shows that the approach learns an accurate database of entities and a sensible model of name structure.

...read moreread less

8 citations