Top 19 papers published by Jacob Eisenstein from Google in 2014

Journal Article•DOI•

Gender identity and lexical variation in social media

[...]

David Bamman¹, Jacob Eisenstein², Tyler Schnoebelen•Institutions (2)

Carnegie Mellon University¹, Georgia Institute of Technology²

01 Apr 2014-Journal of Sociolinguistics

TL;DR: This paper studied the relationship between gender, linguistic style, and social networks, using a novel corpus of 14,000 Twitter users and found that social network homophily is correlated with the use of same-gender language markers.

...read moreread less

Abstract: We present a study of the relationship between gender, linguistic style, and social networks, using a novel corpus of 14,000 Twitter users. Prior quantitative work on gender often treats this social variable as a female/male binary; we argue for a more nuanced approach. By clustering Twitter users, we find a natural decomposition of the dataset into various styles and topical interests. Many clusters have strong gender orientations, but their use of linguistic resources sometimes directly conflicts with the population-level language statistics. We view these clusters as a more accurate reflection of the multifaceted nature of gendered language styles. Previous corpus-based work has also had little to say about individuals whose linguistic styles defy population-level gender patterns. To identify such individuals, we train a statistical classifier, and measure the classifier confidence for each individual in the dataset. Examining individuals whose language does not match the classifier's model for their gender, we find that they have social networks that include significantly fewer same-gender social connections and that, in general, social network homophily is correlated with the use of same-gender language markers. Pairing computational methods and social theory thus offers a new perspective on how gender emerges as individuals position themselves relative to audiences, topics, and mainstream gender norms.

...read moreread less

299 citations

Proceedings Article•DOI•

Representation Learning for Text-level Discourse Parsing

[...]

Yangfeng Ji¹, Jacob Eisenstein¹•Institutions (1)

Georgia Institute of Technology¹

01 Jun 2014

TL;DR: A representation learning approach, in which surface features are transformed into a latent space that facilitates RST discourse parsing, which obtains substantial improvements over the previous state-of-the-art in predicting relations and nuclearity on the RST Treebank.

...read moreread less

Abstract: Text-level discourse parsing is notoriously difficult, as distinctions between discourse relations require subtle semantic judgments that are not easily captured using standard features. In this paper, we present a representation learning approach, in which we transform surface features into a latent space that facilitates RST discourse parsing. By combining the machinery of large-margin transition-based structured prediction with representation learning, our method jointly learns to parse discourse while at the same time learning a discourse-driven projection of surface features. The resulting shift-reduce discourse parser obtains substantial improvements over the previous state-of-the-art in predicting relations and nuclearity on the RST Treebank.

...read moreread less

267 citations

Journal Article•DOI•

Diffusion of lexical change in social media.

[...]

Jacob Eisenstein¹, Brendan O'Connor², Noah A. Smith³, Eric P. Xing³•Institutions (3)

Georgia Institute of Technology¹, University of Massachusetts Amherst², Carnegie Mellon University³

19 Nov 2014-PLOS ONE

TL;DR: Using a latent vector autoregressive model to aggregate across thousands of words, high-level patterns in diffusion of linguistic change over the United States are identified and support for prior arguments that focus on geographical proximity and population size is offered.

...read moreread less

Abstract: Computer-mediated communication is driving fundamental changes in the nature of written language. We investigate these changes by statistical analysis of a dataset comprising 107 million Twitter messages (authored by 2.7 million unique user accounts). Using a latent vector autoregressive model to aggregate across thousands of words, we identify high-level patterns in diffusion of linguistic change over the United States. Our model is robust to unpredictable changes in Twitter's sampling rate, and provides a probabilistic characterization of the relationship of macro-scale linguistic influence to a set of demographic and geographic predictors. The results of this analysis offer support for prior arguments that focus on geographical proximity and population size. However, demographic similarity – especially with regard to race – plays an even more central role, as cities with similar racial demographics are far more likely to share linguistic influence. Rather than moving towards a single unified “netspeak” dialect, language evolution in computer-mediated communication reproduces existing fault lines in spoken American English.

...read moreread less

238 citations

Posted Content•

One Vector is Not Enough: Entity-Augmented Distributional Semantics for Discourse Relations

[...]

Yangfeng Ji, Jacob Eisenstein

25 Nov 2014-arXiv: Computation and Language

TL;DR: The authors used a downward compositional pass to predict implicit discourse relations from the distributional representations of the arguments, and also of their coreferent entity mentions, and obtained substantial improvements over the previous state-of-the-art in predicting implicit relation in the Penn Discourse Treebank.

...read moreread less

Abstract: Discourse relations bind smaller linguistic units into coherent texts. However, automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked arguments. A more subtle challenge is that it is not enough to represent the meaning of each argument of a discourse relation, because the relation may depend on links between lower-level components, such as entity mentions. Our solution computes distributional meaning representations by composition up the syntactic parse tree. A key difference from previous work on compositional distributional semantics is that we also compute representations for entity mentions, using a novel downward compositional pass. Discourse relations are predicted from the distributional representations of the arguments, and also of their coreferent entity mentions. The resulting system obtains substantial improvements over the previous state-of-the-art in predicting implicit discourse relations in the Penn Discourse Treebank.

...read moreread less

93 citations

Journal Article•

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

[...]

Kairit Sirts, Jacob Eisenstein, Micha Elsner, Sharon Goldwater

01 Jun 2014-The Association for Computational Linguistics

89 citations

Proceedings Article•DOI•

Modeling Factuality Judgments in Social Media Text

[...]

Sandeep Soni¹, Tanushree Mitra¹, Eric Gilbert¹, Jacob Eisenstein¹•Institutions (1)

Georgia Institute of Technology¹

01 Jun 2014

TL;DR: It is seen that readers are indeed influenced by such framing devices — and there is no evidence that they consider other factors, such as the source, journalist, or the content itself, when assessing readers’ assessment of the certainty of quoted content.

...read moreread less

Abstract: How do journalists mark quoted content as certain or uncertain, and how do readers interpret these signals? Predicates such as thinks, claims, and admits offer a range of options for framing quoted content according to the author’s own perceptions of its credibility. We gather a new dataset of direct and indirect quotes from Twitter, and obtain annotations of the perceived certainty of the quoted statements. We then compare the ability of linguistic and extra-linguistic features to predict readers’ assessment of the certainty of quoted content. We see that readers are indeed influenced by such framing devices — and we find no evidence that they consider other factors, such as the source, journalist, or the content itself. In addition, we examine the impact of specific framing devices on perceptions of credibility.

...read moreread less

60 citations

Proceedings Article•DOI•

Fast Easy Unsupervised Domain Adaptation with Marginalized Structured Dropout

[...]

Yi Yang¹, Jacob Eisenstein¹•Institutions (1)

Georgia Institute of Technology¹

01 Jun 2014

TL;DR: This work proposes a new technique called marginalized structured dropout, which exploits feature structure to obtain a remarkably simple and efficient feature projection in the context of fine-grained part-of-speech tagging on a dataset of historical Portuguese.

...read moreread less

Abstract: Unsupervised domain adaptation often relies on transforming the instance representation. However, most such approaches are designed for bag-of-words models, and ignore the structured features present in many problems in NLP. We propose a new technique called marginalized structured dropout, which exploits feature structure to obtain a remarkably simple and efficient feature projection. Applied to the task of fine-grained part-of-speech tagging on a dataset of historical Portuguese, marginalized structured dropout yields state-of-the-art accuracy while increasing speed by more than an order-ofmagnitude over previous work.

...read moreread less

33 citations

Proceedings Article•DOI•

POS induction with distributional and morphological information using a distance-dependent Chinese restaurant process

[...]

Kairit Sirts, Jacob Eisenstein¹, Micha Elsner², Sharon Goldwater³•Institutions (3)

Georgia Institute of Technology¹, Brown University², University of Edinburgh³

01 Jun 2014

TL;DR: A new approach to inducing the syntactic categories of words, combining their distributional and morphological properties in a joint nonparametric Bayesian model based on the distance-dependent Chinese Restaurant Process, which outperforms competitive alternatives on English POS induction.

...read moreread less

Abstract: We present a new approach to inducing the syntactic categories of words, combining their distributional and morphological properties in a joint nonparametric Bayesian model based on the distance-dependent Chinese Restaurant Process. The prior distribution over word clusterings uses a log-linear model of morphological similarity; the likelihood function is the probability of generating vector word embeddings. The weights of the morphology model are learned jointly while inducing part-ofspeech clusters, encouraging them to cohere with the distributional features. The resulting algorithm outperforms competitive alternatives on English POS induction.

...read moreread less

9 citations

Posted Content•

"You're Mr. Lebowski, I'm the Dude": Inducing Address Term Formality in Signed Social Networks

[...]

Vinodh Krishnan, Jacob Eisenstein¹•Institutions (1)

Georgia Institute of Technology¹

17 Nov 2014-arXiv: Social and Information Networks

TL;DR: On a dataset of movie scripts, the system obtains a coherent clustering of address terms, while at the same time making intuitively plausible judgments of the formality of social relations in each film.

...read moreread less

Abstract: We present an unsupervised model for inducing signed social networks from the content exchanged across network edges. Inference in this model solves three problems simultaneously: (1) identifying the sign of each edge; (2) characterizing the distribution over content for each edge type; (3) estimating weights for triadic features that map to theoretical models such as structural balance. We apply this model to the problem of inducing the social function of address terms, such as 'Madame', 'comrade', and 'dude'. On a dataset of movie scripts, our system obtains a coherent clustering of address terms, while at the same time making intuitively plausible judgments of the formality of social relations in each film. As an additional contribution, we provide a bootstrapping technique for identifying and tagging address terms in dialogue.

...read moreread less

8 citations

Exploratory Thematic Analysis for Historical Newspaper Archives

[...]

Jacob Eisenstein, Iris Sun, Lauren F. Klein

01 Jan 2014

TL;DR: Synthesizing current research on exploratory data analysis with techniques from the fields of computational linguistics and data visualization, a new set of methods are proposed to assist humanities scholars in computationallyassisted exploratory research.

...read moreread less

Abstract: On July 19th, 1848, 300 concerned United States citizens gathered in Seneca Falls, New York, for the women’s rights convention that would culminate in the signing of the Declaration of Rights and Sentiments, the first major document (in the US) to call for women’s right to vote. In The North Star, Frederick Douglass, the former slave turned abolitionist, extolled the event as a “grand movement for attaining the civil, social, political, and religious rights of women” (1848). In the Oneida Whig, the same event was ridiculed as the “most shocking and unnatural event ever recorded in the history of womanity” (1848). As demonstrated by these contradictory accounts, published opinions varied greatly -about the women’s rights movement in the nineteenthcentury United States, and about current events generally conceived. Large-scale digitization projects have increasingly enabled humanities scholars to search newspapers, such as those just cited, for significant words and phrases. But exploring more open-ended questions such as, “How did the discourse surrounding women’s rights in the United States change in the wake of the 1848 Seneca Falls Convention?” or “Did the women’s rights movement borrow language from the nation’s contemporaneous anti-slavery campaign?” remains a challenge. Synthesizing current research on exploratory data analysis with techniques from the fields of computational linguistics and data visualization, we propose a new set of methods to assist humanities scholars in computationallyassisted exploratory research.

...read moreread less

7 citations

Posted Content•

Unsupervised Domain Adaptation with Feature Embeddings

[...]

Yi Yang¹, Jacob Eisenstein²•Institutions (2)

Carnegie Mellon University¹, Georgia Institute of Technology²

14 Dec 2014-arXiv: Computation and Language

TL;DR: It is shown that a novel but simple feature embedding approach provides better performance, by exploiting the feature template structure common in NLP problems.

...read moreread less

Abstract: Representation learning is the dominant technique for unsupervised domain adaptation, but existing approaches often require the specification of "pivot features" that generalize across domains, which are selected by task-specific heuristics. We show that a novel but simple feature embedding approach provides better performance, by exploiting the feature template structure common in NLP problems.

...read moreread less

Mining Themes and Interests in the Asperger’s and Autism Community (full paper)

[...]

Yangfeng Ji, Hwajung Hong, Rosa I. Arriaga, Agata Rozga, Gregory D. Abowd, Jacob Eisenstein - Show less +2 more

01 Jan 2014

Proceedings Article•DOI•

Mining Themes and Interests in the Asperger's and Autism Community

[...]

Yangfeng Ji¹, Hwajung Hong¹, Rosa I. Arriaga¹, Agata Rozga¹, Gregory D. Abowd¹, Jacob Eisenstein¹ - Show less +2 more•Institutions (1)

Georgia Institute of Technology¹

01 Jun 2014

TL;DR: A novel model for web forums is presented, which captures both thematic content as well as user-specific interests and identifies several topics of concern to individuals who report being on the autism spectrum.

...read moreread less

Abstract: Discussion forums offer a new source of insight for the experiences and challenges faced by individuals affected by mental disorders. Language technology can help domain experts gather insight from these forums, by aggregating themes and user behaviors across thousands of conversations. We present a novel model for web forums, which captures both thematic content as well as user-specific interests. Applying this model to the Aspies Central forum (which covers issues related to Asperger’s syndrome and autism spectrum disorder), we identify several topics of concern to individuals who report being on the autism spectrum. We perform the evaluation on the data collected from Aspies Central forum, including 1,939 threads, 29,947 posts and 972 users. Quantitative evaluations demonstrate that the topics extracted by this model are substantially more than those obtained by Latent Dirichlet Allocation and the Author-Topic Model. Qualitative analysis by subjectmatter experts suggests intriguing directions for future investigation.

...read moreread less

Posted Content•

Entity-Augmented Distributional Semantics for Discourse Relations

[...]

Yangfeng Ji¹, Jacob Eisenstein²•Institutions (2)

Microsoft¹, Georgia Institute of Technology²

17 Dec 2014-arXiv: Computation and Language

TL;DR: This article used a downward compositional pass to predict implicit discourse relations in the Penn Discourse Treebank and achieved substantial improvements over the previous state-of-the-art in predicting implicit discourse relation.

...read moreread less

Abstract: Discourse relations bind smaller linguistic elements into coherent texts. However, automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked sentences. A more subtle challenge is that it is not enough to represent the meaning of each sentence of a discourse relation, because the relation may depend on links between lower-level elements, such as entity mentions. Our solution computes distributional meaning representations by composition up the syntactic parse tree. A key difference from previous work on compositional distributional semantics is that we also compute representations for entity mentions, using a novel downward compositional pass. Discourse relations are predicted not only from the distributional representations of the sentences, but also of their coreferent entity mentions. The resulting system obtains substantial improvements over the previous state-of-the-art in predicting implicit discourse relations in the Penn Discourse Treebank.

...read moreread less

Journal Article•DOI•

Learning Document-Level Semantic Properties from Free-Text Annotations

[...]

S.R.K. Branavan¹, Harr Chen¹, Jacob Eisenstein¹, Regina Barzilay¹•Institutions (1)

Massachusetts Institute of Technology¹

15 Jan 2014-arXiv: Computation and Language

TL;DR: In this article, a hierarchical Bayesian model with joint inference is proposed to infer the semantic properties of documents by leveraging free-text keyphrase annotations, which can be used to summarize single and multiple documents into a set of semantically salient keyphrases.

...read moreread less

Abstract: This paper presents a new method for inferring the semantic properties of documents by leveraging free-text keyphrase annotations. Such annotations are becoming increasingly abundant due to the recent dramatic growth in semi-structured, user-generated online content. One especially relevant domain is product reviews, which are often annotated by their authors with pros/cons keyphrases such as a real bargain or good value. These annotations are representative of the underlying semantic properties; however, unlike expert annotations, they are noisy: lay authors may use different labels to denote the same property, and some labels may be missing. To learn using such noisy annotations, we find a hidden paraphrase structure which clusters the keyphrases. The paraphrase structure is linked with a latent topic model of the review texts, enabling the system to predict the properties of unannotated documents and to effectively aggregate the semantic properties of multiple reviews. Our approach is implemented as a hierarchical Bayesian model with joint inference. We find that joint inference increases the robustness of the keyphrase clustering and encourages the latent topics to correlate with semantically meaningful properties. Multiple evaluations demonstrate that our model substantially outperforms alternative approaches for summarizing single and multiple documents into a set of semantically salient keyphrases.

...read moreread less

Proceedings Article•

Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science

[...]

Cristian Danescu-Niculescu-Mizil, Jacob Eisenstein¹, Kathleen McKeown, Noah A. Smith²•Institutions (2)

Georgia Institute of Technology¹, Carnegie Mellon University²

01 Jan 2014

Posted Content•

Unsupervised Induction of Signed Social Networks from Content and Structure.

[...]

Vinodh Krishnan Elangovan, Jacob Eisenstein

17 Nov 2014

TL;DR: Working without any labeled data, this system offers a coherent clustering of address terms, while at the same time making intuitively plausible judgments of the formality of social relations in each film.

...read moreread less

Abstract: We present an unsupervised model for inducing signed social networks from the content exchanged across network edges. Inference in this model solves three problems simultaneously: (1) identifying the sign of each edge; (2) characterizing the distribution over content for each edge type; (3) estimating weights for triadic features that map to theoretical models such as structural balance. We apply this model to the problem of inducing the social function of address terms, such as 'Madame', 'comrade', and 'dude'. On a dataset of movie scripts, our system obtains a coherent clustering of address terms, while at the same time making intuitively plausible judgments of the formality of social relations in each film. As an additional contribution, we provide a bootstrapping technique for identifying and tagging address terms in dialogue.

...read moreread less

Proceedings Article•

Entity-Augmented Distributional Semantics for Discourse Relations.

[...]

Yangfeng Ji¹, Jacob Eisenstein²•Institutions (2)

Microsoft¹, Georgia Institute of Technology²

01 Jan 2014

TL;DR: This paper used a downward compositional pass to predict implicit discourse relations in the Penn Discourse Treebank and achieved substantial improvements over the previous state-of-the-art in predicting implicit discourse relation.

...read moreread less

Abstract: Discourse relations bind smaller linguistic elements into coherent texts. However, automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked sentences. A more subtle challenge is that it is not enough to represent the meaning of each sentence of a discourse relation, because the relation may depend on links between lower-level elements, such as entity mentions. Our solution computes distributional meaning representations by composition up the syntactic parse tree. A key difference from previous work on compositional distributional semantics is that we also compute representations for entity mentions, using a novel downward compositional pass. Discourse relations are predicted not only from the distributional representations of the sentences, but also of their coreferent entity mentions. The resulting system obtains substantial improvements over the previous state-of-the-art in predicting implicit discourse relations in the Penn Discourse Treebank.

...read moreread less

Journal Article•DOI•

Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches

[...]

Tahira Naseem¹, Benjamin Snyder¹, Jacob Eisenstein¹, Regina Barzilay¹•Institutions (1)

Massachusetts Institute of Technology¹

15 Jan 2014-arXiv: Computation and Language

TL;DR: This article used multilingual learning for unsupervised part-of-speech tagging and found that by combining cues from multiple languages, the structure of each becomes more apparent, and showed that using multilingual evidence can achieve impressive performance gains across a range of scenarios.

...read moreread less

Abstract: We demonstrate the effectiveness of multilingual learning for unsupervised part-of-speech tagging. The central assumption of our work is that by combining cues from multiple languages, the structure of each becomes more apparent. We consider two ways of applying this intuition to the problem of unsupervised part-of-speech tagging: a model that directly merges tag structures for a pair of languages into a single sequence and a second model which instead incorporates multilingual context using latent variables. Both approaches are formulated as hierarchical Bayesian models, using Markov Chain Monte Carlo sampling techniques for inference. Our results demonstrate that by incorporating multilingual evidence we can achieve impressive performance gains across a range of scenarios. We also found that performance improves steadily as the number of available languages increases.

...read moreread less

Showing papers by "Jacob Eisenstein published in 2014"