scispace - formally typeset
Search or ask a question

Showing papers by "Jacob Eisenstein published in 2014"


Journal ArticleDOI
TL;DR: This paper studied the relationship between gender, linguistic style, and social networks, using a novel corpus of 14,000 Twitter users and found that social network homophily is correlated with the use of same-gender language markers.
Abstract: We present a study of the relationship between gender, linguistic style, and social networks, using a novel corpus of 14,000 Twitter users. Prior quantitative work on gender often treats this social variable as a female/male binary; we argue for a more nuanced approach. By clustering Twitter users, we find a natural decomposition of the dataset into various styles and topical interests. Many clusters have strong gender orientations, but their use of linguistic resources sometimes directly conflicts with the population-level language statistics. We view these clusters as a more accurate reflection of the multifaceted nature of gendered language styles. Previous corpus-based work has also had little to say about individuals whose linguistic styles defy population-level gender patterns. To identify such individuals, we train a statistical classifier, and measure the classifier confidence for each individual in the dataset. Examining individuals whose language does not match the classifier's model for their gender, we find that they have social networks that include significantly fewer same-gender social connections and that, in general, social network homophily is correlated with the use of same-gender language markers. Pairing computational methods and social theory thus offers a new perspective on how gender emerges as individuals position themselves relative to audiences, topics, and mainstream gender norms.

299 citations


Proceedings ArticleDOI
01 Jun 2014
TL;DR: A representation learning approach, in which surface features are transformed into a latent space that facilitates RST discourse parsing, which obtains substantial improvements over the previous state-of-the-art in predicting relations and nuclearity on the RST Treebank.
Abstract: Text-level discourse parsing is notoriously difficult, as distinctions between discourse relations require subtle semantic judgments that are not easily captured using standard features. In this paper, we present a representation learning approach, in which we transform surface features into a latent space that facilitates RST discourse parsing. By combining the machinery of large-margin transition-based structured prediction with representation learning, our method jointly learns to parse discourse while at the same time learning a discourse-driven projection of surface features. The resulting shift-reduce discourse parser obtains substantial improvements over the previous state-of-the-art in predicting relations and nuclearity on the RST Treebank.

267 citations


Journal ArticleDOI
19 Nov 2014-PLOS ONE
TL;DR: Using a latent vector autoregressive model to aggregate across thousands of words, high-level patterns in diffusion of linguistic change over the United States are identified and support for prior arguments that focus on geographical proximity and population size is offered.
Abstract: Computer-mediated communication is driving fundamental changes in the nature of written language. We investigate these changes by statistical analysis of a dataset comprising 107 million Twitter messages (authored by 2.7 million unique user accounts). Using a latent vector autoregressive model to aggregate across thousands of words, we identify high-level patterns in diffusion of linguistic change over the United States. Our model is robust to unpredictable changes in Twitter's sampling rate, and provides a probabilistic characterization of the relationship of macro-scale linguistic influence to a set of demographic and geographic predictors. The results of this analysis offer support for prior arguments that focus on geographical proximity and population size. However, demographic similarity – especially with regard to race – plays an even more central role, as cities with similar racial demographics are far more likely to share linguistic influence. Rather than moving towards a single unified “netspeak” dialect, language evolution in computer-mediated communication reproduces existing fault lines in spoken American English.

238 citations


Posted Content
TL;DR: The authors used a downward compositional pass to predict implicit discourse relations from the distributional representations of the arguments, and also of their coreferent entity mentions, and obtained substantial improvements over the previous state-of-the-art in predicting implicit relation in the Penn Discourse Treebank.
Abstract: Discourse relations bind smaller linguistic units into coherent texts. However, automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked arguments. A more subtle challenge is that it is not enough to represent the meaning of each argument of a discourse relation, because the relation may depend on links between lower-level components, such as entity mentions. Our solution computes distributional meaning representations by composition up the syntactic parse tree. A key difference from previous work on compositional distributional semantics is that we also compute representations for entity mentions, using a novel downward compositional pass. Discourse relations are predicted from the distributional representations of the arguments, and also of their coreferent entity mentions. The resulting system obtains substantial improvements over the previous state-of-the-art in predicting implicit discourse relations in the Penn Discourse Treebank.

93 citations



Proceedings ArticleDOI
01 Jun 2014
TL;DR: It is seen that readers are indeed influenced by such framing devices — and there is no evidence that they consider other factors, such as the source, journalist, or the content itself, when assessing readers’ assessment of the certainty of quoted content.
Abstract: How do journalists mark quoted content as certain or uncertain, and how do readers interpret these signals? Predicates such as thinks, claims, and admits offer a range of options for framing quoted content according to the author’s own perceptions of its credibility. We gather a new dataset of direct and indirect quotes from Twitter, and obtain annotations of the perceived certainty of the quoted statements. We then compare the ability of linguistic and extra-linguistic features to predict readers’ assessment of the certainty of quoted content. We see that readers are indeed influenced by such framing devices — and we find no evidence that they consider other factors, such as the source, journalist, or the content itself. In addition, we examine the impact of specific framing devices on perceptions of credibility.

60 citations


Proceedings ArticleDOI
01 Jun 2014
TL;DR: This work proposes a new technique called marginalized structured dropout, which exploits feature structure to obtain a remarkably simple and efficient feature projection in the context of fine-grained part-of-speech tagging on a dataset of historical Portuguese.
Abstract: Unsupervised domain adaptation often relies on transforming the instance representation. However, most such approaches are designed for bag-of-words models, and ignore the structured features present in many problems in NLP. We propose a new technique called marginalized structured dropout, which exploits feature structure to obtain a remarkably simple and efficient feature projection. Applied to the task of fine-grained part-of-speech tagging on a dataset of historical Portuguese, marginalized structured dropout yields state-of-the-art accuracy while increasing speed by more than an order-ofmagnitude over previous work.

33 citations


Proceedings ArticleDOI
01 Jun 2014
TL;DR: A new approach to inducing the syntactic categories of words, combining their distributional and morphological properties in a joint nonparametric Bayesian model based on the distance-dependent Chinese Restaurant Process, which outperforms competitive alternatives on English POS induction.
Abstract: We present a new approach to inducing the syntactic categories of words, combining their distributional and morphological properties in a joint nonparametric Bayesian model based on the distance-dependent Chinese Restaurant Process. The prior distribution over word clusterings uses a log-linear model of morphological similarity; the likelihood function is the probability of generating vector word embeddings. The weights of the morphology model are learned jointly while inducing part-ofspeech clusters, encouraging them to cohere with the distributional features. The resulting algorithm outperforms competitive alternatives on English POS induction.

9 citations


Posted Content
TL;DR: On a dataset of movie scripts, the system obtains a coherent clustering of address terms, while at the same time making intuitively plausible judgments of the formality of social relations in each film.
Abstract: We present an unsupervised model for inducing signed social networks from the content exchanged across network edges. Inference in this model solves three problems simultaneously: (1) identifying the sign of each edge; (2) characterizing the distribution over content for each edge type; (3) estimating weights for triadic features that map to theoretical models such as structural balance. We apply this model to the problem of inducing the social function of address terms, such as 'Madame', 'comrade', and 'dude'. On a dataset of movie scripts, our system obtains a coherent clustering of address terms, while at the same time making intuitively plausible judgments of the formality of social relations in each film. As an additional contribution, we provide a bootstrapping technique for identifying and tagging address terms in dialogue.

8 citations


01 Jan 2014
TL;DR: Synthesizing current research on exploratory data analysis with techniques from the fields of computational linguistics and data visualization, a new set of methods are proposed to assist humanities scholars in computationallyassisted exploratory research.
Abstract: On July 19th, 1848, 300 concerned United States citizens gathered in Seneca Falls, New York, for the women’s rights convention that would culminate in the signing of the Declaration of Rights and Sentiments, the first major document (in the US) to call for women’s right to vote. In The North Star, Frederick Douglass, the former slave turned abolitionist, extolled the event as a “grand movement for attaining the civil, social, political, and religious rights of women” (1848). In the Oneida Whig, the same event was ridiculed as the “most shocking and unnatural event ever recorded in the history of womanity” (1848). As demonstrated by these contradictory accounts, published opinions varied greatly -about the women’s rights movement in the nineteenthcentury United States, and about current events generally conceived. Large-scale digitization projects have increasingly enabled humanities scholars to search newspapers, such as those just cited, for significant words and phrases. But exploring more open-ended questions such as, “How did the discourse surrounding women’s rights in the United States change in the wake of the 1848 Seneca Falls Convention?” or “Did the women’s rights movement borrow language from the nation’s contemporaneous anti-slavery campaign?” remains a challenge. Synthesizing current research on exploratory data analysis with techniques from the fields of computational linguistics and data visualization, we propose a new set of methods to assist humanities scholars in computationallyassisted exploratory research.

7 citations


Posted Content
TL;DR: It is shown that a novel but simple feature embedding approach provides better performance, by exploiting the feature template structure common in NLP problems.
Abstract: Representation learning is the dominant technique for unsupervised domain adaptation, but existing approaches often require the specification of "pivot features" that generalize across domains, which are selected by task-specific heuristics. We show that a novel but simple feature embedding approach provides better performance, by exploiting the feature template structure common in NLP problems.


Proceedings ArticleDOI
01 Jun 2014
TL;DR: A novel model for web forums is presented, which captures both thematic content as well as user-specific interests and identifies several topics of concern to individuals who report being on the autism spectrum.
Abstract: Discussion forums offer a new source of insight for the experiences and challenges faced by individuals affected by mental disorders. Language technology can help domain experts gather insight from these forums, by aggregating themes and user behaviors across thousands of conversations. We present a novel model for web forums, which captures both thematic content as well as user-specific interests. Applying this model to the Aspies Central forum (which covers issues related to Asperger’s syndrome and autism spectrum disorder), we identify several topics of concern to individuals who report being on the autism spectrum. We perform the evaluation on the data collected from Aspies Central forum, including 1,939 threads, 29,947 posts and 972 users. Quantitative evaluations demonstrate that the topics extracted by this model are substantially more than those obtained by Latent Dirichlet Allocation and the Author-Topic Model. Qualitative analysis by subjectmatter experts suggests intriguing directions for future investigation.

Posted Content
TL;DR: This article used a downward compositional pass to predict implicit discourse relations in the Penn Discourse Treebank and achieved substantial improvements over the previous state-of-the-art in predicting implicit discourse relation.
Abstract: Discourse relations bind smaller linguistic elements into coherent texts. However, automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked sentences. A more subtle challenge is that it is not enough to represent the meaning of each sentence of a discourse relation, because the relation may depend on links between lower-level elements, such as entity mentions. Our solution computes distributional meaning representations by composition up the syntactic parse tree. A key difference from previous work on compositional distributional semantics is that we also compute representations for entity mentions, using a novel downward compositional pass. Discourse relations are predicted not only from the distributional representations of the sentences, but also of their coreferent entity mentions. The resulting system obtains substantial improvements over the previous state-of-the-art in predicting implicit discourse relations in the Penn Discourse Treebank.

Journal ArticleDOI
TL;DR: In this article, a hierarchical Bayesian model with joint inference is proposed to infer the semantic properties of documents by leveraging free-text keyphrase annotations, which can be used to summarize single and multiple documents into a set of semantically salient keyphrases.
Abstract: This paper presents a new method for inferring the semantic properties of documents by leveraging free-text keyphrase annotations. Such annotations are becoming increasingly abundant due to the recent dramatic growth in semi-structured, user-generated online content. One especially relevant domain is product reviews, which are often annotated by their authors with pros/cons keyphrases such as a real bargain or good value. These annotations are representative of the underlying semantic properties; however, unlike expert annotations, they are noisy: lay authors may use different labels to denote the same property, and some labels may be missing. To learn using such noisy annotations, we find a hidden paraphrase structure which clusters the keyphrases. The paraphrase structure is linked with a latent topic model of the review texts, enabling the system to predict the properties of unannotated documents and to effectively aggregate the semantic properties of multiple reviews. Our approach is implemented as a hierarchical Bayesian model with joint inference. We find that joint inference increases the robustness of the keyphrase clustering and encourages the latent topics to correlate with semantically meaningful properties. Multiple evaluations demonstrate that our model substantially outperforms alternative approaches for summarizing single and multiple documents into a set of semantically salient keyphrases.


Posted Content
17 Nov 2014
TL;DR: Working without any labeled data, this system offers a coherent clustering of address terms, while at the same time making intuitively plausible judgments of the formality of social relations in each film.
Abstract: We present an unsupervised model for inducing signed social networks from the content exchanged across network edges. Inference in this model solves three problems simultaneously: (1) identifying the sign of each edge; (2) characterizing the distribution over content for each edge type; (3) estimating weights for triadic features that map to theoretical models such as structural balance. We apply this model to the problem of inducing the social function of address terms, such as 'Madame', 'comrade', and 'dude'. On a dataset of movie scripts, our system obtains a coherent clustering of address terms, while at the same time making intuitively plausible judgments of the formality of social relations in each film. As an additional contribution, we provide a bootstrapping technique for identifying and tagging address terms in dialogue.

Proceedings Article
01 Jan 2014
TL;DR: This paper used a downward compositional pass to predict implicit discourse relations in the Penn Discourse Treebank and achieved substantial improvements over the previous state-of-the-art in predicting implicit discourse relation.
Abstract: Discourse relations bind smaller linguistic elements into coherent texts. However, automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked sentences. A more subtle challenge is that it is not enough to represent the meaning of each sentence of a discourse relation, because the relation may depend on links between lower-level elements, such as entity mentions. Our solution computes distributional meaning representations by composition up the syntactic parse tree. A key difference from previous work on compositional distributional semantics is that we also compute representations for entity mentions, using a novel downward compositional pass. Discourse relations are predicted not only from the distributional representations of the sentences, but also of their coreferent entity mentions. The resulting system obtains substantial improvements over the previous state-of-the-art in predicting implicit discourse relations in the Penn Discourse Treebank.

Journal ArticleDOI
TL;DR: This article used multilingual learning for unsupervised part-of-speech tagging and found that by combining cues from multiple languages, the structure of each becomes more apparent, and showed that using multilingual evidence can achieve impressive performance gains across a range of scenarios.
Abstract: We demonstrate the effectiveness of multilingual learning for unsupervised part-of-speech tagging. The central assumption of our work is that by combining cues from multiple languages, the structure of each becomes more apparent. We consider two ways of applying this intuition to the problem of unsupervised part-of-speech tagging: a model that directly merges tag structures for a pair of languages into a single sequence and a second model which instead incorporates multilingual context using latent variables. Both approaches are formulated as hierarchical Bayesian models, using Markov Chain Monte Carlo sampling techniques for inference. Our results demonstrate that by incorporating multilingual evidence we can achieve impressive performance gains across a range of scenarios. We also found that performance improves steadily as the number of available languages increases.