scispace - formally typeset
Search or ask a question

Showing papers by "Dan Jurafsky published in 2018"


Journal ArticleDOI
TL;DR: A framework to demonstrate how the temporal dynamics of the embedding helps to quantify changes in stereotypes and attitudes toward women and ethnic minorities in the 20th and 21st centuries in the United States is developed.
Abstract: Word embeddings are a powerful machine-learning framework that represents each English word by a vector. The geometric relationship between these vectors captures meaningful semantic relationships between the corresponding words. In this paper, we develop a framework to demonstrate how the temporal dynamics of the embedding helps to quantify changes in stereotypes and attitudes toward women and ethnic minorities in the 20th and 21st centuries in the United States. We integrate word embeddings trained on 100 y of text data with the US Census to show that changes in the embedding track closely with demographic and occupation shifts over time. The embedding captures societal shifts-e.g., the women's movement in the 1960s and Asian immigration into the United States-and also illuminates how specific adjectives and occupations became more closely associated with certain populations over time. Our framework for temporal analysis of word embedding opens up a fruitful intersection between machine learning and quantitative social science.

728 citations


Proceedings ArticleDOI
12 May 2018
TL;DR: The authors investigated the role of context in an LSTM LM, through ablation studies, and found that the model is capable of using about 200 tokens of context on average, but sharply distinguishes nearby context (recent 50 tokens) from the distant history.
Abstract: We know very little about how neural language models (LM) use prior linguistic context. In this paper, we investigate the role of context in an LSTM LM, through ablation studies. Specifically, we analyze the increase in perplexity when prior context words are shuffled, replaced, or dropped. On two standard datasets, Penn Treebank and WikiText-2, we find that the model is capable of using about 200 tokens of context on average, but sharply distinguishes nearby context (recent 50 tokens) from the distant history. The model is highly sensitive to the order of words within the most recent sentence, but ignores word order in the long-range context (beyond 50 tokens), suggesting the distant past is modeled only as a rough semantic field or topic. We further find that the neural caching model (Grave et al., 2017b) especially helps the LSTM to copy words from within this distant context. Overall, our analysis not only provides a better understanding of how neural LMs use their context, but also sheds light on recent success from cache-based models.

230 citations


Proceedings ArticleDOI
23 Apr 2018
TL;DR: This work study intercommunity interactions across 36,000 communities on Reddit, examining cases where users of one community are mobilized by negative sentiment to comment in another community, and finds that conflicts are marked by formation of echo chambers.
Abstract: Users organize themselves into communities on web platforms. These communities can interact with one another, often leading to conflicts and toxic interactions. However, little is known about the mechanisms of interactions between communities and how they impact users. Here we study intercommunity interactions across 36,000 communities on Reddit, examining cases where users of one community are mobilized by negative sentiment to comment in another community. We show that such conflicts tend to be initiated by a handful of communities---less than 1% of communities start 74% of conflicts. While conflicts tend to be initiated by highly active community members, they are carried out by significantly less active members. We find that conflicts are marked by formation of echo chambers, where users primarily talk to other users from their own community. In the long-term, conflicts have adverse effects and reduce the overall activity of users in the targeted communities. Our analysis of user interactions also suggests strategies for mitigating the negative impact of conflicts---such as increasing direct engagement between attackers and defenders. Further, we accurately predict whether a conflict will occur by creating a novel LSTM model that combines graph embeddings, user, community, and text features. This model can be used to create an early-warning system for community moderators to prevent conflicts. Altogether, this work presents a data-driven view of community interactions and conflict, and paves the way towards healthier online communities.

193 citations


Proceedings Article
01 Jan 2018
TL;DR: This work introduces a framework to efficiently make predictions about conjunctive logical queries -- a flexible but tractable subset of first-order logic -- on incomplete knowledge graphs and demonstrates the utility of this framework in two application studies on real-world datasets with millions of relations.
Abstract: Learning low-dimensional embeddings of knowledge graphs is a powerful approach used to predict unobserved or missing edges between entities However, an open challenge in this area is developing techniques that can go beyond simple edge prediction and handle more complex logical queries, which might involve multiple unobserved edges, entities, and variables For instance, given an incomplete biological knowledge graph, we might want to predict "em what drugs are likely to target proteins involved with both diseases X and Y?" -- a query that requires reasoning about all possible proteins that might interact with diseases X and Y Here we introduce a framework to efficiently make predictions about conjunctive logical queries -- a flexible but tractable subset of first-order logic -- on incomplete knowledge graphs In our approach, we embed graph nodes in a low-dimensional space and represent logical operators as learned geometric operations (eg, translation, rotation) in this embedding space By performing logical operations within a low-dimensional embedding space, our approach achieves a time complexity that is linear in the number of query variables, compared to the exponential complexity required by a naive enumeration-based approach We demonstrate the utility of this framework in two application studies on real-world datasets with millions of relations: predicting logical relationships in a network of drug-gene-disease interactions and in a graph-based representation of social interactions derived from a popular web forum

180 citations


Journal ArticleDOI
TL;DR: This work performs the largest behavioral study of citations to date, analyzing how scientific works frame their contributions through different types of citations and how this framing affects the field as a whole and changes in citation framing are used to show that the field of NLP is undergoing a significant increase in consensus.
Abstract: Citations have long been used to characterize the state of a scientific field and to identify influential works. However, writers use citations for different purposes, and this varied purpose influ...

176 citations


Posted Content
TL;DR: This paper investigates the role of context in an LSTM LM, through ablation studies, and analyzes the increase in perplexity when prior context words are shuffled, replaced, or dropped to provide a better understanding of how neural LMs use their context.
Abstract: We know very little about how neural language models (LM) use prior linguistic context. In this paper, we investigate the role of context in an LSTM LM, through ablation studies. Specifically, we analyze the increase in perplexity when prior context words are shuffled, replaced, or dropped. On two standard datasets, Penn Treebank and WikiText-2, we find that the model is capable of using about 200 tokens of context on average, but sharply distinguishes nearby context (recent 50 tokens) from the distant history. The model is highly sensitive to the order of words within the most recent sentence, but ignores word order in the long-range context (beyond 50 tokens), suggesting the distant past is modeled only as a rough semantic field or topic. We further find that the neural caching model (Grave et al., 2017b) especially helps the LSTM to copy words from within this distant context. Overall, our analysis not only provides a better understanding of how neural LMs use their context, but also sheds light on recent success from cache-based models.

113 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: This paper proposes beam search noising procedures to synthesize additional noisy examples that human evaluators were nearly unable to discriminate from nonsynthesized examples and develops error generation processes using a neural sequence transduction model trained to translate clean examples to their noisy counterparts.
Abstract: Translation-based methods for grammar correction that directly map noisy, ungrammatical text to their clean counterparts are able to correct a broad range of errors; however, such techniques are bottlenecked by the need for a large parallel corpus of noisy and clean sentence pairs. In this paper, we consider synthesizing parallel data by noising a clean monolingual corpus. While most previous approaches introduce perturbations using features computed from local context windows, we instead develop error generation processes using a neural sequence transduction model trained to translate clean examples to their noisy counterparts. Given a corpus of clean examples, we propose beam search noising procedures to synthesize additional noisy examples that human evaluators were nearly unable to discriminate from nonsynthesized examples. Surprisingly, when trained on additional data synthesized using our best-performing noising scheme, our model approaches the same performance as when trained on additional nonsynthesized data.

112 citations


Proceedings ArticleDOI
01 Aug 2018
TL;DR: This work introduces embedding-based methods for cross-lingually projecting English frames to Russian, and offers new ways to identify subtle media manipulation strategies at the intersection of agenda-setting and framing.
Abstract: Amidst growing concern over media manipulation, NLP attention has focused on overt strategies like censorship and “fake news”. Here, we draw on two concepts from political science literature to explore subtler strategies for government media manipulation: agenda-setting (selecting what topics to cover) and framing (deciding how topics are covered). We analyze 13 years (100K articles) of the Russian newspaper Izvestia and identify a strategy of distraction: articles mention the U.S. more frequently in the month directly following an economic downturn in Russia. We introduce embedding-based methods for cross-lingually projecting English frames to Russian, and discover that these articles emphasize U.S. moral failings and threats to the U.S. Our work offers new ways to identify subtle media manipulation strategies at the intersection of agenda-setting and framing.

101 citations


Proceedings ArticleDOI
TL;DR: In this article, the authors study inter-community interactions across 36,000 communities on Reddit, examining cases where users of one community are mobilized by negative sentiment to comment in another community.
Abstract: Users organize themselves into communities on web platforms. These communities can interact with one another, often leading to conflicts and toxic interactions. However, little is known about the mechanisms of interactions between communities and how they impact users. Here we study intercommunity interactions across 36,000 communities on Reddit, examining cases where users of one community are mobilized by negative sentiment to comment in another community. We show that such conflicts tend to be initiated by a handful of communities---less than 1% of communities start 74% of conflicts. While conflicts tend to be initiated by highly active community members, they are carried out by significantly less active members. We find that conflicts are marked by formation of echo chambers, where users primarily talk to other users from their own community. In the long-term, conflicts have adverse effects and reduce the overall activity of users in the targeted communities. Our analysis of user interactions also suggests strategies for mitigating the negative impact of conflicts---such as increasing direct engagement between attackers and defenders. Further, we accurately predict whether a conflict will occur by creating a novel LSTM model that combines graph embeddings, user, community, and text features. This model can be used toreate early-warning systems for community moderators to prevent conflicts. Altogether, this work presents a data-driven view of community interactions and conflict, and paves the way towards healthier online communities.

84 citations


Proceedings Article
01 May 2018
TL;DR: A multi-genre corpus of more than 25M comments from five socially and topically diverse sources tagged for the gender of the addressee enables studying socially important questions like gender bias, and has potential uses for downstream applications such as dialogue systems, gender detection or obfuscation, and debiasing language generation.
Abstract: Like many social variables, gender pervasively influences how people communicate with one another. However, prior computational work has largely focused on linguistic gender difference and communications about gender, rather than communications directed to people of that gender, in part due to lack of data. Here, we fill a critical need by introducing a multi-genre corpus of more than 25M comments from five socially and topically diverse sources tagged for the gender of the addressee. Using these data, we describe pilot studies on how differential responses to gender can be measured and analyzed and present 30k annotations for the sentiment and relevance of these responses, showing that across our datasets responses to women are more likely to be emotive and about the speaker as an individual (rather than about the content being responded to). Our dataset enables studying socially important questions like gender bias, and has potential uses for downstream applications such as dialogue systems, gender detection or obfuscation, and debiasing language generation.

69 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: A first benchmark comparison of previously proposed coherence models for detecting symptoms of schizophrenia are evaluated and evaluated on a new dataset of recorded interviews between subjects and clinicians and a novel computational model for reference incoherence based on ambiguous pronoun usage is proposed.
Abstract: Schizophrenia is a mental disorder which afflicts an estimated 0.7% of adults world wide. It affects many areas of mental function, often evident from incoherent speech. Diagnosing schizophrenia relies on subjective judgments resulting in disagreements even among trained clinicians. Recent studies have proposed the use of natural language processing for diagnosis by drawing on automatically-extracted linguistic features like discourse coherence and lexicon. Here, we present the first benchmark comparison of previously proposed coherence models for detecting symptoms of schizophrenia and evaluate their performance on a new dataset of recorded interviews between subjects and clinicians. We also present two alternative coherence metrics based on modern sentence embedding techniques that outperform the previous methods on our dataset. Lastly, we propose a novel computational model for reference incoherence based on ambiguous pronoun usage and show that it is a highly predictive feature on our data. While the number of subjects is limited in this pilot study, our results suggest new directions for diagnosing common symptoms of schizophrenia.

Proceedings ArticleDOI
01 Jun 2018
TL;DR: Two deep learning algorithms are introduced that are more predictive and less confound-related than those of standard feature weighting and lexicon induction techniques like regression and log odds and used to induce lexicons that are predictive of timely responses to consumer complaints, enrollment from course descriptions, and sales from product descriptions.
Abstract: NLP algorithms are increasingly used in computational social science to take linguistic observations and predict outcomes like human preferences or actions. Making these social models transparent and interpretable often requires identifying features in the input that predict outcomes while also controlling for potential confounds. We formalize this need as a new task: inducing a lexicon that is predictive of a set of target variables yet uncorrelated to a set of confounding variables. We introduce two deep learning algorithms for the task. The first uses a bifurcated architecture to separate the explanatory power of the text and confounds. The second uses an adversarial discriminator to force confound-invariant text encodings. Both elicit lexicons from learned weights and attentional scores. We use them to induce lexicons that are predictive of timely responses to consumer complaints (controlling for product), enrollment from course descriptions (controlling for subject), and sales from product descriptions (controlling for seller). In each domain our algorithms pick words that are associated with narrative persuasion; more predictive and less confound-related than those of standard feature weighting and lexicon induction techniques like regression and log odds.

Posted Content
TL;DR: In this article, the authors introduce a framework to efficiently make predictions about conjunctive logical queries (a flexible but tractable subset of first-order logic) on incomplete knowledge graphs.
Abstract: Learning low-dimensional embeddings of knowledge graphs is a powerful approach used to predict unobserved or missing edges between entities. However, an open challenge in this area is developing techniques that can go beyond simple edge prediction and handle more complex logical queries, which might involve multiple unobserved edges, entities, and variables. For instance, given an incomplete biological knowledge graph, we might want to predict "em what drugs are likely to target proteins involved with both diseases X and Y?" -- a query that requires reasoning about all possible proteins that {\em might} interact with diseases X and Y. Here we introduce a framework to efficiently make predictions about conjunctive logical queries -- a flexible but tractable subset of first-order logic -- on incomplete knowledge graphs. In our approach, we embed graph nodes in a low-dimensional space and represent logical operators as learned geometric operations (e.g., translation, rotation) in this embedding space. By performing logical operations within a low-dimensional embedding space, our approach achieves a time complexity that is linear in the number of query variables, compared to the exponential complexity required by a naive enumeration-based approach. We demonstrate the utility of this framework in two application studies on real-world datasets with millions of relations: predicting logical relationships in a network of drug-gene-disease interactions and in a graph-based representation of social interactions derived from a popular web forum.

Posted Content
TL;DR: This article analyzed 13 years of the Russian newspaper Izvestia and identified a strategy of distraction: articles mention the U.S. more frequently in the month directly following an economic downturn in Russia.
Abstract: Amidst growing concern over media manipulation, NLP attention has focused on overt strategies like censorship and "fake news'". Here, we draw on two concepts from the political science literature to explore subtler strategies for government media manipulation: agenda-setting (selecting what topics to cover) and framing (deciding how topics are covered). We analyze 13 years (100K articles) of the Russian newspaper Izvestia and identify a strategy of distraction: articles mention the U.S. more frequently in the month directly following an economic downturn in Russia. We introduce embedding-based methods for cross-lingually projecting English frames to Russian, and discover that these articles emphasize U.S. moral failings and threats to the U.S. Our work offers new ways to identify subtle media manipulation strategies at the intersection of agenda-setting and framing.

Journal Article
TL;DR: A rational model of adjective use is described in which listeners explicitly reason about judgments made by different speakers, formalizing the notion of subjectivity as agreement between speakers and suggesting that adjective ordering can be explained by general principles of human communication and language processing.

Proceedings ArticleDOI
07 Sep 2018
TL;DR: This paper presents a new dataset for TAP, baselines, and a model that successfully uses an ILP to enforce the structural constraints of the problem.
Abstract: To understand a sentence like “whereas only 10% of White Americans live at or below the poverty line, 28% of African Americans do” it is important not only to identify individual facts, eg, poverty rates of distinct demographic groups, but also the higher-order relations between them, eg, the disparity between them In this paper, we propose the task of Textual Analogy Parsing (TAP) to model this higher-order meaning Given a sentence such as the one above, TAP outputs a frame-style meaning representation which explicitly specifies what is shared (eg, poverty rates) and what is compared (eg, White Americans vs African Americans, 10% vs 28%) between its component facts Such a meaning representation can enable new applications that rely on discourse understanding such as automated chart generation from quantitative text We present a new dataset for TAP, baselines, and a model that successfully uses an ILP to enforce the structural constraints of the problem

Journal ArticleDOI
TL;DR: It is demonstrated that the dialog structures produced by the tagger could reveal whether officers follow law enforcement norms like introducing themselves, explaining the reason for the stop, and asking permission for searches.
Abstract: We apply computational dialog methods to police body-worn camera footage to model conversations between police officers and community members in traffic stops Relying on the theory of institutiona

Posted Content
05 Jun 2018
TL;DR: A framework to make predictions about conjunctive logical queries about subgraph relationships on heterogeneous network data is introduced and how imposing logical structure can make network embeddings more useful for large-scale knowledge discovery is highlighted.
Abstract: Learning vector embeddings of complex networks is a powerful approach used to predict missing or unobserved edges in network data. However, an open challenge in this area is developing techniques that can reason about $\textit{subgraphs}$ in network data, which can involve the logical conjunction of several edge relationships. Here we introduce a framework to make predictions about conjunctive logical queries---i.e., subgraph relationships---on heterogeneous network data. In our approach, we embed network nodes in a low-dimensional space and represent logical operators as learned geometric operations (e.g., translation, rotation) in this embedding space. We prove that a small set of geometric operations are sufficient to represent conjunctive logical queries on a network, and we introduce a series of increasingly strong implementations of these operators. We demonstrate the utility of this framework in two application studies on networks with millions of edges: predicting unobserved subgraphs in a network of drug-gene-disease interactions and in a network of social interactions derived from a popular web forum. These experiments demonstrate how our framework can efficiently make logical predictions such as "what drugs are likely to target proteins involved with both diseases X and Y?" Together our results highlight how imposing logical structure can make network embeddings more useful for large-scale knowledge discovery.

Posted Content
TL;DR: This article proposed Textual Analogy Parsing (TAP) to model higher-order relations between them, e.g., the disparity between them between different groups of people in a sentence.
Abstract: To understand a sentence like "whereas only 10% of White Americans live at or below the poverty line, 28% of African Americans do" it is important not only to identify individual facts, e.g., poverty rates of distinct demographic groups, but also the higher-order relations between them, e.g., the disparity between them. In this paper, we propose the task of Textual Analogy Parsing (TAP) to model this higher-order meaning. The output of TAP is a frame-style meaning representation which explicitly specifies what is shared (e.g., poverty rates) and what is compared (e.g., White Americans vs. African Americans, 10% vs. 28%) between its component facts. Such a meaning representation can enable new applications that rely on discourse understanding such as automated chart generation from quantitative text. We present a new dataset for TAP, baselines, and a model that successfully uses an ILP to enforce the structural constraints of the problem.