Showing papers by "Dan Jurafsky published in 2018"

PDF

Open Access

Journal Article•DOI•

Word embeddings quantify 100 years of gender and ethnic stereotypes.

[...]

Nikhil Garg¹, Londa Schiebinger¹, Dan Jurafsky¹, James Zou¹•Institutions (1)

17 Apr 2018-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: A framework to demonstrate how the temporal dynamics of the embedding helps to quantify changes in stereotypes and attitudes toward women and ethnic minorities in the 20th and 21st centuries in the United States is developed.

...read moreread less

Abstract: Word embeddings are a powerful machine-learning framework that represents each English word by a vector. The geometric relationship between these vectors captures meaningful semantic relationships between the corresponding words. In this paper, we develop a framework to demonstrate how the temporal dynamics of the embedding helps to quantify changes in stereotypes and attitudes toward women and ethnic minorities in the 20th and 21st centuries in the United States. We integrate word embeddings trained on 100 y of text data with the US Census to show that changes in the embedding track closely with demographic and occupation shifts over time. The embedding captures societal shifts-e.g., the women's movement in the 1960s and Asian immigration into the United States-and also illuminates how specific adjectives and occupations became more closely associated with certain populations over time. Our framework for temporal analysis of word embedding opens up a fruitful intersection between machine learning and quantitative social science.

...read moreread less

728 citations

Proceedings Article•DOI•

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

[...]

Urvashi Khandelwal¹, He He¹, Peng Qi¹, Dan Jurafsky¹•Institutions (1)

Stanford University¹

12 May 2018

TL;DR: The authors investigated the role of context in an LSTM LM, through ablation studies, and found that the model is capable of using about 200 tokens of context on average, but sharply distinguishes nearby context (recent 50 tokens) from the distant history.

...read moreread less

Abstract: We know very little about how neural language models (LM) use prior linguistic context. In this paper, we investigate the role of context in an LSTM LM, through ablation studies. Specifically, we analyze the increase in perplexity when prior context words are shuffled, replaced, or dropped. On two standard datasets, Penn Treebank and WikiText-2, we find that the model is capable of using about 200 tokens of context on average, but sharply distinguishes nearby context (recent 50 tokens) from the distant history. The model is highly sensitive to the order of words within the most recent sentence, but ignores word order in the long-range context (beyond 50 tokens), suggesting the distant past is modeled only as a rough semantic field or topic. We further find that the neural caching model (Grave et al., 2017b) especially helps the LSTM to copy words from within this distant context. Overall, our analysis not only provides a better understanding of how neural LMs use their context, but also sheds light on recent success from cache-based models.

...read moreread less

230 citations

Proceedings Article•DOI•

Community Interaction and Conflict on the Web

[...]

Srijan Kumar¹, William L. Hamilton¹, Jure Leskovec¹, Dan Jurafsky¹•Institutions (1)

Stanford University¹

23 Apr 2018

TL;DR: This work study intercommunity interactions across 36,000 communities on Reddit, examining cases where users of one community are mobilized by negative sentiment to comment in another community, and finds that conflicts are marked by formation of echo chambers.

...read moreread less

Abstract: Users organize themselves into communities on web platforms. These communities can interact with one another, often leading to conflicts and toxic interactions. However, little is known about the mechanisms of interactions between communities and how they impact users. Here we study intercommunity interactions across 36,000 communities on Reddit, examining cases where users of one community are mobilized by negative sentiment to comment in another community. We show that such conflicts tend to be initiated by a handful of communities---less than 1% of communities start 74% of conflicts. While conflicts tend to be initiated by highly active community members, they are carried out by significantly less active members. We find that conflicts are marked by formation of echo chambers, where users primarily talk to other users from their own community. In the long-term, conflicts have adverse effects and reduce the overall activity of users in the targeted communities. Our analysis of user interactions also suggests strategies for mitigating the negative impact of conflicts---such as increasing direct engagement between attackers and defenders. Further, we accurately predict whether a conflict will occur by creating a novel LSTM model that combines graph embeddings, user, community, and text features. This model can be used to create an early-warning system for community moderators to prevent conflicts. Altogether, this work presents a data-driven view of community interactions and conflict, and paves the way towards healthier online communities.

...read moreread less

193 citations

Proceedings Article•

Embedding Logical Queries on Knowledge Graphs

[...]

William L. Hamilton¹, Payal Bajaj¹, Marinka Zitnik¹, Dan Jurafsky¹, Jure Leskovec¹ - Show less +1 more•Institutions (1)

Stanford University¹

01 Jan 2018

TL;DR: This work introduces a framework to efficiently make predictions about conjunctive logical queries -- a flexible but tractable subset of first-order logic -- on incomplete knowledge graphs and demonstrates the utility of this framework in two application studies on real-world datasets with millions of relations.

...read moreread less

Abstract: Learning low-dimensional embeddings of knowledge graphs is a powerful approach used to predict unobserved or missing edges between entities However, an open challenge in this area is developing techniques that can go beyond simple edge prediction and handle more complex logical queries, which might involve multiple unobserved edges, entities, and variables For instance, given an incomplete biological knowledge graph, we might want to predict "em what drugs are likely to target proteins involved with both diseases X and Y?" -- a query that requires reasoning about all possible proteins that might interact with diseases X and Y Here we introduce a framework to efficiently make predictions about conjunctive logical queries -- a flexible but tractable subset of first-order logic -- on incomplete knowledge graphs In our approach, we embed graph nodes in a low-dimensional space and represent logical operators as learned geometric operations (eg, translation, rotation) in this embedding space By performing logical operations within a low-dimensional embedding space, our approach achieves a time complexity that is linear in the number of query variables, compared to the exponential complexity required by a naive enumeration-based approach We demonstrate the utility of this framework in two application studies on real-world datasets with millions of relations: predicting logical relationships in a network of drug-gene-disease interactions and in a graph-based representation of social interactions derived from a popular web forum

...read moreread less

180 citations

Journal Article•DOI•

Measuring the Evolution of a Scientific Field through Citation Frames

[...]

David Jurgens¹, Srijan Kumar², Raine Hoover², Daniel A. McFarland², Dan Jurafsky² - Show less +1 more•Institutions (2)

University of Michigan¹, Stanford University²

14 Jun 2018-Transactions of the Association for Computational Linguistics

TL;DR: This work performs the largest behavioral study of citations to date, analyzing how scientific works frame their contributions through different types of citations and how this framing affects the field as a whole and changes in citation framing are used to show that the field of NLP is undergoing a significant increase in consensus.

...read moreread less

Abstract: Citations have long been used to characterize the state of a scientific field and to identify influential works. However, writers use citations for different purposes, and this varied purpose influ...

...read moreread less

176 citations

Posted Content•

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

[...]

Urvashi Khandelwal¹, He He¹, Peng Qi¹, Dan Jurafsky¹•Institutions (1)

Stanford University¹

12 May 2018-arXiv: Computation and Language

TL;DR: This paper investigates the role of context in an LSTM LM, through ablation studies, and analyzes the increase in perplexity when prior context words are shuffled, replaced, or dropped to provide a better understanding of how neural LMs use their context.

...read moreread less

113 citations

Proceedings Article•DOI•

Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction

[...]

Ziang Xie¹, Guillaume Genthial, Stanley Xie², Andrew Y. Ng³, Dan Jurafsky¹ - Show less +1 more•Institutions (3)

Stanford University¹, Howard Hughes Medical Institute², University at Buffalo³

01 Jun 2018

TL;DR: This paper proposes beam search noising procedures to synthesize additional noisy examples that human evaluators were nearly unable to discriminate from nonsynthesized examples and develops error generation processes using a neural sequence transduction model trained to translate clean examples to their noisy counterparts.

...read moreread less

Abstract: Translation-based methods for grammar correction that directly map noisy, ungrammatical text to their clean counterparts are able to correct a broad range of errors; however, such techniques are bottlenecked by the need for a large parallel corpus of noisy and clean sentence pairs. In this paper, we consider synthesizing parallel data by noising a clean monolingual corpus. While most previous approaches introduce perturbations using features computed from local context windows, we instead develop error generation processes using a neural sequence transduction model trained to translate clean examples to their noisy counterparts. Given a corpus of clean examples, we propose beam search noising procedures to synthesize additional noisy examples that human evaluators were nearly unable to discriminate from nonsynthesized examples. Surprisingly, when trained on additional data synthesized using our best-performing noising scheme, our model approaches the same performance as when trained on additional nonsynthesized data.

...read moreread less

112 citations

Proceedings Article•DOI•

Framing and Agenda-setting in Russian News: a Computational Analysis of Intricate Political Strategies

[...]

Anjalie Field¹, Doron Kliger², Shuly Wintner², Jennifer Pan³, Dan Jurafsky³, Yulia Tsvetkov¹ - Show less +2 more•Institutions (3)

Carnegie Mellon University¹, University of Haifa², Stanford University³

01 Aug 2018

TL;DR: This work introduces embedding-based methods for cross-lingually projecting English frames to Russian, and offers new ways to identify subtle media manipulation strategies at the intersection of agenda-setting and framing.

...read moreread less

Abstract: Amidst growing concern over media manipulation, NLP attention has focused on overt strategies like censorship and “fake news”. Here, we draw on two concepts from political science literature to explore subtler strategies for government media manipulation: agenda-setting (selecting what topics to cover) and framing (deciding how topics are covered). We analyze 13 years (100K articles) of the Russian newspaper Izvestia and identify a strategy of distraction: articles mention the U.S. more frequently in the month directly following an economic downturn in Russia. We introduce embedding-based methods for cross-lingually projecting English frames to Russian, and discover that these articles emphasize U.S. moral failings and threats to the U.S. Our work offers new ways to identify subtle media manipulation strategies at the intersection of agenda-setting and framing.

...read moreread less

101 citations

Proceedings Article•DOI•

Community Interaction and Conflict on the Web

[...]

Srijan Kumar¹, William L. Hamilton¹, Jure Leskovec¹, Dan Jurafsky¹•Institutions (1)

Stanford University¹

09 Mar 2018-arXiv: Social and Information Networks

TL;DR: In this article, the authors study inter-community interactions across 36,000 communities on Reddit, examining cases where users of one community are mobilized by negative sentiment to comment in another community.

...read moreread less

Abstract: Users organize themselves into communities on web platforms. These communities can interact with one another, often leading to conflicts and toxic interactions. However, little is known about the mechanisms of interactions between communities and how they impact users. Here we study intercommunity interactions across 36,000 communities on Reddit, examining cases where users of one community are mobilized by negative sentiment to comment in another community. We show that such conflicts tend to be initiated by a handful of communities---less than 1% of communities start 74% of conflicts. While conflicts tend to be initiated by highly active community members, they are carried out by significantly less active members. We find that conflicts are marked by formation of echo chambers, where users primarily talk to other users from their own community. In the long-term, conflicts have adverse effects and reduce the overall activity of users in the targeted communities. Our analysis of user interactions also suggests strategies for mitigating the negative impact of conflicts---such as increasing direct engagement between attackers and defenders. Further, we accurately predict whether a conflict will occur by creating a novel LSTM model that combines graph embeddings, user, community, and text features. This model can be used toreate early-warning systems for community moderators to prevent conflicts. Altogether, this work presents a data-driven view of community interactions and conflict, and paves the way towards healthier online communities.

...read moreread less

84 citations

Proceedings Article•

RtGender: A Corpus for Studying Differential Responses to Gender

[...]

Rob Voigt, David Jurgens¹, Vinodkumar Prabhakaran², Dan Jurafsky², Yulia Tsvetkov³ - Show less +1 more•Institutions (3)

University of Michigan¹, Stanford University², Carnegie Mellon University³

01 May 2018

TL;DR: A multi-genre corpus of more than 25M comments from five socially and topically diverse sources tagged for the gender of the addressee enables studying socially important questions like gender bias, and has potential uses for downstream applications such as dialogue systems, gender detection or obfuscation, and debiasing language generation.

...read moreread less

Abstract: Like many social variables, gender pervasively influences how people communicate with one another. However, prior computational work has largely focused on linguistic gender difference and communications about gender, rather than communications directed to people of that gender, in part due to lack of data. Here, we fill a critical need by introducing a multi-genre corpus of more than 25M comments from five socially and topically diverse sources tagged for the gender of the addressee. Using these data, we describe pilot studies on how differential responses to gender can be measured and analyzed and present 30k annotations for the sentiment and relevance of these responses, showing that across our datasets responses to women are more likely to be emotive and about the speaker as an individual (rather than about the content being responded to). Our dataset enables studying socially important questions like gender bias, and has potential uses for downstream applications such as dialogue systems, gender detection or obfuscation, and debiasing language generation.

...read moreread less

69 citations

Proceedings Article•DOI•

Automatic Detection of Incoherent Speech for Diagnosing Schizophrenia

[...]

Dan Iter, Jong H. Yoon, Dan Jurafsky

01 Jun 2018

TL;DR: A first benchmark comparison of previously proposed coherence models for detecting symptoms of schizophrenia are evaluated and evaluated on a new dataset of recorded interviews between subjects and clinicians and a novel computational model for reference incoherence based on ambiguous pronoun usage is proposed.

...read moreread less

Abstract: Schizophrenia is a mental disorder which afflicts an estimated 0.7% of adults world wide. It affects many areas of mental function, often evident from incoherent speech. Diagnosing schizophrenia relies on subjective judgments resulting in disagreements even among trained clinicians. Recent studies have proposed the use of natural language processing for diagnosis by drawing on automatically-extracted linguistic features like discourse coherence and lexicon. Here, we present the first benchmark comparison of previously proposed coherence models for detecting symptoms of schizophrenia and evaluate their performance on a new dataset of recorded interviews between subjects and clinicians. We also present two alternative coherence metrics based on modern sentence embedding techniques that outperform the previous methods on our dataset. Lastly, we propose a novel computational model for reference incoherence based on ambiguous pronoun usage and show that it is a highly predictive feature on our data. While the number of subjects is limited in this pilot study, our results suggest new directions for diagnosing common symptoms of schizophrenia.

...read moreread less

Proceedings Article•DOI•

Deconfounded Lexicon Induction for Interpretable Social Science

[...]

Reid Pryzant¹, Kelly Shen, Dan Jurafsky¹, Stefan Wagner•Institutions (1)

Stanford University¹

01 Jun 2018

TL;DR: Two deep learning algorithms are introduced that are more predictive and less confound-related than those of standard feature weighting and lexicon induction techniques like regression and log odds and used to induce lexicons that are predictive of timely responses to consumer complaints, enrollment from course descriptions, and sales from product descriptions.

...read moreread less

Abstract: NLP algorithms are increasingly used in computational social science to take linguistic observations and predict outcomes like human preferences or actions. Making these social models transparent and interpretable often requires identifying features in the input that predict outcomes while also controlling for potential confounds. We formalize this need as a new task: inducing a lexicon that is predictive of a set of target variables yet uncorrelated to a set of confounding variables. We introduce two deep learning algorithms for the task. The first uses a bifurcated architecture to separate the explanatory power of the text and confounds. The second uses an adversarial discriminator to force confound-invariant text encodings. Both elicit lexicons from learned weights and attentional scores. We use them to induce lexicons that are predictive of timely responses to consumer complaints (controlling for product), enrollment from course descriptions (controlling for subject), and sales from product descriptions (controlling for seller). In each domain our algorithms pick words that are associated with narrative persuasion; more predictive and less confound-related than those of standard feature weighting and lexicon induction techniques like regression and log odds.

...read moreread less

Posted Content•

Embedding Logical Queries on Knowledge Graphs

[...]

William L. Hamilton¹, Payal Bajaj¹, Marinka Zitnik¹, Dan Jurafsky¹, Jure Leskovec¹ - Show less +1 more•Institutions (1)

Stanford University¹

05 Jun 2018-arXiv: Social and Information Networks

TL;DR: In this article, the authors introduce a framework to efficiently make predictions about conjunctive logical queries (a flexible but tractable subset of first-order logic) on incomplete knowledge graphs.

...read moreread less

Abstract: Learning low-dimensional embeddings of knowledge graphs is a powerful approach used to predict unobserved or missing edges between entities. However, an open challenge in this area is developing techniques that can go beyond simple edge prediction and handle more complex logical queries, which might involve multiple unobserved edges, entities, and variables. For instance, given an incomplete biological knowledge graph, we might want to predict "em what drugs are likely to target proteins involved with both diseases X and Y?" -- a query that requires reasoning about all possible proteins that {\em might} interact with diseases X and Y. Here we introduce a framework to efficiently make predictions about conjunctive logical queries -- a flexible but tractable subset of first-order logic -- on incomplete knowledge graphs. In our approach, we embed graph nodes in a low-dimensional space and represent logical operators as learned geometric operations (e.g., translation, rotation) in this embedding space. By performing logical operations within a low-dimensional embedding space, our approach achieves a time complexity that is linear in the number of query variables, compared to the exponential complexity required by a naive enumeration-based approach. We demonstrate the utility of this framework in two application studies on real-world datasets with millions of relations: predicting logical relationships in a network of drug-gene-disease interactions and in a graph-based representation of social interactions derived from a popular web forum.

...read moreread less

Posted Content•

Framing and Agenda-setting in Russian News: a Computational Analysis of Intricate Political Strategies.

[...]

Anjalie Field¹, Doron Kliger², Shuly Wintner², Jennifer Pan³, Dan Jurafsky³, Yulia Tsvetkov¹ - Show less +2 more•Institutions (3)

Carnegie Mellon University¹, University of Haifa², Stanford University³

28 Aug 2018-arXiv: Computation and Language

TL;DR: This article analyzed 13 years of the Russian newspaper Izvestia and identified a strategy of distraction: articles mention the U.S. more frequently in the month directly following an economic downturn in Russia.

...read moreread less

Abstract: Amidst growing concern over media manipulation, NLP attention has focused on overt strategies like censorship and "fake news'". Here, we draw on two concepts from the political science literature to explore subtler strategies for government media manipulation: agenda-setting (selecting what topics to cover) and framing (deciding how topics are covered). We analyze 13 years (100K articles) of the Russian newspaper Izvestia and identify a strategy of distraction: articles mention the U.S. more frequently in the month directly following an economic downturn in Russia. We introduce embedding-based methods for cross-lingually projecting English frames to Russian, and discover that these articles emphasize U.S. moral failings and threats to the U.S. Our work offers new ways to identify subtle media manipulation strategies at the intersection of agenda-setting and framing.

...read moreread less

Journal Article•

An information-theoretic explanation of adjective ordering preferences

[...]

Michael Hahn, Judith Degen, Noah D. Goodman, Dan Jurafsky, Futrell, Richard - Show less +2 more

01 Jan 2018-Cognitive Science

TL;DR: A rational model of adjective use is described in which listeners explicitly reason about judgments made by different speakers, formalizing the notion of subjectivity as agreement between speakers and suggesting that adjective ordering can be explained by general principles of human communication and language processing.

...read moreread less

Proceedings Article•DOI•

Textual Analogy Parsing: What's Shared and What's Compared among Analogous Facts

[...]

Matthew Lamm¹, Arun Tejasvi Chaganty¹, Christopher D. Manning¹, Dan Jurafsky¹, Percy Liang² - Show less +1 more•Institutions (2)

Stanford University¹, Microsoft²

07 Sep 2018

TL;DR: This paper presents a new dataset for TAP, baselines, and a model that successfully uses an ILP to enforce the structural constraints of the problem.

...read moreread less

Abstract: To understand a sentence like “whereas only 10% of White Americans live at or below the poverty line, 28% of African Americans do” it is important not only to identify individual facts, eg, poverty rates of distinct demographic groups, but also the higher-order relations between them, eg, the disparity between them In this paper, we propose the task of Textual Analogy Parsing (TAP) to model this higher-order meaning Given a sentence such as the one above, TAP outputs a frame-style meaning representation which explicitly specifies what is shared (eg, poverty rates) and what is compared (eg, White Americans vs African Americans, 10% vs 28%) between its component facts Such a meaning representation can enable new applications that rely on discourse understanding such as automated chart generation from quantitative text We present a new dataset for TAP, baselines, and a model that successfully uses an ILP to enforce the structural constraints of the problem

...read moreread less

Journal Article•DOI•

Detecting Institutional Dialog Acts in Police Traffic Stops

[...]

Vinodkumar Prabhakaran¹, Camilla Griffiths¹, Hang Su², Prateek Verma¹, Nelson Morgan³, Jennifer L. Eberhardt¹, Dan Jurafsky¹ - Show less +3 more•Institutions (3)

Stanford University¹, University of California, Berkeley², Institute of Company Secretaries of India³

16 Jul 2018-Transactions of the Association for Computational Linguistics

TL;DR: It is demonstrated that the dialog structures produced by the tagger could reveal whether officers follow law enforcement norms like introducing themselves, explaining the reason for the stop, and asking permission for searches.

...read moreread less

Abstract: We apply computational dialog methods to police body-worn camera footage to model conversations between police officers and community members in traffic stops Relying on the theory of institutiona

...read moreread less

Posted Content•

Querying Complex Networks in Vector Space.

[...]

William L. Hamilton, Payal Bajaj, Marinka Zitnik, Dan Jurafsky, Jure Leskovec - Show less +1 more

05 Jun 2018

TL;DR: A framework to make predictions about conjunctive logical queries about subgraph relationships on heterogeneous network data is introduced and how imposing logical structure can make network embeddings more useful for large-scale knowledge discovery is highlighted.

...read moreread less

Abstract: Learning vector embeddings of complex networks is a powerful approach used to predict missing or unobserved edges in network data. However, an open challenge in this area is developing techniques that can reason about $\textit{subgraphs}$ in network data, which can involve the logical conjunction of several edge relationships. Here we introduce a framework to make predictions about conjunctive logical queries---i.e., subgraph relationships---on heterogeneous network data. In our approach, we embed network nodes in a low-dimensional space and represent logical operators as learned geometric operations (e.g., translation, rotation) in this embedding space. We prove that a small set of geometric operations are sufficient to represent conjunctive logical queries on a network, and we introduce a series of increasingly strong implementations of these operators. We demonstrate the utility of this framework in two application studies on networks with millions of edges: predicting unobserved subgraphs in a network of drug-gene-disease interactions and in a network of social interactions derived from a popular web forum. These experiments demonstrate how our framework can efficiently make logical predictions such as "what drugs are likely to target proteins involved with both diseases X and Y?" Together our results highlight how imposing logical structure can make network embeddings more useful for large-scale knowledge discovery.

...read moreread less

Posted Content•

Textual Analogy Parsing: What's Shared and What's Compared among Analogous Facts

[...]

Matthew Lamm¹, Arun Tejasvi Chaganty¹, Christopher D. Manning¹, Dan Jurafsky¹, Percy Liang² - Show less +1 more•Institutions (2)

Stanford University¹, Microsoft²

07 Sep 2018-arXiv: Computation and Language

TL;DR: This article proposed Textual Analogy Parsing (TAP) to model higher-order relations between them, e.g., the disparity between them between different groups of people in a sentence.

...read moreread less

Abstract: To understand a sentence like "whereas only 10% of White Americans live at or below the poverty line, 28% of African Americans do" it is important not only to identify individual facts, e.g., poverty rates of distinct demographic groups, but also the higher-order relations between them, e.g., the disparity between them. In this paper, we propose the task of Textual Analogy Parsing (TAP) to model this higher-order meaning. The output of TAP is a frame-style meaning representation which explicitly specifies what is shared (e.g., poverty rates) and what is compared (e.g., White Americans vs. African Americans, 10% vs. 28%) between its component facts. Such a meaning representation can enable new applications that rely on discourse understanding such as automated chart generation from quantitative text. We present a new dataset for TAP, baselines, and a model that successfully uses an ILP to enforce the structural constraints of the problem.

...read moreread less

Robust Deep Semantics for Language Understanding

[...]

Christopher D. Manning, Dan Jurafsky, Percy Liang

01 Jun 2018