Top 31 papers published by Dan Jurafsky from Stanford University in 2020

Journal Article•DOI•

The Diversity-Innovation Paradox in Science

[...]

Bas Hofstra¹, Vivek Kulkarni¹, Sebastian Munoz-Najar Galvez¹, Bryan He¹, Dan Jurafsky¹, Daniel A. McFarland¹ - Show less +2 more•Institutions (1)

Stanford University¹

28 Apr 2020-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: This paper used text analysis and machine learning to answer a series of questions: How do we detect scientific innovations? Are underrepresented groups more likely to generate scientific innovations, and are the innovations of under-represented groups adopted and rewarded?

...read moreread less

Abstract: Prior work finds a diversity paradox: Diversity breeds innovation, yet underrepresented groups that diversify organizations have less successful careers within them. Does the diversity paradox hold for scientists as well? We study this by utilizing a near-complete population of ∼1.2 million US doctoral recipients from 1977 to 2015 and following their careers into publishing and faculty positions. We use text analysis and machine learning to answer a series of questions: How do we detect scientific innovations? Are underrepresented groups more likely to generate scientific innovations? And are the innovations of underrepresented groups adopted and rewarded? Our analyses show that underrepresented groups produce higher rates of scientific novelty. However, their novel contributions are devalued and discounted: For example, novel contributions by gender and racial minorities are taken up by other scholars at lower rates than novel contributions by gender and racial majorities, and equally impactful contributions of gender and racial minorities are less likely to result in successful scientific careers than for majority groups. These results suggest there may be unwarranted reproduction of stratification in academic careers that discounts diversity's role in innovation and partly explains the underrepresentation of some groups in academia.

...read moreread less

449 citations

Proceedings Article•

Generalization through Memorization: Nearest Neighbor Language Models

[...]

Urvashi Khandelwal¹, Omer Levy², Dan Jurafsky¹, Luke Zettlemoyer³, Michael Lewis⁴ - Show less +1 more•Institutions (4)

Stanford University¹, Facebook², University of Washington³, University of Pittsburgh⁴

30 Apr 2020

TL;DR: This article proposed the $k$NN-LMs, which extend a pre-trained neural language model by linearly interpolating it with a $k-nearest neighbors model.

...read moreread less

Abstract: We introduce $k$NN-LMs, which extend a pre-trained neural language model (LM) by linearly interpolating it with a $k$-nearest neighbors ($k$NN) model. The nearest neighbors are computed according to distance in the pre-trained LM embedding space, and can be drawn from any text collection, including the original LM training data. Applying this transformation to a strong Wikitext-103 LM, with neighbors drawn from the original training set, our $k$NN-LM achieves a new state-of-the-art perplexity of 15.79 -- a 2.9 point improvement with no additional training. We also show that this approach has implications for efficiently scaling up to larger training sets and allows for effective domain adaptation, by simply varying the nearest neighbor datastore, again without further training. Qualitatively, the model is particularly helpful in predicting rare patterns, such as factual knowledge. Together, these results strongly suggest that learning similarity between sequences of text is easier than predicting the next word, and that nearest neighbor search is an effective approach for language modeling in the long tail.

...read moreread less

371 citations

Journal Article•DOI•

Racial Disparities in Automated Speech Recognition

[...]

Allison Koenecke¹, Andrew Joo Hun Nam¹, Emily Lake¹, Joe Nudell¹, Minnie Quartey², Zion Mengesha¹, Connor Toups¹, John R. Rickford¹, Dan Jurafsky¹, Sharad Goel¹ - Show less +6 more•Institutions (2)

Stanford University¹, Georgetown University²

07 Apr 2020-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: Analysis of a large corpus of sociolinguistic interviews with white and African American speakers demonstrates large racial disparities in the performance of five popular commercial ASR systems, and proposes strategies to reduce these performance differences and ensure speech recognition technology is inclusive.

...read moreread less

Abstract: Automated speech recognition (ASR) systems, which use sophisticated machine-learning algorithms to convert spoken language to text, have become increasingly widespread, powering popular virtual assistants, facilitating automated closed captioning, and enabling digital dictation platforms for health care. Over the last several years, the quality of these systems has dramatically improved, due both to advances in deep learning and to the collection of large-scale datasets used to train the systems. There is concern, however, that these tools do not work equally well for all subgroups of the population. Here, we examine the ability of five state-of-the-art ASR systems—developed by Amazon, Apple, Google, IBM, and Microsoft—to transcribe structured interviews conducted with 42 white speakers and 73 black speakers. In total, this corpus spans five US cities and consists of 19.8 h of audio matched on the age and gender of the speaker. We found that all five ASR systems exhibited substantial racial disparities, with an average word error rate (WER) of 0.35 for black speakers compared with 0.19 for white speakers. We trace these disparities to the underlying acoustic models used by the ASR systems as the race gap was equally large on a subset of identical phrases spoken by black and white individuals in our corpus. We conclude by proposing strategies—such as using more diverse training datasets that include African American Vernacular English—to reduce these performance differences and ensure speech recognition technology is inclusive.

...read moreread less

324 citations

Posted Content•

Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning

[...]

Peter Henderson, Jieru Hu, Joshua Romoff, Emma Brunskill, Dan Jurafsky, Joelle Pineau - Show less +2 more

31 Jan 2020-arXiv: Computers and Society

TL;DR: A framework is introduced that makes accounting easier by providing a simple interface for tracking realtime energy consumption and carbon emissions, as well as generating standardized online appendices, and creates a leaderboard for energy efficient reinforcement learning algorithms to incentivize responsible research.

...read moreread less

Abstract: Accurate reporting of energy and carbon usage is essential for understanding the potential climate impacts of machine learning research. We introduce a framework that makes this easier by providing a simple interface for tracking realtime energy consumption and carbon emissions, as well as generating standardized online appendices. Utilizing this framework, we create a leaderboard for energy efficient reinforcement learning algorithms to incentivize responsible research in this area as an example for other areas of machine learning. Finally, based on case studies using our framework, we propose strategies for mitigation of carbon emissions and reduction of energy consumption. By making accounting easier, we hope to further the sustainable development of machine learning experiments and spur more research into energy efficient algorithms.

...read moreread less

196 citations

Posted Content•

Nearest Neighbor Machine Translation

[...]

Urvashi Khandelwal¹, Angela Fan¹, Dan Jurafsky², Luke Zettlemoyer², Michael Lewis² - Show less +1 more•Institutions (2)

Stanford University¹, Facebook²

01 Oct 2020-arXiv: Computation and Language

TL;DR: This work introduces $k$-nearest-neighbor machine translation ($k$NN-MT), which predicts tokens with a nearest neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search.

...read moreread less

Abstract: We introduce $k$-nearest-neighbor machine translation ($k$NN-MT), which predicts tokens with a nearest neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search. This approach requires no additional training and scales to give the decoder direct access to billions of examples at test time, resulting in a highly expressive model that consistently improves performance across many settings. Simply adding nearest neighbor search improves a state-of-the-art German-English translation model by 1.5 BLEU. $k$NN-MT allows a single model to be adapted to diverse domains by using a domain-specific datastore, improving results by an average of 9.2 BLEU over zero-shot transfer, and achieving new state-of-the-art results -- without training on these domains. A massively multilingual model can also be specialized for particular language pairs, with improvements of 3 BLEU for translating from English into German and Chinese. Qualitatively, $k$NN-MT is easily interpretable; it combines source and target context to retrieve highly relevant examples.

...read moreread less

141 citations

Proceedings Article•DOI•

Social Bias Frames: Reasoning about Social and Power Implications of Language

[...]

Maarten Sap¹, Saadia Gabriel¹, Lianhui Qin², Dan Jurafsky³, Noah A. Smith⁴, Yejin Choi¹ - Show less +2 more•Institutions (4)

University of Washington¹, Carnegie Mellon University², Stanford University³, Allen Institute for Artificial Intelligence⁴

01 Jul 2020

TL;DR: It is found that while state-of-the-art neural models are effective at high-level categorization of whether a given statement projects unwanted social bias, they are not effective at spelling out more detailed explanations in terms of Social Bias Frames.

...read moreread less

Abstract: Warning: this paper contains content that may be offensive or upsetting. Language has the power to reinforce stereotypes and project social biases onto others. At the core of the challenge is that it is rarely what is stated explicitly, but rather the implied meanings, that frame people’s judgments about others. For example, given a statement that “we shouldn’t lower our standards to hire more women,” most listeners will infer the implicature intended by the speaker - that “women (candidates) are less qualified.” Most semantic formalisms, to date, do not capture such pragmatic implications in which people express social biases and power differentials in language. We introduce Social Bias Frames, a new conceptual formalism that aims to model the pragmatic frames in which people project social biases and stereotypes onto others. In addition, we introduce the Social Bias Inference Corpus to support large-scale modelling and evaluation with 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups. We then establish baseline approaches that learn to recover Social Bias Frames from unstructured text. We find that while state-of-the-art neural models are effective at high-level categorization of whether a given statement projects unwanted social bias (80% F1), they are not effective at spelling out more detailed explanations in terms of Social Bias Frames. Our study motivates future work that combines structured pragmatic inference with commonsense reasoning on social implications.

...read moreread less

120 citations

Proceedings Article•DOI•

With Little Power Comes Great Responsibility

[...]

Dallas Card¹, Peter Henderson¹, Urvashi Khandelwal¹, Robin Jia¹, Kyle Mahowald², Dan Jurafsky¹ - Show less +2 more•Institutions (2)

Stanford University¹, University of California, Santa Barbara²

01 Nov 2020

TL;DR: It is concluded that underpowered experiments are common in the NLP literature and an overview of best practices for power analysis in NLP is given and a series of notebooks are released to assist with future power analyses.

...read moreread less

Abstract: Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings. By meta-analyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of settings and conclude that underpowered experiments are common in the NLP literature. In particular, for several tasks in the popular GLUE benchmark, small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. Similarly, based on reasonable assumptions, we find that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied. For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point. To improve the situation going forward, we give an overview of best practices for power analysis in NLP and release a series of notebooks to assist with future power analyses.

...read moreread less

75 citations

Journal Article•DOI•

Automatically Neutralizing Subjective Bias in Text

[...]

Reid Pryzant¹, Richard Diehl Martinez¹, Nathan Dass¹, Sadao Kurohashi², Dan Jurafsky¹, Diyi Yang³ - Show less +2 more•Institutions (3)

Stanford University¹, Kyoto University², Georgia Institute of Technology³

03 Apr 2020

TL;DR: This paper proposed two strong encoder-decoder baselines for the task of neutralizing biased text in natural language generation, using a BERT-based classifier to identify problematic words and join embedding through which the classifier can edit the hidden states of the encoder.

...read moreread less

Abstract: Texts like news, encyclopedias, and some social media strive for objectivity. Yet bias in the form of inappropriate subjectivity — introducing attitudes via framing, presupposing truth, and casting doubt — remains ubiquitous. This kind of bias erodes our collective trust and fuels social conflict. To address this issue, we introduce a novel testbed for natural language generation: automatically bringing inappropriately subjective text into a neutral point of view (“neutralizing” biased text). We also offer the first parallel corpus of biased language. The corpus contains 180,000 sentence pairs and originates from Wikipedia edits that removed various framings, presuppositions, and attitudes from biased sentences. Last, we propose two strong encoder-decoder baselines for the task. A straightforward yet opaque concurrent system uses a BERT encoder to identify subjective words as part of the generation process. An interpretable and controllable modular algorithm separates these steps, using (1) a BERT-based classifier to identify problematic words and (2) a novel join embedding through which the classifier can edit the hidden states of the encoder. Large-scale human evaluation across four domains (encyclopedias, news headlines, books, and political speeches) suggests that these algorithms are a first step towards the automatic identification and reduction of bias.

...read moreread less

72 citations

Proceedings Article•DOI•

Utility is in the Eye of the User: A Critique of NLP Leaderboards

[...]

Kawin Ethayarajh¹, Dan Jurafsky¹•Institutions (1)

Stanford University¹

29 Sep 2020

TL;DR: This opinion paper formalizes how leaderboards -- in their current form -- can be poor proxies for the NLP community at large and advocates for more transparency on leaderboards, such as the reporting of statistics that are of practical concern.

...read moreread less

Abstract: Benchmarks such as GLUE have helped drive advances in NLP by incentivizing the creation of more accurate models. While this leaderboard paradigm has been remarkably successful, a historical focus on performance-based evaluation has been at the expense of other qualities that the NLP community values in models, such as compactness, fairness, and energy efficiency. In this opinion paper, we study the divergence between what is incentivized by leaderboards and what is useful in practice through the lens of microeconomic theory. We frame both the leaderboard and NLP practitioners as consumers and the benefit they get from a model as its utility to them. With this framing, we formalize how leaderboards -- in their current form -- can be poor proxies for the NLP community at large. For example, a highly inefficient model would provide less utility to practitioners but not to a leaderboard, since it is a cost that only the former must bear. To allow practitioners to better estimate a model's utility to them, we advocate for more transparency on leaderboards, such as the reporting of statistics that are of practical concern (e.g., model size, energy efficiency, and inference latency).

...read moreread less

63 citations

Journal Article•DOI•

Universals of word order reflect optimization of grammars for efficient communication

[...]

Michael Hahn¹, Dan Jurafsky¹, Richard Futrell²•Institutions (2)

Stanford University¹, University of California, Irvine²

17 Jan 2020-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: Computational and corpus evidence is reported for the hypothesis that a prominent subset of these universal properties—those related to word order—result from a process of optimization for efficient communication among humans, trading off the need to reduce complexity with theneed to reduce ambiguity.

...read moreread less

Abstract: The universal properties of human languages have been the subject of intense study across the language sciences. We report computational and corpus evidence for the hypothesis that a prominent subset of these universal properties-those related to word order-result from a process of optimization for efficient communication among humans, trading off the need to reduce complexity with the need to reduce ambiguity. We formalize these two pressures with information-theoretic and neural-network models of complexity and ambiguity and simulate grammars with optimized word-order parameters on large-scale data from 51 languages. Evolution of grammars toward efficiency results in word-order patterns that predict a large subset of the major word-order correlations across languages.

...read moreread less

60 citations

Journal Article•DOI•

Content Analysis of Textbooks via Natural Language Processing: Findings on Gender, Race, and Ethnicity in Texas U.S. History Textbooks:

[...]

Li Lucy¹, Dorottya Demszky¹, Patricia Bromley¹, Dan Jurafsky¹•Institutions (1)

Stanford University¹

17 Jul 2020-AERA Open

TL;DR: The authors apply techniques from natural language processing (lexicons, word embeddings, topic mode, etc.) for educational research. But they focus on the use of data science techniques to shed new light on fundamental questions in education research.

...read moreread less

Abstract: Cutting-edge data science techniques can shed new light on fundamental questions in educational research. We apply techniques from natural language processing (lexicons, word embeddings, topic mode...

...read moreread less

Posted Content•

Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation

[...]

Yasuhide Miura¹, Yuhao Zhang¹, Emily B. Tsai¹, Curtis P. Langlotz¹, Dan Jurafsky¹ - Show less +1 more•Institutions (1)

Stanford University¹

20 Oct 2020-arXiv: Computation and Language

TL;DR: In this paper, the authors introduce two simple rewards to encourage the generation of factually complete and consistent radiology reports: one that encourages the system to generate radiology domain entities consistent with the reference, and one that uses natural language inference to encourage these entities to be described in inferentially consistent ways.

...read moreread less

Abstract: Neural image-to-text radiology report generation systems offer the potential to improve radiology reporting by reducing the repetitive process of report drafting and identifying possible medical errors. However, existing report generation systems, despite achieving high performances on natural language generation metrics such as CIDEr or BLEU, still suffer from incomplete and inconsistent generations. Here we introduce two new simple rewards to encourage the generation of factually complete and consistent radiology reports: one that encourages the system to generate radiology domain entities consistent with the reference, and one that uses natural language inference to encourage these entities to be described in inferentially consistent ways. We combine these with the novel use of an existing semantic equivalence metric (BERTScore). We further propose a report generation system that optimizes these rewards via reinforcement learning. On two open radiology report datasets, our system substantially improved the F1 score of a clinical information extraction performance by +22.1 (Delta +63.9%). We further show via a human evaluation and a qualitative analysis that our system leads to generations that are more factually complete and consistent compared to the baselines.

...read moreread less

Proceedings Article•DOI•

Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models

[...]

Dan Iter¹, Kelvin Guu¹, Larry Lansing², Dan Jurafsky²•Institutions (2)

Stanford University¹, Google²

20 May 2020

TL;DR: Conpono, an inter-sentence objective for pretraining language models that models discourse coherence and the distance between sentences is proposed, and it is shown that Conpono yields gains of 2%-6% absolute even for tasks that do not explicitly evaluate discourse: textual entailment, common sense reasoning and reading comprehension.

...read moreread less

Abstract: Recent models for unsupervised representation learning of text have employed a number of techniques to improve contextual word representations but have put little focus on discourse-level representations. We propose Conpono, an inter-sentence objective for pretraining language models that models discourse coherence and the distance between sentences. Given an anchor sentence, our model is trained to predict the text k sentences away using a sampled-softmax objective where the candidates consist of neighboring sentences and sentences randomly sampled from the corpus. On the discourse representation benchmark DiscoEval, our model improves over the previous state-of-the-art by up to 13% and on average 4% absolute across 7 tasks. Our model is the same size as BERT-Base, but outperforms the much larger BERT-Large model and other more recent approaches that incorporate discourse. We also show that Conpono yields gains of 2%-6% absolute even for tasks that do not explicitly evaluate discourse: textual entailment (RTE), common sense reasoning (COPA) and reading comprehension (ReCoRD).

...read moreread less

Journal Article•DOI•

Assessing the accuracy of automatic speech recognition for psychotherapy

[...]

Adam S. Miner¹, Albert Haque¹, Jason A. Fries¹, Scott L. Fleming¹, Denise E. Wilfley², G. Terence Wilson³, Arnold Milstein¹, Dan Jurafsky¹, Bruce A. Arnow¹, W. Stewart Agras¹, Li Fei-Fei¹, Nigam H. Shah¹ - Show less +8 more•Institutions (3)

Stanford University¹, Washington University in St. Louis², Rutgers University³

03 Jun 2020

TL;DR: It is shown that automatic speech recognition is feasible in psychotherapy, but further improvements in accuracy are needed before widespread use, and that it may not be ready for individual-level safety surveillance.

...read moreread less

Abstract: Accurate transcription of audio recordings in psychotherapy would improve therapy effectiveness, clinician training, and safety monitoring. Although automatic speech recognition software is commercially available, its accuracy in mental health settings has not been well described. It is unclear which metrics and thresholds are appropriate for different clinical use cases, which may range from population descriptions to individual safety monitoring. Here we show that automatic speech recognition is feasible in psychotherapy, but further improvements in accuracy are needed before widespread use. Our HIPAA-compliant automatic speech recognition system demonstrated a transcription word error rate of 25%. For depression-related utterances, sensitivity was 80% and positive predictive value was 83%. For clinician-identified harm-related sentences, the word error rate was 34%. These results suggest that automatic speech recognition may support understanding of language patterns and subgroup variation in existing treatments but may not be ready for individual-level safety surveillance.

...read moreread less

Journal Article•DOI•

A Framework for the Computational Linguistic Analysis of Dehumanization

[...]

Julia Mendelsohn¹, Yulia Tsvetkov², Dan Jurafsky³•Institutions (3)

University of Michigan¹, Carnegie Mellon University², Stanford University³

06 Mar 2020-arXiv: Computation and Language

TL;DR: A computational linguistic framework for analyzing dehumanizing language is created by identifying linguistic correlates of salient components of dehumanization and is applied to analyze discussions of LGBTQ people in the New York Times from 1986 to 2015.

...read moreread less

Abstract: Dehumanization is a pernicious psychological process that often leads to extreme intergroup bias, hate speech, and violence aimed at targeted social groups. Despite these serious consequences and the wealth of available data, dehumanization has not yet been computationally studied on a large scale. Drawing upon social psychology research, we create a computational linguistic framework for analyzing dehumanizing language by identifying linguistic correlates of salient components of dehumanization. We then apply this framework to analyze discussions of LGBTQ people in the New York Times from 1986 to 2015. Overall, we find increasingly humanizing descriptions of LGBTQ people over time. However, we find that the label homosexual has emerged to be much more strongly associated with dehumanizing attitudes than other labels, such as gay. Our proposed techniques highlight processes of linguistic variation and change in discourses surrounding marginalized groups. Furthermore, the ability to analyze dehumanizing language at a large scale has implications for automatically detecting and understanding media bias as well as abusive language online.

...read moreread less

Journal Article•DOI•

A Framework for the Computational Linguistic Analysis of Dehumanization.

[...]

Julia Mendelsohn¹, Yulia Tsvetkov², Dan Jurafsky³•Institutions (3)

University of Michigan¹, Carnegie Mellon University², Stanford University³

07 Aug 2020

TL;DR: The authors created a computational linguistic framework for analyzing dehumanizing language by identifying linguistic correlates of salient components of dehumanization and applied this framework to analyze discussions of LGBTQ people in the New York Times from 1986 to 2015.

...read moreread less

Abstract: Dehumanization is a pernicious psychological process that often leads to extreme intergroup bias, hate speech, and violence aimed at targeted social groups. Despite these serious consequences and the wealth of available data, dehumanization has not yet been computationally studied on a large scale. Drawing upon social psychology research, we create a computational linguistic framework for analyzing dehumanizing language by identifying linguistic correlates of salient components of dehumanization. We then apply this framework to analyze discussions of LGBTQ people in the New York Times from 1986 to 2015. Overall, we find increasingly humanizing descriptions of LGBTQ people over time. However, we find that the label homosexual has emerged to be much more strongly associated with dehumanizing attitudes than other labels, such as gay. Our proposed techniques highlight processes of linguistic variation and change in discourses surrounding marginalized groups. Furthermore, the ability to analyze dehumanizing language at a large scale has implications for automatically detecting and understanding media bias as well as abusive language online.

...read moreread less

Posted Content•

Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models

[...]

Dan Iter¹, Kelvin Guu¹, Larry Lansing², Dan Jurafsky²•Institutions (2)

Stanford University¹, Google²

20 May 2020-arXiv: Computation and Language

TL;DR: The authors proposed an inter-sentence objective for pretraining language models that models discourse coherence and the distance between sentences, which achieved state-of-the-art performance on DiscoEval 2016.

...read moreread less

Abstract: Recent models for unsupervised representation learning of text have employed a number of techniques to improve contextual word representations but have put little focus on discourse-level representations. We propose CONPONO, an inter-sentence objective for pretraining language models that models discourse coherence and the distance between sentences. Given an anchor sentence, our model is trained to predict the text k sentences away using a sampled-softmax objective where the candidates consist of neighboring sentences and sentences randomly sampled from the corpus. On the discourse representation benchmark DiscoEval, our model improves over the previous state-of-the-art by up to 13% and on average 4% absolute across 7 tasks. Our model is the same size as BERT-Base, but outperforms the much larger BERT- Large model and other more recent approaches that incorporate discourse. We also show that CONPONO yields gains of 2%-6% absolute even for tasks that do not explicitly evaluate discourse: textual entailment (RTE), common sense reasoning (COPA) and reading comprehension (ReCoRD).

...read moreread less

Posted Content•

Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models

[...]

Isabel Papadimitriou¹, Dan Jurafsky¹•Institutions (1)

Stanford University¹

30 Apr 2020-arXiv: Computation and Language

TL;DR: Experiments on transfer between natural languages show that zero-shot performance on a test language is highly correlated with typological syntactic similarity to the training language, suggesting that representations induced from natural languages correspond to the cross-linguistic syntactic properties studied in linguistic typology.

...read moreread less

Abstract: We propose transfer learning as a method for analyzing the encoding of grammatical structure in neural language models. We train LSTMs on non-linguistic data and evaluate their performance on natural language to assess which kinds of data induce generalizable structural features that LSTMs can use for natural language. We find that training on non-linguistic data with latent structure (MIDI music or Java code) improves test performance on natural language, despite no overlap in surface form or vocabulary. To pinpoint the kinds of abstract structure that models may be encoding to lead to this improvement, we run similar experiments with two artificial parentheses languages: one which has a hierarchical recursive structure, and a control which has paired tokens but no recursion. Surprisingly, training a model on either of these artificial languages leads to the same substantial gains when testing on natural language. Further experiments on transfer between natural languages controlling for vocabulary overlap show that zero-shot performance on a test language is highly correlated with typological syntactic similarity to the training language, suggesting that representations induced by pre-training correspond to the cross-linguistic syntactic properties. Our results provide insights into the ways that neural models represent abstract syntactic structure, and also about the kind of structural inductive biases which allow for natural language acquisition.

...read moreread less

Journal Article•

Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning

[...]

Peter Henderson, Jieru Hu, Joshua Romoff, Emma Brunskill, Dan Jurafsky, Joelle Pineau - Show less +2 more

31 Jan 2020-Journal of Machine Learning Research

TL;DR: In this article, the authors present a framework that makes this easier by providing a simple interface for tracking real-time energy consumption and carbon emissions, as well as generating standardized online appendices.

...read moreread less

Abstract: Accurate reporting of energy and carbon usage is essential for understanding the potential climate impacts of machine learning research. We introduce a framework that makes this easier by providing a simple interface for tracking realtime energy consumption and carbon emissions, as well as generating standardized online appendices. Utilizing this framework, we create a leaderboard for energy efficient reinforcement learning algorithms to incentivize responsible research in this area as an example for other areas of machine learning. Finally, based on case studies using our framework, we propose strategies for mitigation of carbon emissions and reduction of energy consumption. By making accounting easier, we hope to further the sustainable development of machine learning experiments and spur more research into energy efficient algorithms.

...read moreread less

Posted Content•

DeSMOG: Detecting Stance in Media On Global Warming

[...]

Yiwei Luo, Dallas Card¹, Dan Jurafsky¹•Institutions (1)

Stanford University¹

28 Oct 2020-arXiv: Computation and Language

TL;DR: This work studies opinion-framing in the global warming debate, an increasingly partisan issue that has received little attention in NLP, and introduces DeSMOG, a dataset of stance-labeled GW sentences, and releases the stance dataset, model, and lexicons of framing devices.

...read moreread less

Abstract: Citing opinions is a powerful yet understudied strategy in argumentation. For example, an environmental activist might say, "Leading scientists agree that global warming is a serious concern," framing a clause which affirms their own stance ("that global warming is serious") as an opinion endorsed ("[scientists] agree") by a reputable source ("leading"). In contrast, a global warming denier might frame the same clause as the opinion of an untrustworthy source with a predicate connoting doubt: "Mistaken scientists claim [...]." Our work studies opinion-framing in the global warming (GW) debate, an increasingly partisan issue that has received little attention in NLP. We introduce DeSMOG, a dataset of stance-labeled GW sentences, and train a BERT classifier to study novel aspects of argumentation in how different sides of a debate represent their own and each other's opinions. From 56K news articles, we find that similar linguistic devices for self-affirming and opponent-doubting discourse are used across GW-accepting and skeptic media, though GW-skeptical media shows more opponent-doubt. We also find that authors often characterize sources as hypocritical, by ascribing opinions expressing the author's own view to source entities known to publicly endorse the opposing view. We release our stance dataset, model, and lexicons of framing devices for future work on opinion-framing and the automatic detection of GW stance.

...read moreread less

Proceedings Article•DOI•

Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models

[...]

Isabel Papadimitriou¹, Dan Jurafsky¹•Institutions (1)

Stanford University¹

01 Nov 2020

TL;DR: This article proposed transfer learning as a method for analyzing the encoding of grammatical structure in neural language models, finding that training on non-linguistic data with latent structure improves test performance on natural language, despite no overlap in surface form or vocabulary.

...read moreread less

Abstract: We propose transfer learning as a method for analyzing the encoding of grammatical structure in neural language models. We train LSTMs on non-linguistic data and evaluate their performance on natural language to assess which kinds of data induce generalizable structural features that LSTMs can use for natural language. We find that training on non-linguistic data with latent structure (MIDI music or Java code) improves test performance on natural language, despite no overlap in surface form or vocabulary. To pinpoint the kinds of abstract structure that models may be encoding to lead to this improvement, we run similar experiments with two artificial parentheses languages: one which has a hierarchical recursive structure, and a control which has paired tokens but no recursion. Surprisingly, training a model on either of these artificial languages leads the same substantial gains when testing on natural language. Further experiments on transfer between natural languages controlling for vocabulary overlap show that zero-shot performance on a test language is highly correlated with typological syntactic similarity to the training language, suggesting that representations induced by pre-training correspond to the cross-linguistic syntactic properties. Our results provide insights into the ways that neural models represent abstract syntactic structure, and also about the kind of structural inductive biases which allow for natural language acquisition.

...read moreread less

Posted Content•

Language Through a Prism: A Spectral Approach for Multiscale Language Representations

[...]

Alex Tamkin¹, Dan Jurafsky¹, Noah D. Goodman¹•Institutions (1)

Stanford University¹

09 Nov 2020-arXiv: Computation and Language

TL;DR: This work applies spectral filters to the activations of a neuron across an input, producing filtered embeddings that perform well on part of speech tagging, dialog speech acts classification, or topic classification, while performing poorly on the other tasks.

...read moreread less

Abstract: Language exhibits structure at different scales, ranging from subwords to words, sentences, paragraphs, and documents. To what extent do deep models capture information at these scales, and can we force them to better capture structure across this hierarchy? We approach this question by focusing on individual neurons, analyzing the behavior of their activations at different timescales. We show that signal processing provides a natural framework for separating structure across scales, enabling us to 1) disentangle scale-specific information in existing embeddings and 2) train models to learn more about particular scales. Concretely, we apply spectral filters to the activations of a neuron across an input, producing filtered embeddings that perform well on part of speech tagging (word-level), dialog speech acts classification (utterance-level), or topic classification (document-level), while performing poorly on the other tasks. We also present a prism layer for training models, which uses spectral filters to constrain different neurons to model structure at different scales. Our proposed BERT + Prism model can better predict masked tokens using long-range context and produces multiscale representations that perform better at utterance- and document-level tasks. Our methods are general and readily applicable to other domains besides language, such as images, audio, and video.

...read moreread less

Proceedings Article•DOI•

DeSMOG: Detecting Stance in Media On Global Warming

[...]

Yiwei Luo, Dallas Card¹, Dan Jurafsky¹•Institutions (1)

Stanford University¹

28 Oct 2020

TL;DR: This paper studied opinion-framing in the global warming (GW) debate, an increasingly partisan issue that has received little attention in NLP and introduced DeSMOG, a dataset of stance-labeled GW sentences, and train a BERT classifier to study novel aspects of argumentation in how different sides of a debate represent their own and each other's opinions.

...read moreread less

Abstract: Citing opinions is a powerful yet understudied strategy in argumentation. For example, an environmental activist might say, “Leading scientists agree that global warming is a serious concern,” framing a clause which affirms their own stance (“that global warming is serious”) as an opinion endorsed ("[scientists] agree”) by a reputable source (“leading”). In contrast, a global warming denier might frame the same clause as the opinion of an untrustworthy source with a predicate connoting doubt: “Mistaken scientists claim [...]." Our work studies opinion-framing in the global warming (GW) debate, an increasingly partisan issue that has received little attention in NLP. We introduce DeSMOG, a dataset of stance-labeled GW sentences, and train a BERT classifier to study novel aspects of argumentation in how different sides of a debate represent their own and each other’s opinions. From 56K news articles, we find that similar linguistic devices for self-affirming and opponent-doubting discourse are used across GW-accepting and skeptic media, though GW-skeptical media shows more opponent-doubt. We also find that authors often characterize sources as hypocritical, by ascribing opinions expressing the author’s own view to source entities known to publicly endorse the opposing view. We release our stance dataset, model, and lexicons of framing devices for future work on opinion-framing and the automatic detection of GW stance.

...read moreread less

Posted Content•

Pretraining on Non-linguistic Structure as a Tool for Analyzing Learning Bias in Language Models

[...]

Isabel Papadimitriou, Dan Jurafsky

30 Apr 2020

TL;DR: It is found that models trained on structured data such as music and Java code have internal representations that help in modelling human language, and that, surprisingly, adding minimal amounts of structure to the training data makes a large difference in transfer to natural language.

...read moreread less

Abstract: We propose a novel methodology for analyzing the encoding of grammatical structure in neural language models through transfer learning. We test how a language model can leverage its internal representations to transfer knowledge across languages and symbol systems. We train LSTMs on non-linguistic, structured data and test their performance on human language to assess which kinds of data induce generalizable encodings that LSTMs can use for natural language. We find that models trained on structured data such as music and Java code have internal representations that help in modelling human language, and that, surprisingly, adding minimal amounts of structure to the training data makes a large difference in transfer to natural language. Further experiments on transfer between human languages show that zero-shot performance on a test language is highly correlated with syntactic similarity to the training language, even after removing any vocabulary overlap. This suggests that the internal representations induced from natural languages are typologically coherent: they encode the features and differences outlined in typological studies. Our results provide insights into how neural networks represent linguistic structure, and also about the kinds of structural biases that give learners the ability to model language.

...read moreread less

Posted Content•

Causal Effects of Linguistic Properties

[...]

Reid Pryzant¹, Dallas Card¹, Dan Jurafsky¹, Victor Veitch², Dhanya Sridhar² - Show less +1 more•Institutions (2)

Stanford University¹, Columbia University²

24 Oct 2020-arXiv: Computation and Language

TL;DR: This paper proposed TextCause, an algorithm for estimating causal effects of linguistic properties, which leverages distant supervision to improve the quality of noisy proxies and a pre-trained language model (BERT) to adjust for the text.

...read moreread less

Abstract: We consider the problem of using observational data to estimate the causal effects of linguistic properties. For example, does writing a complaint politely lead to a faster response time? How much will a positive product review increase sales? This paper addresses two technical challenges related to the problem before developing a practical method. First, we formalize the causal quantity of interest as the effect of a writer's intent, and establish the assumptions necessary to identify this from observational data. Second, in practice, we only have access to noisy proxies for the linguistic properties of interest -- e.g., predictions from classifiers and lexicons. We propose an estimator for this setting and prove that its bias is bounded when we perform an adjustment for the text. Based on these results, we introduce TextCause, an algorithm for estimating causal effects of linguistic properties. The method leverages (1) distant supervision to improve the quality of noisy proxies, and (2) a pre-trained language model (BERT) to adjust for the text. We show that the proposed method outperforms related approaches when estimating the effect of Amazon review sentiment on semi-simulated sales figures. Finally, we present an applied case study investigating the effects of complaint politeness on bureaucratic response times.

...read moreread less

Posted Content•

Detecting Stance in Media on Global Warming

[...]

Yiwei Luo, Dallas Card, Dan Jurafsky¹•Institutions (1)

Stanford University¹

28 Oct 2020-arXiv: Computation and Language

TL;DR: This paper studied opinion-framing in the global warming debate and found that authors often characterize sources as hypocritical, by ascribing opinions expressing the author's own view to source entities known to publicly endorse the opposing view.

...read moreread less

Abstract: Citing opinions is a powerful yet understudied strategy in argumentation. For example, an environmental activist might say, "Leading scientists agree that global warming is a serious concern," framing a clause which affirms their own stance ("that global warming is serious") as an opinion endorsed ("[scientists] agree") by a reputable source ("leading"). In contrast, a global warming denier might frame the same clause as the opinion of an untrustworthy source with a predicate connoting doubt: "Mistaken scientists claim [...]." Our work studies opinion-framing in the global warming (GW) debate, an increasingly partisan issue that has received little attention in NLP. We introduce Global Warming Stance Dataset (GWSD), a dataset of stance-labeled GW sentences, and train a BERT classifier to study novel aspects of argumentation in how different sides of a debate represent their own and each other's opinions. From 56K news articles, we find that similar linguistic devices for self-affirming and opponent-doubting discourse are used across GW-accepting and skeptic media, though GW-skeptical media shows more opponent-doubt. We also find that authors often characterize sources as hypocritical, by ascribing opinions expressing the author's own view to source entities known to publicly endorse the opposing view. We release our stance dataset, model, and lexicons of framing devices for future work on opinion-framing and the automatic detection of GW stance.

...read moreread less

Posted Content•

Utility is in the Eye of the User: A Critique of NLP Leaderboards

[...]

Kawin Ethayarajh¹, Dan Jurafsky¹•Institutions (1)

Stanford University¹

29 Sep 2020-arXiv: Computation and Language

TL;DR: This paper study the divergence between what is incentivized by leaderboards and what is useful in practice through the lens of microeconomic theory and frame both the leaderboard and NLP practitioners as consumers and the benefit they get from a model as its utility to them.

...read moreread less

Abstract: Benchmarks such as GLUE have helped drive advances in NLP by incentivizing the creation of more accurate models. While this leaderboard paradigm has been remarkably successful, a historical focus on performance-based evaluation has been at the expense of other qualities that the NLP community values in models, such as compactness, fairness, and energy efficiency. In this opinion paper, we study the divergence between what is incentivized by leaderboards and what is useful in practice through the lens of microeconomic theory. We frame both the leaderboard and NLP practitioners as consumers and the benefit they get from a model as its utility to them. With this framing, we formalize how leaderboards -- in their current form -- can be poor proxies for the NLP community at large. For example, a highly inefficient model would provide less utility to practitioners but not to a leaderboard, since it is a cost that only the former must bear. To allow practitioners to better estimate a model's utility to them, we advocate for more transparency on leaderboards, such as the reporting of statistics that are of practical concern (e.g., model size, energy efficiency, and inference latency).

...read moreread less

Posted Content•

With Little Power Comes Great Responsibility

[...]

Dallas Card¹, Peter Henderson¹, Urvashi Khandelwal¹, Robin Jia¹, Kyle Mahowald², Dan Jurafsky¹ - Show less +2 more•Institutions (2)

Stanford University¹, University of California, Santa Barbara²

13 Oct 2020-arXiv: Computation and Language

TL;DR: This paper found that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point, and that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied.

...read moreread less

Abstract: Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings. By meta-analyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of settings and conclude that underpowered experiments are common in the NLP literature. In particular, for several tasks in the popular GLUE benchmark, small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. Similarly, based on reasonable assumptions, we find that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied. For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point. To improve the situation going forward, we give an overview of best practices for power analysis in NLP and release a series of notebooks to assist with future power analyses.

...read moreread less

Proceedings Article•

Language Through a Prism: A Spectral Approach for Multiscale Language Representations

[...]

Alex Tamkin¹, Dan Jurafsky¹, Noah D. Goodman¹•Institutions (1)

Stanford University¹

01 Jan 2020

TL;DR: The authors apply spectral filters to the activations of a neuron across an input, producing filtered embeddings that perform well on part of speech tagging (word-level), dialog speech acts classification (utterance-level) or topic classification (document-level).

...read moreread less

Abstract: Language exhibits structure at different scales, ranging from subwords to words, sentences, paragraphs, and documents. To what extent do deep models capture information at these scales, and can we force them to better capture structure across this hierarchy? We approach this question by focusing on individual neurons, analyzing the behavior of their activations at different timescales. We show that signal processing provides a natural framework for separating structure across scales, enabling us to 1) disentangle scale-specific information in existing embeddings and 2) train models to learn more about particular scales. Concretely, we apply spectral filters to the activations of a neuron across an input, producing filtered embeddings that perform well on part of speech tagging (word-level), dialog speech acts classification (utterance-level), or topic classification (document-level), while performing poorly on the other tasks. We also present a prism layer for training models, which uses spectral filters to constrain different neurons to model structure at different scales. Our proposed BERT + Prism model can better predict masked tokens using long-range context and produces multiscale representations that perform better at utterance- and document-level tasks. Our methods are general and readily applicable to other domains besides language, such as images, audio, and video.

...read moreread less

Journal Article•DOI•

Five-star prices, appealing healthy item descriptions? Expensive restaurants' descriptive menu language.

[...]

Bradley P. Turnwald, Kathryn G Anderson, Dan Jurafsky, Alia J. Crum

17 Sep 2020-Health Psychology

TL;DR: Like inexpensive restaurants, expensive American restaurants described healthy items as less appealing and less authentically American than standard foods, but to a lesser extent.

...read moreread less

Abstract: Objective: Prior research shows that America's top-selling inexpensive casual dining restaurants use less appealing language to describe healthy menu items than standard items. This may suggest to diners that healthy options are less tasty and enjoyable. The present research asked whether expensive restaurants also use less appealing language to describe healthy items, or whether healthy items are described with equally appealing language as standard items in high status dining contexts. Method: Using Yelp, the name and description of every food item were recorded from the menus of 160 top-rated expensive restaurants across 8 U.S. cities (Nitems = 3,295; Nwords = 32,516). Healthy menu items were defined as salads and side vegetables, and standard items as all other dishes (excluding desserts), with high interrater reliability (K = .89). Descriptive words were categorized into 22 predefined themes, and log likelihood analyses compared normalized theme frequencies from standard item and healthy item descriptions. Results: Healthy items were described with 4.8-times fewer American region words, 2.7-times fewer exciting words, 1.4-times fewer tasty words, and significantly fewer portion size, spicy, artisanal, and foreign region words. Unlike inexpensive restaurants, however, expensive restaurants did not use any health-focused themes to promote healthy items and used several appealing themes more frequently in healthy item descriptions. Conclusions: Like inexpensive restaurants, expensive American restaurants described healthy items as less appealing and less authentically American than standard foods, but to a lesser extent. Implications for ordering behavior and solutions for improving the appeal of healthy menu items are discussed. (PsycInfo Database Record (c) 2020 APA, all rights reserved).

...read moreread less

Showing papers by "Dan Jurafsky published in 2020"