scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach

TL;DR: This represents the largest study, by an order of magnitude, of language and personality, and found striking variations in language with personality, gender, and age.
Abstract: We analyzed 700 million words, phrases, and topic instances collected from the Facebook messages of 75,000 volunteers, who also took standard personality tests, and found striking variations in language with personality, gender, and age. In our open-vocabulary technique, the data itself drives a comprehensive exploration of language that distinguishes people, finding connections that are not captured with traditional closed-vocabulary word-category analyses. Our analyses shed new light on psychosocial processes yielding results that are face valid (e.g., subjects living in high elevations talk about the mountains), tie in with other research (e.g., neurotic people disproportionately use the phrase ‘sick of’ and the word ‘depressed’), suggest new hypotheses (e.g., an active life implies emotional stability), and give detailed insights (males use the possessive ‘my’ when mentioning their ‘wife’ or ‘girlfriend’ more often than females use ‘my’ with ‘husband’ or 'boyfriend’). To date, this represents the largest study, by an order of magnitude, of language and personality.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The aim of this review is to describe the chemical characteristics of compounds present in honey, their stability when heated or stored for long periods of time and the parameters of identity and quality.

810 citations

Journal ArticleDOI
TL;DR: This work demonstrates how to recruit participants using Facebook, incentivize them effectively, and maximize their engagement, and outlines the most important opportunities and challenges associated with using Facebook for research.
Abstract: Facebook is rapidly gaining recognition as a powerful research tool for the social sciences. It constitutes a large and diverse pool of participants, who can be selectively recruited for both online and offline studies. Additionally, it facilitates data collection by storing detailed records of its users' demographic profiles, social interactions, and behaviors. With participants' consent, these data can be recorded retrospectively in a convenient, accurate, and inexpensive way. Based on our experience in designing, implementing, and maintaining multiple Facebook-based psychological studies that attracted over 10 million participants, we demonstrate how to recruit participants using Facebook, incentivize them effectively, and maximize their engagement. We also outline the most important opportunities and challenges associated with using Facebook for research, provide several practical guidelines on how to successfully implement studies on Facebook, and finally, discuss ethical considerations.

709 citations


Cites background from "Personality, Gender, and Age in the..."

  • ...…participants are unlikely to have enough time, attention, and knowledge to reliably report on past events attended (e.g., used in Han et al., 2012), the natural language used in their day-to-day conversations (e.g., used in Schwartz et al., 2013), or the shape of their own egocentric network....

    [...]

Proceedings ArticleDOI
01 Jun 2014
TL;DR: A novel method for gathering data for a range of mental illnesses quickly and cheaply is presented, then analysis of four in particular: post-traumatic stress disorder, depression, bipolar disorder, and seasonal affective disorder are focused on.
Abstract: The ubiquity of social media provides a rich opportunity to enhance the data available to mental health clinicians and researchers, enabling a better-informed and better-equipped mental health field. We present analysis of mental health phenomena in publicly available Twitter data, demonstrating how rigorous application of simple natural language processing methods can yield insight into specific disorders as well as mental health writ large, along with evidence that as-of-yet undiscovered linguistic signals relevant to mental health exist in social media. We present a novel method for gathering data for a range of mental illnesses quickly and cheaply, then focus on analysis of four in particular: post-traumatic stress disorder (PTSD), depression, bipolar disorder, and seasonal affective disorder (SAD). We intend for these proof-of-concept results to inform the necessary ethical discussion regarding the balance between the utility of such data and the privacy of mental health related information.

570 citations

Journal ArticleDOI
TL;DR: A survey of the state of the art in natural language generation can be found in this article, with an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organized.
Abstract: This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past two decades, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artifical intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of nlp, with an emphasis on different evaluation methods and the relationships between them.

562 citations

Journal ArticleDOI
TL;DR: Results indicated that language-based assessments can constitute valid personality measures: they agreed with self-reports and informant reports of personality, added incremental validity over informant reports, adequately discriminated between traits, and were stable over 6-month intervals.
Abstract: Language use is a psychologically rich, stable individual difference with well-established correlations to personality. We describe a method for assessing personality using an open-vocabulary analysis of language from social media. We compiled the written language from 66,732 Facebook users and their questionnaire-based self-reported Big Five personality traits, and then we built a predictive model of personality based on their language. We used this model to predict the 5 personality factors in a separate sample of 4,824 Facebook users, examining (a) convergence with self-reports of personality at the domain- and facet-level; (b) discriminant validity between predictions of distinct traits; (c) agreement with informant reports of personality; (d) patterns of correlations with external criteria (e.g., number of friends, political attitudes, impulsiveness); and (e) test-retest reliability over 6-month intervals. Results indicated that language-based assessments can constitute valid personality measures: they agreed with self-reports and informant reports of personality, added incremental validity over informant reports, adequately discriminated between traits, exhibited patterns of correlations with external criteria similar to those found with self-reported personality, and were stable over 6-month intervals. Analysis of predictive language can provide rich portraits of the mental life associated with traits. This approach can complement and extend traditional methods, providing researchers with an additional measure that can quickly and cheaply assess large groups of participants with minimal burden.

528 citations


Cites background or methods from "Personality, Gender, and Age in the..."

  • ...In contrast, techniques from computational linguistics offer finer-grained, open-vocabulary methods for language analysis (e.g., Grimmer & Stewart, 2013; O’Connor, Bamman, & Smith, 2011; Schwartz et al., 2013b; Yarkoni, 2010)....

    [...]

  • ...For example, Schwartz et al. (2013a) illustrated how the language from Twitter can be used to predict the average life satisfaction of U.S. counties....

    [...]

  • ...…rich source of relevant trait cues (Tausczik & Pennebaker, 2010); it has been used to accurately predict personality by both human (Mehl et al., 2006) and auto- mated judges (e.g., Iacobelli et al., 2011; Mairesse, Walker, Mehl, & Moore, 2007; Schwartz et al., 2013b; Sumner et al., 2012)....

    [...]

  • ...To date, researchers have evaluated predictive models of psychological characteristics on the basis of predictive accuracy alone, that is, how accurately a model can predict self-reports of personality (e.g., Golbeck et al., 2011; Iacobelli et al., 2011; Schwartz et al., 2013b; Sumner et al., 2012)....

    [...]

  • ...For example, Schwartz et al. (2013b) found that the words fucking and depression were both highly correlated with neuroticism, but depression is used far less frequently....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Abstract: SUMMARY The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

83,420 citations


"Personality, Gender, and Age in the..." refers methods in this paper

  • ...makes it harder than necessary to pass significance tests), for this result we applied the Benjamini-Hochberg false discovery rate procedure for multiple hypothesis testing [90]....

    [...]

Journal ArticleDOI
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

30,570 citations


"Personality, Gender, and Age in the..." refers background or methods in this paper

  • ...The LDA generative model assumes that documents (i.e. Facebook messages) contain a combination of topics, and that topics are a distribution of words; since the words in a document are known, the latent variable of topics can be estimated through Gibbs sampling [74]....

    [...]

  • ...To use topics as features, we find the probability of a subject’s use of each topic: p(topic j subject)~ X word[topic p(topic j word) p(word j subject) where p(word j subject) is the normalized word use by that subject and p(topic j word) is the probability of the topic given the word (a value provided from the LDA procedure)....

    [...]

  • ...We use an implementation of the LDA algorithm provided by the Mallet package [75], adjusting one parameter (alpha~0:30) to favor fewer topics per document, since individual Facebook status updates tend to contain fewer topics than the typical documents (newspaper or encyclopedia articles) to which LDA is applied....

    [...]

  • ...The second type of linguistic feature, topics, consists of word clusters created using Latent Dirichlet Allocation (LDA) [72,73]....

    [...]

  • ...Language use features include: (a) words and phrases: a sequence of 1 to 3 words found using an emoticon-aware tokenizer and a collocation filter (24,530 features) (b) topics: automatically derived groups of words for a single topic found using the Latent Dirichlet Allocation technique [72,75] (500 features)....

    [...]

Proceedings Article
03 Jan 2001
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

25,546 citations

Journal ArticleDOI
William S. Cleveland1
TL;DR: Robust locally weighted regression as discussed by the authors is a method for smoothing a scatterplot, in which the fitted value at z k is the value of a polynomial fit to the data using weighted least squares, where the weight for (x i, y i ) is large if x i is close to x k and small if it is not.
Abstract: The visual information on a scatterplot can be greatly enhanced, with little additional cost, by computing and plotting smoothed points. Robust locally weighted regression is a method for smoothing a scatterplot, (x i , y i ), i = 1, …, n, in which the fitted value at z k is the value of a polynomial fit to the data using weighted least squares, where the weight for (x i , y i ) is large if x i is close to x k and small if it is not. A robust fitting procedure is used that guards against deviant points distorting the smoothed points. Visual, computational, and statistical issues of robust locally weighted regression are discussed. Several examples, including data on lead intoxication, are used to illustrate the methodology.

10,225 citations


"Personality, Gender, and Age in the..." refers methods in this paper

  • ...Lines are fit from first-order LOESS regression [81] controlled for gender....

    [...]

  • ...When plotting language as a function of age, we fit first-order LOESS regression lines [81] to the age as the x-axis data and standardized frequency as the y-axis data over all users....

    [...]

Journal ArticleDOI
TL;DR: In this paper, an estimation procedure based on adding small positive quantities to the diagonal of X′X was proposed, which is a method for showing in two dimensions the effects of nonorthogonality.
Abstract: In multiple regression it is shown that parameter estimates based on minimum residual sum of squares have a high probability of being unsatisfactory, if not incorrect, if the prediction vectors are not orthogonal. Proposed is an estimation procedure based on adding small positive quantities to the diagonal of X′X. Introduced is the ridge trace, a method for showing in two dimensions the effects of nonorthogonality. It is then shown how to augment X′X to obtain biased estimates with smaller mean square error.

8,091 citations