scispace - formally typeset
Open AccessProceedings ArticleDOI

Predicting the Political Alignment of Twitter Users

Reads0
Chats0
TLDR
Several methods for predicting the political alignment of Twitter users based on the content and structure of their political communication in the run-up to the 2010 U.S. midterm elections are described and a practical application of this machinery to web-based political advertising is outlined.
Abstract
The widespread adoption of social media for political communication creates unprecedented opportunities to monitor the opinions of large numbers of politically active individuals in real time. However, without a way to distinguish between users of opposing political alignments, conflicting signals at the individual level may, in the aggregate, obscure partisan differences in opinion that are important to political strategy. In this article we describe several methods for predicting the political alignment of Twitter users based on the content and structure of their political communication in the run-up to the 2010 U.S. midterm elections. Using a data set of 1,000 manually-annotated individuals, we find that a support vector machine (SVM) trained on hash tag metadata outperforms an SVM trained on the full text of users' tweets, yielding predictions of political affiliations with 91% accuracy. Applying latent semantic analysis to the content of users' tweets we identify hidden structure in the data strongly associated with political affiliation, but do not find that topic detection improves prediction performance. All of these content-based methods are outperformed by a classifier based on the segregated community structure of political information diffusion networks (95% accuracy). We conclude with a practical application of this machinery to web-based political advertising, and outline several approaches to public opinion monitoring based on the techniques developed herein.

read more

Content maybe subject to copyright    Report

Predicting the Political Alignment of Twitter Users
Michael D. Conover, Bruno Gonc¸alves, Jacob Ratkiewicz, Alessandro Flammini and Filippo Menczer
Center for Complex Networks and Systems Research
School of Informatics and Computing
Indiana University, Bloomington
Abstract—The widespread adoption of social media for po-
litical communication creates unprecedented opportunities to
monitor the opinions of large numbers of politically active
individuals in real time. However, without a way to distinguish
between users of opposing political alignments, conflicting signals
at the individual level may, in the aggregate, obscure partisan
differences in opinion that are important to political strategy.
In this article we describe several methods for predicting the
political alignment of Twitter users based on the content and
structure of their political communication in the run-up to the
2010 U.S. midterm elections. Using a data set of 1,000 manually-
annotated individuals, we find that a support vector machine
(SVM) trained on hashtag metadata outperforms an SVM trained
on the full text of users’ tweets, yielding predictions of political
affiliations with 91% accuracy. Applying latent semantic analysis
to the content of users’ tweets we identify hidden structure in the
data strongly associated with political affiliation, but do not find
that topic detection improves prediction performance. All of these
content-based methods are outperformed by a classifier based
on the segregated community structure of political information
diffusion networks (95% accuracy). We conclude with a practical
application of this machinery to web-based political advertising,
and outline several approaches to public opinion monitoring
based on the techniques developed herein.
I. INTRODUCTION
Political advertising expenditures are steadily increasing [1],
and are estimated to have reached four billion US dollars
during the 2010 U.S. congressional midterm elections [2]. The
recent ‘Citizens United’ Supreme Court ruling, which removed
restrictions on corporate spending in political campaigns, is
likely to accelerate this trend. As a result, political campaigns
are placing more emphasis on social media tools as a low-cost
platform for connecting with voters and promoting engage-
ment among users in their political base.
This trend is also fueled in part by the fact that voters
are increasingly engaging with the political process online.
According to the Pew Internet and American Life Project,
fully 73% of adult internet users went online to get news
or information about politics in 2010, with more than one
in five adults (22%) using Twitter or social networking sites
for political purposes [3].
A popular microblogging platform with almost 200 million
users [4], Twitter is an outlet for up-to-the-minute status up-
dates, allowing campaigns, candidates and citizens to respond
in real-time to news and political events. From the perspective
of political mobilization, Twitter creates opportunities for viral
marketing efforts that can be leveraged to reach audiences
whose size is disproportionately large relative to the initial
investment.
Of particular interest to political campaigns is how the scale
of the Twitter platform creates the potential to monitor political
opinions in real time. For example, imagine a campaign
interested in tracking voter opinion relating to a specific piece
of legislation. One could easily envision applying sentiment
analysis tools to the set of tweets containing keyword relating
to the bill. However, without the ability to distinguish between
users with different political affiliations, aggregation over
conflicting partisan signals would likely obscure the nuances
most relevant to political strategy.
Here we explore several different approaches to the problem
of discriminating between users with left- and right-leaning
political alignment using manually annotated training data
covering nearly 1,000 Twitter users actively engaged in the
discussion of U.S. politics. Considering content based fea-
tures first, we show that a support vector machine trained
on user-generated metadata achieves 91% overall accuracy
when tasked with predicting whether a user’s tweets express
a ‘left’ or ‘right’ political alignment. Using latent semantic
analysis we identify hidden sources of structural variation
in user-generated metadata that are strongly associated with
individuals’ political alignment.
Taking an interaction based perspective on political com-
munication, we use network clustering algorithms to extract
information about the individuals with whom each user com-
municates, and show that these topological properties can be
used to improve classification accuracy even further. Specifi-
cally, we find that the community structure of the network of
political retweets can be used to predict the political alignment
of users with 95% accuracy.
We conclude with a proof of concept application based
on these classifications, identifying the websites most fre-
quently tweeted by left- and right-leaning users. We show
that domain popularity among politically active Twitter users
is not strongly correlated with overall traffic to a site, a
finding that could allow campaigns to increase returns on
advertising investments by targeting lower-traffic sites that are
very popular among politically active social media users.
II. BACKGROUND
A. The Twitter Platform
Twitter is a popular social networking and microblogging
site where users can broadcast short messages called ‘tweets’
to a global audience. A key feature of this platform is that,
by default, each user’s stream of real-time posts is public.
This fact, combined with its substantial population of users,

TABLE I
HASHTAGS RELATED TO #p2, #tcot, OR BOTH. TWEETS CONTAINING
ANY OF THESE HASHTAGS WERE INCLUDED IN OUR SAMPLE.
Just #p2 #casen #dadt #dc10210 #democrats #du1
#fem2 #gotv #kysen #lgf #ofa #onenation
#p2b #pledge #rebelleft #truthout #vote
#vote2010 #whyimvotingdemocrat #youcut
Both #cspj #dem #dems #desen #gop #hcr #nvsen
#obama #ocra #p2 #p21 #phnm #politics
#sgp #tcot #teaparty #tlot #topprog #tpp
#twisters #votedem
Just #tcot #912 #ampat #ftrs #glennbeck #hhrs
#iamthemob #ma04 #mapoli #palin #palin12
#spwbt #tsot #tweetcongress #ucot
#wethepeople
TABLE II
HASHTAGS EXCLUDED FROM THE ANALYSIS DUE TO AMBIGUOUS OR
OVERLY BROAD MEANING.
Excl. from #p2 #economy #gay #glbt #us #wc #lgbt
Excl. from both #israel #rs
Excl. from #tcot #news #qsn #politicalhumor
renders Twitter an extremely valuable resource for commercial
and political data mining and research applications.
One of Twitter’s defining features is that each message
is limited to 140 characters. In response to these space
constraints, Twitter users have developed metadata annota-
tion schemes which, as we demonstrate, compress substantial
amounts of information into a comparatively tiny space. ‘Hash-
tags, the metadata feature on which we focus in this paper, are
short tokens used to indicate the topic or intended audience
of a tweet [5]; for example, #dadt for ‘Don’t Ask Don’t
Tell’ or #jlot for ‘Jewish Libertarians on Twitter. Originally
an informal practice, Twitter has integrated hashtags into the
core architecture of the service, allowing users to search for
these terms explicitly to retrieve a list of recent tweets about
a specific topic.
In addition to broadcasting tweets to an audience of fol-
lowers, Twitter users interact with one another primarily in
two public ways: retweets and mentions. Retweets act as
a form of endorsement, allowing individuals to rebroadcast
content generated by other users, thus raising the content’s
visibility [6]. Mentions serve a different function, as they
allow someone to address a specific user directly through
the public feed, or to refer to an individual in the third
person [7]. These two means of communication serve distinct
and complementary purposes and together act as the primary
mechanisms for explicit, public, user to user interaction on
Twitter.
The free-form nature of the platform, combined with its
space limitations and resulting annotation vocabulary, have
led to a multitude of uses. Some use the service as a forum
for personal updates and conversation, others as a platform
for receiving and broadcasting real-time news and still others
treat it as an outlet for social commentary and critical culture.
Of particular interest to this study is the role of Twitter as a
platform for political discourse.
B. Data Mining and Twitter
Owing to the fact that Twitter provides a constant stream
of real-time updates from around the globe, much research
has focused on detecting noteworthy, unexpected events as
they rise to prominence in the public feed. Examples of this
work include the detection of influenza outbreaks [8], seismic
events [9], and the identification of breaking news stories [10]–
[12]. These applications are similar in many respects to
streaming data mining efforts focused on other media outlets,
such as Kleinberg and Leskovec’s ‘MemeTracker’ [13].
Its large scale and streaming nature make Twitter an ideal
platform for monitoring events in real time. However, many
of the characteristics that have led to Twitter’s widespread
adoption have also made it a prime target for spammers.
The detection of spam accounts and content is an active area
of research [14]–[16]. In related work we investigated the
purposeful spread of misinformation by politically-motivated
parties [17].
Another pertinent line of research in this area relates to
the application of sentiment analysis techniques to the Twitter
corpus. Work by Bollen et al. has shown that indicators derived
from measures of ‘mood’ states on Twitter are temporally
correlated with events such as presidential elections [18]. In
a highly relevant application, Goorha and Ungar used Twitter
data to develop sentiment analysis tools for the Dow Jones
Company to detect significant emerging trends relating to
specific products and companies [19]. Derivations of these
techniques could be paired with the machinery from Sec-
tion IV to accomplish the kind of real-time public opinion
monitoring described in the introduction.
C. Data Mining and Political Speech
Formal political speech and activity have also been a target
for data mining applications. The seminal work of Poole and
Rosenthal applied multidimensional scaling to congressional
voting records to quantify the ideological leanings of members
of the first 99 United States Congresses [20]. Similar work by
Thomas et al. used transcripts of floor debates in the House
of Representatives to predict whether a speech segment was
provided in support of or opposition to a specific proposal [21].
Related efforts have been undertaken for more informal,
web-based political speech, such as that found on blogs
and blog comments [22], [23]. While these studies report
reasonable performance, the Twitter stream provides several
advantages compared to blog data: Twitter provides a cen-
tralized data source, updated in real-time, with new sources
automatically integrated into the corpus. Moreover, Twitter
represents a broad range of individual voices, with tens of
thousands of active contributors involved in the political dis-
course.
III. DATA AND METHODS
A. Political Tweets
The Twitter ‘gardenhose’ streaming API (dev.twitter.com/
pages/streaming
api) provides a sample of about 10% of the
entire Twitter feed in a machine-readable JSON format. Each

tweet entry is composed of several fields, including a unique
identifier, the text of the tweet, the time it was produced, the
username of the account that produced the tweet, and in the
case of retweets or mentions, the account names of the other
users associated with the tweet.
This analysis focuses on six weeks of gardenhose data col-
lected as part of a related study on political polarization [24].
The data cover approximately 355 million tweets produced
during the period between September 14th and November 1st,
2010 the run-up to the November 4th US congressional
midterm elections.
Among all tweets, we consider as political communication
any tweet that contained at least one politically relevant hash-
tag. Political hashtags were identified by performing a simple
tag co-occurrence discovery procedure. We began by seeding
our sample with two widely used left- and right-leaning
political hashtags, #p2 (“Progressives 2.0”) and #tcot (“Top
Conservatives on Twitter”). For each of these, we identified the
set of hashtags with which it co-occurred in at least one tweet,
and ranked the results using the Jaccard coefficient. For the set
of tweets S containing a seed hashtag, and the set of tweets
T containing another hashtag, the Jaccard coefficient between
S and T is given by
σ(S, T) =
|S T |
|S T |
. (1)
Thus, when the tweets in which a hashtag and seed both occur
make up a large portion of the tweets in which either occurs,
the two are similar. Using a similarity threshold of 0.005 we
identified 66 unique hashtags, eleven of which were excluded
due to overly-broad or ambiguous meaning (see Tables I and
II.) The set of all tweets containing any one of these hashtags,
252 thousand in total, is used in all of the following analyses.
It’s important to note that politically-motivated individuals
often annotate content with hashtags whose primary audience
would not likely choose to see such information ahead of
time, a phenomenon known as content injection. As a result,
hashtags in this study are frequently associated with users from
both sides of the political spectrum, and therefore this seeding
algorithm does not create a trivial classification scenario [24].
B. Communication Networks
From the set of political tweets we also construct two
networks: one based on mention edges and one based on
retweet edges. In the mention network, nodes representing
users A and B are connected by a weighted, undirected edge if
either user mentioned the other during the analysis period. The
weight of each edge corresponds to the number of mentions
between the two users. The retweet network is constructed in
the same manner: an edge between A and B means that A
retweeted B or viceversa, with the weight representing the
number of retweets between the two. In total, the mention
network consists of 10,142 non-singleton nodes, with 7,175
nodes in its largest connected component (and 119 in the
next-largest). The retweet network is larger, consisting of
23,766 non-singleton nodes, with 18,470 nodes in its largest
connected component (and 102 nodes in the next-largest).
TABLE III
CONTINGENCY TABLE OF INTER-ANNOTATOR AGREEMENT ON MANUAL
CLASSIFICATIONS.
Left Ambiguous Right
Left 303 51 23
Ambiguous 19 32 24
Right 22 59 423
TABLE IV
FINAL CLASS ASSIGNMENTS BASED ON RESOLUTION PROCEDURES
DESCRIBED IN TEXT.
Left Ambiguous Right
373 77 506
C. Labeled Data
Let us now describe the creation of the labeled data used
in this study for for training and testing our classifiers. We
randomly selected a set of 1,000 users who were present in
the largest connected components of both the mention and
retweet networks. All users were individually classified by two
annotators working independently of one another.
Each annotator assigned users to one of three categories:
‘left’, ‘right’, or ‘ambiguous’, based on the content of his or
her tweets produced during the six week study period. The
groups primarily associated with a ‘left’ political alignment
were democrats and progressives; those primarily associated
with a ‘right’ political alignment were republicans, conserva-
tives, libertarians, and the Tea Party. Users coded as ‘ambigu-
ous’ may have been taking part in a political dialogue, but
it was difficult to make a clear determination about political
alignment from the content of their tweets.
Using this coding scheme each of the annotators labeled
1,000 random users. Forty four accounts producing primarily
non-English or spam tweets were considered irrelevant and
excluded from this analysis. Table III shows the classifications
of each annotator and their agreement.
Inter-annotator agreement is quite high for the ‘left’ and
‘right’ categories, but quite marginal for the ‘ambiguous’
category. This means that there were several users for whom
one annotator had the domain knowledge required to infer a
political alignment while the other did not. To address this
issue we assigned a label to a user when either annotator
detected information suggesting a political alignment in the
content of a user’s tweets. This mechanism was used to resolve
ambiguity in 16% of users. Among the 956 relevant users in
the sample there were 45 for whom the annotators explicitly
disagreed about political alignment (‘left’ vs. ‘right’). These
individuals were included in the ‘ambiguous’ category.
After this resolution procedure, 373 users were labeled by
the human annotators as expressing a ‘left’ political alignment,
506 users were labeled as ‘right’, and 77 were placed in the
‘ambiguous’ category, for a total of 956 users (Table IV).
Ambiguous classifications are a typical result of scarce data
at the individual level, but for completeness we report worst-
case bounds on accuracy for the scenario in which all of these
users are classified incorrectly.

IV. CLASSIFICATION
One of the central goals of this paper is to establish effective
features for discriminating politically left- and right-leaning
individuals. To this end we examine several features from
two broad categories: user-level features based on content
and network-level features based on the relationships between
users. Each feature set is represented in terms of a feature-
user matrix M, where M
ij
encodes the value for feature i
with respect to user j.
For content-based classifications we use linear support
vector machines (SVMs) to discriminate between users in
the ‘left’ and ‘right’ classes. In the simple case of binary
classification, an SVM works by embedding data in a high-
dimensional space and attempting to find the hyperplane that
best separates the two classes [25]. Support vector machines
are widely used for document classification because they
are well-suited to classification tasks based on sparse, high-
dimensional data, such as those commonly associated with text
corpora [26].
To quantify performance for different feature sets we report
the confusion matrix for each classifier, as well as accuracy
scores based on 10-fold cross-validation. For a confusion
matrix containing true left (tl), true right (tr), false left (f l)
and false right (fr), the accuracy of a classifier is defined by:
accuracy =
tl + tr
tl + tr + f l + fr
(2)
where tl is the number of left-leaning users who are correctly
classified, and so on.
A. Content Analysis
1) Full-Text: To establish a performance baseline, we train
a support vector machine on a feature-user matrix correspond-
ing to the TFIDF-weighted terms (unigrams) contained in
each user’s tweets [27]. In addition to common stopwords we
remove hashtags, mentions, and URLs from the set of terms
produced by all users, a step we take to facilitate comparison
with other feature sets. Additionally, we exclude terms that
occur only once in the entire corpus because they carry no
generalizable information and increase memory usage. After
these preprocessing steps, the resulting corpus contains 13,080
features, each representing a single term.
To make it clear how we compute vectors for each user and
his associated tweets let us define TFIDF in detail. The TFIDF
score for term i with respect to user j is defined in terms of
two components, term frequency (TF) and inverse document
frequency (IDF). TF measures the relative importance of term
i in the set of tweets produced by user j, and is defined as:
T F
ij
=
n
ij
P
k
n
k,j
(3)
where n
ij
is the number of times term i occurs in all tweets
produced by user j, and
P
k
n
k,j
is the total number of terms
in all tweets produced by user j. IDF discounts terms with
high overall prominence across all users, and is defined as:
IDF
i
= log
|U|
|U
i
|
(4)
TABLE V
SUMMARY OF CONFUSION MATRICES AND ACCURACY SCORES FOR
VARIOUS CLASSIFICATION FEATURES, WITH THE SECTIONS IN WHICH
THEY ARE DISCUSSED.
Features Conf. matrix Accuracy Section
Full-Text
h
266 107
75 431
i
79.2% § IV-A1
Hashtags
h
331 42
41 465
i
90.8% § IV-A2
Clusters
h
367 6
38 468
i
94.9% § IV-B
Clusters + Tags
h
366 7
38 468
i
94.9% § IV-B
where U is the set of all users, and U
i
is the subset of users
who produced term i. A term produced by every user has no
discriminative power and its IDF is zero. The product T F
ij
·
IDF
i
measures the extent to which term i occurs frequently
in user js tweets without occurring in the tweets of too many
other users.
The classification accuracy for this representation of the
data is 79%, and its confusion matrix is shown in Table V.
The lower accuracy bound for this approach, assuming that all
ambiguous users are incorrectly classified, is 72.6%.
2) Hashtags: Hashtags emerged organically within the
Twitter user community as a way of annotating topics and
threads of discussion. Since these tokens are intended to mark
the content of discussion, we might expect that they contain
substantial information about a user’s political leaning.
In this experiment we populate the feature-user matrix with
values corresponding to the relative frequency with which user
j used a hashtag i. This value is equivalent to the TF measure
from Equation 3, but described in terms of hashtags rather
than unigrams. We note that weighting by IDF did not improve
performance. Eliminating hashtags used by only one user we
are left with 4,701 features. For this classification task we
report an accuracy of 90.8%; see Table V for the confusion
matrix. The lower bound on this approach, assuming that all
ambiguous users were misclassified, is 83.5%.
As evidenced by its higher accuracy score, a classifier that
uses hashtag metadata outperforms one trained on the unigram
baseline data. Analogous findings are observed in biomedical
document classification, where classifiers trained on abstracts
outperform those trained on the articles’ full text [28]. The
reasoning underlying this improvement is that abstracts are
necessarily brief and information rich. In the same way,
Twitter users must condense substantial semantic content into
hashtags, reducing noise and simplifying the classification
task.
3) Latent Semantic Analysis of Hashtags: Latent semantic
analysis (LSA) is a technique used in text mining to discover a
set of topics present in the documents of a corpus. Based on the
singular value decomposition, LSA is argued to address issues
of polysemy, synonym, and lexical noise common in text

TABLE VI
MOST EXTREME HASHTAG COEFFICIENTS FOR SECOND LEFT SINGULAR
VECTOR. THIS LINEAR COMBINATION OF HASHTAGS APPEARS TO
CAPTURE VARIANCE ASSOCIATED WITH POLITICAL ALIGNMENT.
Hashtag Coeff. Hashtag Coeff.
#tcot 0.380 #p2 -0.914
#sgp 0.030 #dadt -0.071
#ocra 0.020 #p21 -0.042
#hhrs 0.013 #votedem -0.039
#twisters 0.012 #lgbt -0.038
#tlot 0.011 #p2b -0.032
#whyimvotingdemocrat 0.009 #topprog -0.027
#rs 0.005 #onenation -0.025
#ftrs 0.004 #dems -0.023
#ma04 0.004 #gop -0.021
#tpp 0.003 #hcr -0.017
corpora [29]. Given a feature-document matrix, the singular
value decomposition UΣV
t
, produces a factorization in terms
of two sets of orthogonal basis vectors, described by U and V
t
.
The left singular vectors, U, provide a vector basis for terms in
the factorized representation, and the right singular vectors, V ,
provide a basis for the original documents, with the singular
values of matrix Σ acting as scaling factors that identify the
variance associated with each dimension. In practice, LSA is
said to uncover hidden topics present in a corpus, a claim
supported by the analytical work of Papadimitriou et al. [30].
We apply this technique to the hashtag-user matrix in an
attempt to identify latent factors corresponding to political
alignment. The coefficients of the linear combination of hash-
tags most strongly associated with the second left singular
vector, shown in Table VI, suggest that one is present in
the data. Hashtags with extreme coefficients for this dimen-
sion include #dadt for ‘Dont Ask Don’t Tell’, #p2 for
Progressives 2.0, #tcot for Top Conservatives on Twitter,
and #ocra for ‘Organized Conservative Resistance Alliance.
The hashtag #whyimvotingdemocrat originally became a
trending topic among left-leaning users, but was subsequently
hijacked by right-leaning users to express sarcastic reasons
they might vote for a Democratic candidate. A consequence
of these coefficients is that users who use many left-leaning
hashtags will have negative magnitude with respect to this
dimension, and users who use many right-leaning hashtags
will have positive magnitude in this dimension. Figure 1 shows
clear separation between left- and right-leaning users in terms
of the first and second right singular vectors.
A support vector machine trained on features describing
users in terms of the first two right singular hashtag vectors
does not improve accuracy compared to hashtag TF scores
alone. Expanding the feature space to the first three LSA
dimensions improves performance by an insignificant amount
(about 0.1%), and the addition of subsequent features only
degrades performance.
B. Network Analysis
The previous two feature sets are based on the content of
each user’s tweets. We might also choose to ignore this content
entirely, focusing instead on the relationships between users.
Fig. 1. Users plotted in the latent semantic space of the first and second
right singular vectors. Colors correspond to class labels.
Many social networks exhibit homophilic properties that
is, users prefer to connect to those more like themselves
and as a consequence structural information can be leveraged
to infer properties about nodes that tend to associate with one
another [31], [32]. In the following, we focus on the largest
connected component of the retweet network, as previous work
suggests that it may tend to segregate ideologically-opposed
users [24].
The cluster structure of the retweet network was established
by applying a community detection algorithm using the label
propagation method of Raghavan et al. [33]. Starting with an
initial arbitrary label (cluster membership), this greedy method
works by iteratively assigning to each node the label that is
shared by most of its neighbors. Ties are broken randomly
when they occur. Owing to this stochasticity, the label propa-
gation method can return different cluster assignments for the
same graph, even with the same initial conditions. Empirical
analysis highlighted further instability resulting from random
starting conditions: the algorithm easily converges to local
optima.
To address this issue we used initial label assignments based
on the clusters produced by Newman’s leading eigenvector
modularity maximization method for two clusters [34], rather
than assigning labels at random. To verify that consistent
clusters are produced across different runs of the algorithm
for the same starting conditions, we repeated the analysis one
hundred times and compared the label assignments produced
at every run.
The similarity of two label assignments C and C
0
over
a graph with n nodes can be computed by the Adjusted
Rand Index (ARI) [35] as follows. Arbitrarily number the two
clusters of C as c
1
, c
2
, and likewise number the clusters of C
0

Citations
More filters
Journal ArticleDOI

Social bots distort the 2016 U.S. Presidential election online discussion

TL;DR: The findings suggest that the presence of social media bots can indeed negatively affect democratic political discussion rather than improving it, which in turn can potentially alter public opinion and endanger the integrity of the Presidential election.
Journal ArticleDOI

Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data

TL;DR: This article developed a Bayesian Spatial Following model that scales Twitter users along a common ideological dimension based on who they follow, and applied this network-based method to estimate ideal points for Twitter users in the US, the UK, Spain, Italy, and the Netherlands.
Journal ArticleDOI

Sensing Trending Topics in Twitter

TL;DR: It is found that standard natural language processing techniques can perform well for social streams on very focused topics, but novel techniques designed to mine the temporal distribution of concepts are needed to handle more heterogeneous streams containing multiple stories evolving in parallel.
Proceedings Article

Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors

TL;DR: This paper evaluates the inference accuracy gained by augmenting the user features with features derived from the Twitter profiles and postings of her friends, and considers three attributes which have varying degrees of assortativity: gender, age, and political affiliation.
Book ChapterDOI

Sentiment Analysis: Detecting Valence, Emotions, and Other Affectual States from Text

TL;DR: Sentiment analysis is the task of automatically determining from text the attitude, emotion, or some other affectual state of the author as mentioned in this paper, which is a difficult task due to the complexity and subtlety of language use.
References
More filters
Journal ArticleDOI

Birds of a Feather: Homophily in Social Networks

TL;DR: The homophily principle as mentioned in this paper states that similarity breeds connection, and that people's personal networks are homogeneous with regard to many sociodemographic, behavioral, and intrapersonal characteristics.
Journal ArticleDOI

Indexing by Latent Semantic Analysis

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Book ChapterDOI

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.
Journal ArticleDOI

A vector space model for automatic indexing

TL;DR: An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents, demonstating the usefulness of the model.
Proceedings ArticleDOI

What is Twitter, a social network or a news media?

TL;DR: In this paper, the authors have crawled the entire Twittersphere and found a non-power-law follower distribution, a short effective diameter, and low reciprocity, which all mark a deviation from known characteristics of human social networks.
Related Papers (5)