scispace - formally typeset
Open AccessProceedings ArticleDOI

Real-time Detection of Content Polluters in Partially Observable Twitter Networks

Reads0
Chats0
TLDR
This work develops a methodology to detect content polluters in social media datasets that are streamed in real-time and identifies some peculiar characteristics of these bots in the authors' dataset and proposes metrics for identification of such accounts.
Abstract
Content polluters, or bots that hijack a conversation for political or advertising purposes are a known problem for event prediction, election forecasting and when distinguishing real news from fake news in social media data. Identifying this type of bot is particularly challenging, with state-of-the-art methods utilising large volumes of network data as features for machine learning models. Such datasets are generally not readily available in typical applications which stream social media data for real-time event prediction. In this work we develop a methodology to detect content polluters in social media datasets that are streamed in real-time. Applying our method to the problem of civil unrest event prediction in Australia, we identify content polluters from individual tweets, without collecting social network or historical data from individual accounts. We identify some peculiar characteristics of these bots in our dataset and propose metrics for identification of such accounts. We then pose some research questions around this type of bot detection, including: how good Twitter is at detecting content polluters and how well state-of-the-art methods perform in detecting bots in our dataset.

read more

Content maybe subject to copyright    Report

Real-time Detection of Content Polluters in Partially Observable
Twier Networks
Mehwish Nasim
School of Mathematical Sciences
University of Adelaide
Adelaide, Australia
mehwish.nasim@adelaide.edu.au
Andrew Nguyen
School of Mathematical Sciences
University of Adelaide
Adelaide, Australia
andrew.nguyen03@adelaide.edu.au
Nick Lothian
Tyto.ai
Adelaide, Australia
nick.lothian@gmail.com
Robert Cope
School of Mathematical Sciences
University of Adelaide
Adelaide, Australia
robert.cope@adelaide.edu.au
Lewis Mitchell
School of Mathematical Sciences
University of Adelaide
Adelaide, Australia
lewis.mitchell@adelaide.edu.au
ABSTRACT
Content polluters, or bots that hijack a conversation for political
or advertising purposes are a known problem for event prediction,
election forecasting and when distinguishing real news from fake
news in social media data. Identifying this type of bot is particularly
challenging, with state-of-the-art methods utilising large volumes
of network data as features for machine learning models. Such
datasets are generally not readily available in typical applications
which stream social media data for real-time event prediction. In
this work we develop a methodology to detect content polluters in
social media datasets that are streamed in real-time. Applying our
method to the problem of civil unrest event prediction in Australia,
we identify content polluters from individual tweets, without col-
lecting social network or historical data from individual accounts.
We identify some peculiar characteristics of these bots in our dataset
and propose metrics for identication of such accounts. We then
pose some research questions around this type of bot detection,
including: how good Twitter is at detecting content polluters and
how well state-of-the-art methods perform in detecting bots in our
dataset.
CCS CONCEPTS
Information systems Social networking sites
;
Security
and privacy Social network security and privacy;
KEYWORDS
Civil unrest, Social bots, Content polluters, Missing links, Twitter
ACM Reference Format:
Mehwish Nasim, Andrew Nguyen, Nick Lothian, Robert Cope, and Lewis
Mitchell. 2018. Real-time Detection of Content Polluters in Partially Observ-
able Twitter Networks. In WWW ’18 Companion: The 2018 Web Conference
Work undertaken while at Data to Decisions CRC.
This paper is published under the Creative Commons Attribution 4.0 International
(CC BY 4.0) license. Authors reserve their rights to disseminate the work on their
personal and corporate Web sites with the appropriate attribution.
WWW ’18 Companion, April 23–27, 2018, Lyon, France
©
2018 IW3C2 (International World Wide Web Conference Committee), published
under Creative Commons CC BY 4.0 License.
ACM ISBN 978-1-4503-5640-4/18/04.
https://doi.org/10.1145/3184558.3191574
Companion, April 23–27, 2018, Lyon, France. ACM, New York, NY, USA,
9 pages. https://doi.org/10.1145/3184558.3191574
1 INTRODUCTION
1.1 Motivation
Bots and content polluters in online social media aect the socio-
political state of the world, from meddling in elections [
4
,
13
,
37
]
to inuencing US veterans [
15
]. In late September 2017, Twitter
admitted to Congress that it had found 200 Russian accounts that
overlapped with Facebook accounts which were used to sway Amer-
icans and create divisions during the elections held in 2016 [
37
]. Of
course, some bots are useful as well, for instance accounts that will
tweet alerts to people about natural disasters. The problem arises
when they try to inuence people or spread misinformation. The
importance of detecting bots in online social media has produced
an active research area on this topic [9, 21].
State-of-the-art methods for bot detection use historical patterns
of behaviour and a rich feature set including textual, temporal, and
social network features, to distinguish automated bots from real
human users [
35
]. However, for real-time application using large
streamed datasets, such methods can be prohibitive due to the sheer
volume, velocity, and incompleteness of data samples. In this work
we develop a new method to detect one particular type of social
bot content polluters in streamed microblog datasets such as
Twitter. Content polluters are bots that attempt to subvert a genuine
discussion by hijacking it for political or advertising purposes. As
we will show, these bots are a major concern for applications such
as real-time event prediction, such as social unrest, from social
media datasets.
1.2 Problem context
Social unrest prediction is a growing concern for governments
worldwide. This is evidenced by DARPA’s Open Source Intelligence
program, which produced numerous methods to predict the occur-
rence of future population-level events such as civil unrest, political
crises, election outcomes and disease outbreaks [
12
,
25
,
30
,
32
]. It
has been observed that social events are either preceded or fol-
lowed by changes in population-level communication behaviour,
consumption and movement. A large fraction of population-level
Track: 9th International Workshop on Modeling Social Media (MSM 2018)
Applying Machine Learning and AI for Modeling Social Media
WWW 2018, April 23-27, 2018, Lyon, France
1331

changes are implicitly reected in online data such as blogs, online
social networks, nancial markets, or search queries. Some of these
data sources have been shown to eectively detect population-level
events in real time. Methods have been developed for predicting
such events by fusing publicly available data from multiple sources.
There exists a plethora of research focused on social media-based
forecasting models, suggesting that features from micro-blogs such
as Twitter can predict and detect population-level events [
30
]. Once
one develops a “gold standard" (ground truth) record of known
events (e.g. election results, or protests occurring) models can be
trained using open source data to make predictions. A signicant
challenge for such models is noise reduction through ltering “fake
news”, removing misclassied or irrelevant tweets, or mitigating the
eects of missing data. This is of particular concern, as the changing
limits on accessing social media data remains a major challenge for
researchers [
26
]. Access to data through APIs and third parties can
be inconsistent, incomplete, and corrupted by noise in the form of
bots. Where bots are inuencing people through fake social media
accounts, they also act as content polluters on social media sites [
33
].
According to the Digital Forensics Research Lab (DFRL), “They can
make a group of six people look like a group of 46,000 people."
The main goal of our work was nding out content polluters
in a dataset comprising tweets related to Australian social unrest
events in real time, without access to complete prole information
of the users. Due to rate limits on the public API and the high
cost of accessing data, we were restricted to using only streamed
tweets satisfying certain criteria. While the actual event prediction
algorithm is not the primary concern of this paper, further detail
can be found in Osborne et al. [29].
1.3 Related Work
A social bot is a computer algorithm that automatically produces
content and interacts with humans on social media, trying to em-
ulate and possibly alter their behaviour [
14
]. Social bots inhabit
social media platforms, and online social networks are inundated
by millions of bots exhibiting increasingly sophisticated, human-
like behaviour. In the coming years a proliferation of social media
bots is expected as advertisers, criminals, politicians, governments,
terrorists, and other organizations attempt to inuence populations
[
34
]. This introduces dimensions for social bots, including social
network characteristics, temporal activity, diusion patterns, and
sentiment expression [14].
Ghost et al. [
16
] conducted an analysis on the follower/followee
links acquired by over 40,000 spammer accounts suspended by Twit-
ter. They showed that penalizing users for connecting to spammers
can be eective because it would de-incentivize users from linking
with other users in order to gain inuence. Yang et al. [
40
] found
that bot accounts in online social networks connect to each other
by chance and integrate into the social network just like normal
users. Network information along with content has been shown to
detect spam in online social networks [
20
]. While researchers were
proposing various bot-detection models, Lee et al. [
24
] identied
and engaged strangers on social media to eectively propagate
information/misinformation. They proposed a model to leverage
peoples’ social behaviour (online interactions) and users’ wait times
for retweeting.
Social bots evolve over time, making them resilient against stan-
dard bot detection approaches [
9
]. They are apt at changing discus-
sion topics and posting activities [
38
]. Researchers have proposed
complex models, such as those based on interaction graphs of suspi-
cious accounts [
19
,
20
,
22
,
39
]. An adversary often controls multiple
social bots known as a sybil. One strategy to detect such accounts
relies on investigating social graph structure, on the assumption
that sybil accounts link to a small number of legitimate users [
7
].
Behavioural patterns and sentiments analysis have also been used
for bot detection [
11
]. Such patterns can easily be encoded in fea-
tures, thus machine learning techniques can be used to distinguish
bot-like from human-like behaviour. Previous work uses network-
based features or content analysis for bot detection, along with
indicators such as temporal activity, retweets, and crowd sourcing
[
10
,
36
]. Such eorts require substantial network knowledge or the
ability to quickly query an API for a complete history of social
media postings by suspected bots. However, real-time applications,
such as streaming messages based on keywords or geographic lo-
cations, render this impractical. A major challenge therefore is
developing methodologies to detect and remove bots based on par-
tial information, message histories, and network knowledge, in real
time.
In this work we detect bots from individual tweets downloaded
for predicting social unrest in Australian cities. Given lters on
keywords and geographic location of events (such as protests, ral-
lies, civil disturbances) collected in real time, it leaves a small but
informative dataset for prediction. Predictions are generated in real
time by analysing data from online social media platforms such as
Twitter and validated against hand-labeled “Gold standard records”
(GSR) [
29
]. The GSR is created by the news analysts; after going
through a validation and cleaning process this data is ready to be
used as the ground truth. If Twitter data is contaminated with so-
cial bots, it can greatly degrade prediction models. It is therefore
imperative to develop techniques for detecting and removing social
bots for real-time data streams.
Contributions: Our scientic contributions are as follows:
(1)
We develop a method to identify social bots in data using only
partial information about the user and their tweet history,
in real time.
(2)
We present a new dataset of hand-labelled bots and legiti-
mate records, and use it to validate our method
1
.
(3)
We pose a set of research questions for evaluating whether
Twitter users, Twitter, or existing state-of-the-art bot detec-
tion methods could detect bots in our dataset or not.
1.4 Dataset
Our dataset consists of timestamped tweets from 1 January 2015 till
31 December 2016 from 5 major capital cities in Australia. Tweets
identify one of the following locations: Australia’, ‘Adelaide’, ‘Bris-
bane’, ‘Melbourne’, ‘Perth’, or ‘Sydney’. The data are targeted at
studying civil unrest and intends to capture ways in which people
express opinions and organize marches, rallies, peaceful/violent
protests etc., within Australia. Such events aim to draw attention
toward an issue e.g., infrastructure, taxes, immigration laws etc.
Australia has a population of about 24
.
5 million people and, like
1
Data can be accessed on http://maths.adelaide.edu.au/mehwish.nasim/
Track: 9th International Workshop on Modeling Social Media (MSM 2018)
Applying Machine Learning and AI for Modeling Social Media
WWW 2018, April 23-27, 2018, Lyon, France
1332

Table 1: Data statistics
Parameters Adelaide Brisbane Melbourne Perth Sydney
Number of tweets 14087 5913 23720 8421 31568
Number of unique users 12039 3466 14611 6215 14515
Number of unique URLs 548 233 762 456 844
Average number of followers (in degree) 8812 9624 6733 5409 6052
Average number friends (out degree) 1223 1736 1517 1643 1860
Number of veried accounts 293 432 840 209 412
in many developed countries, predicting civil unrest events is of
interest to law enforcement agencies, government bodies, media
and academia. Notwithstanding this fact, the literature is devoid
of exploratory studies conducted on this population for real-time
prediction of civil unrest events. The basic statistics about protest-
related tweets in our dataset are reported in Table 1.
Note that the dataset was devoid of information on the alters (fol-
lowers/friends of egos), except for the total count of alters (numbers
of followers and friends).
2 DETECTING CONTENT POLLUTERS
We investigate two characteristics of tweets i.e., temporal informa-
tion and message diversity in a tweet.
Temporal Patterns:
In the rst step we were interested in 1).
users who tweet frequently, 2). pairs of users who tweet on the
same day using the desired keywords. Since no information about
the network of individual users is available, we cannot construct a
follower-friend network graph. Instead, we construct a two mode
user-event network. For all the events in the data we connect two
users if they have tweeted on the same event day. We represent
this problem in graph theoretic terms as follows:
Let
G
be a bipartite graph of users and events. Let
U
be the
set of users and let
V
be the set of events. Let
u, v U
and let
i, j V
. For any
i V
if
N (u) N (v) , {}
then
(u, v) E
in
the one-mode projection of the bipartite graph. The neighb ourhood
N (v)
of a vertex
v U
is the set of vertices that are adjacent to
v
. The resulting projection is an undirected loopless multigraph.
If the edge set
E
contains the same edge several times, then
E
is
a multiset. If an edge occurs several times in
E
, the copies of that
edge are called parallel edges. Graphs that have parallel edges are
also called multigraphs.
Similar to other social networks such as friendship networks,
event networks are a result of complex sociological processes with
a multitude of relations. When such relations are conated into
a dense network, the visualization often resembles a “hairball”.
Various approaches to declutter drawings of such networks exist
in the literature. We use the recent backbone layout approach for
network visualization [
28
], which accounts for strong ties (or mul-
tiplicity of edges) and uses the union of all maximum spanning
trees as a sparsier to ensure a connected subgraph. In Figure 1b,
the thickness of edges represents how often a pair of nodes tweet
on the same event day’
2
whereas, the size of the nodes indicates
the individual frequency of tweets by a user
3
. We noticed that bots
2
Event day was conrmed from the GSR.
3
Networks visualizations are created in visone (http://www.visone.info/).
(a) Two purple nodes at the right side that are loosely con-
nected to the core, are bots. They have tweeted together fre-
quently and their individual frequency to tweet is low as com-
pared to other nodes in the graph, however the dyadic (pair-
wise) frequency is higher.
(b) Two densely connected components in the tweets graph.
Figure 1: Graphs containing bots and legitimate users from
the Melbourne events network.
tweeted together frequently. Their individual frequency to tweet is
low as compared to other nodes in the graph, however the dyadic
(pairwise) frequency is higher. For instance, the two purple nodes
on the right have tweeted together frequently, in Figure 1a. Their
individual frequency to tweet is low as compared to other nodes
Track: 9th International Workshop on Modeling Social Media (MSM 2018)
Applying Machine Learning and AI for Modeling Social Media
WWW 2018, April 23-27, 2018, Lyon, France
1333

Figure 2: Graph containing bots and legitimate users from
the Melbourne events network.
in the graph, however the dyadic (pairwise) frequency is higher.
These two nodes are weakly connected to the core. Upon checking
their complete proles, the users were found to be political bots.
This motivated us to further explore the tweets-graph.
The core of the network (green nodes) were found to be news
channels and popular blogs in Australia, such as MelbLiveNews,
newsonaust, 7NewsMelbourne and LoversMelbourne to name a few.
Media accounts are likely to report population-level events on the
day of the events, thus they form a strongly-connected core of the
events network graph.
We then clustered all tweets in a similar manner to construct a
graph where two users have an edge between them if they have
tweeted on the same day, irrespective of whether there was an event
that day or not. We used the Louvain Method for clustering the
network [
5
], based on the concept of modularity. Optimizing the
modularity results in the best possible grouping of nodes in a given
network. We then found two strongly-connected components in
the graph: 1. News channels, and 2. Bots. We analysed the strongly-
connected vertex-induced subgraphs from the network. One such
component for the city of Melbourne is shown in Figure 2, which is a
strongly-connected component from Figure 1b. Bots are the purple
nodes (validated by manual inspection of proles). Green nodes
represent false positives. Orange nodes are not bots but are also not
relevant for predictions, since these users were not geographically
located in Australia and were tweeting about Victoria in the UK.
Message diversity:
We computed the diversity in the tweets
based upon mentions of URLs and hashtags. We selected the top
most tweeted URLs,
{K }
(
|K | =
20), and then ltered out the users
(
¯
U U
) who mentioned those URLs. The motivation for this ap-
proach is that an event prediction model should be resilient against
bot-URLs that are infrequently mentioned in the tweets, so these
will not greatly impact the prediction accuracy. We then computed
the following three measures for each of the remaining users: i).
total number of tweets containing any URL(s),
u
all
i
, ii). number of
tweets mentioning URL
k K
,
u
k
i
and iii). diversity score i.e., the
dierence between the two measures, u
d
i
= u
all
i
u
k
i
.
We then plot the diversity score distribution for every
u
k
¯
U
,
for every URL
k K
. This immediately provides some relevant
insights about the behaviour of content polluters: Figure 3a shows a
legitimate URL (i.e., linked to by legitimate users), whereas, Figures
3b and 3c show bot-URLs (i.e., URLs linked to by bots). Users who
tweet these URLs are classied as potential bots. The gures show
that the diversity of users linking to legitimate URLs is generally
far greater than those linking to bot-URLs. The temporal patterns
of bot-URL mentions and those which are being tweeted at regular
intervals indicated that these users were indeed bots.
We measure the extent of diversity in two ways:
(1) The Gini coecient (G R, G=[0,1]):
G =
Í
n
i=1
Í
n
j=1
|u
d
i
u
d
j
|
2n
Í
n
i=1
u
d
i
, (1)
where n is the number of users tweeting a particular URL.
The Gini coecient
G
describes the relative amount of in-
equality in the distribution of diversity:
G =
0 indicates
complete equality while
G =
1 indicates complete inequality.
A high
G
suggests coordination among the observations. The
Gini coecient does not measure absolute inequality and
the interpretation can vary from situation to situation. Le-
gitimate accounts such as news channels, newspapers, and
famous activists are likely to tweet legitimate and diverse
URLs, thus the Gini coecient for legitimate URLs is high
as compared to illegitimate URLs. The Gini coecient for a
sample of ten URLs is shown in Figure 4.
(2)
Rank-size Rule: We observed that only a fraction of URLs
are mentioned very frequently in the tweets and very large
number of URLs barely nd their way in more than a single
tweet. It is interesting to note that cities and their rank also
follow a similar distribution; this pattern is generally known
as the rank-size rule [
31
]. This has also been observed in
various studies on calling behaviour of users [2][3] [27].
We t a curve on every user versus URL-diversity graph and
measure the coecient of determination
R
2
. Values close to zero
indicate that the model explains little of the variability of the re-
sponse data around its mean. For legitimate URLs, we obtained
values close to 1 (Figure 3).
Recently, Gilani et al. [
18
] evaluated the characteristics of auto-
mated versus human accounts by looking at complete tweet his-
tories. They initially hypothesized that bots tweet a number of
dierent URLs, however in the actual data they found that humans
may also post a number of URLs. Conversely, in this work we looked
at most frequently posted URLs and then for each URL we analysed
how diverse the users’ tweets are who are tweeting that URL.
We detected 849 bots in the data using message diversity on
URLs, which we call content polluters. These content polluters con-
tributed about 7% of tweets in the data. We computed some statistics
on content polluters versus legitimate users, shown in Figure 5. In
[
14
], authors argued that social bots tend to have recent accounts
with long names. However, we did not nd a signicant dierence
in our data between content polluters and regular users. The av-
erage account age of content polluters accounts was 2
.
9 years as
compared to legitimate users which was 4
.
2 years. This dierence
was signicant (
p <
0
.
01). This suggests that these particular type
of bot accounts are relatively old and have remained (potentially)
undetected by Twitter. The length of Twitter names for bots had
on average 11 characters as compared to non-bots that had 12 char-
acters. None of the bots had veried Twitter accounts. A total of
109 political bot accounts were created on 20 February 2014 with
Track: 9th International Workshop on Modeling Social Media (MSM 2018)
Applying Machine Learning and AI for Modeling Social Media
WWW 2018, April 23-27, 2018, Lyon, France
1334

0 100 200 300 400 500 600
2 4 6 8 10 12 14
users
diversity
(a) Legitimate (Gini = 0.8, R
2
= 0. 98)
0 100 200 300
1.0 1.2 1.4 1.6 1.8 2.0
users
diversity
(b) Bots (Gini = 0.32, R
2
= 0)
0 50 100 150
1.5 2.0 2.5
users
diversity
(c) Bots (Gini = 0, R
2
= 0)
Figure 3: Message diversity measured through 3 URLs for bots and genuine users.
Gini
0.0 0.2 0.4 0.6 0.8
www.digitaltrends.com
linkis.com
www.9news.com.au
www.theguardian.com
www.facebook.com
www.youtube.com
www.heraldsun.com.au
www.theage.com.au
www.abc.net.au
www.mojahedin.org
twitter.com
Figure 4: Gini score for ten URLs. High Gini coecient indi-
cates a legitimate URL. The three URLs with the lowest Gini
coecients were being tweeted by content-polluting bots.
only 12 unique names, a strong indication of being a bot network.
We also found several digital media bot accounts. Such accounts
aim at becoming famous by attracting followers [
6
]. A set of such
accounts was created on 30 March 2016. This set consisted of 8
accounts with an average friend count of 4099 and follower count
of 1112.
We also explored the dataset from [
23
] using our algorithm. The
dataset contains more than 600k tweets. The Gini coecient for
each dataset (bots and non-bots) was around 0.5, hence we remain
inconclusive. The data set from Gilani et al. [
18
] only consisted
of the number of URLs each user mentioned, therefore it was not
possible to check the relative frequency of any particular URL.
We argue that the nature of content polluting bots makes them
dicult to distinguish in traditional bot-detection datasets. This
motivates our research questions below and the creation of a new
human-validated content-pollution dataset in the next section.
3 CREATING A CONTENT-POLLUTING BOT
DATASET
Given the peculiarities in the bot accounts that we found in our
analysis, we move on to some pertinent research questions.
3.1 Do humans succeed in detecting content
polluters?
We conducted a user study to hand-label a set of Twitter accounts
that contained equal number of content polluters (from our list
obtained in the previous section) and legitimate accounts. We asked
three independent hand-labellers to create the dataset. Users were
rst shown several examples of content polluters as well as of
legitimate accounts. All three participants were well versed with
using Twitter. All participants found it very dicult to assess non-
English accounts even with automatic translation.
The participants recorded the following comments:
Participant 1
Domain Knowledge: Advance Twitter User
Comments: “What I’m struggling with is that, the user doesn’t
actually initiates a suspicious tweet. He simply retweets a whole
bunch of content polluting tweets".
Strategy:
If user has tweeted or retweeted from well known news
spam sites then mark as bot.
Otherwise look through pattern of tweets, if very spamy
tweet behaviour, for example highly consistent frequency
of tweeting behaviour and tweets are from a single source
then mark as bot.
See if they regular mention and interact with other twitter
users which indicates a good sign for a regular account.
Look at prole details and follower and followees ratio
to distinguish if it appears like a regular account or a bot.
Track: 9th International Workshop on Modeling Social Media (MSM 2018)
Applying Machine Learning and AI for Modeling Social Media
WWW 2018, April 23-27, 2018, Lyon, France
1335

Citations
More filters
Proceedings ArticleDOI

Detect Me If You Can: Spam Bot Detection Using Inductive Representation Learning

TL;DR: The hypothesis is that to better detect spam bots, in addition to defining a features set, the social graph must also be taken into consideration, and this work is the first attempt of using graph convolutional neural networks in spam bot detection.
Journal ArticleDOI

You talkin’ to me? Exploring Human/Bot Communication Patterns during Riot Events

TL;DR: Analysis of a data-set including more than 4.5 million tweets related to four highly emotional riot events finds that human-to-human conversations results in a variety of motifs that contain reciprocal edges and self-loops.
Journal ArticleDOI

DeeProBot: a hybrid deep neural network model for social bot detection based on user profile data

TL;DR: Deep Profile-based Bot detection framework (DeeProBot) as discussed by the authors uses the information from user profile metadata of the Twitter account like description, follower count and tweet count to classify Twitter accounts as either human or bot.
Journal ArticleDOI

Fake news detection based on explicit and implicit signals of a hybrid crowd: An approach inspired in meta-learning

TL;DR: HCS, an approach based on crowd signals that considers implicit user opinions instead of the explicit ones that is inspired in Meta-Learning, is proposed, able to achieve results comparable to the ones produced by the Crowd Signals approach.
Book ChapterDOI

#ArsonEmergency and Australia's "Black Summer": Polarisation and Misinformation on Social Media.

TL;DR: This paper examined the communication and behaviour of two polarised online communities before and after news of the misinformation became public knowledge and found that the Supporter community actively engaged with others to spread the hashtag, using a variety of news sources pushing the arson narrative, while the Opposer community engaged less, retweeted more, and focused its use of URLs to link to mainstream sources, debunking the narratives and exposing the anomalous behaviour.
References
More filters
Journal ArticleDOI

Fast unfolding of communities in large networks

TL;DR: In this paper, the authors proposed a simple method to extract the community structure of large networks based on modularity optimization, which is shown to outperform all other known community detection methods in terms of computation time.
Journal ArticleDOI

The rise of social bots

TL;DR: In this article, the authors discuss the threat posed by today's social bots and how their presence can endanger online ecosystems as well as our society, and how to deal with them.
Journal ArticleDOI

Social bots distort the 2016 U.S. Presidential election online discussion

TL;DR: The findings suggest that the presence of social media bots can indeed negatively affect democratic political discussion rather than improving it, which in turn can potentially alter public opinion and endanger the integrity of the Presidential election.
Proceedings ArticleDOI

BotOrNot: A System to Evaluate Social Bots

TL;DR: BotOrNot as discussed by the authors is a publicly available service that leverages more than one thousand features to evaluate the extent to which a Twitter account exhibits similarity to the known characteristics of social bots.
Journal ArticleDOI

The Size Distribution of Cities: An Examination of the Pareto Law and Primacy

TL;DR: In this article, the authors examined the Pareto and primacy measures of the size distribution of cities in 44 countries and found that the value of the pareto exponent is quite sensitive to the definition of the city and the choice of city sample size.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions mentioned in the paper "Real-time detection of content polluters in partially observable twitter networks" ?

Identifying this type of bot is particularly challenging, with state-of-the-art methods utilising large volumes of network data as features for machine learning models. In this work the authors develop a methodology to detect content polluters in social media datasets that are streamed in real-time. The authors identify some peculiar characteristics of these bots in their dataset and propose metrics for identification of such accounts. The authors then pose some research questions around this type of bot detection, including: how good Twitter is at detecting content polluters and how well state-of-the-art methods perform in detecting bots in their dataset. 

Due to rate limits on the public API and the high cost of accessing data, the authors were restricted to using only streamed tweets satisfying certain criteria. 

A total of 109 political bot accounts were created on 20 February 2014 withonly 12 unique names, a strong indication of being a bot network. 

The data are targeted at studying civil unrest and intends to capture ways in which people express opinions and organize marches, rallies, peaceful/violent protests etc., within Australia. 

The main goal of their work was finding out content polluters in a dataset comprising tweets related to Australian social unrest events in real time, without access to complete profile information of the users. 

The most challenging aspect of this work is to validate results since user perceptions are not always correct, and standard bot detection methods are very much prone to misclassification despite using complete twitter account information [9, 17, 18]. 

The authors reiterate that the authors detected content polluter accounts using message diversity since the authors did not have access to complete account information, whereas Truthy exploited features obtained from the complete user profile and network. 

The authors noted that the model was no longer erroneously predicting events related to ‘escorts’, which improved model performance noticeably. 

The authors measure the extent of diversity in two ways: (1) The Gini coefficient (G ∈ R, G=[0,1]):G =∑n i=1 ∑n j=1 |udi − u d j |2n ∑n i=1 u d i, (1)where n is the number of users tweeting a particular URL. 

Given a query for a specific account, the Twitter API returns an error message if the account is suspended by Twitter or deleted by the user. 

The bots that the authors detected in their dataset helped to remove noise in the data and significantly improved the performance of prediction models. 

Otherwise look through pattern of tweets, if very spamy tweet behaviour, for example highly consistent frequency of tweeting behaviour and tweets are from a single source then mark as bot.