scispace - formally typeset
Open Access

Detecting Spammers on Twitter

Reads0
Chats0
TLDR
This paper uses tweets related to three famous trending topics from 2009 to construct a large labeled collection of users, manually classified into spammers and non-spammers, and identifies a number of characteristics related to tweet content and user social behavior which could potentially be used to detect spammers.
Abstract
With millions of users tweeting around the world, real time search systems and dierent types of mining tools are emerging to allow people tracking the repercussion of events and news on Twitter. However, although appealing as mechanisms to ease the spread of news and allow users to discuss events and post their status, these services open opportunities for new forms of spam. Trending topics, the most talked about items on Twitter at a given point in time, have been seen as an opportunity to generate trac and revenue. Spammers post tweets containing typical words of a trending topic and URLs, usually obfuscated by URL shorteners, that lead users to completely unrelated websites. This kind of spam can contribute to de-value real time search services unless mechanisms to fight and stop spammers can be found. In this paper we consider the problem of detecting spammers on Twitter. We first collected a large dataset of Twitter that includes more than 54 million users, 1.9 billion links, and almost 1.8 billion tweets. Using tweets related to three famous trending topics from 2009, we construct a large labeled collection of users, manually classified into spammers and non-spammers. We then identify a number of characteristics related to tweet content and user social behavior, which could potentially be used to detect spammers. We used these characteristics as attributes of machine learning process for classifying users as either spammers or nonspammers. Our strategy succeeds at detecting much of the spammers while only a small percentage of non-spammers are misclassified. Approximately 70% of spammers and 96% of non-spammers were correctly classified. Our results also highlight the most important attributes for spam detection on Twitter.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Information credibility on twitter

TL;DR: There are measurable differences in the way messages propagate, that can be used to classify them automatically as credible or not credible, with precision and recall in the range of 70% to 80%.
Journal ArticleDOI

Processing Social Media Messages in Mass Emergency: A Survey

TL;DR: This survey surveys the state of the art regarding computational methods to process social media messages and highlights both their contributions and shortcomings, and methodically examines a series of key subproblems ranging from the detection of events to the creation of actionable and useful summaries.
Journal ArticleDOI

A Survey of Techniques for Event Detection in Twitter

TL;DR: A survey of techniques for event detection from Twitter streams aimed at finding real‐world occurrences that unfold over space and time and highlights the need for public benchmarks to evaluate the performance of different detection approaches and various features.
Proceedings ArticleDOI

Design and Evaluation of a Real-Time URL Spam Filtering Service

TL;DR: It is shown that Monarch can provide accurate, real-time protection, but that the underlying characteristics of spam do not generalize across web services, and the distinctions between email and Twitter spam are explored.
Proceedings ArticleDOI

Truthy: mapping the spread of astroturf in microblog streams

TL;DR: A web service that tracks political memes in Twitter and helps detect astroturfing, smear campaigns, and other misinformation in the context of U.S. political elections is demonstrated.
References
More filters
Book

Data Mining: Practical Machine Learning Tools and Techniques

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Book ChapterDOI

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.
Proceedings ArticleDOI

What is Twitter, a social network or a news media?

TL;DR: In this paper, the authors have crawled the entire Twittersphere and found a non-power-law follower distribution, a short effective diameter, and low reciprocity, which all mark a deviation from known characteristics of human social networks.
Proceedings Article

A Comparative Study on Feature Selection in Text Categorization

TL;DR: This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.
Proceedings Article

Measuring User Influence in Twitter: The Million Follower Fallacy

TL;DR: An in-depth comparison of three measures of influence, using a large amount of data collected from Twitter, is presented, suggesting that topological measures such as indegree alone reveals very little about the influence of a user.