Spam filtering for short messages

doi:10.1145/1321440.1321486

Home
/
Papers
/
Spam filtering for short messages

Proceedings Article•DOI•

Spam filtering for short messages

Gordon V. Cormack¹, José María Gómez Hidalgo², Enrique Puertas Sanz²•Institutions (2)

University of Waterloo¹, European University of Madrid²

06 Nov 2007-pp 313-320

TL;DR: It is concluded that content filtering for short messages is surprisingly effective and can be improved substantially using different features, while compression-model filters perform quite well as-is.

read less

Abstract: We consider the problem of content-based spam filtering for short text messages that arise in three contexts: mobile (SMS) communication, blog comments, and email summary information such as might be displayed by a low-bandwidth client. Short messages often consist of only a few words, and therefore present a challenge to traditional bag-of-words based spam filters. Using three corpora of short messages and message fields derived from real SMS, blog, and spam messages, we evaluate feature-based and compression-model-based spam filters. We observe that bag-of-words filters can be improved substantially using different features, while compression-model filters perform quite well as-is. We conclude that content filtering for short messages is surprisingly effective.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Short text classification in twitter to improve information filtering

[...]

Bharath Sriram¹, Dave Fuhry¹, Engin Demir¹, Hakan Ferhatosmanoglu¹, Murat Demirbas² - Show less +1 more•Institutions (2)

Ohio State University¹, University at Buffalo²

19 Jul 2010

TL;DR: A small set of domain-specific features extracted from the author's profile and text is proposed to use to classify short text messages to a predefined set of generic classes such as News, Events, Opinions, Deals, and Private Messages.

...read moreread less

Abstract: In microblogging services such as Twitter, the users may become overwhelmed by the raw data One solution to this problem is the classification of short text messages As short texts do not provide sufficient word occurrences, traditional classification methods such as "Bag-Of-Words" have limitations To address this problem, we propose to use a small set of domain-specific features extracted from the author's profile and text The proposed approach effectively classifies the text to a predefined set of generic classes such as News, Events, Opinions, Deals, and Private Messages

...read moreread less

782 citations

Proceedings Article•DOI•

Contributions to the study of SMS spam filtering: new collection and results

[...]

Tiago A. Almeida¹, José María Gómez Hidalgo, Akebo Yamakami¹•Institutions (1)

State University of Campinas¹

19 Sep 2011

TL;DR: A new real, public and non-encoded SMS spam collection that is the largest one as far as the authors know is offered and the performance achieved by several established machine learning methods is compared.

...read moreread less

Abstract: The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. In practice, fighting mobile phone spam is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. On the other hand, in academic settings, a major handicap is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, as SMS messages are fairly short, content-based spam filters may have their performance degraded. In this paper, we offer a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we compare the performance achieved by several established machine learning methods. The results indicate that Support Vector Machine outperforms other evaluated classifiers and, hence, it can be used as a good baseline for further comparison.

...read moreread less

369 citations

Cites methods from "Spam filtering for short messages"

...This corpus has been used in the following academic research efforts: [6], [7], and [14]....
[...]

Book•

Email Spam Filtering: A Systematic Review

[...]

Gordon V. Cormack¹•Institutions (1)

University of Waterloo¹

23 Jun 2008

TL;DR: This work examines the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe, and outlines several uncertainties and proposes experimental methods to address them.

...read moreread less

Abstract: Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than "I know it when I see it." Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam? We survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media — such as instant messaging and the Web — are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.

...read moreread less

259 citations

Cites background from "Spam filtering for short messages"

...While this survey confines itself to email spam, we note that the definitions above apply to any number of communication media, including text and voice messages [31, 45, 84], social networks [206], and blog comments [37, 123]....
[...]

Journal Article•DOI•

Text mining and probabilistic language modeling for online review spam detection

[...]

Raymond Y. K. Lau¹, S. Y. Liao¹, Ron Chi-Wai Kwok¹, Kaiquan Xu², Yunqing Xia³, Yuefeng Li⁴ - Show less +2 more•Institutions (4)

City University of Hong Kong¹, Nanjing University², Tsinghua University³, Queensland University of Technology⁴

05 Jan 2012

TL;DR: The work discussed in this article represents the first successful attempt to apply text mining methods and semantic language models to the detection of fake consumer reviews.

...read moreread less

Abstract: In the era of Web 2.0, huge volumes of consumer reviews are posted to the Internet every day. Manual approaches to detecting and analyzing fake reviews (i.e., spam) are not practical due to the problem of information overload. However, the design and development of automated methods of detecting fake reviews is a challenging research problem. The main reason is that fake reviews are specifically composed to mislead readers, so they may appear the same as legitimate reviews (i.e., ham). As a result, discriminatory features that would enable individual reviews to be classified as spam or ham may not be available. Guided by the design science research methodology, the main contribution of this study is the design and instantiation of novel computational models for detecting fake reviews. In particular, a novel text mining model is developed and integrated into a semantic language model for the detection of untruthful reviews. The models are then evaluated based on a real-world dataset collected from amazon.com. The results of our experiments confirm that the proposed models outperform other well-known baseline models in detecting fake reviews. To the best of our knowledge, the work discussed in this article represents the first successful attempt to apply text mining methods and semantic language models to the detection of fake consumer reviews. A managerial implication of our research is that firms can apply our design artifacts to monitor online consumer reviews to develop effective marketing or product design strategies based on genuine consumer feedback posted to the Internet.

...read moreread less

188 citations

Proceedings Article•

Classifying text messages for the haiti earthquake

[...]

Cornelia Caragea¹, Nathan J. McNeese¹, Anuj R. Jaiswal¹, Greg Traylor¹, Hyun Woo Kim¹, Prasenjit Mitra², Dinghao Wu², Andrea H. Tapia², C. Lee Giles², Bernard J. Jansen², John Yen² - Show less +7 more•Institutions (2)

Pennsylvania State University¹, Penn State College of Information Sciences and Technology²

01 Jan 2011

TL;DR: A reusable information technology infrastructure is developed, called Enhanced Messaging for the Emergency Response Sector (EMERSE), which classifies and aggregates tweets and text messages about the Haiti disaster relief so that non-governmental organizations, relief workers, people in Haiti, and their friends and families can easily access them.

...read moreread less

Abstract: In case of emergencies (e.g., earthquakes, flooding), rapid responses are needed in order to address victims’ requests for help. Social media used around crises involves self-organizing behavior that can produce accurate results, often in advance of official communications. This allows affected population to send tweets or text messages, and hence, make them heard. The ability to classify tweets and text messages automatically, together with the ability to deliver the relevant information to the appropriate personnel are essential for enabling the personnel to timely and efficiently work to address the most urgent needs, and to understand the emergency situation better. In this study, we developed a reusable information technology infrastructure, called Enhanced Messaging for the Emergency Response Sector (EMERSE), which classifies and aggregates tweets and text messages about the Haiti disaster relief so that non-governmental organizations, relief workers, people in Haiti, and their friends and families can easily access them.

...read moreread less

180 citations

Cites background or methods from "Spam filtering for short messages"

...In the remaining of this section, we describe four methods, which produce feature representations that are used as input to machine learning algorithms: (1) the BoWs approach; (2) feature abstraction; (3) feature selection; and (4) Latent Dirichlet Allocation (LDA)....
[...]
...Healy, Delany, and Zamolotskikh (2005), Hidalgo, Bringas, Sanz, and Garcıa (2006), and Cormack, Hidalgo, and Sanz (2007) have previously addressed the problem of identifying spam Proceedings of the 8th International ISCRAM Conference – Lisbon, Portugal, May 2011 2 short messages, by employing…...
[...]
...The messages have been manually labeled into 10 categories: (1) medical emergency; (2) people trapped; (3) food shortage; (4) water shortage; (5) water sanitation; (6) shelter needed; (7) collapsed structure; (8) food distribution; (9) hospital/clinic services; and (10) person news....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Machine learning in automated text categorization

[...]

Fabrizio Sebastiani

01 Mar 2002-ACM Computing Surveys

TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

...read moreread less

Abstract: The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

...read moreread less

7,539 citations

Making large-scale support vector machine learning practical

[...]

Thorsten Joachims

01 Jan 1998

1,660 citations

Book•

Making large-scale support vector machine learning practical

[...]

Thorsten Joachims

08 Feb 1999

TL;DR: This chapter presents algorithmic and computational results developed for SV M light V2.0, which make large-scale SVM training more practical and give guidelines for the application of SVMs to large domains.

...read moreread less

Abstract: Training a support vector machine (SVM) leads to a quadratic optimization problem with bound constraints and one linear equality constraint. Despite the fact that this type of problem is well understood, there are many issues to be considered in designing an SVM learner. In particular, for large learning tasks with many training examples, oo-the-shelf optimization techniques for general quadratic programs quickly become intractable in their memory and time requirements. SV M light1 is an implementation of an SVM learner which addresses the problem of large tasks. This chapter presents algorithmic and computational results developed for SV M light V2.0, which make large-scale SVM training more practical. The results give guidelines for the application of SVMs to large domains.

...read moreread less

1,386 citations

Journal Article•DOI•

Spam Filtering Using Statistical Data Compression Models

[...]

Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan - Show less +1 more

01 Dec 2006-Journal of Machine Learning Research

TL;DR: A novel approach to spam filtering based on adaptive statistical data compression models that outperform currently established spam filters, as well as a number of methods proposed in previous studies.

...read moreread less

Abstract: Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.

...read moreread less

255 citations

Proceedings Article•DOI•

Relaxed online SVMs for spam filtering

[...]

D. Sculley¹, Gabriel Wachman¹•Institutions (1)

Tufts University¹

23 Jul 2007

TL;DR: It is shown that online SVMs indeed give state-of-the-art classification performance on online spam filtering on large benchmark data sets, and that nearly equivalent performance may be achieved by a Relaxed Online SVM (ROSVM) at greatly reduced computational cost.

...read moreread less

Abstract: Spam is a key problem in electronic communication, including large-scale email systems and the growing number of blogs. Content-based filtering is one reliable method of combating this threat in its various forms, but some academic researchers and industrial practitioners disagree on how best to filter spam. The former have advocated the use of Support Vector Machines (SVMs) for content-based filtering, as this machine learning methodology gives state-of-the-art performance for text classification. However, similar performance gains have yet to be demonstrated for online spam filtering. Additionally, practitioners cite the high cost of SVMs as reason to prefer faster (if less statistically robust) Bayesian methods. In this paper, we offer a resolution to this controversy. First, we show that online SVMs indeed give state-of-the-art classification performance on online spam filtering on large benchmark data sets. Second, we show that nearly equivalent performance may be achieved by a Relaxed Online SVM (ROSVM) at greatly reduced computational cost. Our results are experimentally verified on email spam, blog spam, and splog detection tasks.

...read moreread less

246 citations