scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Content based SMS spam filtering

TL;DR: This paper analyzes to what extent Bayesian filtering techniques used to block email spam, can be applied to the problem of detecting and stopping mobile spam, and demonstrates that Bayesian filters can be effectively transferred from email to SMS spam.
Abstract: In the recent years, we have witnessed a dramatic increment in the volume of spam email. Other related forms of spam are increasingly revealing as a problem of importance, specially the spam on Instant Messaging services (the so called SPIM), and Short Message Service (SMS) or mobile spam.Like email spam, the SMS spam problem can be approached with legal, economic or technical measures. Among the wide range of technical measures, Bayesian filters are playing a key role in stopping email spam. In this paper, we analyze to what extent Bayesian filtering techniques used to block email spam, can be applied to the problem of detecting and stopping mobile spam. In particular, we have built two SMS spam test collections of significant size, in English and Spanish. We have tested on them a number of messages representation techniques and Machine Learning algorithms, in terms of effectiveness. Our results demonstrate that Bayesian filtering techniques can be effectively transferred from email to SMS spam.
Citations
More filters
Proceedings ArticleDOI
19 Sep 2011
TL;DR: A new real, public and non-encoded SMS spam collection that is the largest one as far as the authors know is offered and the performance achieved by several established machine learning methods is compared.
Abstract: The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. In practice, fighting mobile phone spam is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. On the other hand, in academic settings, a major handicap is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, as SMS messages are fairly short, content-based spam filters may have their performance degraded. In this paper, we offer a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we compare the performance achieved by several established machine learning methods. The results indicate that Support Vector Machine outperforms other evaluated classifiers and, hence, it can be used as a good baseline for further comparison.

369 citations


Cites methods from "Content based SMS spam filtering"

  • ...This corpus has been used in the following academic research efforts: [6], [7], and [14]....

    [...]

  • ...Table 3: Evaluated classi.ers Classi.ers BasicNa¨iveBayes(NB) BasicNB[2] MultinomialtermfrequencyNB MNTFNB[2] MultinomialBooleanNB MNBoolNB[2] MultivariateBernoulliNB BernNB[2] BooleanNB BoolNB[2] MultivariateGaussNB GaussNB[2] FlexibleBayes FlexNB[2] Boosted NB[12] LinearSupportVectorMachine SVM[10,13] MinimumDescription Length MDL[4] K-NearestNeighbors KNN[1,14](K =1,3 or5) C4.5[15, 14] Boosted C4.5[14] PART[11, 14] 3.1 Results We carried out this study using the following experiment protocol....

    [...]

  • ...K-Nearest Neighbors – KNN [1, 14] (K = 1, 3 or 5)...

    [...]

Book
23 Jun 2008
TL;DR: This work examines the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe, and outlines several uncertainties and proposes experimental methods to address them.
Abstract: Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than "I know it when I see it." Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam? We survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media — such as instant messaging and the Web — are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.

259 citations


Cites background from "Content based SMS spam filtering"

  • ...While this survey confines itself to email spam, we note that the definitions above apply to any number of communication media, including text and voice messages [31, 45, 84], social networks [206], and blog comments [37, 123]....

    [...]

Proceedings Article
01 Jan 2011
TL;DR: A reusable information technology infrastructure is developed, called Enhanced Messaging for the Emergency Response Sector (EMERSE), which classifies and aggregates tweets and text messages about the Haiti disaster relief so that non-governmental organizations, relief workers, people in Haiti, and their friends and families can easily access them.
Abstract: In case of emergencies (e.g., earthquakes, flooding), rapid responses are needed in order to address victims’ requests for help. Social media used around crises involves self-organizing behavior that can produce accurate results, often in advance of official communications. This allows affected population to send tweets or text messages, and hence, make them heard. The ability to classify tweets and text messages automatically, together with the ability to deliver the relevant information to the appropriate personnel are essential for enabling the personnel to timely and efficiently work to address the most urgent needs, and to understand the emergency situation better. In this study, we developed a reusable information technology infrastructure, called Enhanced Messaging for the Emergency Response Sector (EMERSE), which classifies and aggregates tweets and text messages about the Haiti disaster relief so that non-governmental organizations, relief workers, people in Haiti, and their friends and families can easily access them.

180 citations


Cites background from "Content based SMS spam filtering"

  • ...The messages have been manually labeled into 10 categories: (1) medical emergency; (2) people trapped; (3) food shortage; (4) water shortage; (5) water sanitation; (6) shelter needed; (7) collapsed structure; (8) food distribution; (9) hospital/clinic services; and (10) person news....

    [...]

Journal ArticleDOI
TL;DR: The need for content-based SMS spam filtering is motivated and the issues with data collection and availability for furthering research in this area are discussed, a large corpus of SMS spam is analyzed, and some initial benchmark results are provided.
Abstract: Highlights? We motivate the need for content-based SMS spam filtering. ? We discuss similarities/differences between email and SMS spam filtering. ? We review recent research in SMS spam filtering. ? We analyse recent SMS spam messages and make a dataset available. ? Early days, no consensus yet on best techniques but significant challenges exist. Mobile or SMS spam is a real and growing problem primarily due to the availability of very cheap bulk pre-pay SMS packages and the fact that SMS engenders higher response rates as it is a trusted and personal service. SMS spam filtering is a relatively new task which inherits many issues and solutions from email spam filtering. However it poses its own specific challenges. This paper motivates work on filtering SMS spam and reviews recent developments in SMS spam filtering. The paper also discusses the issues with data collection and availability for furthering research in this area, analyses a large corpus of SMS spam, and provides some initial benchmark results.

164 citations


Cites background or methods from "Content based SMS spam filtering"

  • ...Most work in SMS spam filtering uses some sort of feature selection technique to reduce the large feature space, including Information Gain (Gómez Hidalgo et al., 2006; Sohn et al., 2009) and Mutual Information (Deng & Peng, 2006) which are widely accepted methods in text classification, but also including less commonly used methods such as Expected Cross Entropy (Cai et al., 2008)—interestingly Information Gain is also the most 7Stylometry is the statistical analysis of linguistic style. commonly used method for email spam filtering (Guzella & Caminhas, 2009)....

    [...]

  • ...Most work in SMS spam filtering uses some sort of feature selection technique to reduce the large feature space, including Information Gain (Gómez Hidalgo et al., 2006; Sohn et al., 2009) and Mutual Information (Deng & Peng, 2006) which are widely accepted methods in text classification, but also…...

    [...]

  • ...A feature set including words, normalised (i.e. lowercase) words, character bi- and tri-grams and word bi-grams suggested by Gómez Hidalgo et al. (2006) has provided a base feature set for much of the work in feature engineering....

    [...]

  • ...Recently the SMS Spam Collection has been made publicly available15 (Almeida et al., 2011), which is an extension of a corpus previously compiled by Gómez Hidalgo et al. (2006)....

    [...]

  • ...Gómez Hidalgo et al. (2006) evaluated a number of classification algorithms on two SMS spam datasets and concluded that these techniques can be effectively transferred from email to SMS spam filtering, with SVMs being the most suitable....

    [...]

Proceedings ArticleDOI
06 Nov 2007
TL;DR: It is concluded that content filtering for short messages is surprisingly effective and can be improved substantially using different features, while compression-model filters perform quite well as-is.
Abstract: We consider the problem of content-based spam filtering for short text messages that arise in three contexts: mobile (SMS) communication, blog comments, and email summary information such as might be displayed by a low-bandwidth client. Short messages often consist of only a few words, and therefore present a challenge to traditional bag-of-words based spam filters. Using three corpora of short messages and message fields derived from real SMS, blog, and spam messages, we evaluate feature-based and compression-model-based spam filters. We observe that bag-of-words filters can be improved substantially using different features, while compression-model filters perform quite well as-is. We conclude that content filtering for short messages is surprisingly effective.

140 citations

References
More filters
Book
15 Oct 1992
TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Abstract: From the Publisher: Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation. C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies. This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.

21,674 citations

Journal ArticleDOI
TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Abstract: The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

7,539 citations


"Content based SMS spam filtering" refers background in this paper

  • ...learning process takes as input the training collection, and consists of the following steps [14]:...

    [...]

Journal ArticleDOI
01 Mar 2002
TL;DR: This presentation discusses the design and implementation of machine learning algorithms in Java, as well as some of the techniques used to develop and implement these algorithms.
Abstract: 1. What's It All About? 2. Input: Concepts, Instances, Attributes 3. Output: Knowledge Representation 4. Algorithms: The Basic Methods 5. Credibility: Evaluating What's Been Learned 6. Implementations: Real Machine Learning Schemes 7. Moving On: Engineering The Input And Output 8. Nuts And Bolts: Machine Learning Algorithms In Java 9. Looking Forward

5,936 citations


"Content based SMS spam filtering" refers methods in this paper

  • ...Contentbased filters can also be built by using Machine Learning techniques applied to a set of pre-classified messages [16]....

    [...]

Proceedings Article
08 Jul 1997
TL;DR: This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.
Abstract: This paper is a comparative study of feature selection methods in statistical learning of text categorization The focus is on aggres sive dimensionality reduction Five meth ods were evaluated including term selection based on document frequency DF informa tion gain IG mutual information MI a test CHI and term strength TS We found IG and CHI most e ective in our ex periments Using IG thresholding with a k nearest neighbor classi er on the Reuters cor pus removal of up to removal of unique terms actually yielded an improved classi cation accuracy measured by average preci sion DF thresholding performed similarly Indeed we found strong correlations between the DF IG and CHI values of a term This suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive TS compares favorably with the other methods with up to vocabulary reduction but is not competitive at higher vo cabulary reduction levels In contrast MI had relatively poor performance due to its bias towards favoring rare terms and its sen sitivity to probability estimation errors

5,366 citations


"Content based SMS spam filtering" refers methods in this paper

  • ...We use Information Gain (IG) [18, 19] as attribute quality metric....

    [...]

Book
Gerard Salton1
03 Jan 1989

3,571 citations


"Content based SMS spam filtering" refers background in this paper

  • ...Conversion of a message into an attribute-value pairs’ vector [13], where the attributes are the previously defined tokens, and their values can be binary, (relative) frequencies, etc....

    [...]