scispace - formally typeset
Search or ask a question
Book ChapterDOI

Automated Spam Detection in Short Text Messages

TL;DR: Experimental results indicate that the proposed algorithm is highly accurate in detecting spam in short messages and can be utilized by a wide variety of users to reduce the volume of spam messages.
Abstract: Increase in the popularity and reach of short text messages has led to their usage in propagating unsolicited advertising, promotional offers, and other unwarranted material to users. This has led to a high influx of such spam messages. In order to protect the interests of the user, several countermeasures have been deployed by telecommunication companies to hinder the volume of such spam. However, some volume of spam messages still manage to avoid these measures and cause varying degree of annoyance to users. In this chapter, an automated spam detection algorithm is proposed to deal with the particular problem of short text message spam. The proposed algorithm performs the two class (spam, ham) classification using stylistic and text features specific to short text messages. The algorithm is evaluated on three databases belonging to diverse demographic settings. Experimental results indicate that the proposed algorithm is highly accurate in detecting spam in short messages and can be utilized by a wide variety of users to reduce the volume of spam messages.
Citations
More filters
Journal ArticleDOI
TL;DR: A new hybrid ensemble approach is proposed that combines the predictions obtained by the classifiers using the original text samples along with their variations created by applying text normalization and semantic indexing techniques, which can improve the text content quality and enhance the performance of the expert systems for spamming detection.
Abstract: A new classifier is presented to detect undesired short text comments.The proposed approach is light, fast, multinomial and offers incremental learning.The impact of applying text normalization and semantic indexing is studied.The results indicate the proposed techniques outperformed most of the approaches.Text normalization and semantic indexing enhanced the classifiers performance. The popularity and reach of short text messages commonly used in electronic communication have led spammers to use them to propagate undesired content. This is often composed by misleading information, advertisements, viruses, and malwares that can be harmful and annoying to users. The dynamic nature of spam messages demands for knowledge-based systems with online learning and, therefore, the most traditional text categorization techniques can not be used. In this study, we introduce the MDLText, a text classifier based on the minimum description length principle, to the context of filtering undesired short text messages. The proposed approach supports incremental learning and, therefore, its predictive model is scalable and can adapt to continuously evolving spamming techniques. It is also fast, with computational cost increasing linearly with the number of samples and features, which is very desirable for expert systems applied to real-time electronic communication. In addition to the dynamic nature of these messages, they are also short and usually poorly written, rife with slangs, symbols, and abbreviations that difficult text representation, learning, and filtering. In this scenario, we also investigated the benefits of using text normalization and semantic indexing techniques. We showed these techniques can improve the text content quality and, consequently, enhance the performance of the expert systems for spamming detection. Based on these findings, we propose a new hybrid ensemble approach that combines the predictions obtained by the classifiers using the original text samples along with their variations created by applying text normalization and semantic indexing techniques. It has the advantages of being independent of the classification method and the results indicated it is efficient to filter undesired short text messages.

23 citations


Cites methods from "Automated Spam Detection in Short T..."

  • ...In a recent study, Goswami et al. (2016) used SVM trained with stylistic and text features specific of short text samples to classify SMS messages....

    [...]

Proceedings ArticleDOI
05 Oct 2020
TL;DR: In this article, TF-IDF and RF term weighting methods were compared in order to classify spam SMS and to use the limited content of SMSs more meaningfully, and the vectors obtained from the data set were weighted by TFIDF, RF and 5 different classifiers popular in this field.
Abstract: Short message services are one of the most widely used communication services. The increased use of mobile devices and the lowering of SMS costs by operators enable short message services to remain popular. However, this popularity causes tens of users to be exposed to spam SMS every day. The term spam can simply be referred to as unwanted messages by users. Although organizations take measures against spam SMS and there are widely used spam SMS filtering systems, the problem of spam SMS is becoming widespread. There are many studies in the literature for the detection of spam SMS, but new and efficient methods are still needed. In this study, TF-IDF and RF term weighting methods which are frequently used in text mining applications were compared in order to classify spam SMS and to use the limited content of SMSs more meaningfully. The vectors obtained from the data set were weighted by TF-IDF and RF term weighting methods and classified with 5 different classifiers popular in this field.

4 citations

Proceedings ArticleDOI
15 May 2018
TL;DR: The experimental studies have shown that a developed artificial neural network model is adequate and it can be effectively used for the e-mail messages classification and the scheme of this technology for e- mail messages “spam”/“not spam” classification is shown.
Abstract: In this paper we solve the problem of neural network technology development for e-mail messages classification. We analyze basic methods of spam filtering such as a sender IP-address analysis, spam messages repeats detection and the Bayesian filtering according to words. We offer the neural network technology for solving this problem because the neural networks are universal approximators and effective in addressing the problems of classification. Also, we offer the scheme of this technology for e-mail messages “spam”/“not spam” classification. The creation of effective neural network model of spam filtering is performed within the databases knowledge discovery technology. For this training set is formed, the neural network model is trained, its value and classifying ability are estimated. The experimental studies have shown that a developed artificial neural network model is adequate and it can be effectively used for the e-mail messages classification. Thus, in this paper we have shown the possibility of the effective neural network model use for the e-mail messages filtration and have shown a scheme of artificial neural network model use as a part of the e-mail spam filtering intellectual system.

2 citations


Cites methods from "Automated Spam Detection in Short T..."

  • ...SPAM FILTRATION METHODS ANALYSIS The basic methods of spam filtration are [8,12,20,21]:...

    [...]

Journal ArticleDOI
21 May 2018-iSys
TL;DR: A simple, fast, scalable, multiclass, and online text classification method based on the minimum description length principle that is effective on instant messaging and SMS spam filtering in both online and offline learning contexts is evaluated.
Abstract: Spam filtering in online instant messages and SMS is a challenging problem nowadays. It is because the messages are often very short and rife with slangs, idioms, symbols, emoticons, and abbreviations which hamper predicting and knowledge discovering. In order to face this problem, we evaluated a simple, fast, scalable, multiclass, and online text classification method based on the minimum description length principle. We conducted experiments using a real and public dataset, which demonstrate that our method is effective on instant messaging and SMS spam filtering in both online and offline learning contexts.

1 citations

References
More filters
Journal ArticleDOI
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

40,826 citations

Proceedings ArticleDOI
19 Sep 2011
TL;DR: A new real, public and non-encoded SMS spam collection that is the largest one as far as the authors know is offered and the performance achieved by several established machine learning methods is compared.
Abstract: The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. In practice, fighting mobile phone spam is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. On the other hand, in academic settings, a major handicap is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, as SMS messages are fairly short, content-based spam filters may have their performance degraded. In this paper, we offer a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we compare the performance achieved by several established machine learning methods. The results indicate that Support Vector Machine outperforms other evaluated classifiers and, hence, it can be used as a good baseline for further comparison.

369 citations

Proceedings ArticleDOI
10 Oct 2006
TL;DR: This paper analyzes to what extent Bayesian filtering techniques used to block email spam, can be applied to the problem of detecting and stopping mobile spam, and demonstrates that Bayesian filters can be effectively transferred from email to SMS spam.
Abstract: In the recent years, we have witnessed a dramatic increment in the volume of spam email. Other related forms of spam are increasingly revealing as a problem of importance, specially the spam on Instant Messaging services (the so called SPIM), and Short Message Service (SMS) or mobile spam.Like email spam, the SMS spam problem can be approached with legal, economic or technical measures. Among the wide range of technical measures, Bayesian filters are playing a key role in stopping email spam. In this paper, we analyze to what extent Bayesian filtering techniques used to block email spam, can be applied to the problem of detecting and stopping mobile spam. In particular, we have built two SMS spam test collections of significant size, in English and Spanish. We have tested on them a number of messages representation techniques and Machine Learning algorithms, in terms of effectiveness. Our results demonstrate that Bayesian filtering techniques can be effectively transferred from email to SMS spam.

197 citations

Proceedings ArticleDOI
06 Nov 2007
TL;DR: It is concluded that content filtering for short messages is surprisingly effective and can be improved substantially using different features, while compression-model filters perform quite well as-is.
Abstract: We consider the problem of content-based spam filtering for short text messages that arise in three contexts: mobile (SMS) communication, blog comments, and email summary information such as might be displayed by a low-bandwidth client. Short messages often consist of only a few words, and therefore present a challenge to traditional bag-of-words based spam filters. Using three corpora of short messages and message fields derived from real SMS, blog, and spam messages, we evaluate feature-based and compression-model-based spam filters. We observe that bag-of-words filters can be improved substantially using different features, while compression-model filters perform quite well as-is. We conclude that content filtering for short messages is surprisingly effective.

140 citations

Proceedings ArticleDOI
01 Mar 2011
TL;DR: A mobile-based system SMSAssassin that can filter SMS spam messages based on bayesian learning and sender blacklisting mechanism and uses crowd sourcing to keep itself updated is developed.
Abstract: Due to increase in use of Short Message Service (SMS) over mobile phones in developing countries, there has been a burst of spam SMSes. Content-based machine learning approaches were effective in filtering email spams. Researchers have used topical and stylistic features of the SMS to classify spam and ham. SMS spam filtering can be largely influenced by the presence of regional words, abbreviations and idioms. We have tested the feasibility of applying Bayesian learning and Support Vector Machine(SVM) based machine learning techniques which were reported to be most effective in email spam filtering on a India centric dataset. In our ongoing research, as an exploratory step, we have developed a mobile-based system SMSAssassin that can filter SMS spam messages based on bayesian learning and sender blacklisting mechanism. Since the spam SMS keywords and patterns keep on changing, SMSAssassin uses crowd sourcing to keep itself updated. Using a dataset that we are collecting from users in the real-world, we evaluated our approaches and found some interesting results.

95 citations