scispace - formally typeset
Search or ask a question
Journal ArticleDOI

SMS Spam Detection Using Noncontent Features

Qian Xu1, Evan Wei Xiang1, Qiang Yang2, Jiachun Du2, Jieping Zhong2 
01 Nov 2012-IEEE Intelligent Systems (IEEE)-Vol. 27, Iss: 6, pp 44-51
TL;DR: This service-side solution uses graph data mining to distinguish spammers from nonspammers and detect spam without checking a message's contents.
Abstract: Short Message Service text messages are indispensable, but they face a serious problem from spamming. This service-side solution uses graph data mining to distinguish spammers from nonspammers and detect spam without checking a message's contents.
Citations
More filters
Journal ArticleDOI
TL;DR: Different real-world applications have varying definitions of suspicious behaviors, and detection methods often look for the most suspicious parts of the data by optimizing scores, but quantifying the suspiciousness of a behavioral pattern is still an open issue.
Abstract: Different real-world applications have varying definitions of suspicious behaviors. Detection methods often look for the most suspicious parts of the data by optimizing scores, but quantifying the suspiciousness of a behavioral pattern is still an open issue.

116 citations

Proceedings Article
22 Jul 2012
TL;DR: A Supervised Matrix Factorization method with Social Regularization (SMFSR) for spammer detection in social networks that exploits both social activities as well as users' social relations in an innovative and highly scalable manner is proposed.
Abstract: As the popularity of the social media increases, as evidenced in Twitter, Facebook and China's Renren, spamming activities also picked up in numbers and variety. On social network sites, spammers often disguise themselves by creating fake accounts and hijacking normal users' accounts for personal gains. Different from the spammers in traditional systems such as SMS and email, spammers in social media behave like normal users and they continue to change their spamming strategies to fool anti-spamming systems. However, due to the privacy and resource concerns, many social media websites cannot fully monitor all the contents of users, making many of the previous approaches, such as topology-based and content-classification-based methods, infeasible to use. In this paper, we propose a Supervised Matrix Factorization method with Social Regularization (SMFSR) for spammer detection in social networks that exploits both social activities as well as users' social relations in an innovative and highly scalable manner. The proposed method detects spammers collectively based on users' social actions and social relations. We have empirically tested our method on data from Renren.com, which is one of the largest social networks in China, and demonstrated that our new method can improve the detection performance significantly.

109 citations


Cites background from "SMS Spam Detection Using Noncontent..."

  • ...Major research topics in spamming detection include spamming email detection (Blanzieri and Bryl 2008), spamming Web page detection (Gyöngyi and Garcia-Molina 2005), and spamming instant message detection (Xu et al. 2012; Liu et al. 2006)....

    [...]

  • ...…message systems Spammer detection has been studied in message systems for many years, from email systems (Blanzieri and Bryl 2008), to SMS systems (Liu et al. 2006; Xu et al. 2012), and most recently to microblogging websites such as Twitter (Benevenuto et al. 2010; Lee, Eoff, and Caverlee 2011)....

    [...]

Journal ArticleDOI
TL;DR: Using the dataset from a popular OHC, the research demonstrated that the proposed metric is highly effective in identifying influential users and combining the metric with other traditional measures further improves the identification of influential users.

106 citations

31 Mar 2013
TL;DR: The results indicate that the procedure followed to build the collection does not lead to near-duplicates and, regarding the classifiers, the Support Vector Machines outperforms other evaluated techniques and, hence, it can be used as a good baseline for further comparison.
Abstract: The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. Recent reports clearly indicate that the volume of mobile phone spam is dramatically increasing year by year. In practice, fighting such plague is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. Probably, one of the major concerns in academic settings is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, traditional content-based filters may have their performance seriously degraded since SMS messages are fairly short and their text is generally rife with idioms and abbreviations. In this paper, we present details about a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we offer a comprehensive analysis of such dataset in order to ensure that there are no duplicated messages coming from previously existing datasets, since it may ease the task of learning SMS spam classifiers and could compromise the evaluation of methods. Additionally, we compare the performance achieved by several established machine learning techniques. Im summary, the results indicate that the procedure followed to build the collection does not lead to near-duplicates and, regarding the classifiers, the Support Vector Machines outperforms other evaluated techniques and, hence, it can be used as a good baseline for further comparison.

84 citations


Cites background from "SMS Spam Detection Using Noncontent..."

  • ...[21] proposed a service-side solution that uses graph data mining to distinguish likely spammers from normal senders....

    [...]

Journal ArticleDOI
TL;DR: The proposed text processing approach is based on lexicographic and semantic dictionaries along with state-of-the-art techniques for semantic analysis and context detection and aims to alleviate factors that can degrade the algorithms performance, such as redundancies and inconsistencies.
Abstract: The rapid popularization of smartphones has contributed to the growth of online Instant Messaging and SMS usage as an alternative way of communication The increasing number of users, along with the trust they inherently have in their devices, makes such messages a propitious environment for spammers In fact, reports clearly indicate that volume of spam over Instant Messaging and SMS is dramatically increasing year by year It represents a challenging problem for traditional filtering methods nowadays, since such messages are usually fairly short and normally rife with slangs, idioms, symbols and acronyms that make even tokenization a difficult task In this scenario, this paper proposes and then evaluates a method to normalize and expand original short and messy text messages in order to acquire better attributes and enhance the classification performance The proposed text processing approach is based on lexicographic and semantic dictionaries along with state-of-the-art techniques for semantic analysis and context detection This technique is used to normalize terms and create new attributes in order to change and expand original text samples aiming to alleviate factors that can degrade the algorithms performance, such as redundancies and inconsistencies We have evaluated our approach with a public, real and non-encoded data-set along with several established machine learning methods Our experiments were diligently designed to ensure statistically sound results which indicate that the proposed text processing techniques can in fact enhance Instant Messaging and SMS spam filtering

80 citations


Additional excerpts

  • ...[62] make use of non-content features like time and network traffic in the same learning-based approach....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.

17,017 citations


"SMS Spam Detection Using Noncontent..." refers background in this paper

  • ...Ranking based evaluation metrics are used increasingly in machine learning and data mining community when dealing with imbalanced data [12, 13]....

    [...]

Journal ArticleDOI
TL;DR: There are several arguments which support the observed high accuracy of SVMs, which are reviewed and numerous examples and proofs of most of the key theorems are given.
Abstract: The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, working through a non-trivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique and when they are global. We describe how support vector training can be practically implemented, and discuss in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data. We show how Support Vector machines can have very large (even infinite) VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels. While very high VC dimension would normally bode ill for generalization performance, and while at present there exists no theory which shows that good generalization performance is guaranteed for SVMs, there are several arguments which support the observed high accuracy of SVMs, which we review. Results of some experiments which were inspired by these arguments are also presented. We give numerous examples and proofs of most of the key theorems. There is new material, and I hope that the reader will find that even old material is cast in a fresh light.

15,696 citations

Journal Article
TL;DR: LIBLINEAR is an open source library for large-scale linear classification that supports logistic regression and linear support vector machines and provides easy-to-use command-line tools and library calls for users and developers.
Abstract: LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regression and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.

7,848 citations


"SMS Spam Detection Using Noncontent..." refers methods in this paper

  • ...For the SVM classifier, we use LIBLINEARSVM [17]....

    [...]

  • ...[17] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- Rui Wang, and Chih-Jen Lin....

    [...]

Journal ArticleDOI
TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Abstract: The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

7,539 citations


"SMS Spam Detection Using Noncontent..." refers background in this paper

  • ...Content-based approaches [5] are among the first to be applied....

    [...]

Proceedings ArticleDOI
20 Jul 2008
TL;DR: This paper proposes a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings and shows that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the NDCG measure for evaluating ranked results.
Abstract: A recommender system must be able to suggest items that are likely to be preferred by the user. In most systems, the degree of preference is represented by a rating score. Given a database of users' past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. In this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. We measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. Experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the NDCG measure for evaluating ranked results.

303 citations


"SMS Spam Detection Using Noncontent..." refers background in this paper

  • ...Ranking based evaluation metrics are used increasingly in machine learning and data mining community when dealing with imbalanced data [12, 13]....

    [...]

  • ...[13] Nathan Nan Liu and Qiang Yang: EigenRank: a ranking-oriented approach to collaborative filtering....

    [...]