scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Effective Filtering of Unsolicited Messages from Online Social Networks Using Spam Templates and Social Contexts

01 Jul 2020-Wireless Personal Communications (Springer US)-Vol. 113, Iss: 1, pp 519-536
TL;DR: Experimental results demonstrate that the proposed model with SVM-Polynomial Radial Basis kernel which provides better accuracy in spam classification and outperforms all the state-of-the-art methods.
Abstract: Online social networking sites have shown an unbelievable widening in the last decade. Spammers utilise social networking sites to unroll spam messages due to its fame and use various procedures to spread spam. Consequently, the identification of spam must be well fortified enough to detect unsolicited messages and deter spammers. Though various spam identification procedures are obtainable, to improve the accuracy for spam identification is inevitable. In this work, a method to detect unsolicited messages is proposed to recognise and avert spam messages. The social context parameters such as trust and strength as well as spam template matching are also considered along with basic classifiers for effective spam classification. The intercommunication factors between the users are used for strength calculation. Spam template generation is performed based on the majority merge operation on the spam messages during the training time, and spam templates comparison is performed with the incoming messages during the testing time. Trust value updation is performed after the message classification. Experimental results demonstrate that the proposed model with SVM-Polynomial Radial Basis kernel which provides better accuracy in spam classification and outperforms all the state-of-the-art methods.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article , the authors proposed to convert the email text into vector features using the vector space model, constructed a two-dimensional matrix, and used a convolutional neural network (CNN) to identify spam on the Internet.
Abstract: In order to enhance the filtering of spam on the Internet and improve the experience of Internet users, this paper proposed to convert the email text into vector features using the vector space model, constructed a two-dimensional matrix, and used a convolutional neural network (CNN) to identify spam on the Internet. The CNN was compared with other two classifiers, support vector machine (SVM), and backward-propagation neural network (BPNN), in simulation experiments. The final results showed that the spam recognition algorithm with CNN as the classifier had better recognition performance than the algorithms with SVM and BPNN classifiers and was also more advantageous in terms of recognition cost and time for spam; in addition, the CNN had the best recognition performance when the number of extracted features was 15.
References
More filters
Journal ArticleDOI
01 Dec 2015
TL;DR: A novel Class A classifier general enough to thwart overfitting, lightweight thanks to the usage of the less costly features, and still able to correctly classify more than 95% of the accounts of the original training set.
Abstract: Fake followers are those Twitter accounts specifically created to inflate the number of followers of a target account. Fake followers are dangerous for the social platform and beyond, since they may alter concepts like popularity and influence in the Twittersphere-hence impacting on economy, politics, and society. In this paper, we contribute along different dimensions. First, we review some of the most relevant existing features and rules (proposed by Academia and Media) for anomalous Twitter accounts detection. Second, we create a baseline dataset of verified human and fake follower accounts. Such baseline dataset is publicly available to the scientific community. Then, we exploit the baseline dataset to train a set of machine-learning classifiers built over the reviewed rules and features. Our results show that most of the rules proposed by Media provide unsatisfactory performance in revealing fake followers, while features proposed in the past by Academia for spam detection provide good results. Building on the most promising features, we revise the classifiers both in terms of reduction of overfitting and cost for gathering the data needed to compute the features. The final result is a novel Class A classifier, general enough to thwart overfitting, lightweight thanks to the usage of the less costly features, and still able to correctly classify more than 95% of the accounts of the original training set. We ultimately perform an information fusion-based sensitivity analysis, to assess the global sensitivity of each of the features employed by the classifier.The findings reported in this paper, other than being supported by a thorough experimental methodology and interesting on their own, also pave the way for further investigation on the novel issue of fake Twitter followers.

340 citations

Journal ArticleDOI
TL;DR: A comprehensive review on feature selection techniques for text classification, including Nearest Neighbor (NN) method, Naïve Bayes, Support Vector Machine (SVM), Decision Tree (DT), and Neural Networks, is given.
Abstract: Big multimedia data is heterogeneous in essence, that is, the data may be a mixture of video, audio, text, and images. This is due to the prevalence of novel applications in recent years, such as social media, video sharing, and location based services (LBS), etc. In many multimedia applications, for example, video/image tagging and multimedia recommendation, text classification techniques have been used extensively to facilitate multimedia data processing. In this paper, we give a comprehensive review on feature selection techniques for text classification. We begin by introducing some popular representation schemes for documents, and similarity measures used in text classification. Then, we review the most popular text classifiers, including Nearest Neighbor (NN) method, Naive Bayes (NB), Support Vector Machine (SVM), Decision Tree (DT), and Neural Networks. Next, we survey four feature selection models, namely the filter, wrapper, embedded and hybrid, discussing pros and cons of the state-of-the-art feature selection approaches. Finally, we conclude the paper and give a brief introduction to some interesting feature selection work that does not belong to the four models.

223 citations

Journal ArticleDOI
TL;DR: The goal of the present survey is to analyze the data mining techniques that were utilized by social media networks between 2003 and 2015 and suggest that more research be conducted by both the academia and the industry since the studies done so far are not sufficiently exhaustive of datamining techniques.

128 citations

Journal ArticleDOI
TL;DR: This work proposes a modified flow-based trust evaluation scheme GFTrust, in which it addresses path dependence using network flow, and model trust decay with the leakage associated with each node, to predict trust in OSNs with a high accuracy and verify its preferable properties.
Abstract: In online social networks (OSNs), to evaluate trust from one user to another indirectly connected user, the trust evidence in the trusted paths (i.e., paths built through intermediate trustful users) should be carefully treated. Some paths may overlap with each other, leading to a unique challenge of path dependence , i.e., how to aggregate the trust values of multiple dependent trusted paths. OSNs bear the characteristic of high clustering, which makes the path dependence phenomenon common. Another challenge is trust decay through propagation, i.e., how to propagate trust along a trusted path, considering the possible decay in each node. We analyze the similarity between trust propagation and network flow, and convert a trust evaluation task with path dependence and trust decay into a generalized network flow problem. We propose a modified flow-based trust evaluation scheme GFTrust , in which we address path dependence using network flow, and model trust decay with the leakage associated with each node. Experimental results, with the real social network data sets of Epinions and Advogato, demonstrate that GFTrust can predict trust in OSNs with a high accuracy, and verify its preferable properties.

112 citations

Journal ArticleDOI
TL;DR: The results show the streaming spam tweet detection is still a big challenge and a robust detection technique should take into account the three aspects of data, feature, and model, and a performance evaluation of existing machine learning-based streaming spam detection methods is needed.
Abstract: The popularity of Twitter attracts more and more spammers. Spammers send unwanted tweets to Twitter users to promote websites or services, which are harmful to normal users. In order to stop spammers, researchers have proposed a number of mechanisms. The focus of recent works is on the application of machine learning techniques into Twitter spam detection. However, tweets are retrieved in a streaming way, and Twitter provides the Streaming API for developers and researchers to access public tweets in real time. There lacks a performance evaluation of existing machine learning-based streaming spam detection methods. In this paper, we bridged the gap by carrying out a performance evaluation, which was from three different aspects of data, feature, and model. A big ground-truth of over 600 million public tweets was created by using a commercial URL-based security tool. For real-time spam detection, we further extracted 12 lightweight features for tweet representation. Spam detection was then transformed to a binary classification problem in the feature space and can be solved by conventional machine learning algorithms. We evaluated the impact of different factors to the spam detection performance, which included spam to nonspam ratio, feature discretization, training data size, data sampling, time-related data, and machine learning algorithms. The results show the streaming spam tweet detection is still a big challenge and a robust detection technique should take into account the three aspects of data, feature, and model.

102 citations