scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Leveraging Social Networks for Effective Spam Filtering

Haiying Shen1, Ze Li1
01 Nov 2014-IEEE Transactions on Computers (IEEE)-Vol. 63, Iss: 11, pp 2743-2759
TL;DR: Experimental results show that SOAP can greatly improve the performance of Bayesian spam filters in terms of accuracy, attack-resilience, and efficiency of spam detection.
Abstract: The explosive growth of unsolicited e-mails has prompted the development of numerous spam filter techniques. Bayesian spam filters are superior to static keyword-based spam filters in that they can continuously evolve to tackle new spam by learning keywords in new spam emails. However, Bayesian spam filters are easily poisoned by clever spammers who avoid spam keywords and add many innocuous words in their emails. Also, Bayesian spam filters need a significant amount of time to adapt to a new spam based on user feedback. Moreover, few current spam filters exploit social networks to assist in spam detection. In order to develop an accurate and user-friendly spam filter, we propose a SOcial network Aided Personalized and effective spam filter (SOAP) in this paper. In SOAP, each node connects to its social friends; i.e., nodes form a distributed overlay by directly using social network links as overlay links. Each node uses SOAP to collect information and check spam autonomously in a distributed manner. Unlike previous spam filters that focus on parsing keywords (e.g., Bayesian filters) or building blacklists, SOAP exploits the social relationships among email correspondents and their (dis)interests to detect spam adaptively and automatically. In each node, SOAP integrates four components into the basic Bayesian filter: social closeness-based spam filtering, social interest-based spam filtering, adaptive trust management, and friend notification. We have evaluated the performance of SOAP using simulation based on trace data from Facebook. We also have implemented a SOAP prototype for real-world experiments. Experimental results show that SOAP can greatly improve the performance of Bayesian spam filters in terms of accuracy, attack-resilience, and efficiency of spam detection. The performance of the Bayesian spam filter is SOAP’s lower bound.
Citations
More filters
Journal ArticleDOI
TL;DR: A novel spam filter integrating an N-gram tf.idf feature selection, modified distribution-based balancing algorithm and a regularized deep multi-layer perceptron NN model with rectified linear units is proposed, which outperforms state-of-the-art spam filters and several machine learning algorithms commonly used to classify text.
Abstract: Rapid growth in the volume of unsolicited and unwanted messages has inspired the development of many anti-spam methods. Supervised anti-spam filters using machine-learning methods have been particularly effective in categorizing spam and non-spam messages. These automatically integrate spam corpora pre-processing, appropriate word lists selection, and the calculation of word weights, usually in a bag-of-words fashion. To develop an accurate spam filter is challenging because spammers attempt to decrease the probability of spam detection by using legitimate words. Complex models are therefore needed to solve such a problem. However, existing spam filtering methods usually converge to a poor local minimum, cannot effectively handle high-dimensional data and suffer from overfitting issues. To overcome these problems, we propose a novel spam filter integrating an N-gram tf.idf feature selection, modified distribution-based balancing algorithm and a regularized deep multi-layer perceptron NN model with rectified linear units (DBB-RDNN-ReL). As demonstrated on four benchmark spam datasets (Enron, SpamAssassin, SMS spam collection and Social networking), the proposed approach enables capturing more complex features from high-dimensional data by additional layers of neurons. Another advantage of this approach is that no additional dimensionality reduction is necessary and spam dataset imbalance is addressed using a modified distribution-based algorithm. We compare the performance of the approach with that of state-of-the-art spam filters (Minimum Description Length, Factorial Design using SVM and NB, Incremental Learning C4.5, and Random Forest, Voting and Convolutional Neural Network) and several machine learning algorithms commonly used to classify text. We show that the proposed model outperforms these other methods in terms of classification accuracy, with fewer false negatives and false positives. Notably, the proposed spam filter classifies both major (legitimate) and minor (spam) classes well on personalized / non-personalized and balanced / imbalanced spam datasets. In addition, we show that the proposed model performs better than the results reported by previous studies in terms of accuracy. However, the high computational expenses related to additional hidden layers limit its application as an online spam filter and make it difficult to overcome the problem of concept drift.

70 citations


Cites background from "Leveraging Social Networks for Effe..."

  • ...Increasing the cost of sending spam and reducing the burden spam places on users require highly accurate spam filters [65]....

    [...]

  • ...This is a challenging task because spammers usually attempt to decrease the probability their messages are detected as spam by using legitimate words [65]....

    [...]

Journal ArticleDOI
TL;DR: An extreme learning machine (ELM)-based supervised machine is proposed for effective spammer detection and could achieve better reliability and feasibility compared with existing SVM-based approaches.
Abstract: Online social networks, such as Facebook, Twitter, and Weibo have played an important role in people's common life. Most existing social network platforms, however, face the challenges of dealing with undesirable users and their malicious spam activities that disseminate content, malware, viruses, etc. to the legitimate users of the service. The spreading of spam degrades user experience and also negatively impacts server-side functions such as data mining, user behavior analysis, and resource recommendation. In this paper, an extreme learning machine (ELM)-based supervised machine is proposed for effective spammer detection. The work first constructs the labeled dataset through crawling Sina Weibo data and manually classifying corresponding users into spammer and non-spammer categories. A set of features is then extracted from message content and user behavior and applies them to the ELM-based spammer classification algorithm. The experiment and evaluation show that the proposed solution provides excellent performance with a true positive rate of spammers and non-spammers reaching 99 and 99.95 %, respectively. As the results suggest, the proposed solution could achieve better reliability and feasibility compared with existing SVM-based approaches.

40 citations


Cites background from "Leveraging Social Networks for Effe..."

  • ...Currently, ELM has been an important research topic due to its high efficiency, easy-implementation, unification of classification and regression, and therefore might be capable to be implemented in social spammer detection field [10]....

    [...]

Journal ArticleDOI
TL;DR: It is demonstrated that the PIF can protect users' private keywords included in the filter from disclosure to others and detect forged filters and can not only filter spams efficiently but also achieve high delivery ratio and low latency with acceptable resource consumption.
Abstract: Mobile social network (MSN) emerges as a promising social network paradigm that enables mobile users' information sharing in the proximity and facilitates their cyber-physical-social interactions. As the advertisements, rumors, and spams spread in MSNs, it is necessary to filter spams before they arrive at the recipients to make the MSN energy efficient. To this end, we propose a personalized fine-grained filtering scheme (PIF) with privacy preservation in MSNs. Specifically, we first develop a social-assisted filter distribution scheme, where the filter creators send filters to their social friends (i.e., filter holders). These fil- ter holders store filters and decide to block spams or relay the desired packets through coarse-grained and fine-grained keyword filtering schemes. Meanwhile, the developed cryptographic filter- ing schemes protect creator's private information (i.e., keyword) embedded in the filters from directly disclosing to other users. In addition, we establish a Merkle Hash tree to store filters as leaf nodes where filter creators can check if the distributed filters need to be updated by retrieving the value of root node. It is demon- strated that the PIF can protect users' private keywords included in the filter from disclosure to others and detect forged filters. We also conduct the trace-driven simulations to show that the PIF can not only filter spams efficiently but also achieve high delivery ratio and low latency with acceptable resource consumption.

27 citations


Cites background from "Leveraging Social Networks for Effe..."

  • ...MSN users receive various types of information, such as newsletters, personal posts, rumors, and advertisements, most of which are of great value to users....

    [...]

Journal ArticleDOI
TL;DR: This work has validated proposed approach with Microsoft Learning to Rank dataset, and it has been found in the experiments performed that 3403 dangling pages out of 12211 dangling pages have been degraded using the proposed scheme.

27 citations

Book ChapterDOI
29 Nov 2016
TL;DR: It is shown that the RANN-ReL outperforms other methods in terms of classification accuracy, false negative and false positive rates, and Notably, it classifies well both major legitimate and minor spam classes.
Abstract: The rapid growth of unsolicited and unwanted messages has inspired the development of many anti-spam methods. Machine-learning methods such as Naive Bayes NB, support vector machines SVMs or neural networks NNs have been particularly effective in categorizing spam /non-spam messages. They automatically construct word lists and their weights usually in a bag-of-words fashion. However, traditional multilayer perceptron MLP NNs usually suffer from slow optimization convergence to a poor local minimum and overfitting issues. To overcome this problem, we use a regularized NN with rectified linear units RANN-ReL for spam filtering. We compare its performance on three benchmark spam datasets Enron, SpamAssassin, and SMS spam collection with four machine algorithms commonly used in text classification, namely NB, SVM, MLP, and k-NN. We show that the RANN-ReL outperforms other methods in terms of classification accuracy, false negative and false positive rates. Notably, it classifies well both major legitimate and minor spam classes.

22 citations

References
More filters
Journal ArticleDOI
TL;DR: A least squares version for support vector machine (SVM) classifiers that follows from solving a set of linear equations, instead of quadratic programming for classical SVM's.
Abstract: In this letter we discuss a least squares version for support vector machine (SVM) classifiers. Due to equality type constraints in the formulation, the solution follows from solving a set of linear equations, instead of quadratic programming for classical SVM‘s. The approach is illustrated on a two-spiral benchmark classification problem.

8,811 citations


Additional excerpts

  • ...The second category of content-based approaches includes machine learning-based approaches such as Bayesian filters [6], decision trees [8], [9], Support Vector Machines [10], [11], Bayes Classifiers [12], [13] and combinations of these techni-...

    [...]

  • ...The second category of content-based approaches includes machine learning-based approaches such as Bayesian filters [6], decision trees [8], [9], Support Vector Machines [10], [11], Bayes Classifiers [12], [13] and combinations of these techniques [14]....

    [...]

Proceedings ArticleDOI
01 Aug 2000
TL;DR: This paper explores and evaluates the use of directed diffusion for a simple remote-surveillance sensor network and its implications for sensing, communication and computation.
Abstract: Advances in processor, memory and radio technology will enable small and cheap nodes capable of sensing, communication and computation. Networks of such nodes can coordinate to perform distributed sensing of environmental phenomena. In this paper, we explore the directed diffusion paradigm for such coordination. Directed diffusion is datacentric in that all communication is for named data. All nodes in a directed diffusion-based network are application-aware. This enables diffusion to achieve energy savings by selecting empirically good paths and by caching and processing data in-network. We explore and evaluate the use of directed diffusion for a simple remote-surveillance sensor network.

6,061 citations

Journal Article

6,034 citations


"Leveraging Social Networks for Effe..." refers background in this paper

  • ...two persons is less than or equal to 6 [43]....

    [...]

Journal ArticleDOI
TL;DR: This work reviews localization techniques and evaluates the effectiveness of a very simple connectivity metric method for localization in outdoor environments that makes use of the inherent RF communications capabilities of these devices.
Abstract: Instrumenting the physical world through large networks of wireless sensor nodes, particularly for applications like environmental monitoring of water and soil, requires that these nodes be very small, lightweight, untethered, and unobtrusive. The problem of localization, that is, determining where a given node is physically located in a network, is a challenging one, and yet extremely crucial for many of these applications. Practical considerations such as the small size, form factor, cost and power constraints of nodes preclude the reliance on GPS of all nodes in these networks. We review localization techniques and evaluate the effectiveness of a very simple connectivity metric method for localization in outdoor environments that makes use of the inherent RF communications capabilities of these devices. A fixed number of reference points in the network with overlapping regions of coverage transmit periodic beacon signals. Nodes use a simple connectivity metric, which is more robust to environmental vagaries, to infer proximity to a given subset of these reference points. Nodes localize themselves to the centroid of their proximate reference points. The accuracy of localization is then dependent on the separation distance between two-adjacent reference points and the transmission range of these reference points. Initial experimental results show that the accuracy for 90 percent of our data points is within one-third of the separation distance. However, future work is needed to extend the technique to more cluttered environments.

3,723 citations

Journal ArticleDOI
Fritz Heider1
TL;DR: A comparison of attitudes and cognitive Organization in the context of war and post-war Europe shows marked differences in the attitudes of men and women towards one another and towards Europe in general.
Abstract: (1946). Attitudes and Cognitive Organization. The Journal of Psychology: Vol. 21, No. 1, pp. 107-112.

3,204 citations