scispace - formally typeset
Search or ask a question
Journal ArticleDOI

CPSFS: A Credible Personalized Spam Filtering Scheme by Crowdsourcing

TL;DR: The experimental results show that the proposed CPSFS can improve the accuracy rate of distinguishing spam from legitimate emails compared with that of Bayesian filter alone.
Abstract: Email spam consumes a lot of network resources and threatens many systems because of its unwanted or malicious content. Most existing spam filters only target complete-spam but ignore semispam. This paper proposes a novel and comprehensive CPSFS scheme: Credible Personalized Spam Filtering Scheme, which classifies spam into two categories: complete-spam and semispam, and targets filtering both kinds of spam. Complete-spam is always spam for all users; semispam is an email identified as spam by some users and as regular email by other users. Most existing spam filters target complete-spam but ignore semispam. In CPSFS, Bayesian filtering is deployed at email servers to identify complete-spam, while semispam is identified at client side by crowdsourcing. An email user client can distinguish junk from legitimate emails according to spam reports from credible contacts with the similar interests. Social trust and interest similarity between users and their contacts are calculated so that spam reports are more accurately targeted to similar users. The experimental results show that the proposed CPSFS can improve the accuracy rate of distinguishing spam from legitimate emails compared with that of Bayesian filter alone.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: An enhanced model is proposed for ensuring lifelong spam classification model and the overall performance of the suggested model is contrasted against various other stream mining classification techniques to prove the success of the proposed model as a lifelong spam emails classification method.

17 citations

Journal ArticleDOI
TL;DR: This paper reviews studies applying mobile crowdsourcing to training dataset collection and annotation and proposes a new possible combination of machine learning and crowdsourcing systems.
Abstract: Recently, machine learning has become popular in various fields like healthcare, smart transportation, network, and big data. However, the labelled training dataset, which is one of the most core of machine learning, cannot meet the requirements of quantity, quality, and diversity due to the limitation of data sources. Crowdsourcing systems based on mobile computing seem to address the bottlenecks faced by machine learning due to their unique advantages; i.e., crowdsourcing can make professional and nonprofessional participate in the collection and annotation process, which can greatly improve the quantity of the training dataset. Additionally, distributed blockchain technology can be embedded into crowdsourcing systems to make it transparent, secure, traceable, and decentralized. Moreover, truth discovery algorithm can improve the accuracy of annotation. Reasonable incentive mechanism will attract many workers to provide plenty of dataset. In this paper, we review studies applying mobile crowdsourcing to training dataset collection and annotation. In addition, after reviewing researches on blockchain or incentive mechanism, we propose a new possible combination of machine learning and crowdsourcing systems.

3 citations

Journal ArticleDOI
TL;DR: The proposed SentiFilter model is a hybrid model that combines both sentimental and behavioral factors to detect unwanted content for each user towards pre-defined topics and is expected to provide an effective automated solution for filtering semi-spam content in favor of personalized preferences.
Abstract: Unwanted content in online social network services is a substantial issue that is continuously growing and negatively affecting the user-browsing experience. Current practices do not provide personalized solutions that meet each individual’s needs and preferences. Therefore, there is a potential demand to provide each user with a personalized level of protection against what he/she perceives as unwanted content. Thus, this paper proposes a personalized filtering model, which we named SentiFilter. It is a hybrid model that combines both sentimental and behavioral factors to detect unwanted content for each user towards pre-defined topics. An experiment involving 80,098 Twitter messages from 32 users was conducted to evaluate the effectiveness of the SentiFilter model. The effectiveness was measured in terms of the consistency between the implicit feedback derived from the SentiFilter model towards five selected topics and the explicit feedback collected explicitly from participants towards the same topics. Results reveal that commenting behavior is more effective than liking behavior to detect unwanted content because of its high consistency with users’ explicit feedback. Findings also indicate that sentiment of users’ comments does not reflect users’ perception of unwanted content. The results of implicit feedback derived from the SentiFilter model accurately agree with users’ explicit feedback by the indication of the low statistical significance difference between the two sets. The proposed model is expected to provide an effective automated solution for filtering semi-spam content in favor of personalized preferences.

3 citations


Cites background from "CPSFS: A Credible Personalized Spam..."

  • ...Therefore, a trust value needs to be assigned and computed for each contact [3]....

    [...]

  • ...[3] classified spam emails into two categories: complete spam and semispam emails....

    [...]

  • ...Studies that involved users‟ perspectives in identifying spam content have used terms such as semi-spam [3] and grey spam [2]....

    [...]

Journal ArticleDOI
TL;DR: In this paper , the authors developed baseline models of random forest and extreme gradient boost (XGBoost) ensemble algorithms for the detection and classification of spam emails using the Enron1 dataset.
Abstract: Unsolicited emails, popularly referred to as spam, have remained one of the biggest threats to cybersecurity globally. More than half of the emails sent in 2021 were spam, resulting in huge financial losses. The tenacity and perpetual presence of the adversary, the spammer, has necessitated the need for improved efforts at filtering spam. This study, therefore, developed baseline models of random forest and extreme gradient boost (XGBoost) ensemble algorithms for the detection and classification of spam emails using the Enron1 dataset. The developed ensemble models were then optimized using the grid-search cross-validation technique to search the hyperparameter space for optimal hyperparameter values. The performance of the baseline (un-tuned) and the tuned models of both algorithms were evaluated and compared. The impact of hyperparameter tuning on both models was also examined. The findings of the experimental study revealed that the hyperparameter tuning improved the performance of both models when compared with the baseline models. The tuned RF and XGBoost models achieved an accuracy of 97.78% and 98.09%, a sensitivity of 98.44% and 98.84%, and an F1 score of 97.85% and 98.16%, respectively. The XGBoost model outperformed the random forest model. The developed XGBoost model is effective and efficient for spam email detection.

2 citations

References
More filters
Journal ArticleDOI
TL;DR: This study proposes a series of acceleration techniques that speed up Bayesian filters based on approximate classifications and demonstrates a 6× speedup over two well-known spam filters while achieving an identical false positive rate and similar false negative rate to the original filters.
Abstract: Statistical-based Bayesian filters have become a popular and important defense against spam. However, despite their effectiveness, their greater processing overhead can prevent them from scaling well for enterprise level mail servers. For example, the dictionary lookups that are characteristic of this approach are limited by the memory access rate, therefore relatively insensitive to increases in CPU speed. We conduct a comprehensive study to address this scaling issue by proposing a series of acceleration techniques that speed up Bayesian filters based on approximate classifications. The core approximation technique uses hash-based lookup and lossy encoding. Lookup approximation is based on the popular Bloom filter data structure with an extension to support value retrieval. Lossy encoding is used to further compress the data structure. While these approximation methods introduce additional errors to a strict Bayesian approach, we show how the errors can be both minimized and biased toward a false negative classification. We demonstrate a 6× speedup over two well-known spam filters (bogofilter and qsf) while achieving an identical false positive rate and similar false negative rate to the original filters.

17 citations


"CPSFS: A Credible Personalized Spam..." refers methods in this paper

  • ...All emails of a user are examined by a Bayesian filter at an email server before they reach clients [28]....

    [...]

Book ChapterDOI
15 Dec 2009
TL;DR: An algorithm and data structure for fast computation of similarity based on Jaccard coefficient to retrieve images with regions similar to those of a query image to use the runlength description of an image for computing the number of overlapped pixels between the regions.
Abstract: This paper proposes an algorithm and data structure for fast computation of similarity based on Jaccard coefficient to retrieve images with regions similar to those of a query image. The similarity measures the degree of overlap between the regions of an image and those of another image. The key idea for fast computation of the similarity is to use the runlength description of an image for computing the number of overlapped pixels between the regions. We present an algorithm and data structure, and do experiments on 30,000 images to evaluate the performance of our algorithm. Experiments showed that the proposed algorithm is 5.49 (2.36) times faster than a naive algorithm on the average (the worst). And we theoretically gave fairly good estimates of the computation time.

17 citations


"CPSFS: A Credible Personalized Spam..." refers background in this paper

  • ...The more the mutual interests and disinterests between a user and his or her contacts are, the more similar they are [26]....

    [...]

Proceedings ArticleDOI
Liu Xin, Shi Leyi, Wang Yao, Xin Zhaojun, Fu Wenjing 
28 Oct 2013
TL;DR: By statistical analysis of trust value in social network, this algorithm improved the accuracy of trust transitivity and trust value computing compared with a classical trust algorithm.
Abstract: More and more users joined in social network. The precise social trust value is critical for application system such as recommendation system. To a user, the egocentric network is formed by the user, his friends and social relationships between him and other users. We proposed an algorithm for inferring dynamic trust based on trust chains and interactions. Indirect trust values are calculated depending on direct trust values and trust chains in the egocentric network. As the social network evolves, the dynamic trust values can be resulted from the interactions between a user and his friends and trust reference. By statistical analysis of trust value in social network, this algorithm improved the accuracy of trust transitivity and trust value computing compared with a classical trust algorithm.

14 citations


"CPSFS: A Credible Personalized Spam..." refers methods in this paper

  • ...Social Computing Approach....

    [...]

  • ...We divided the existing work into four types based on the used techniques: the Black/White List, Bayesian,Machine Learning, and Social Computing....

    [...]

  • ...Social trust can be calculated by analyzing Social Computing [15, 16]....

    [...]

Journal ArticleDOI
Xin Liu1, Yao Wang1, Dehai Zhao1, Weishan Zhang1, Leyi Shi1 
TL;DR: Experiments show the distributed patching mechanism proposed in which the patch can tend to hub nodes automatically based on social computing in social networks is more efficient than other patching mechanisms.

7 citations


"CPSFS: A Credible Personalized Spam..." refers methods in this paper

  • ...Social Computing Approach....

    [...]

  • ...We divided the existing work into four types based on the used techniques: the Black/White List, Bayesian,Machine Learning, and Social Computing....

    [...]

  • ...Social trust can be calculated by analyzing Social Computing [15, 16]....

    [...]

Proceedings Article
01 Dec 2011
TL;DR: A user based collaborative approach to address the spam problem, exchanging vote databases containing the hash values of the emails perceived as spam by its users, the mailbook is proposed.
Abstract: Spam is the main problem of email systems nowadays. The total amount of spam emails account for more than 75% of the total emails exchanged worldwide; recent reports raise this number up to more than 90%. Novel anti-spam solutions are proposed constantly, to be followed by announcements of sophisticated methods to overcome them through the use of advanced software to reach the spammers' goal. In this paper we propose a collaborative spam filter over a social network, exchanging vote databases containing the hash values of the emails perceived as spam by its users, the mailbook. Social networks are blooming nowadays and users are accustomed to their use more and more every day. Our proposal builds upon that strong attachment between friends and people with the same interests and habits. We propose a user based collaborative approach to address the spam problem. Users characterize spam mail and exchange their votes among their friends through mailbook. User profiles are created in mailbook to express user interests, which in turn are used for evaluating mail as spam according to the user's characteristics. Users also form groups of interests which are also used by our method as another mean to evaluate spam for the specific group in a more effective way.

4 citations


"CPSFS: A Credible Personalized Spam..." refers methods in this paper

  • ...[21] presented a collaborative method for email filtering called Mailbook which was based on a social network....

    [...]