CPSFS: A Credible Personalized Spam Filtering Scheme by Crowdsourcing

doi:10.1155/2017/1457870

Home
/
Papers
/
CPSFS: A Credible Personalized Spam Filtering Scheme by Crowdsourcing

Journal Article•DOI•

CPSFS: A Credible Personalized Spam Filtering Scheme by Crowdsourcing

Xin Liu¹, Zou Pingjun¹, Weishan Zhang¹, Jiehan Zhou², Changying Dai¹, Wang Feng¹, Xiaomiao Zhang¹ - Show less +3 more•Institutions (2)

China University of Petroleum¹, University of Oulu²

27 Dec 2017-Wireless Communications and Mobile Computing (Hindawi)-Vol. 2017, pp 1-9

TL;DR: The experimental results show that the proposed CPSFS can improve the accuracy rate of distinguishing spam from legitimate emails compared with that of Bayesian filter alone.

read less

Abstract: Email spam consumes a lot of network resources and threatens many systems because of its unwanted or malicious content. Most existing spam filters only target complete-spam but ignore semispam. This paper proposes a novel and comprehensive CPSFS scheme: Credible Personalized Spam Filtering Scheme, which classifies spam into two categories: complete-spam and semispam, and targets filtering both kinds of spam. Complete-spam is always spam for all users; semispam is an email identified as spam by some users and as regular email by other users. Most existing spam filters target complete-spam but ignore semispam. In CPSFS, Bayesian filtering is deployed at email servers to identify complete-spam, while semispam is identified at client side by crowdsourcing. An email user client can distinguish junk from legitimate emails according to spam reports from credible contacts with the similar interests. Social trust and interest similarity between users and their contacts are calculated so that spam reports are more accurately targeted to similar users. The experimental results show that the proposed CPSFS can improve the accuracy rate of distinguishing spam from legitimate emails compared with that of Bayesian filter alone.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A lifelong spam emails classification model

[...]

Rami Mustafa A. Mohammad¹•Institutions (1)

University of Dammam¹

23 Jan 2020-Applied Computing and Informatics

TL;DR: An enhanced model is proposed for ensuring lifelong spam classification model and the overall performance of the suggested model is contrasted against various other stream mining classification techniques to prove the success of the proposed model as a lifelong spam emails classification method.

...read moreread less

17 citations

Journal Article•DOI•

Design and Simulation of MIMO Antennas for Mobile Communication

[...]

D. S. Bhargava, T. V. Padmavathy, Yadhamuri Vinitha Reddy, Neelapareddy Kavitha, Vuppalapati Hema - Show less +1 more

01 Dec 2020

7 citations

Journal Article•DOI•

Blockchain-Based Crowdsourcing Makes Training Dataset of Machine Learning No Longer Be in Short Supply

[...]

Haitao Xu, Wei Wei, Yong Qi, Saiyu Qi

26 Jul 2022-Wireless Communications and Mobile Computing

TL;DR: This paper reviews studies applying mobile crowdsourcing to training dataset collection and annotation and proposes a new possible combination of machine learning and crowdsourcing systems.

...read moreread less

Abstract: Recently, machine learning has become popular in various fields like healthcare, smart transportation, network, and big data. However, the labelled training dataset, which is one of the most core of machine learning, cannot meet the requirements of quantity, quality, and diversity due to the limitation of data sources. Crowdsourcing systems based on mobile computing seem to address the bottlenecks faced by machine learning due to their unique advantages; i.e., crowdsourcing can make professional and nonprofessional participate in the collection and annotation process, which can greatly improve the quantity of the training dataset. Additionally, distributed blockchain technology can be embedded into crowdsourcing systems to make it transparent, secure, traceable, and decentralized. Moreover, truth discovery algorithm can improve the accuracy of annotation. Reasonable incentive mechanism will attract many workers to provide plenty of dataset. In this paper, we review studies applying mobile crowdsourcing to training dataset collection and annotation. In addition, after reviewing researches on blockchain or incentive mechanism, we propose a new possible combination of machine learning and crowdsourcing systems.

...read moreread less

3 citations

Journal Article•DOI•

SentiFilter: A Personalized Filtering Model for Arabic Semi-Spam Content based on Sentimental and Behavioral Analysis

[...]

Mashael M. Alsulami, Arwa Yousef Al-Aama

01 Jan 2020-International Journal of Advanced Computer Science and Applications

TL;DR: The proposed SentiFilter model is a hybrid model that combines both sentimental and behavioral factors to detect unwanted content for each user towards pre-defined topics and is expected to provide an effective automated solution for filtering semi-spam content in favor of personalized preferences.

...read moreread less

Abstract: Unwanted content in online social network services is a substantial issue that is continuously growing and negatively affecting the user-browsing experience. Current practices do not provide personalized solutions that meet each individual’s needs and preferences. Therefore, there is a potential demand to provide each user with a personalized level of protection against what he/she perceives as unwanted content. Thus, this paper proposes a personalized filtering model, which we named SentiFilter. It is a hybrid model that combines both sentimental and behavioral factors to detect unwanted content for each user towards pre-defined topics. An experiment involving 80,098 Twitter messages from 32 users was conducted to evaluate the effectiveness of the SentiFilter model. The effectiveness was measured in terms of the consistency between the implicit feedback derived from the SentiFilter model towards five selected topics and the explicit feedback collected explicitly from participants towards the same topics. Results reveal that commenting behavior is more effective than liking behavior to detect unwanted content because of its high consistency with users’ explicit feedback. Findings also indicate that sentiment of users’ comments does not reflect users’ perception of unwanted content. The results of implicit feedback derived from the SentiFilter model accurately agree with users’ explicit feedback by the indication of the low statistical significance difference between the two sets. The proposed model is expected to provide an effective automated solution for filtering semi-spam content in favor of personalized preferences.

...read moreread less

3 citations

Cites background from "CPSFS: A Credible Personalized Spam..."

...Therefore, a trust value needs to be assigned and computed for each contact [3]....
[...]
...[3] classified spam emails into two categories: complete spam and semispam emails....
[...]
...Studies that involved users‟ perspectives in identifying spam content have used terms such as semi-spam [3] and grey spam [2]....
[...]

Journal Article•DOI•

Hyperparameter Optimization of Ensemble Models for Spam Email Detection

[...]

Temidayo Oluwatosin Omotehinwa, David Opeoluwa Oyewola

03 Feb 2023-Applied Sciences

TL;DR: In this paper , the authors developed baseline models of random forest and extreme gradient boost (XGBoost) ensemble algorithms for the detection and classification of spam emails using the Enron1 dataset.

...read moreread less

Abstract: Unsolicited emails, popularly referred to as spam, have remained one of the biggest threats to cybersecurity globally. More than half of the emails sent in 2021 were spam, resulting in huge financial losses. The tenacity and perpetual presence of the adversary, the spammer, has necessitated the need for improved efforts at filtering spam. This study, therefore, developed baseline models of random forest and extreme gradient boost (XGBoost) ensemble algorithms for the detection and classification of spam emails using the Enron1 dataset. The developed ensemble models were then optimized using the grid-search cross-validation technique to search the hyperparameter space for optimal hyperparameter values. The performance of the baseline (un-tuned) and the tuned models of both algorithms were evaluated and compared. The impact of hyperparameter tuning on both models was also examined. The findings of the experimental study revealed that the hyperparameter tuning improved the performance of both models when compared with the baseline models. The tuned RF and XGBoost models achieved an accuracy of 97.78% and 98.09%, a sensitivity of 98.44% and 98.84%, and an F1 score of 97.85% and 98.16%, respectively. The XGBoost model outperformed the random forest model. The developed XGBoost model is effective and efficient for spam email detection.

...read moreread less

2 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

New order preserving encryption model for outsourced databases in cloud environments

[...]

Zheli Liu¹, Xiaofeng Chen², Jun Yang¹, Chunfu Jia¹, Ilsun You - Show less +1 more•Institutions (2)

Nankai University¹, Xidian University²

01 Jan 2016-Journal of Network and Computer Applications

TL;DR: A new simple OPE model is proposed, which uses message space expansion and nonlinear space split to hide data distribution and frequency and further analyze its security against two kinds of attack in details.

...read moreread less

92 citations

"CPSFS: A Credible Personalized Spam..." refers background in this paper

...If a recipient clicks a malicious link in the spam message, their personal information may be automatically sent to the spammer via a malicious program, which is an obvious challenge for privacy protection [3, 4]....
[...]

Proceedings Article•

Dirichlet-Enhanced Spam Filtering based on Biased Samples

[...]

Steffen Bickel¹, Tobias Scheffer¹•Institutions (1)

Max Planck Society¹

04 Dec 2006

TL;DR: It is observed that bias-corrected learning outperforms naive reliance on the assumption of independent and identically distributed data; Dirichlet-enhanced generalization across users outperforms a single ("one size fits all") filter as well as independent filters for all users.

...read moreread less

Abstract: We study a setting that is motivated by the problem of filtering spam messages for many users. Each user receives messages according to an individual, unknown distribution, reflected only in the unlabeled inbox. The spam filter for a user is required to perform well with respect to this distribution. Labeled messages from publicly available sources can be utilized, but they are governed by a distinct distribution, not adequately representing most inboxes. We devise a method that minimizes a loss function with respect to a user's personal distribution based on the available biased sample. A nonparametric hierarchical Bayesian model furthermore generalizes across users by learning a common prior which is imposed on new email accounts. Empirically, we observe that bias-corrected learning outperforms naive reliance on the assumption of independent and identically distributed data; Dirichlet-enhanced generalization across users outperforms a single ("one size fits all") filter as well as independent filters for all users.

...read moreread less

90 citations

"CPSFS: A Credible Personalized Spam..." refers methods in this paper

...Scholkopf and Platt [20] presented a method that minimizes a loss function with respect to user’s personal distribution based on the available biased samples....
[...]

DOI•

Spam filters: bayes vs. chi-squared; letters vs. words

[...]

Cormac O'Brien¹, Carl Vogel¹•Institutions (1)

University College Dublin¹

24 Sep 2003

TL;DR: The Naive Bayesian method is examined in relation to the 'Chi by degrees of Freedom' approach, the latter used in the field of authorship identification, and statistics based on character-level tokenization proves more effective than word-level.

...read moreread less

Abstract: We compare two statistical methods for identifying spam or junk electronic mail. Spam filters are classifiers which determine whether an email is junk or not. The proliferation of spam email has made electronic filtering vitally important. The magnitude of the problem is discussed. We examine the Naive Bayesian method in relation to the 'Chi by degrees of Freedom' approach, the latter used in the field of authorship identification. Both methods produce very promising results. However, the 'Chi by degrees of Freedom' has the advantage of providing significance measures, which will help to reduce false positives. Statistics based on character-level tokenization proves more effective than word-level.

...read moreread less

78 citations

"CPSFS: A Credible Personalized Spam..." refers methods in this paper

...O’Brien and Vogel [18] applied the Bayesian algorithm for spam filtering....
[...]

Proceedings Article•DOI•

SocialFilter: Introducing social trust to collaborative spam mitigation

[...]

Michael Sirivianos¹, Kyungbaek Kim², Xiaowei Yang³•Institutions (3)

Telefónica¹, University of California, Irvine², Duke University³

10 Apr 2011

TL;DR: SocialFilter is the first collaborative unwanted traffic mitigation system that assesses the trustworthiness of spam reporters by both auditing their reports and by leveraging the social network of the reporters' administrators.

...read moreread less

Abstract: We propose SocialFilter, a trust-aware collaborative spam mitigation system. Our proposal enables nodes with no email classification functionality to query the network on whether a host is a spammer. It employs Sybil-resilient trust inference to weigh the reports concerning spamming hosts that collaborating spam-detecting nodes (reporters) submit to the system. It weighs the spam reports according to the trustworthiness of their reporters to derive a measure of the system's belief that a host is a spammer. SocialFilter is the first collaborative unwanted traffic mitigation system that assesses the trustworthiness of spam reporters by both auditing their reports and by leveraging the social network of the reporters' administrators. The design and evaluation of our proposal offers us the following lessons: a) it is plausible to introduce Sybil-resilient Online-Social-Network-based trust inference mechanisms to improve the reliability and the attack-resistance of collaborative spam mitigation; b) using social links to obtain the trustworthiness of reports concerning spammers can result in comparable spam-blocking effectiveness with approaches that use social links to rate-limit spam (e.g., Ostra [27]); c) unlike Ostra, in the absence of reports that incriminate benign email senders, SocialFilter yields no false positives.

...read moreread less

71 citations

"CPSFS: A Credible Personalized Spam..." refers methods in this paper

...[23] applied social network and trust mechanism for spam filtering....
[...]

Journal Article•DOI•

The dynamic competitive recommendation algorithm in social network services

[...]

Seok Jong Yu¹•Institutions (1)

Sookmyung Women's University¹

01 Mar 2012-Information Sciences

TL;DR: This study proposes the dynamic competitive recommendation algorithm based on the competition of multiple component algorithms and shows that it outperforms previous approaches through performance evaluation on actual Twitter dataset.

...read moreread less

57 citations

"CPSFS: A Credible Personalized Spam..." refers background in this paper

...Social trust is a key factor that affects the sharing of knowledge and the development of social relationships [12, 13]: users are more likely to accept suggestions from others with high trust value and interests similarity [14]....
[...]