scispace - formally typeset
Search or ask a question
Journal ArticleDOI

CPSFS: A Credible Personalized Spam Filtering Scheme by Crowdsourcing

TL;DR: The experimental results show that the proposed CPSFS can improve the accuracy rate of distinguishing spam from legitimate emails compared with that of Bayesian filter alone.
Abstract: Email spam consumes a lot of network resources and threatens many systems because of its unwanted or malicious content. Most existing spam filters only target complete-spam but ignore semispam. This paper proposes a novel and comprehensive CPSFS scheme: Credible Personalized Spam Filtering Scheme, which classifies spam into two categories: complete-spam and semispam, and targets filtering both kinds of spam. Complete-spam is always spam for all users; semispam is an email identified as spam by some users and as regular email by other users. Most existing spam filters target complete-spam but ignore semispam. In CPSFS, Bayesian filtering is deployed at email servers to identify complete-spam, while semispam is identified at client side by crowdsourcing. An email user client can distinguish junk from legitimate emails according to spam reports from credible contacts with the similar interests. Social trust and interest similarity between users and their contacts are calculated so that spam reports are more accurately targeted to similar users. The experimental results show that the proposed CPSFS can improve the accuracy rate of distinguishing spam from legitimate emails compared with that of Bayesian filter alone.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: An enhanced model is proposed for ensuring lifelong spam classification model and the overall performance of the suggested model is contrasted against various other stream mining classification techniques to prove the success of the proposed model as a lifelong spam emails classification method.

17 citations

Journal ArticleDOI
TL;DR: This paper reviews studies applying mobile crowdsourcing to training dataset collection and annotation and proposes a new possible combination of machine learning and crowdsourcing systems.
Abstract: Recently, machine learning has become popular in various fields like healthcare, smart transportation, network, and big data. However, the labelled training dataset, which is one of the most core of machine learning, cannot meet the requirements of quantity, quality, and diversity due to the limitation of data sources. Crowdsourcing systems based on mobile computing seem to address the bottlenecks faced by machine learning due to their unique advantages; i.e., crowdsourcing can make professional and nonprofessional participate in the collection and annotation process, which can greatly improve the quantity of the training dataset. Additionally, distributed blockchain technology can be embedded into crowdsourcing systems to make it transparent, secure, traceable, and decentralized. Moreover, truth discovery algorithm can improve the accuracy of annotation. Reasonable incentive mechanism will attract many workers to provide plenty of dataset. In this paper, we review studies applying mobile crowdsourcing to training dataset collection and annotation. In addition, after reviewing researches on blockchain or incentive mechanism, we propose a new possible combination of machine learning and crowdsourcing systems.

3 citations

Journal ArticleDOI
TL;DR: The proposed SentiFilter model is a hybrid model that combines both sentimental and behavioral factors to detect unwanted content for each user towards pre-defined topics and is expected to provide an effective automated solution for filtering semi-spam content in favor of personalized preferences.
Abstract: Unwanted content in online social network services is a substantial issue that is continuously growing and negatively affecting the user-browsing experience. Current practices do not provide personalized solutions that meet each individual’s needs and preferences. Therefore, there is a potential demand to provide each user with a personalized level of protection against what he/she perceives as unwanted content. Thus, this paper proposes a personalized filtering model, which we named SentiFilter. It is a hybrid model that combines both sentimental and behavioral factors to detect unwanted content for each user towards pre-defined topics. An experiment involving 80,098 Twitter messages from 32 users was conducted to evaluate the effectiveness of the SentiFilter model. The effectiveness was measured in terms of the consistency between the implicit feedback derived from the SentiFilter model towards five selected topics and the explicit feedback collected explicitly from participants towards the same topics. Results reveal that commenting behavior is more effective than liking behavior to detect unwanted content because of its high consistency with users’ explicit feedback. Findings also indicate that sentiment of users’ comments does not reflect users’ perception of unwanted content. The results of implicit feedback derived from the SentiFilter model accurately agree with users’ explicit feedback by the indication of the low statistical significance difference between the two sets. The proposed model is expected to provide an effective automated solution for filtering semi-spam content in favor of personalized preferences.

3 citations


Cites background from "CPSFS: A Credible Personalized Spam..."

  • ...Therefore, a trust value needs to be assigned and computed for each contact [3]....

    [...]

  • ...[3] classified spam emails into two categories: complete spam and semispam emails....

    [...]

  • ...Studies that involved users‟ perspectives in identifying spam content have used terms such as semi-spam [3] and grey spam [2]....

    [...]

Journal ArticleDOI
TL;DR: In this paper , the authors developed baseline models of random forest and extreme gradient boost (XGBoost) ensemble algorithms for the detection and classification of spam emails using the Enron1 dataset.
Abstract: Unsolicited emails, popularly referred to as spam, have remained one of the biggest threats to cybersecurity globally. More than half of the emails sent in 2021 were spam, resulting in huge financial losses. The tenacity and perpetual presence of the adversary, the spammer, has necessitated the need for improved efforts at filtering spam. This study, therefore, developed baseline models of random forest and extreme gradient boost (XGBoost) ensemble algorithms for the detection and classification of spam emails using the Enron1 dataset. The developed ensemble models were then optimized using the grid-search cross-validation technique to search the hyperparameter space for optimal hyperparameter values. The performance of the baseline (un-tuned) and the tuned models of both algorithms were evaluated and compared. The impact of hyperparameter tuning on both models was also examined. The findings of the experimental study revealed that the hyperparameter tuning improved the performance of both models when compared with the baseline models. The tuned RF and XGBoost models achieved an accuracy of 97.78% and 98.09%, a sensitivity of 98.44% and 98.84%, and an F1 score of 97.85% and 98.16%, respectively. The XGBoost model outperformed the random forest model. The developed XGBoost model is effective and efficient for spam email detection.

2 citations

References
More filters
Proceedings ArticleDOI
01 Apr 2009
TL;DR: This paper proposes the use of interaction graphs to impart meaning to online social links by quantifying user interactions, and uses both types of graphs to validate two well-known social-based applications (RE and SybilGuard).
Abstract: Social networks are popular platforms for interaction, communication and collaboration between friends. Researchers have recently proposed an emerging class of applications that leverage relationships from social networks to improve security and performance in applications such as email, web browsing and overlay routing. While these applications often cite social network connectivity statistics to support their designs, researchers in psychology and sociology have repeatedly cast doubt on the practice of inferring meaningful relationships from social network connections alone.This leads to the question: Are social links valid indicators of real user interaction? If not, then how can we quantify these factors to form a more accurate model for evaluating socially-enhanced applications? In this paper, we address this question through a detailed study of user interactions in the Facebook social network. We propose the use of interaction graphs to impart meaning to online social links by quantifying user interactions. We analyze interaction graphs derived from Facebook user traces and show that they exhibit significantly lower levels of the "small-world" properties shown in their social graph counterparts. This means that these graphs have fewer "supernodes" with extremely high degree, and overall network diameter increases significantly as a result. To quantify the impact of our observations, we use both types of graphs to validate two well-known social-based applications (RE and SybilGuard). The results reveal new insights into both systems, and confirm our hypothesis that studies of social applications should use real indicators of user interactions in lieu of social graphs.

992 citations


"CPSFS: A Credible Personalized Spam..." refers background in this paper

  • ...The interests of a user in a social network represent the user’s personality [25]....

    [...]

Journal ArticleDOI
TL;DR: This paper proposes an efficient mutual verifiable provable data possession scheme, which utilizes Diffie-Hellman shared key to construct the homomorphic authenticator and is very efficient compared with the previous PDP schemes, since the bilinear operation is not required.
Abstract: Cloud storage is now a hot research topic in information technology. In cloud storage, date security properties such as data confidentiality, integrity and availability become more and more important in many commercial applications. Recently, many provable data possession (PDP) schemes are proposed to protect data integrity. In some cases, it has to delegate the remote data possession checking task to some proxy. However, these PDP schemes are not secure since the proxy stores some state information in cloud storage servers. Hence, in this paper, we propose an efficient mutual verifiable provable data possession scheme, which utilizes Diffie-Hellman shared key to construct the homomorphic authenticator. In particular, the verifier in our scheme is stateless and independent of the cloud storage service. It is worth noting that the presented scheme is very efficient compared with the previous PDP schemes, since the bilinear operation is not required.

349 citations

Journal ArticleDOI
TL;DR: To effectively handle the streaming nature of tweets, two stream clustering algorithms, StreamKM++ and DenStream, were modified to facilitate spam identification, and the system was able to identify 100% of the spammers in the authors' test while incorrectly detecting only 2.2% of normal users as spammers.

293 citations


"CPSFS: A Credible Personalized Spam..." refers background in this paper

  • ...Spam consumes network bandwidth and brings also other threats to recipients: unwanted advertisements and pornographic content, as well as malicious viruses [1]....

    [...]

Journal ArticleDOI
TL;DR: An application scenario on trajectory data-analysis-based traffic anomaly detection for VSNs and several research challenges and open issues are highlighted and discussed.
Abstract: Vehicular transportation is an essential part of modern cities. However, the ever increasing number of road accidents, traffic congestion, and other such issues become obstacles for the realization of smart cities. As the integration of the Internet of Vehicles and social networks, vehicular social networks (VSNs) are promising to solve the above-mentioned problems by enabling smart mobility in modern cities, which are likely to pave the way for sustainable development by promoting transportation efficiency. In this article, the definition of and a brief introduction to VSNs are presented first. Existing supporting communication technologies are then summarized. Furthermore, we introduce an application scenario on trajectory data-analysis-based traffic anomaly detection for VSNs. Finally, several research challenges and open issues are highlighted and discussed.

286 citations

Journal ArticleDOI
TL;DR: An automated antispam tool exploits the properties of social networks to distinguish between unsolicited commercial e-mail - spam - and messages associated with people the user knows.
Abstract: Social networks are useful for judging the trustworthiness of outsiders. An automated antispam tool exploits the properties of social networks to distinguish between unsolicited commercial e-mail - spam - and messages associated with people the user knows.

258 citations


"CPSFS: A Credible Personalized Spam..." refers background in this paper

  • ...Similarly, Boykin and Roychowdhury [22] proposed a spam filtering approach based on social networks, which allows users to share the spam information with their friends to identify spam....

    [...]