scispace - formally typeset
Journal ArticleDOI

Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails

TLDR
Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection and among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate.
Abstract
Detection of the spam emails within a set of email files has become challenging task for researchers. Identification of an effective classifier is based not only on high accuracy of detection but also on low false alarm rates, and the need to use as few features as possible. In view of these challenges, this research examines the effects of using features selected by four feature subset selection methods (i.e. Genetic, Greedy Stepwise, Best First, and Rank Search) on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine, Genetic Algorithm, J48 and Random Forest. Tests were performed on three different publicly available spam email datasets: "Enron", "SpamAssassin" and "LingSpam". Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate. However, results of Random Forest were very close to that of Support Vector Machine. The Genetic classifier was identified as a weak classifier.

read more

Citations
More filters
Proceedings ArticleDOI

A study of machine learning classifiers for spam detection

TL;DR: SVM is the best classifier to be used because it has the high accuracy and the low false positive rate and training time of SVM to build the model is high, but as the results on other parameters are positive, the time does not pose such an issue.
Proceedings ArticleDOI

A Comparative Study of Various Supervised Feature Selection Methods for Spam Classification

TL;DR: Results of this study shows that RF is the excellent feature selection technique amongst other in terms of classification accuracy and false positive rate whereas DF and X2 were not so effective methods.
Proceedings ArticleDOI

A Combining Classifiers Approach for Detecting Email Spams

TL;DR: Results show the best results of novel combining classifier approach in compression with individual classifiers compared in terms of good performance accuracy and low false positives.
Journal ArticleDOI

Spam classification: a comparative analysis of different boosted decision tree approaches

TL;DR: A boosted decision tree approach has been proposed and used to classify email spam and ham files; this is found to be a highly effective approach in comparison with other state-of-the-art modes used in other studies.
Proceedings ArticleDOI

Using Adaboost and Stochastic gradient descent (SGD) Algorithms with R and Orange Software for Filtering E-mail Spam

TL;DR: The experimental results showed that the algorithms Adaboost and stochastic gradient descent (SGD) provided true positive value and 98.1% respectively and false positive rates of 0.0% and 1.9% respectively put them among the best choices of spam filtering methods.
References
More filters
Book

Adaptation in natural and artificial systems

TL;DR: Names of founding work in the area of Adaptation and modiication, which aims to mimic biological optimization, and some (Non-GA) branches of AI.
Book ChapterDOI

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.
Journal ArticleDOI

An overview of statistical learning theory

TL;DR: How the abstract learning theory established conditions for generalization which are more general than those discussed in classical statistical paradigms are demonstrated and how the understanding of these conditions inspired new algorithmic approaches to function estimation problems are demonstrated.
Book ChapterDOI

Naive (Bayes) at forty: the independence assumption in information retrieval

TL;DR: The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval, and some of the variations used for text retrieval and classification are reviewed.
Journal ArticleDOI

Support vector machines for spam categorization

TL;DR: The use of support vector machines in classifying e-mail as spam or nonspam is studied by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees, which found SVM's performed best when using binary features.
Related Papers (5)