Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails

doi:10.1145/2600617.2600622

Journal ArticleDOI

Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails

Shrawan Kumar Trivedi, +1 more

- 01 Mar 2014 -

ACM Sigapp Applied Computing Review

- Vol. 14, Iss: 1, pp 53-61

TLDR

Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection and among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate.

Abstract:

Detection of the spam emails within a set of email files has become challenging task for researchers. Identification of an effective classifier is based not only on high accuracy of detection but also on low false alarm rates, and the need to use as few features as possible. In view of these challenges, this research examines the effects of using features selected by four feature subset selection methods (i.e. Genetic, Greedy Stepwise, Best First, and Rank Search) on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine, Genetic Algorithm, J48 and Random Forest. Tests were performed on three different publicly available spam email datasets: "Enron", "SpamAssassin" and "LingSpam". Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate. However, results of Random Forest were very close to that of Support Vector Machine. The Genetic classifier was identified as a weak classifier.

Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails

Citations

A study of machine learning classifiers for spam detection

A Comparative Study of Various Supervised Feature Selection Methods for Spam Classification

A Combining Classifiers Approach for Detecting Email Spams

Spam classification: a comparative analysis of different boosted decision tree approaches

Using Adaboost and Stochastic gradient descent (SGD) Algorithms with R and Orange Software for Filtering E-mail Spam

References

Adaptation in natural and artificial systems

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

An overview of statistical learning theory

Naive (Bayes) at forty: the independence assumption in information retrieval

Support vector machines for spam categorization

Related Papers (5)

Interplay between Probabilistic Classifiers and Boosting Algorithms for Detecting Complex Unsolicited Emails

An Enhanced Genetic Programming Approach for Detecting Unsolicited Emails

Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams

Support vector machines for spam categorization

Naive (Bayes) at forty: the independence assumption in information retrieval