Effect of feature selection methods on machine learning classifiers for detecting email spams

doi:10.1145/2513228.2513313

Home
/
Papers
/
Effect of feature selection methods on machine learning classifiers for detecting email spams

Proceedings Article•DOI•

Effect of feature selection methods on machine learning classifiers for detecting email spams

Shrawan Kumar Trivedi¹, Shubhamoy Dey¹•Institutions (1)

Indian Institute of Management Ahmedabad¹

01 Oct 2013-pp 35-40

TL;DR: Results show that, Greedy Stepwise Search is a good method for feature selection for spam email detection and among the Machine Learning Classifiers, Support Vector Machine has been found to be the best both in terms of accuracy and False Positive rate.

read less

Abstract: This research presents the effects of using features selected by two feature selection methods i.e. Genetic Search and Greedy Stepwise Search on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine and Genetic Algorithm. Tests were performed on two different publicly available spam email datasets: "Enron" and "SpamAssassin". Results show that, Greedy Stepwise Search is a good method for feature selection for spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best both in terms of accuracy and False Positive rate

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Design of multi-view based email classification for IoT systems via semi-supervised learning

[...]

Wenjuan Li¹, Wenjuan Li², Weizhi Meng², Zhiyuan Tan³, Yang Xiang⁴ - Show less +1 more•Institutions (4)

City University of Hong Kong¹, Technical University of Denmark², Edinburgh Napier University³, Swinburne University of Technology⁴

15 Feb 2019-Journal of Network and Computer Applications

TL;DR: Experimental results demonstrate that the use of multi-view data can achieve more accurate email classification than theUse of single-View data, and that the approach is more effective as compared to several existing similar algorithms.

...read moreread less

41 citations

Proceedings Article•DOI•

Using supervised machine learning algorithms to detect suspicious URLs in online social networks

[...]

Mohammed Al-Janabi¹, Ed de Quincey¹, Peter Andras¹•Institutions (1)

Keele University¹

31 Jul 2017

TL;DR: A supervised machine learning classification model that has been built to detect the distribution of malicious content in online social networks (ONSs) and is able to improve the classifier performance to 0.92 in recall.

...read moreread less

Abstract: The increasing volume of malicious content in social networks requires automated methods to detect and eliminate such content. This paper describes a supervised machine learning classification model that has been built to detect the distribution of malicious content in online social networks (ONSs). Multisource features have been used to detect social network posts that contain malicious Uniform Resource Locators (URLs). These URLs could direct users to websites that contain malicious content, drive-by download attacks, phishing, spam, and scams. For the data collection stage, the Twitter streaming application programming interface (API) was used and VirusTotal was used for labelling the dataset. A random forest classification model was used with a combination of features derived from a range of sources. The random forest model without any tuning and feature selection produced a recall value of 0.89. After further investigation and applying parameter tuning and feature selection methods, however, we were able to improve the classifier performance to 0.92 in recall.

...read moreread less

33 citations

Cites background from "Effect of feature selection methods..."

...Due to the inventiveness of spammers detection systems are bypassed after some time and the set of features used for spam detection has to be regularly revised [18][19]....
[...]

Journal Article•DOI•

Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails

[...]

Shrawan Kumar Trivedi¹, Shubhamoy Dey¹•Institutions (1)

Indian Institute of Management Ahmedabad¹

01 Mar 2014-ACM Sigapp Applied Computing Review

TL;DR: Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection and among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate.

...read moreread less

Abstract: Detection of the spam emails within a set of email files has become challenging task for researchers. Identification of an effective classifier is based not only on high accuracy of detection but also on low false alarm rates, and the need to use as few features as possible. In view of these challenges, this research examines the effects of using features selected by four feature subset selection methods (i.e. Genetic, Greedy Stepwise, Best First, and Rank Search) on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine, Genetic Algorithm, J48 and Random Forest. Tests were performed on three different publicly available spam email datasets: "Enron", "SpamAssassin" and "LingSpam". Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate. However, results of Random Forest were very close to that of Support Vector Machine. The Genetic classifier was identified as a weak classifier.

...read moreread less

19 citations

Proceedings Article•DOI•

An Enhanced Genetic Programming Approach for Detecting Unsolicited Emails

[...]

Shrawan Kumar Trivedi, Shubhamoy Dey

03 Dec 2013

TL;DR: Results of this study indicate that the proposed classifier (EGP) is the best classifier among those compared in terms of performance accuracy as well as false positive rate.

...read moreread less

Abstract: Identification of unsolicited emails (spams) is now a well-recognized research area within text classification. A good email classifier is not only evaluated by performance accuracy but also by the false positive rate. This research presents an Enhanced Genetic Programming (EGP) approach which works by building an ensemble of classifiers for detecting spams. The proposed classifier is tested on the most informative features of two public ally available corpuses (Enron and Spam assassin) found using Greedy stepwise search method. Thereafter, the proposed ensemble of classifiers is compared with various Machine Learning Classifiers: Genetic Programming (GP), Bayesian, Naive Bayes (NB), J48, Random forest (RF), and SVM. Results of this study indicate that the proposed classifier (EGP) is the best classifier among those compared in terms of performance accuracy as well as false positive rate.

...read moreread less

18 citations

Cites methods from "Effect of feature selection methods..."

...This research uses Greedy Stepwise subset Evaluation method to obtain the most informative feature subset....
[...]

Proceedings Article•DOI•

A Comparative Study of Various Supervised Feature Selection Methods for Spam Classification

[...]

Shrawan Kumar Trivedi¹, Shubhamoy Dey²•Institutions (2)

BML Munjal University¹, Indian Institute of Management Ahmedabad²

04 Mar 2016

TL;DR: Results of this study shows that RF is the excellent feature selection technique amongst other in terms of classification accuracy and false positive rate whereas DF and X2 were not so effective methods.

...read moreread less

Abstract: Classification of the spam from bunch of the email files is a challenging research area in text mining domain. However, machine learning based approaches are widely experimented in the literature with enormous success. For excellent learning of the classifiers, few numbers of informative features are important. This researh presents a comparative study between various supervised feature selection methods such as Document Frequency (DF), Chi-Squared (χ2), Information Gain (IG), Gain Ratio (GR), Relief F (RF), and One R (OR). Two corpuses (Enron and SpamAssassin) are selected for this study where enron is main corpus and spamAssassin is used for validation of the results. Bayesian Classifier is taken to classify the given corpuses with the help of features selected by above feature selection techniques. Results of this study shows that RF is the excellent feature selection technique amongst other in terms of classification accuracy and false positive rate whereas DF and X2 were not so effective methods. Bayesian classifier has proven its worth in this study in terms of good performance accuracy and low false positives.

...read moreread less

16 citations

Cites background from "Effect of feature selection methods..."

...The rationale behind to select these version was the complexities imbibed in the Email Spam files [6, 7, 8, 9]....
[...]

1
2
3
4
…
5

References

PDF

Open Access

More filters

Book•

Adaptation in natural and artificial systems

[...]

John H. Holland

01 Jan 1975

TL;DR: Names of founding work in the area of Adaptation and modiication, which aims to mimic biological optimization, and some (Non-GA) branches of AI.

...read moreread less

Abstract: Name of founding work in the area. Adaptation is key to survival and evolution. Evolution implicitly optimizes organisims. AI wants to mimic biological optimization { Survival of the ttest { Exploration and exploitation { Niche nding { Robust across changing environments (Mammals v. Dinos) { Self-regulation,-repair and-reproduction 2 Artiicial Inteligence Some deenitions { "Making computers do what they do in the movies" { "Making computers do what humans (currently) do best" { "Giving computers common sense; letting them make simple deci-sions" (do as I want, not what I say) { "Anything too new to be pidgeonholed" Adaptation and modiication is root of intelligence Some (Non-GA) branches of AI: { Expert Systems (Rule based deduction)

...read moreread less

32,573 citations

"Effect of feature selection methods..." refers methods in this paper

...1 Genetic Algorithm based Classifier These algorithms use a learning approach based on the principles of natural selection introduced by Holland [18]....
[...]
...METHODOLOGY 3.1 Genetic Algorithm based Classifier These algorithms use a learning approach based on the principles of natural selection introduced by Holland [18]....
[...]
...17 [18] Holland, J. H., "Adaptation in Natural and Artificial Systems," University of Michigan Press, Ann Arbor, MI., 1975....
[...]

Book Chapter•DOI•

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

[...]

Thorsten Joachims¹•Institutions (1)

Technical University of Dortmund¹

21 Apr 1998

TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.

...read moreread less

Abstract: This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks. Furthermore they are fully automatic, eliminating the need for manual parameter tuning.

...read moreread less

8,658 citations

"Effect of feature selection methods..." refers background in this paper

...2 Dimensionality reduction Dimensions can be reduced by “Feature selection” or “Feature extraction” and “Stop word” (terms that consist no information such as Pronouns, Prepositions, and conjunctions) elimination [20] and “Lemmatisation” (grouping the terms that come from the same ‘root’ word)....
[...]
...feature vectors i k a defined as the weight of word i that belongs to document k [20]....
[...]

Journal Article•DOI•

An overview of statistical learning theory

[...]

Vladimir Vapnik¹•Institutions (1)

AT&T Labs¹

01 Sep 1999-IEEE Transactions on Neural Networks

TL;DR: How the abstract learning theory established conditions for generalization which are more general than those discussed in classical statistical paradigms are demonstrated and how the understanding of these conditions inspired new algorithmic approaches to function estimation problems are demonstrated.

...read moreread less

Abstract: Statistical learning theory was introduced in the late 1960's. Until the 1990's it was a purely theoretical analysis of the problem of function estimation from a given collection of data. In the middle of the 1990's new types of learning algorithms (called support vector machines) based on the developed theory were proposed. This made statistical learning theory not only a tool for the theoretical analysis but also a tool for creating practical algorithms for estimating multidimensional functions. This article presents a very general overview of statistical learning theory including both theoretical and algorithmic aspects of the theory. The goal of this overview is to demonstrate how the abstract learning theory established conditions for generalization which are more general than those discussed in classical statistical paradigms and how the understanding of these conditions inspired new algorithmic approaches to function estimation problems.

...read moreread less

5,370 citations

"Effect of feature selection methods..." refers background or methods in this paper

...It takes its inspiration from Statistical Learning Theory and structural Minimization Principal [6]....
[...]
...5 [6] V.N Vapnik, An Overview of Statistical Learning Theory , IEEE Trans.on Neural Network, Vol. 10, No. 5, pp.988-998 , 1999....
[...]
...It takes its inspiration from Statistical Learning Theory and structural Minimization Principal [6]....
[...]
...13 [14] Drucker, H., Wu, D., & Vapnik, V. N. (1999)....
[...]
...SVM [3, 5] uses the concept of “Statistical Learning Theory” proposed by Vapnik [6]....
[...]

Book Chapter•DOI•

Naive (Bayes) at forty: the independence assumption in information retrieval

[...]

David D. Lewis¹•Institutions (1)

AT&T Labs¹

21 Apr 1998

TL;DR: The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval, and some of the variations used for text retrieval and classification are reviewed.

...read moreread less

Abstract: The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assumptions made about word occurrences in documents.

...read moreread less

2,235 citations

"Effect of feature selection methods..." refers background in this paper

...2 Probabilistic Classifiers: This idea was proposed by Lewis in 1998 [13], who introduced...
[...]

Journal Article•DOI•

Support vector machines for spam categorization

[...]

H. Drucker¹, Donghui Wu, Vladimir Vapnik•Institutions (1)

AT&T Labs¹

01 Sep 1999-IEEE Transactions on Neural Networks

TL;DR: The use of support vector machines in classifying e-mail as spam or nonspam is studied by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees, which found SVM's performed best when using binary features.

...read moreread less

Abstract: We study the use of support vector machines (SVM) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features were constrained to the 1000 best features and another data set where the dimensionality was over 7000. SVM performed best when using binary features. For both data sets, boosting trees and SVM had acceptable test performance in terms of accuracy and speed. However, SVM had significantly less training time.

...read moreread less

1,536 citations