scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Effect of feature selection methods on machine learning classifiers for detecting email spams

TL;DR: Results show that, Greedy Stepwise Search is a good method for feature selection for spam email detection and among the Machine Learning Classifiers, Support Vector Machine has been found to be the best both in terms of accuracy and False Positive rate.
Abstract: This research presents the effects of using features selected by two feature selection methods i.e. Genetic Search and Greedy Stepwise Search on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine and Genetic Algorithm. Tests were performed on two different publicly available spam email datasets: "Enron" and "SpamAssassin". Results show that, Greedy Stepwise Search is a good method for feature selection for spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best both in terms of accuracy and False Positive rate
Citations
More filters
Journal ArticleDOI
TL;DR: Experimental results demonstrate that the use of multi-view data can achieve more accurate email classification than theUse of single-View data, and that the approach is more effective as compared to several existing similar algorithms.

41 citations

Proceedings ArticleDOI
31 Jul 2017
TL;DR: A supervised machine learning classification model that has been built to detect the distribution of malicious content in online social networks (ONSs) and is able to improve the classifier performance to 0.92 in recall.
Abstract: The increasing volume of malicious content in social networks requires automated methods to detect and eliminate such content. This paper describes a supervised machine learning classification model that has been built to detect the distribution of malicious content in online social networks (ONSs). Multisource features have been used to detect social network posts that contain malicious Uniform Resource Locators (URLs). These URLs could direct users to websites that contain malicious content, drive-by download attacks, phishing, spam, and scams. For the data collection stage, the Twitter streaming application programming interface (API) was used and VirusTotal was used for labelling the dataset. A random forest classification model was used with a combination of features derived from a range of sources. The random forest model without any tuning and feature selection produced a recall value of 0.89. After further investigation and applying parameter tuning and feature selection methods, however, we were able to improve the classifier performance to 0.92 in recall.

33 citations


Cites background from "Effect of feature selection methods..."

  • ...Due to the inventiveness of spammers detection systems are bypassed after some time and the set of features used for spam detection has to be regularly revised [18][19]....

    [...]

Journal ArticleDOI
TL;DR: Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection and among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate.
Abstract: Detection of the spam emails within a set of email files has become challenging task for researchers. Identification of an effective classifier is based not only on high accuracy of detection but also on low false alarm rates, and the need to use as few features as possible. In view of these challenges, this research examines the effects of using features selected by four feature subset selection methods (i.e. Genetic, Greedy Stepwise, Best First, and Rank Search) on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine, Genetic Algorithm, J48 and Random Forest. Tests were performed on three different publicly available spam email datasets: "Enron", "SpamAssassin" and "LingSpam". Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate. However, results of Random Forest were very close to that of Support Vector Machine. The Genetic classifier was identified as a weak classifier.

19 citations

Proceedings ArticleDOI
03 Dec 2013
TL;DR: Results of this study indicate that the proposed classifier (EGP) is the best classifier among those compared in terms of performance accuracy as well as false positive rate.
Abstract: Identification of unsolicited emails (spams) is now a well-recognized research area within text classification. A good email classifier is not only evaluated by performance accuracy but also by the false positive rate. This research presents an Enhanced Genetic Programming (EGP) approach which works by building an ensemble of classifiers for detecting spams. The proposed classifier is tested on the most informative features of two public ally available corpuses (Enron and Spam assassin) found using Greedy stepwise search method. Thereafter, the proposed ensemble of classifiers is compared with various Machine Learning Classifiers: Genetic Programming (GP), Bayesian, Naive Bayes (NB), J48, Random forest (RF), and SVM. Results of this study indicate that the proposed classifier (EGP) is the best classifier among those compared in terms of performance accuracy as well as false positive rate.

18 citations


Cites methods from "Effect of feature selection methods..."

  • ...This research uses Greedy Stepwise subset Evaluation method to obtain the most informative feature subset....

    [...]

Proceedings ArticleDOI
04 Mar 2016
TL;DR: Results of this study shows that RF is the excellent feature selection technique amongst other in terms of classification accuracy and false positive rate whereas DF and X2 were not so effective methods.
Abstract: Classification of the spam from bunch of the email files is a challenging research area in text mining domain. However, machine learning based approaches are widely experimented in the literature with enormous success. For excellent learning of the classifiers, few numbers of informative features are important. This researh presents a comparative study between various supervised feature selection methods such as Document Frequency (DF), Chi-Squared (χ2), Information Gain (IG), Gain Ratio (GR), Relief F (RF), and One R (OR). Two corpuses (Enron and SpamAssassin) are selected for this study where enron is main corpus and spamAssassin is used for validation of the results. Bayesian Classifier is taken to classify the given corpuses with the help of features selected by above feature selection techniques. Results of this study shows that RF is the excellent feature selection technique amongst other in terms of classification accuracy and false positive rate whereas DF and X2 were not so effective methods. Bayesian classifier has proven its worth in this study in terms of good performance accuracy and low false positives.

16 citations


Cites background from "Effect of feature selection methods..."

  • ...The rationale behind to select these version was the complexities imbibed in the Email Spam files [6, 7, 8, 9]....

    [...]

References
More filters
Book
01 Jan 1975
TL;DR: Names of founding work in the area of Adaptation and modiication, which aims to mimic biological optimization, and some (Non-GA) branches of AI.
Abstract: Name of founding work in the area. Adaptation is key to survival and evolution. Evolution implicitly optimizes organisims. AI wants to mimic biological optimization { Survival of the ttest { Exploration and exploitation { Niche nding { Robust across changing environments (Mammals v. Dinos) { Self-regulation,-repair and-reproduction 2 Artiicial Inteligence Some deenitions { "Making computers do what they do in the movies" { "Making computers do what humans (currently) do best" { "Giving computers common sense; letting them make simple deci-sions" (do as I want, not what I say) { "Anything too new to be pidgeonholed" Adaptation and modiication is root of intelligence Some (Non-GA) branches of AI: { Expert Systems (Rule based deduction)

32,573 citations


"Effect of feature selection methods..." refers methods in this paper

  • ...1 Genetic Algorithm based Classifier These algorithms use a learning approach based on the principles of natural selection introduced by Holland [18]....

    [...]

  • ...METHODOLOGY 3.1 Genetic Algorithm based Classifier These algorithms use a learning approach based on the principles of natural selection introduced by Holland [18]....

    [...]

  • ...17 [18] Holland, J. H., "Adaptation in Natural and Artificial Systems," University of Michigan Press, Ann Arbor, MI., 1975....

    [...]

Book ChapterDOI
21 Apr 1998
TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.
Abstract: This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks. Furthermore they are fully automatic, eliminating the need for manual parameter tuning.

8,658 citations


"Effect of feature selection methods..." refers background in this paper

  • ...2 Dimensionality reduction Dimensions can be reduced by “Feature selection” or “Feature extraction” and “Stop word” (terms that consist no information such as Pronouns, Prepositions, and conjunctions) elimination [20] and “Lemmatisation” (grouping the terms that come from the same ‘root’ word)....

    [...]

  • ...feature vectors i k a defined as the weight of word i that belongs to document k [20]....

    [...]

Journal ArticleDOI
Vladimir Vapnik1
TL;DR: How the abstract learning theory established conditions for generalization which are more general than those discussed in classical statistical paradigms are demonstrated and how the understanding of these conditions inspired new algorithmic approaches to function estimation problems are demonstrated.
Abstract: Statistical learning theory was introduced in the late 1960's. Until the 1990's it was a purely theoretical analysis of the problem of function estimation from a given collection of data. In the middle of the 1990's new types of learning algorithms (called support vector machines) based on the developed theory were proposed. This made statistical learning theory not only a tool for the theoretical analysis but also a tool for creating practical algorithms for estimating multidimensional functions. This article presents a very general overview of statistical learning theory including both theoretical and algorithmic aspects of the theory. The goal of this overview is to demonstrate how the abstract learning theory established conditions for generalization which are more general than those discussed in classical statistical paradigms and how the understanding of these conditions inspired new algorithmic approaches to function estimation problems.

5,370 citations


"Effect of feature selection methods..." refers background or methods in this paper

  • ...It takes its inspiration from Statistical Learning Theory and structural Minimization Principal [6]....

    [...]

  • ...5 [6] V.N Vapnik, An Overview of Statistical Learning Theory , IEEE Trans.on Neural Network, Vol. 10, No. 5, pp.988-998 , 1999....

    [...]

  • ...It takes its inspiration from Statistical Learning Theory and structural Minimization Principal [6]....

    [...]

  • ...13 [14] Drucker, H., Wu, D., & Vapnik, V. N. (1999)....

    [...]

  • ...SVM [3, 5] uses the concept of “Statistical Learning Theory” proposed by Vapnik [6]....

    [...]

Book ChapterDOI
David D. Lewis1
21 Apr 1998
TL;DR: The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval, and some of the variations used for text retrieval and classification are reviewed.
Abstract: The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assumptions made about word occurrences in documents.

2,235 citations


"Effect of feature selection methods..." refers background in this paper

  • ...2 Probabilistic Classifiers: This idea was proposed by Lewis in 1998 [13], who introduced...

    [...]

Journal ArticleDOI
TL;DR: The use of support vector machines in classifying e-mail as spam or nonspam is studied by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees, which found SVM's performed best when using binary features.
Abstract: We study the use of support vector machines (SVM) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features were constrained to the 1000 best features and another data set where the dimensionality was over 7000. SVM performed best when using binary features. For both data sets, boosting trees and SVM had acceptable test performance in terms of accuracy and speed. However, SVM had significantly less training time.

1,536 citations


"Effect of feature selection methods..." refers methods in this paper

  • ...[14] compares the performance of SVM with various machine learning classifiers....

    [...]