scispace - formally typeset
Proceedings ArticleDOI

A Combining Classifiers Approach for Detecting Email Spams

01 Mar 2016-pp 355-360
TL;DR: Results show the best results of novel combining classifier approach in compression with individual classifiers compared in terms of good performance accuracy and low false positives.
Abstract: Email is a rapid and cheap communication medium for sending and receiving information where spam is becoming a nuisance for such communication. A good spam filtering cannot only be achieved by high performance accuracy but low false positive is also necessary. This paper presents a combining classifiers approach with committee selection mechanism where the main objective is to combine individual decisions of the good classifiers for utmost classification outcome in spam classification domain. In this context, three different classifiers have been selected i.e. "Boosted Bayesian", "Boosted Naive Bayes and Support Vector Machine (SVM). For combining classifiers, boosted bayesian and boosted naive bayes are chosen as members of committee and SVM is taken as the president. The member of committee have been selected from our previous study where we have identified boosting with adaboost improves the performance of probabilistic classifier. Results show the best results of novel combining classifier approach in compression with individual classifiers compared in terms of good performance accuracy and low false positives. In addition, greedy step wise feature search method is found to be good in this study.
Citations
More filters

Journal ArticleDOI
TL;DR: A novel spam filter integrating an N-gram tf.idf feature selection, modified distribution-based balancing algorithm and a regularized deep multi-layer perceptron NN model with rectified linear units is proposed, which outperforms state-of-the-art spam filters and several machine learning algorithms commonly used to classify text.
Abstract: Rapid growth in the volume of unsolicited and unwanted messages has inspired the development of many anti-spam methods. Supervised anti-spam filters using machine-learning methods have been particularly effective in categorizing spam and non-spam messages. These automatically integrate spam corpora pre-processing, appropriate word lists selection, and the calculation of word weights, usually in a bag-of-words fashion. To develop an accurate spam filter is challenging because spammers attempt to decrease the probability of spam detection by using legitimate words. Complex models are therefore needed to solve such a problem. However, existing spam filtering methods usually converge to a poor local minimum, cannot effectively handle high-dimensional data and suffer from overfitting issues. To overcome these problems, we propose a novel spam filter integrating an N-gram tf.idf feature selection, modified distribution-based balancing algorithm and a regularized deep multi-layer perceptron NN model with rectified linear units (DBB-RDNN-ReL). As demonstrated on four benchmark spam datasets (Enron, SpamAssassin, SMS spam collection and Social networking), the proposed approach enables capturing more complex features from high-dimensional data by additional layers of neurons. Another advantage of this approach is that no additional dimensionality reduction is necessary and spam dataset imbalance is addressed using a modified distribution-based algorithm. We compare the performance of the approach with that of state-of-the-art spam filters (Minimum Description Length, Factorial Design using SVM and NB, Incremental Learning C4.5, and Random Forest, Voting and Convolutional Neural Network) and several machine learning algorithms commonly used to classify text. We show that the proposed model outperforms these other methods in terms of classification accuracy, with fewer false negatives and false positives. Notably, the proposed spam filter classifies both major (legitimate) and minor (spam) classes well on personalized / non-personalized and balanced / imbalanced spam datasets. In addition, we show that the proposed model performs better than the results reported by previous studies in terms of accuracy. However, the high computational expenses related to additional hidden layers limit its application as an online spam filter and make it difficult to overcome the problem of concept drift.

49 citations


13


Cites background from "A Combining Classifiers Approach fo..."

  • ...The combination of boosting and SVM outperformed single classifiers on several benchmark datasets in [71]....

    [...]


Journal ArticleDOI
TL;DR: A novel spam detection method that combines the artificial bee colony algorithm with a logistic regression classification model is proposed that outperforms other spam detection techniques considered in this study in terms of classification accuracy.
Abstract: Email spam is a serious problem that annoys recipients and wastes their time. Machine-learning methods have been prevalent in spam detection systems owing to their efficiency in classifying mail as solicited or unsolicited. However, existing spam detection techniques usually suffer from low detection rates and cannot efficiently handle high-dimensional data. Therefore, we propose a novel spam detection method that combines the artificial bee colony algorithm with a logistic regression classification model. The empirical results on three publicly available datasets (Enron, CSDMC2010, and TurkishEmail) show that the proposed model can handle high-dimensional data thanks to its highly effective local and global search abilities. We compare the proposed model’s spam detection performance to those of support vector machine, logistic regression, and naive Bayes classifiers, in addition to the performance of the state-of-the-art methods reported by previous studies. We observe that the proposed method outperforms other spam detection techniques considered in this study in terms of classification accuracy.

25 citations


Journal ArticleDOI
02 Dec 2020-IEEE Access
TL;DR: This paper aims to provide a comprehensive overview of the challenges that ML techniques face in protecting cyberspace against attacks, by presenting a literature on ML techniques for cyber security including intrusion detection, spam detection, and malware detection on computer networks and mobile networks in the last decade.
Abstract: Pervasive growth and usage of the Internet and mobile applications have expanded cyberspace. The cyberspace has become more vulnerable to automated and prolonged cyberattacks. Cyber security techniques provide enhancements in security measures to detect and react against cyberattacks. The previously used security systems are no longer sufficient because cybercriminals are smart enough to evade conventional security systems. Conventional security systems lack efficiency in detecting previously unseen and polymorphic security attacks. Machine learning (ML) techniques are playing a vital role in numerous applications of cyber security. However, despite the ongoing success, there are significant challenges in ensuring the trustworthiness of ML systems. There are incentivized malicious adversaries present in the cyberspace that are willing to game and exploit such ML vulnerabilities. This paper aims to provide a comprehensive overview of the challenges that ML techniques face in protecting cyberspace against attacks, by presenting a literature on ML techniques for cyber security including intrusion detection, spam detection, and malware detection on computer networks and mobile networks in the last decade. It also provides brief descriptions of each ML method, frequently used security datasets, essential ML tools, and evaluation metrics to evaluate a classification model. It finally discusses the challenges of using ML techniques in cyber security. This paper provides the latest extensive bibliography and the current trends of ML in cyber security.

20 citations


Journal ArticleDOI
TL;DR: A boosted decision tree approach has been proposed and used to classify email spam and ham files; this is found to be a highly effective approach in comparison with other state-of-the-art modes used in other studies.
Abstract: Purpose Email spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy but also by sensitivity (correctly classified legitimate emails) and specificity (correctly classified unsolicited emails) towards the accurate classification, captured by both false positive and false negative rates. This paper aims to present a comparative study between various decision tree classifiers (such as AD tree, decision stump and REP tree) with/without different boosting algorithms (bagging, boosting with re-sample and AdaBoost). Design/methodology/approach Artificial intelligence and text mining approaches have been incorporated in this study. Each decision tree classifier in this study is tested on informative words/features selected from the two publically available data sets (SpamAssassin and LingSpam) using a greedy step-wise feature search method. Findings Outcomes of this study show that without boosting, the REP tree provides high performance accuracy with the AD tree ranking as the second-best performer. Decision stump is found to be the under-performing classifier of this study. However, with boosting, the combination of REP tree and AdaBoost compares favourably with other classification models. If the metrics false positive rate and performance accuracy are taken together, AD tree and REP tree with AdaBoost were both found to carry out an effective classification task. Greedy stepwise has proven its worth in this study by selecting a subset of valuable features to identify the correct class of emails. Research limitations/implications This research is focussed on the classification of those email spams that are written in the English language only. The proposed models work with content (words/features) of email data that is mostly found in the body of the mail. Image spam has not been included in this study. Other messages such as short message service or multi-media messaging service were not included in this study. Practical implications In this research, a boosted decision tree approach has been proposed and used to classify email spam and ham files; this is found to be a highly effective approach in comparison with other state-of-the-art modes used in other studies. This classifier may be tested for different applications and may provide new insights for developers and researchers. Originality/value A comparison of decision tree classifiers with/without ensemble has been presented for spam classification.

9 citations


Book ChapterDOI
01 Jan 2019-
TL;DR: Voting classifier, a type of ensemble learning to calculate the accuracy of different combinations of classifiers is used, and results show that use of voting classifier produces more accurate prediction than individual classifier.
Abstract: In our daily life, we use email and SMS many times to communicate to each other, but due to the increase of spam email and SMS, it becomes a headache for both the sender and receiver. We need spam detection tool to detect the spam, and there are many spam detection tools available in the market but they are not up to the mark because they only emphasize on individual classifier or only one or two combination of classifier. In our research, we present different combinations of four different classifiers, namely “Gaussian Naive Bayes”, “Multinomial Naive Bayes”, “Bernoulli Naive Bayes”, and “Decision Tree”. We have used voting classifier, a type of ensemble learning to calculate the accuracy of different combinations of classifiers. Results show that use of voting classifier produces more accurate prediction than individual classifier. We had also created an android application to serve the purpose. The mobile application works on client–server principle. Basically, the mobile application acts as a client which sends the data clicked by a user from mobile to server. At the server, there is machine learning script which classifies the received data and sends the prediction back to the client.

9 citations


References
More filters

01 Jan 1998-
TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

26,121 citations


"A Combining Classifiers Approach fo..." refers methods in this paper

  • ...It is a statistical sample based method that consists of drawing randomly with replacement from a data set....

    [...]


Book ChapterDOI
21 Apr 1998-
TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.
Abstract: This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks. Furthermore they are fully automatic, eliminating the need for manual parameter tuning.

8,287 citations


"A Combining Classifiers Approach fo..." refers background in this paper

  • ...An Email file is represented as a collection of feature vectors jia , is defined as the word i belongs to feature vector j (Joachims 1998) [12]....

    [...]

  • ...…for dimension reduction of the matrix is also involved such as stop word (Least informative words such as pronouns, prepositions, and conjunction) removal (Joachims 1998) [12] and lemmatization (grouping similar informative words such as “Perform, Performed and Performing can be group as perform)....

    [...]


Book ChapterDOI
David D. Lewis1
21 Apr 1998-
TL;DR: The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval, and some of the variations used for text retrieval and classification are reviewed.
Abstract: The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assumptions made about word occurrences in documents.

2,111 citations


"A Combining Classifiers Approach fo..." refers background in this paper

  • ...This idea [14], suggests a term ij c d P , which is termed as...

    [...]


Journal ArticleDOI
Abstract: Motivated by a representation for the least squares estimator, we propose a class of weighted jackknife variance estimators for the least squares estimator by deleting any fixed number of observations at a time. They are unbiased for homoscedastic errors and a special case, the delete-one jackknife, is almost unbiased for heteroscedastic errors. The method is extended to cover nonlinear parameters, regression $M$-estimators, nonlinear regression and generalized linear models. Interval estimators can be constructed from the jackknife histogram. Three bootstrap methods are considered. Two are shown to give biased variance estimators and one does not have the bias-robustness property enjoyed by the weighted delete-one jackknife. A general method for resampling residuals is proposed. It gives variance estimators that are bias-robust. Several bias-reducing estimators are proposed. Some simulation results are reported.

1,616 citations


Proceedings Article
01 Jul 1998-
TL;DR: This work examines methods for the automated construction of filters to eliminate such unwanted messages from a user’s mail stream, and shows the efficacy of such filters in a real world usage scenario, arguing that this technology is mature enough for deployment.
Abstract: In addressing the growing problem of junk E-mail on the Internet, we examine methods for the automated construction of filters to eliminate such unwanted messages from a user’s mail stream. By casting this problem in a decision theoretic framework, we are able to make use of probabilistic learning methods in conjunction with a notion of differential misclassification cost to produce filters Which are especially appropriate for the nuances of this task. While this may appear, at first, to be a straight-forward text classification problem, we show that by considering domain-specific features of this problem in addition to the raw text of E-mail messages, we can produce much more accurate filters. Finally, we show the efficacy of such filters in a real world usage scenario, arguing that this technology is mature enough for deployment.

1,515 citations


Network Information
Related Papers (5)
29 Jul 2005

David Maxwell Chickering, Geoffrey J. Hulten +4 more

03 Dec 2013

Shrawan Kumar Trivedi, Shubhamoy Dey

Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20213
20202
20192
20185
20171