scispace - formally typeset
Journal ArticleDOI

Spam classification: a comparative analysis of different boosted decision tree approaches

Shrawan Kumar Trivedi, +1 more
- 30 Oct 2018 - 
- Vol. 20, Iss: 3, pp 105-298
TLDR
A boosted decision tree approach has been proposed and used to classify email spam and ham files; this is found to be a highly effective approach in comparison with other state-of-the-art modes used in other studies.
Abstract
Purpose Email spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy but also by sensitivity (correctly classified legitimate emails) and specificity (correctly classified unsolicited emails) towards the accurate classification, captured by both false positive and false negative rates. This paper aims to present a comparative study between various decision tree classifiers (such as AD tree, decision stump and REP tree) with/without different boosting algorithms (bagging, boosting with re-sample and AdaBoost). Design/methodology/approach Artificial intelligence and text mining approaches have been incorporated in this study. Each decision tree classifier in this study is tested on informative words/features selected from the two publically available data sets (SpamAssassin and LingSpam) using a greedy step-wise feature search method. Findings Outcomes of this study show that without boosting, the REP tree provides high performance accuracy with the AD tree ranking as the second-best performer. Decision stump is found to be the under-performing classifier of this study. However, with boosting, the combination of REP tree and AdaBoost compares favourably with other classification models. If the metrics false positive rate and performance accuracy are taken together, AD tree and REP tree with AdaBoost were both found to carry out an effective classification task. Greedy stepwise has proven its worth in this study by selecting a subset of valuable features to identify the correct class of emails. Research limitations/implications This research is focussed on the classification of those email spams that are written in the English language only. The proposed models work with content (words/features) of email data that is mostly found in the body of the mail. Image spam has not been included in this study. Other messages such as short message service or multi-media messaging service were not included in this study. Practical implications In this research, a boosted decision tree approach has been proposed and used to classify email spam and ham files; this is found to be a highly effective approach in comparison with other state-of-the-art modes used in other studies. This classifier may be tested for different applications and may provide new insights for developers and researchers. Originality/value A comparison of decision tree classifiers with/without ensemble has been presented for spam classification.

read more

Citations
More filters
Journal ArticleDOI

A Survey on Machine Learning Techniques for Cyber Security in the Last Decade

TL;DR: This paper aims to provide a comprehensive overview of the challenges that ML techniques face in protecting cyberspace against attacks, by presenting a literature on ML techniques for cyber security including intrusion detection, spam detection, and malware detection on computer networks and mobile networks in the last decade.
Journal ArticleDOI

A study on credit scoring modeling with different feature selection and machine learning approaches

TL;DR: A combination of random forest and Chi-Square is found good, among other combinations, with respect to good performance accuracy, F-measure and low false positive and false negative rates.
Journal ArticleDOI

Hybrid Water Cycle Optimization Algorithm With Simulated Annealing for Spam E-mail Detection

TL;DR: The aim of this research is to improve the accuracy of feature selection by applying hybrid Water Cycle and Simulated Annealing to optimize results and to evaluate the proposed Spam Detection.
Journal ArticleDOI

An empirical study of supervised email classification in Internet of Things: Practical performance and key influencing factors

TL;DR: An empirical study to validate the performance of common learning algorithms under three different environments for email classification and indicates that LibSVM and SMO‐SVM can achieve better performance than other selected algorithms.
Journal ArticleDOI

Spam emails in academia: issues and costs

TL;DR: This paper addresses issues associated with spam emails, and a conservative estimate of the external costs of spam from publishers and journals amounts to US$ 1.1 billion per year, and when all spam emails are included in the calculation, the cost rises to approximately US$ 2.6 billion per year.
References
More filters
Journal Article

Statistical Comparisons of Classifiers over Multiple Data Sets

TL;DR: A set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers is recommended: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparisons of more classifiers over multiple data sets.
Proceedings Article

Experiments with a new boosting algorithm

TL;DR: This paper describes experiments carried out to assess how well AdaBoost with and without pseudo-loss, performs on real learning problems and compared boosting to Breiman's "bagging" method when used to aggregate various classifiers.
Book

The jackknife, the bootstrap, and other resampling plans

Bradley Efron
TL;DR: The Delta Method and the Influence Function Cross-Validation, Jackknife and Bootstrap Balanced Repeated Replication (half-sampling) Random Subsampling Nonparametric Confidence Intervals as mentioned in this paper.
Journal ArticleDOI

The Strength of Weak Learnability

TL;DR: In this paper, a method is described for converting a weak learning algorithm into one that achieves arbitrarily high accuracy, and it is shown that these two notions of learnability are equivalent.
Related Papers (5)