scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A lifelong spam emails classification model

23 Jan 2020-Applied Computing and Informatics (No longer published by Elsevier)-
TL;DR: An enhanced model is proposed for ensuring lifelong spam classification model and the overall performance of the suggested model is contrasted against various other stream mining classification techniques to prove the success of the proposed model as a lifelong spam emails classification method.
About: This article is published in Applied Computing and Informatics.The article was published on 2020-01-23 and is currently open access. It has received 17 citations till now.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article , the authors present the consequences of ignoring the problem of dataset shift in spam email detection and show that this shift may lead to severe degradation in the estimated generalization performance, with error rates reaching values up to $48.81.
Abstract: Abstract Spam emails have been traditionally seen as just annoying and unsolicited emails containing advertisements, but they increasingly include scams, malware or phishing. In order to ensure the security and integrity for the users, organisations and researchers aim to develop robust filters for spam email detection. Recently, most spam filters based on machine learning algorithms published in academic journals report very high performance, but users are still reporting a rising number of frauds and attacks via spam emails. Two main challenges can be found in this field: (a) it is a very dynamic environment prone to the dataset shift problem and (b) it suffers from the presence of an adversarial figure, i.e. the spammer. Unlike classical spam email reviews, this one is particularly focused on the problems that this constantly changing environment poses. Moreover, we analyse the different spammer strategies used for contaminating the emails, and we review the state-of-the-art techniques to develop filters based on machine learning. Finally, we empirically evaluate and present the consequences of ignoring the matter of dataset shift in this practical field. Experimental results show that this shift may lead to severe degradation in the estimated generalisation performance, with error rates reaching values up to $$48.81\%$$ 48.81 % .

15 citations

Journal ArticleDOI
TL;DR: An innovative feature selection algorithm is proposed and is called “the Highest Wins” (HW), which is used for building a naive Bayes and decision tree intrusion detection classifiers using the well-known dataset from Network Security Laboratory-Knowledge Discovery in Databases (NSL-KDD).
Abstract: The rapid advancement of Internet stimulates building intelligent data mining systems for detecting intrusion attacks The performance of such systems might be negatively affected due to the big datasets employed in the learning phase Determining the appropriate group of features within training datasets is an essential phase when building data mining classification models Nevertheless, the resulted minimized set of features should maintain or even improve the performance of the classification models Throughout this article, an innovative feature selection algorithm is proposed and is called “the Highest Wins” (HW) To evaluate the generalization ability of HW, it has been applied for creating classification models using naive Bayes technique from 10 benchmark datasets The obtained results were compared against two well-known strategies, namely chi-square and information gain The experimental results confirmed the competitiveness ability of the suggested strategy in terms of various evaluation measurements such as recall, precision, and error rate while significantly decreasing the number of selected features in datasets Further, the HW is used for building a naive Bayes and decision tree intrusion detection classifiers using the well-known dataset from Network Security Laboratory-Knowledge Discovery in Databases (NSL-KDD) The results were promising not just in terms of overall performance, but also in terms of the time needed to build the classification model

13 citations

Journal ArticleDOI
TL;DR: An “Improved RI algorithm” (IRI) is offered which decreases the searching space for generating classification rules by removing all unimportant candidate rule-items along the way of creating the classification model.

7 citations

Journal ArticleDOI
TL;DR: Point-Biserial correlation is applied to each feature concerning the class label of the University of California Irvine spambase email dataset to select the best features and evaluates the performance of applied classifiers.

6 citations

Journal ArticleDOI
TL;DR: In this article, a hybrid technique is created by combining the Naive Bayes Algorithm and the Markov Random Field, which is used to determine the prevalence and configuration of values in a dataset and perform a basic probabilistic classification operation in this proposed hybrid technique.

6 citations

References
More filters
Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations

Journal ArticleDOI
TL;DR: The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.

17,017 citations

Journal ArticleDOI
01 Mar 2002
TL;DR: This presentation discusses the design and implementation of machine learning algorithms in Java, as well as some of the techniques used to develop and implement these algorithms.
Abstract: 1. What's It All About? 2. Input: Concepts, Instances, Attributes 3. Output: Knowledge Representation 4. Algorithms: The Basic Methods 5. Credibility: Evaluating What's Been Learned 6. Implementations: Real Machine Learning Schemes 7. Moving On: Engineering The Input And Output 8. Nuts And Bolts: Machine Learning Algorithms In Java 9. Looking Forward

5,936 citations

Book ChapterDOI
21 Jun 2000
TL;DR: Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.
Abstract: Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.

5,679 citations

Journal ArticleDOI
TL;DR: The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art and aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.
Abstract: Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we characterize adaptive learning processes; categorize existing strategies for handling concept drift; overview the most representative, distinct, and popular techniques and algorithms; discuss evaluation methodology of adaptive algorithms; and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.

2,374 citations