A novel committee selection mechanism for combining classifiers to detect unsolicited emails

doi:10.1108/VJIKMS-07-2015-0042

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Spam classification: a comparative analysis of different boosted decision tree approaches

[...]

Shrawan Kumar Trivedi, Prabin Kumar Panigrahi

30 Oct 2018-Journal of Systems and Information Technology

TL;DR: A boosted decision tree approach has been proposed and used to classify email spam and ham files; this is found to be a highly effective approach in comparison with other state-of-the-art modes used in other studies.

...read moreread less

Abstract: Purpose Email spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy but also by sensitivity (correctly classified legitimate emails) and specificity (correctly classified unsolicited emails) towards the accurate classification, captured by both false positive and false negative rates. This paper aims to present a comparative study between various decision tree classifiers (such as AD tree, decision stump and REP tree) with/without different boosting algorithms (bagging, boosting with re-sample and AdaBoost). Design/methodology/approach Artificial intelligence and text mining approaches have been incorporated in this study. Each decision tree classifier in this study is tested on informative words/features selected from the two publically available data sets (SpamAssassin and LingSpam) using a greedy step-wise feature search method. Findings Outcomes of this study show that without boosting, the REP tree provides high performance accuracy with the AD tree ranking as the second-best performer. Decision stump is found to be the under-performing classifier of this study. However, with boosting, the combination of REP tree and AdaBoost compares favourably with other classification models. If the metrics false positive rate and performance accuracy are taken together, AD tree and REP tree with AdaBoost were both found to carry out an effective classification task. Greedy stepwise has proven its worth in this study by selecting a subset of valuable features to identify the correct class of emails. Research limitations/implications This research is focussed on the classification of those email spams that are written in the English language only. The proposed models work with content (words/features) of email data that is mostly found in the body of the mail. Image spam has not been included in this study. Other messages such as short message service or multi-media messaging service were not included in this study. Practical implications In this research, a boosted decision tree approach has been proposed and used to classify email spam and ham files; this is found to be a highly effective approach in comparison with other state-of-the-art modes used in other studies. This classifier may be tested for different applications and may provide new insights for developers and researchers. Originality/value A comparison of decision tree classifiers with/without ensemble has been presented for spam classification.

...read moreread less

13 citations

Journal Article•DOI•

Capturing user sentiments for online Indian movie reviews: A comparative analysis of different machine-learning models

[...]

Shrawan Kumar Trivedi, Shubhamoy Dey, Anil Kumar

06 Aug 2018-The Electronic Library

TL;DR: This research aims to present sentiment analysis of an Indian movie review corpus using natural language processing and various machine learning classifiers, finding that, for the maximum number of features, the RF feature selection approach was found to be the best, with better F-values, a low FP rate and less time needed to train the classifiers.

...read moreread less

Abstract: Purpose Sentiment analysis and opinion mining are emerging areas of research for analyzing Web data and capturing users’ sentiments. This research aims to present sentiment analysis of an Indian movie review corpus using natural language processing and various machine learning classifiers. Design/methodology/approach In this paper, a comparative study between three machine learning classifiers (Bayesian, naive Bayesian and support vector machine [SVM]) was performed. All the classifiers were trained on the words/features of the corpus extracted, using five different feature selection algorithms (Chi-square, info-gain, gain ratio, one-R and relief-F [RF] attributes), and a comparative study was performed between them. The classifiers and feature selection approaches were evaluated using different metrics (F-value, false-positive [FP] rate and training time). Findings The results of this study show that, for the maximum number of features, the RF feature selection approach was found to be the best, with better F-values, a low FP rate and less time needed to train the classifiers, whereas for the least number of features, one-R was better than RF. When the evaluation was performed for machine learning classifiers, SVM was found to be superior, although the Bayesian classifier was comparable with SVM. Originality/value This is a novel research where Indian review data were collected and then a classification model for sentiment polarity (positive/negative) was constructed.

...read moreread less

5 citations

Journal Article•DOI•

Analysing user sentiment of Indian movie reviews: A probabilistic committee selection model

[...]

Shrawan Kumar Trivedi, Shubhamoy Dey

29 Oct 2018-The Electronic Library

TL;DR: A novel probabilistic committee selection classifier (PCC) is proposed and used for classifying movie reviews, and is found to be highly effective in comparison with other state-of-the-art classifiers.

...read moreread less

Abstract: Purpose To be sustainable and competitive in the current business environment, it is useful to understand users’ sentiment towards products and services. This critical task can be achieved via natural language processing and machine learning classifiers. This paper aims to propose a novel probabilistic committee selection classifier (PCC) to analyse and classify the sentiment polarities of movie reviews. Design/methodology/approach An Indian movie review corpus is assembled for this study. Another publicly available movie review polarity corpus is also involved with regard to validating the results. The greedy stepwise search method is used to extract the features/words of the reviews. The performance of the proposed classifier is measured using different metrics, such as F-measure, false positive rate, receiver operating characteristic (ROC) curve and training time. Further, the proposed classifier is compared with other popular machine-learning classifiers, such as Bayesian, Naive Bayes, Decision Tree (J48), Support Vector Machine and Random Forest. Findings The results of this study show that the proposed classifier is good at predicting the positive or negative polarity of movie reviews. Its performance accuracy and the value of the ROC curve of the PCC is found to be the most suitable of all other classifiers tested in this study. This classifier is also found to be efficient at identifying positive sentiments of reviews, where it gives low false positive rates for both the Indian Movie Review and Review Polarity corpora used in this study. The training time of the proposed classifier is found to be slightly higher than that of Bayesian, Naive Bayes and J48. Research limitations/implications Only movie review sentiments written in English are considered. In addition, the proposed committee selection classifier is prepared only using the committee of probabilistic classifiers; however, other classifier committees can also be built, tested and compared with the present experiment scenario. Practical implications In this paper, a novel probabilistic approach is proposed and used for classifying movie reviews, and is found to be highly effective in comparison with other state-of-the-art classifiers. This classifier may be tested for different applications and may provide new insights for developers and researchers. Social implications The proposed PCC may be used to classify different product reviews, and hence may be beneficial to organizations to justify users’ reviews about specific products or services. By using authentic positive and negative sentiments of users, the credibility of the specific product, service or event may be enhanced. PCC may also be applied to other applications, such as spam detection, blog mining, news mining and various other data-mining applications. Originality/value The constructed PCC is novel and was tested on Indian movie review data.

...read moreread less

5 citations

Journal Article•DOI•

A modified content-based evolutionary approach to identify unsolicited emails

[...]

Shrawan Kumar Trivedi¹, Shubhamoy Dey²•Institutions (2)

Indian Institute of Management Ahmedabad¹, Indian Institute of Management Indore²

01 Sep 2019-Knowledge and Information Systems

TL;DR: The findings suggest that the MGP classifier with the greedy stepwise feature search method offers an improvement over alternative methods in detecting unsolicited emails.

...read moreread less

Abstract: This computational research seeks to classify unsolicited versus legitimate emails A modified version of an existing genetic programming (GP) classifier—ie, modified genetic programming (MGP)—is implemented to build an ensemble of classifiers to identify unsolicited emails The proposed classifier is assessed using informative features extracted from two corpora (Enron and SpamAssassin) with the help of the greedy stepwise feature search method Further, a comparative study is performed with other popular classifiers, such as Bayesian network, naive Bayes, decision tree, random forest (RF), support vector machine (SVM), and GP Further the results are validated with 20-fold cross-validation and paired T test The results prove that the proposed classifier performs better in terms of accuracy and false-positive detection in comparison with the other machine learning classifiers tested in this study Using different training and testing a set of email files from the Enron corpus, ensemble-based classifiers, such as boosted SVM, boosted Bayesian, boosted naive Bayesian, RF, and the proposed MGP classifier, are tested and compared on all metrics, including training and testing time The findings suggest that the MGP classifier with the greedy stepwise feature search method offers an improvement over alternative methods in detecting unsolicited emails

...read moreread less

3 citations

Journal Article•DOI•

A study of boosted evolutionary classifiers for detecting spam

[...]

Shrawan Kumar Trivedi, Shubhamoy Dey

01 Nov 2019

TL;DR: This study reveals the fact that evolutionary algorithms are promising in classification and prediction applications where genetic programing with adaptive boosting is turned out not only an accurate classifier but also a sensitive classifier.

...read moreread less

Abstract: Email is a rapid and cheapest medium of sharing information, whereas unsolicited email (spam) is constant trouble in the email communication. The rapid growth of the spam creates a necessity to build a reliable and robust spam classifier. This paper aims to presents a study of evolutionary classifiers (genetic algorithm [GA] and genetic programming [GP]) without/with the help of an ensemble of classifiers method. In this research, the classifiers ensemble has been developed with adaptive boosting technique.,Text mining methods are applied for classifying spam emails and legitimate emails. Two data sets (Enron and SpamAssassin) are taken to test the concerned classifiers. Initially, pre-processing is performed to extract the features/words from email files. Informative feature subset is selected from greedy stepwise feature subset search method. With the help of informative features, a comparative study is performed initially within the evolutionary classifiers and then with other popular machine learning classifiers (Bayesian, naive Bayes and support vector machine).,This study reveals the fact that evolutionary algorithms are promising in classification and prediction applications where genetic programing with adaptive boosting is turned out not only an accurate classifier but also a sensitive classifier. Results show that initially GA performs better than GP but after an ensemble of classifiers (a large number of iterations), GP overshoots GA with significantly higher accuracy. Amongst all classifiers, boosted GP turns out to be not only good regarding classification accuracy but also low false positive (FP) rates, which is considered to be the important criteria in email spam classification. Also, greedy stepwise feature search is found to be an effective method for feature selection in this application domain.,The research implication of this research consists of the reduction in cost incurred because of spam/unsolicited bulk email. Email is a fundamental necessity to share information within a number of units of the organizations to be competitive with the business rivals. In addition, it is continually a hurdle for internet service providers to provide the best emailing services to their customers. Although, the organizations and the internet service providers are continuously adopting novel spam filtering approaches to reduce the number of unwanted emails, the desired effect could not be significantly seen because of the cost of installation, customizable ability and the threat of misclassification of important emails. This research deals with all the issues and challenges faced by internet service providers and organizations.,In this research, the proposed models have not only provided excellent performance accuracy, sensitivity with low FP rate, customizable capability but also worked on reducing the cost of spam. The same models may be used for other applications of text mining also such as sentiment analysis, blog mining, news mining or other text mining research.,A comparison between GP and GAs has been shown with/without ensemble in spam classification application domain.

...read moreread less

1 citations

A novel committee selection mechanism for combining classifiers to detect unsolicited emails

Citations

References

Related Papers (5)

Trending Questions (1)