scispace - formally typeset
Journal ArticleDOI

Ensemble of keyword extraction methods and classifiers in text classification

TLDR
The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability ofText classification schemes, which is of practical importance in the application fields of text classification.
Abstract
Text classification is a domain with high dimensional feature space.Extracting the keywords as the features can be extremely useful in text classification.An empirical analysis of five statistical keyword extraction methods.A comprehensive analysis of classifier and keyword extraction ensembles.For ACM collection, a classification accuracy of 93.80% with Bagging ensemble of Random Forest. Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naive Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.

read more

Citations
More filters
Journal ArticleDOI

A recent overview of the state-of-the-art elements of text classification

TL;DR: Six baseline elements of text classification including data collection, data analysis for labelling, feature construction and weighing, feature selection and projection, training of a classification model, and solution evaluation are described.
Journal ArticleDOI

An ensemble scheme based on language function analysis and feature engineering for text genre classification

TL;DR: An ensemble classification scheme is presented, which integrates Random Subspace ensemble of Random Forest with four types of features (features used in authorship attribution, character n-grams, part of speech n- grams and the frequency of the most discriminative words) and the highest average predictive performance obtained by the proposed scheme is 94.43%.
Journal ArticleDOI

A hybrid ensemble pruning approach based on consensus clustering and multi-objective evolutionary algorithm for sentiment classification

TL;DR: A hybrid ensemble pruning scheme based on clustering and randomized search for text sentiment classification and a consensus clustering scheme is presented to deal with the instability of clustering results.
Journal ArticleDOI

Sentiment analysis on massive open online course evaluations: A text mining and deep learning approach

TL;DR: Empirical analysis indicate that deep learning‐based architectures outperform ensemble learning methods and supervised learning methods for the task of sentiment analysis on educational data mining.
Journal ArticleDOI

GIS Based Hybrid Computational Approaches for Flash Flood Susceptibility Assessment

TL;DR: In this article, the authors identify probable flash floods in a catchment region where the response time of the drainage basin is short and use this information to predict flash floods. But, they do not identify the cause of the flash flooding.
References
More filters
Journal ArticleDOI

Random Forests

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Book

The Nature of Statistical Learning Theory

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Book

Data Mining: Concepts and Techniques

TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Book

Data Mining: Practical Machine Learning Tools and Techniques

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Related Papers (5)