Open AccessJournal Article
A Comprehensive Survey on various Feature Selection Methods to Categorize Text Documents
B. S. Harish,M. Revanasiddappa +1 more
Reads0
Chats0
TLDR
This paper presents an empirical study of most widely used feature selection methods viz.Abstract:
Feature selection is one of the well known solution to high dimensionality problem of text categorization. In text categorization, selection of good features (terms) plays a very important role. Feature selection is a strategy that can be used to improve categorization accuracy, effectiveness and computational efficiency. This paper presents an empirical study of most widely used feature selection methods viz. Term Frequency-Inverse Document Frequency (tf·idf ), Information Gain (IG), Mutual Information(MI), CHI-Square (χ), Ambiguity Measure (AM), Term Strength (TS), Term Frequency-Relevance Frequency (tf·rf ) and Symbolic Feature Selection (SFS) with five different classifiers (Nave Bayes, KNearest Neighbor, Centroid Based Classifier, Support Vector Machine and Symbolic Classifier). Experimentations are carried out on standard bench mark datasets like Reuters-21578, 20-Newsgroups and 4 University dataset.read more
Citations
More filters
Journal ArticleDOI
Machine learning
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Journal ArticleDOI
Feature selection using an improved Chi-square for Arabic text classification
TL;DR: An improved method for Arabic text classification that employs the Chi-square feature selection (referred to, hereafter, as ImpCHI) to enhance the classification performance and outperforms other combinations in terms of precision, recall and f-measures.
Journal ArticleDOI
A Detailed Survey on Topic Modeling for Document and Short Text Data
TL;DR: A detailed survey covering the various topic modeling techniques proposed in last decade is presented, which focuses on different strategies of extracting the topics in social media text, where the goal is to find and aggregate the topic within short texts.
Journal ArticleDOI
Multi-label Arabic text classification in Online Social Networks
TL;DR: A standard multi-label Arabic dataset is constructed using manual annotation and a semi-supervised annotation technique that can be used for short text classification, sentiment analysis, and multilabel classification and a relationship between topics published in OSNs and hate speech is found.
Journal ArticleDOI
Sarcasm classification: A novel approach by using Content Based Feature Selection Method
H. M. Keerthi Kumar,B. S. Harish +1 more
TL;DR: The proposed approach to classify sarcastic text using content based feature selection method out-performance the existing methods in terms of Precision, Recall, F-measure on Amazon product review dataset.
References
More filters
Journal ArticleDOI
Machine learning
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Journal ArticleDOI
Term Weighting Approaches in Automatic Text Retrieval
Gerard Salton,Chris Buckley +1 more
TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.
Journal ArticleDOI
Machine learning in automated text categorization
TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Journal ArticleDOI
Word association norms, mutual information, and lexicography
Kenneth Church,Patrick Hanks +1 more
TL;DR: The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.
Journal Article
An extensive empirical study of feature selection metrics for text classification
TL;DR: An empirical comparison of twelve feature selection methods evaluated on a benchmark of 229 text classification problem instances, revealing that a new feature selection metric, called 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations and was the top single choice for all goals except precision.