scispace - formally typeset
Open AccessJournal Article

A Comprehensive Survey on various Feature Selection Methods to Categorize Text Documents

B. S. Harish, +1 more
- 17 Apr 2017 - 
- Vol. 164, Iss: 8, pp 1-7
Reads0
Chats0
TLDR
This paper presents an empirical study of most widely used feature selection methods viz.
Abstract
Feature selection is one of the well known solution to high dimensionality problem of text categorization. In text categorization, selection of good features (terms) plays a very important role. Feature selection is a strategy that can be used to improve categorization accuracy, effectiveness and computational efficiency. This paper presents an empirical study of most widely used feature selection methods viz. Term Frequency-Inverse Document Frequency (tf·idf ), Information Gain (IG), Mutual Information(MI), CHI-Square (χ), Ambiguity Measure (AM), Term Strength (TS), Term Frequency-Relevance Frequency (tf·rf ) and Symbolic Feature Selection (SFS) with five different classifiers (Nave Bayes, KNearest Neighbor, Centroid Based Classifier, Support Vector Machine and Symbolic Classifier). Experimentations are carried out on standard bench mark datasets like Reuters-21578, 20-Newsgroups and 4 University dataset.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Machine learning

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Journal ArticleDOI

Feature selection using an improved Chi-square for Arabic text classification

TL;DR: An improved method for Arabic text classification that employs the Chi-square feature selection (referred to, hereafter, as ImpCHI) to enhance the classification performance and outperforms other combinations in terms of precision, recall and f-measures.
Journal ArticleDOI

A Detailed Survey on Topic Modeling for Document and Short Text Data

TL;DR: A detailed survey covering the various topic modeling techniques proposed in last decade is presented, which focuses on different strategies of extracting the topics in social media text, where the goal is to find and aggregate the topic within short texts.
Journal ArticleDOI

Multi-label Arabic text classification in Online Social Networks

TL;DR: A standard multi-label Arabic dataset is constructed using manual annotation and a semi-supervised annotation technique that can be used for short text classification, sentiment analysis, and multilabel classification and a relationship between topics published in OSNs and hate speech is found.
Journal ArticleDOI

Sarcasm classification: A novel approach by using Content Based Feature Selection Method

TL;DR: The proposed approach to classify sarcastic text using content based feature selection method out-performance the existing methods in terms of Precision, Recall, F-measure on Amazon product review dataset.
References
More filters
Journal ArticleDOI

Machine learning

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Journal ArticleDOI

Term Weighting Approaches in Automatic Text Retrieval

TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.
Journal ArticleDOI

Machine learning in automated text categorization

TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Journal ArticleDOI

Word association norms, mutual information, and lexicography

TL;DR: The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.
Journal Article

An extensive empirical study of feature selection metrics for text classification

TL;DR: An empirical comparison of twelve feature selection methods evaluated on a benchmark of 229 text classification problem instances, revealing that a new feature selection metric, called 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations and was the top single choice for all goals except precision.
Related Papers (5)