A Comprehensive Survey on various Feature Selection Methods to Categorize Text Documents

Open AccessJournal Article

A Comprehensive Survey on various Feature Selection Methods to Categorize Text Documents

B. S. Harish, +1 more

- 17 Apr 2017 -

International Journal of Computer Applic...

- Vol. 164, Iss: 8, pp 1-7

Chats0

TLDR

This paper presents an empirical study of most widely used feature selection methods viz.

Abstract:

Feature selection is one of the well known solution to high dimensionality problem of text categorization. In text categorization, selection of good features (terms) plays a very important role. Feature selection is a strategy that can be used to improve categorization accuracy, effectiveness and computational efficiency. This paper presents an empirical study of most widely used feature selection methods viz. Term Frequency-Inverse Document Frequency (tf·idf ), Information Gain (IG), Mutual Information(MI), CHI-Square (χ), Ambiguity Measure (AM), Term Strength (TS), Term Frequency-Relevance Frequency (tf·rf ) and Symbolic Feature Selection (SFS) with five different classifiers (Nave Bayes, KNearest Neighbor, Centroid Based Classifier, Support Vector Machine and Symbolic Classifier). Experimentations are carried out on standard bench mark datasets like Reuters-21578, 20-Newsgroups and 4 University dataset.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Machine learning

Thomas G. Dietterich

- 01 Dec 1996 -

ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

Journal ArticleDOI

Feature selection using an improved Chi-square for Arabic text classification

Said Bahassine, +3 more

- 01 Feb 2020 -

Journal of King Saud University - Comput...

TL;DR: An improved method for Arabic text classification that employs the Chi-square feature selection (referred to, hereafter, as ImpCHI) to enhance the classification performance and outperforms other combinations in terms of precision, recall and f-measures.

...read moreread less

Journal ArticleDOI

A Detailed Survey on Topic Modeling for Document and Short Text Data

S. Likhitha

- 19 Aug 2019 -

International Journal of Computer Applic...

TL;DR: A detailed survey covering the various topic modeling techniques proposed in last decade is presented, which focuses on different strategies of extracting the topics in social media text, where the goal is to find and aggregate the topic within short texts.

...read moreread less

Journal ArticleDOI

Multi-label Arabic text classification in Online Social Networks

Ahmed Omar, +4 more

- 01 Sep 2021 -

Information Systems

TL;DR: A standard multi-label Arabic dataset is constructed using manual annotation and a semi-supervised annotation technique that can be used for short text classification, sentiment analysis, and multilabel classification and a relationship between topics published in OSNs and hate speech is found.

...read moreread less

Journal ArticleDOI

Sarcasm classification: A novel approach by using Content Based Feature Selection Method

H. M. Keerthi Kumar, +1 more

- 01 Jan 2018 -

Procedia Computer Science

TL;DR: The proposed approach to classify sarcastic text using content based feature selection method out-performance the existing methods in terms of Precision, Recall, F-measure on Amazon product review dataset.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Machine learning

Thomas G. Dietterich

- 01 Dec 1996 -

ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

Journal ArticleDOI

Term Weighting Approaches in Automatic Text Retrieval

Gerard Salton, +1 more

- 01 Aug 1988 -

Information Processing and Management

TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.

...read moreread less

Journal ArticleDOI

Machine learning in automated text categorization

Fabrizio Sebastiani

- 01 Mar 2002 -

ACM Computing Surveys

TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

...read moreread less

Journal ArticleDOI

Word association norms, mutual information, and lexicography

Kenneth Church, +1 more

- 01 Mar 1990 -

Computational Linguistics

TL;DR: The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.

...read moreread less

Journal Article

An extensive empirical study of feature selection metrics for text classification

George Forman

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: An empirical comparison of twelve feature selection methods evaluated on a benchmark of 229 text classification problem instances, revealing that a new feature selection metric, called 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations and was the top single choice for all goals except precision.

...read moreread less