scispace - formally typeset
Journal ArticleDOI

A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm

Harun Uğuz
- 01 Oct 2011 - 
- Vol. 24, Iss: 7, pp 1024-1032
Reads0
Chats0
TLDR
Two-stage feature selection and feature extraction is used to improve the performance of text categorization and the proposed model is able to achieve high categorization effectiveness as measured by precision, recall and F-measure.
Abstract
Text categorization is widely used when organizing documents in a digital form. Due to the increasing number of documents in digital form, automated text categorization has become more promising in the last ten years. A major problem of text categorization is its large number of features. Most of those are irrelevant noise that can mislead the classifier. Therefore, feature selection is often used in text categorization to reduce the dimensionality of the feature space and to improve performance. In this study, two-stage feature selection and feature extraction is used to improve the performance of text categorization. In the first stage, each term within the document is ranked depending on their importance for classification using the information gain (IG) method. In the second stage, genetic algorithm (GA) and principal component analysis (PCA) feature selection and feature extraction methods are applied separately to the terms which are ranked in decreasing order of importance, and a dimension reduction is carried out. Thereby, during text categorization, terms of less importance are ignored, and feature selection and extraction methods are applied to the terms of highest importance; thus, the computational time and complexity of categorization is reduced. To evaluate the effectiveness of dimension reduction methods on our purposed model, experiments are conducted using the k-nearest neighbour (KNN) and C4.5 decision tree algorithm on Reuters-21,578 and Classic3 datasets collection for text categorization. The experimental results show that the proposed model is able to achieve high categorization effectiveness as measured by precision, recall and F-measure.

read more

Citations
More filters
Proceedings ArticleDOI

An Evolutionary Based Multi-Objective Filter Approach for Feature Selection

TL;DR: A novel multi-objective algorithm based on mutual information for feature selection, called multi-Objective mutual information (MOMI), which identifies a set of features with minimal redundancy and maximum relevancy with the target class.
Proceedings ArticleDOI

Improved information gain feature selection method for Chinese text classification based on word embedding

TL;DR: An improved feature selection method is proposed which uses word embedding to calculate the most similar words to the current dictionary selected by IG algorithm and expand the dictionary with these words under certain regulations.
Journal ArticleDOI

Machine Learning Methods in Smart Lighting Towards Achieving User Comfort: A Survey

TL;DR: A systematic literature review from a bird’s eye view covering full-length research topics on smart lighting, including issues, implementation targets, technological solutions, and prospects and a detailed and extensive overview of emerging machine learning techniques as a key solution to overcome complex problems in smart lighting.
Proceedings ArticleDOI

A hybrid feature selection method for high-dimensional data

TL;DR: An ensemble of three different filter ranking methods including: Information Gain, ReliefF and F-score are used to reduce the dimension of datasets and the experimental results confirm the capability of the proposedIBGSA.
Proceedings ArticleDOI

Exploring Hybrid Linguistic Feature Sets to Measure Filipino Text Readability

TL;DR: This paper used traditional (TRAD) and lexical (LEX) linguistic features by incorporating language model (LM) features for possible improvement in identifying readability levels of Filipino storybooks, and found that combining LM predictors to TRAD and LEX, forming a hybrid feature set, increased the performances of readability models trained using Logistic Regression and Support Vector Machines by up to $ 25% -32%.
References
More filters
Book

Genetic algorithms in search, optimization, and machine learning

TL;DR: In this article, the authors present the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields, including computer programming and mathematics.
Book

Adaptation in natural and artificial systems

TL;DR: Names of founding work in the area of Adaptation and modiication, which aims to mimic biological optimization, and some (Non-GA) branches of AI.
Journal ArticleDOI

Induction of Decision Trees

J. R. Quinlan
- 25 Mar 1986 - 
TL;DR: In this paper, an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail, is described, and a reported shortcoming of the basic algorithm is discussed.
Journal ArticleDOI

Nearest neighbor pattern classification

TL;DR: The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points, so it may be said that half the classification information in an infinite sample set is contained in the nearest neighbor.
Journal ArticleDOI

Term Weighting Approaches in Automatic Text Retrieval

TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.