scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm

Harun Uğuz1
01 Oct 2011-Knowledge Based Systems (Elsevier)-Vol. 24, Iss: 7, pp 1024-1032
TL;DR: Two-stage feature selection and feature extraction is used to improve the performance of text categorization and the proposed model is able to achieve high categorization effectiveness as measured by precision, recall and F-measure.
Abstract: Text categorization is widely used when organizing documents in a digital form. Due to the increasing number of documents in digital form, automated text categorization has become more promising in the last ten years. A major problem of text categorization is its large number of features. Most of those are irrelevant noise that can mislead the classifier. Therefore, feature selection is often used in text categorization to reduce the dimensionality of the feature space and to improve performance. In this study, two-stage feature selection and feature extraction is used to improve the performance of text categorization. In the first stage, each term within the document is ranked depending on their importance for classification using the information gain (IG) method. In the second stage, genetic algorithm (GA) and principal component analysis (PCA) feature selection and feature extraction methods are applied separately to the terms which are ranked in decreasing order of importance, and a dimension reduction is carried out. Thereby, during text categorization, terms of less importance are ignored, and feature selection and extraction methods are applied to the terms of highest importance; thus, the computational time and complexity of categorization is reduced. To evaluate the effectiveness of dimension reduction methods on our purposed model, experiments are conducted using the k-nearest neighbour (KNN) and C4.5 decision tree algorithm on Reuters-21,578 and Classic3 datasets collection for text categorization. The experimental results show that the proposed model is able to achieve high categorization effectiveness as measured by precision, recall and F-measure.
Citations
More filters
Journal ArticleDOI
TL;DR: A novel feature selection method, namely,feature selection method using the particle swarm optimization (PSO) algorithm (FSPSOTC) to solve the feature selection problem by creating a new subset of informative text features that can improve the performance of the text clustering technique and reduce the computational time.

401 citations

Journal ArticleDOI
TL;DR: In this paper, semi-supervised feature selection methods are fully investigated and two taxonomies of these methods are presented based on two different perspectives which represent the hierarchical structure of semi- supervised feature Selection methods.

371 citations

Journal ArticleDOI
TL;DR: The results show that the proposed algorithm hybrid algorithm (H-FSPSOTC) improved the performance of the clustering algorithm by generating a new subset of more informative features, and is compared with the other comparative algorithms published in the literature.
Abstract: The text clustering technique is an appropriate method used to partition a huge amount of text documents into groups. The documents size affects the text clustering by decreasing its performance. Subsequently, text documents contain sparse and uninformative features, which reduce the performance of the underlying text clustering algorithm and increase the computational time. Feature selection is a fundamental unsupervised learning technique used to select a new subset of informative text features to improve the performance of the text clustering and reduce the computational time. This paper proposes a hybrid of particle swarm optimization algorithm with genetic operators for the feature selection problem. The k-means clustering is used to evaluate the effectiveness of the obtained features subsets. The experiments were conducted using eight common text datasets with variant characteristics. The results show that the proposed algorithm hybrid algorithm (H-FSPSOTC) improved the performance of the clustering algorithm by generating a new subset of more informative features. The proposed algorithm is compared with the other comparative algorithms published in the literature. Finally, the feature selection technique encourages the clustering algorithm to obtain accurate clusters.

366 citations


Cites background from "A two-stage feature selection metho..."

  • ...ch/Info/clef/, which consists of 571 words [20]....

    [...]

  • ...Finally, the solution with high fitness value is the optimal solution so far [20]....

    [...]

Journal ArticleDOI
01 Apr 2018-Catena
TL;DR: Wang et al. as mentioned in this paper investigated and compared the use of current state-of-the-art ensemble techniques, such as AdaBoost, Bagging, and Rotation Forest, for landslide susceptibility assessment with the base classifier of J48 Decision Tree (JDT).
Abstract: Landslides are a manifestation of slope instability causing different kinds of damage affecting life and property. Therefore, high-performance-based landslide prediction models are useful to government institutions for developing strategies for landslide hazard prevention and mitigation. Development of data mining based algorithms shows that high-performance models can be obtained using ensemble frameworks. The primary objective of this study is to investigate and compare the use of current state-of-the-art ensemble techniques, such as AdaBoost, Bagging, and Rotation Forest, for landslide susceptibility assessment with the base classifier of J48 Decision Tree (JDT). The Guangchang district (Jiangxi province, China) was selected as the case study. Firstly, a landslide inventory map with 237 landslide locations was constructed; the landslide locations were then randomly divided into a ratio of 70/30 for the training and validating models. Secondly, fifteen landslide conditioning factors were prepared, such as slope, aspect, altitude, topographic wetness index (TWI), stream power index (SPI), sediment transport index (STI), plan curvature, profile curvature, lithology, distance to faults, distance to rivers, distance to roads, land use, normalized difference vegetation index (NDVI), and rainfall. Relief-F with the 10-fold cross-validation method was applied to quantify the predictive ability of the conditioning factors and for feature selection. Using the JDT and its three ensemble techniques, a total of four landslide susceptibility models were constructed. Finally, the overall performance of the resulting models was assessed and compared using area under the receiver operating characteristic (ROC) curve (AUC) and statistical indexes. The result showed that all landslide models have high performance (AUC > 0.8). However, the JDT with the Rotation Forest model presents the highest prediction capability (AUC = 0.855), followed by the JDT with the AdaBoost (0.850), the Bagging (0.839), and the JDT (0.814), respectively. Therefore, the result demonstrates that the JDT with Rotation Forest is the best optimized model in this study and it can be considered as a promising method for landslide susceptibility mapping in similar cases for better accuracy.

330 citations

Journal ArticleDOI
TL;DR: This paper presents an unsupervised feature selection method based on ant colony optimization, called UFSACO, which seeks to find the optimal feature subset through several iterations without using any learning algorithms.

304 citations


Cites methods from "A two-stage feature selection metho..."

  • ...Feature selection has been applied to many fields such as text categorization (Chen et al., 2006; Uğuz, 2011; Yang et al., 2011), face recognition (Kanan and Faez, 2008; Yan and Yuan, 2004), cancer classification (Guyon et al....

    [...]

References
More filters
Book
01 Sep 1988
TL;DR: In this article, the authors present the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields, including computer programming and mathematics.
Abstract: From the Publisher: This book brings together - in an informal and tutorial fashion - the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields Major concepts are illustrated with running examples, and major algorithms are illustrated by Pascal computer programs No prior knowledge of GAs or genetics is assumed, and only a minimum of computer programming and mathematics background is required

52,797 citations

Book
01 Jan 1975
TL;DR: Names of founding work in the area of Adaptation and modiication, which aims to mimic biological optimization, and some (Non-GA) branches of AI.
Abstract: Name of founding work in the area. Adaptation is key to survival and evolution. Evolution implicitly optimizes organisims. AI wants to mimic biological optimization { Survival of the ttest { Exploration and exploitation { Niche nding { Robust across changing environments (Mammals v. Dinos) { Self-regulation,-repair and-reproduction 2 Artiicial Inteligence Some deenitions { "Making computers do what they do in the movies" { "Making computers do what humans (currently) do best" { "Giving computers common sense; letting them make simple deci-sions" (do as I want, not what I say) { "Anything too new to be pidgeonholed" Adaptation and modiication is root of intelligence Some (Non-GA) branches of AI: { Expert Systems (Rule based deduction)

32,573 citations

Journal ArticleDOI
TL;DR: In this paper, an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail, is described, and a reported shortcoming of the basic algorithm is discussed.
Abstract: The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.

17,177 citations

Journal ArticleDOI
TL;DR: The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points, so it may be said that half the classification information in an infinite sample set is contained in the nearest neighbor.
Abstract: The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points. This rule is independent of the underlying joint distribution on the sample points and their classifications, and hence the probability of error R of such a rule must be at least as great as the Bayes probability of error R^{\ast} --the minimum probability of error over all decision rules taking underlying probability structure into account. However, in a large sample analysis, we will show in the M -category case that R^{\ast} \leq R \leq R^{\ast}(2 --MR^{\ast}/(M-1)) , where these bounds are the tightest possible, for all suitably smooth underlying distributions. Thus for any number of categories, the probability of error of the nearest neighbor rule is bounded above by twice the Bayes probability of error. In this sense, it may be said that half the classification information in an infinite sample set is contained in the nearest neighbor.

12,243 citations

Journal ArticleDOI
TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.
Abstract: The experimental evidence accumulated over the past 20 years indicates that textindexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucially on the choice of effective term weighting systems. This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.

9,460 citations