scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Sentiment analysis of feature ranking methods for classification accuracy

01 Nov 2017-Vol. 263, Iss: 4, pp 042011
TL;DR: Five feature ranking methods namely: document frequency, standard deviation information gain, CHI-SQUARE, and weighted-log likelihood –ratio is analyzed and helps improve text classification accuracy.
Abstract: Text pre-processing and feature selection are important and critical steps in text mining. Text pre-processing of large volumes of datasets is a difficult task as unstructured raw data is converted into structured format. Traditional methods of processing and weighing took much time and were less accurate. To overcome this challenge, feature ranking techniques have been devised. A feature set from text preprocessing is fed as input for feature selection. Feature selection helps improve text classification accuracy. Of the three feature selection categories available, the filter category will be the focus. Five feature ranking methods namely: document frequency, standard deviation information gain, CHI-SQUARE, and weighted-log likelihood –ratio is analyzed.
Citations
More filters
Dissertation
01 Jan 2018
TL;DR: The extensive experimental results show that the proposed automated CBL approach provides competitive accuracy and achieved on average, more than 5% increase in f-measure and predictive accuracy as compared to state-of-the-art methods.
Abstract: Case-based learning (CBL) approach has been receiving a lot of attention in medical education, as an alternative to the traditional learning environment. This student-centric teaching methodology, exposes the medical students to practice the real-world scenarios. In order to support the learning outcomes of students, existing systems do not provide computer-based as well as experiential knowledge-based support for CBL practice. Medical literature contains textual knowledge, which can be used as a very beneficial source for the computer-based CBL practice. Therefore, designing and developing of an automated CBL approach is a challenging problem. In order, to solve this problem, the text mining domain provides the basic framework for constructing domain knowledge, where the feature selection is considered to be one of the most critical requirement to select the appropriate features. Keeping in view these facts, this research, provides contribution, in the following areas: (1) Feature Ranking; where we propose, an innovative unified features scoring algorithm to generate a final ranked list of features, (2) Feature Selection; where we propose, an innovative threshold value selection algorithm to define a cut-off point for removing irrelevant features for the domain knowledge construction, and (3) CBL Platform; where we designed and developed, an interactive CBL system consisting of experiential as well as domain knowledge to nurture medical students for their professional learning and development. We perform both quantitative and qualitative evaluation of our proposed (1) methodology on benchmark datasets, and (2) CBL approach. The extensive experimental results show that our approach provides competitive accuracy and achieved (1) on average, more than 5% increase in f-measure and predictive accuracy as compared to state-of-the-art methods, and (2) a success rate of more than 70% for students’ interaction, group learning, solo learning, and improving clinical skills.

4 citations

Book ChapterDOI
01 Jan 1984

2 citations

Proceedings ArticleDOI
01 Dec 2018
TL;DR: A Document Retrieval based on Topic Clustering method could eliminate the query-document distance calculation function in the retrieval process, so it was hoped that the search process would run faster.
Abstract: Document retrieval aims to find documents in a collection of unstructured text to meet the needs of user information. The search engine was required in the document retrieval system to perform the entire process automatically, starting from the processing of document text in the collection, feature selection, feature extraction, query text processing and search documents relevant to the query. There were three main factors in improving search engine performance: the feature selection method, the method of weighting features in document collections and the method of searching documents in the collection. In this paper, there were some methods used to improve the performance of search engines. For feature selection, Term Frequency-Invers Document Frequency based on Luhn's Idea was used for document features selection. For weighting features, Fuzzy Gibbs Latent Dirichlet Allocation was used for feature extraction method to weight the document features. To search documents that were relevant to the query, this paper used a Document Retrieval based on Topic Clustering method. Through this method, all documents were clustered based on feature weight obtained through feature extraction methods. Clusters that relevant to the query term combinations were selected and all documents in the cluster were displayed as search results. The result showed this method can retrieve set of documents in the cluster that relevant to the query. Therefore, this method could eliminate the query-document distance calculation function in the retrieval process, so it was hoped that the search process would run faster. Keywords— document retrieval; topic model; clustering

1 citations

References
More filters
Journal ArticleDOI
TL;DR: In this article, the authors highlight the disadvantages of this method and present the median absolute deviation, an alternative and more robust measure of dispersion that is easy to implement, and explain the procedures for calculating this indicator in SPSS and R software.

2,647 citations

Book ChapterDOI
15 Sep 2008
TL;DR: It is shown that ensemble feature selection techniques show great promise for high-dimensional domains with small sample sizes, and provide more robust feature subsets than a single feature selection technique.
Abstract: Robustness or stability of feature selection techniques is a topic of recent interest, and is an important issue when selected feature subsets are subsequently analysed by domain experts to gain more insight into the problem modelled. In this work, we investigate the use of ensemble feature selection techniques, where multiple feature selection methods are combined to yield more robust results. We show that these techniques show great promise for high-dimensional domains with small sample sizes, and provide more robust feature subsets than a single feature selection technique. In addition, we also investigate the effect of ensemble feature selection techniques on classification performance, giving rise to a new model selection strategy.

587 citations

Journal ArticleDOI
01 Jan 2006
TL;DR: This paper introduces a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models.
Abstract: Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and Sahami's method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy.

330 citations

Journal ArticleDOI
TL;DR: A design for a speech recognition interface for an HIWIN robotic endoscope holder and a method is proposed for voice-to-motion calibration that compares the degree of change in the endoscope image for a voice command.
Abstract: Speech recognition is common in electronic appliances and personal services, but its use for industrial and medical purposes is rare because of the presence of motion ambiguity. For minimally invasive surgical robotic assistants, this ambiguity arises because the robotic motion is not calibrated to the camera images. This paper presents a design for a speech recognition interface for an HIWIN robotic endoscope holder. A new intentional speech control is proposed to control movement over long distances. To decrease ambiguity, a method is proposed for voice-to-motion calibration that compares the degree of change in the endoscope image for a voice command. A speech recognition algorithm is implemented on Ubuntu OS, using CMU Sphinx. The control signal is sent to the robot controller using serial-port communication through a RS232 cable. The experimental results show that the proposed intentional speech control strategy has a navigation precision of up to 3.1° of angular displacement for the endoscope. The overall system processing time, including robotic motion, is 3.22 s for ∼1.8-s speech duration. The reference image navigation range is from 2.5 mm for ∼0.5-s speech duration up to 6 mm for ∼1.8-s speech duration, using a setup with camera tip that is located at a distance of 5 cm from the remote center of motion point.

92 citations

Journal ArticleDOI
Ping-Ping Wen1, Shao-Ping Shi1, Hao-Dong Xu1, Li-Na Wang1, Jian-Ding Qiu1 
TL;DR: The prediction results serve as useful resources to elucidate the mechanism of arginine or lysine methylation and facilitate hypothesis-driven experimental design and validation and significantly improves the predictive performance compare with other general methylation prediction tools.
Abstract: As one of the most important reversible types of post-translational modification, protein methylation catalyzed by methyltransferases carries many pivotal biological functions as well as many essential biological processes. Identification of methylation sites is prerequisite for decoding methylation regulatory networks in living cells and understanding their physiological roles. Experimental methods are limitations of labor-intensive and time-consuming. While in silicon approaches are cost-effective and high-throughput manner to predict potential methylation sites, but those previous predictors only have a mixed model and their prediction performances are not fully satisfactory now. Recently, with increasing availability of quantitative methylation datasets in diverse species (especially in eukaryotes), there is a growing need to develop a species-specific predictor. Here, we designed a tool named PSSMe based on information gain (IG) feature optimization method for species-specific methylation site prediction. The IG method was adopted to analyze the importance and contribution of each feature, then select the valuable dimension feature vectors to reconstitute a new orderly feature, which was applied to build the finally prediction model. Finally, our method improves prediction performance of accuracy about 15% comparing with single features. Furthermore, our species-specific model significantly improves the predictive performance compare with other general methylation prediction tools. Hence, our prediction results serve as useful resources to elucidate the mechanism of arginine or lysine methylation and facilitate hypothesis-driven experimental design and validation. Availability and Implementation: The tool online service is implemented by C# language and freely available at http://bioinfo.ncu.edu.cn/PSSMe.aspx. Contact: jdqiu@ncu.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online.

65 citations