scispace - formally typeset
Proceedings ArticleDOI: 10.1109/ICSMC.2008.4811692

SVM ranking with backward search for feature selection in type II diabetes databases

01 Oct 2008-pp 2628-2633
Abstract: Clinical databases have accumulated large quantities of information about patients and their clinical history. Data mining is the search for relationships and patterns within this data that could provide useful knowledge for effective decision-making. Classification analysis is one of the widely adopted data mining techniques for healthcare applications to support medical diagnosis, improving quality of patient care, etc. Usually medical databases are high dimensional in nature. If a training dataset contains irrelevant features (i.e., attributes), classification analysis may produce less accurate results. Data pre-processing is required to prepare the data for data mining and machine learning to increase the predictive accuracy. Feature selection is a preprocessing technique commonly used on high-dimensional data and its purposes include reducing dimensionality, removing irrelevant and redundant features, reducing the amount of data needed for learning, improving algorithms' predictive accuracy, and increasing the constructed models' comprehensibility. Much research work in data mining has gone into improving the predictive accuracy of the classifiers by applying the techniques of feature selection. The importance of feature selection in medical data mining is appreciable as the diagnosis of the disease could be done in this patient-care activity with minimum number of features. Feature selection may provide us with the means to reduce the number of clinical measures made while still maintaining or even enhancing accuracy and reducing false negative rates. In medical diagnosis, reduction in false negative rate can, literally, be the difference between life and death. In this paper we propose a feature selection approach for finding an optimum feature subset that enhances the classification accuracy of Naive .Bayes classifier. Experiments were conducted on the Pima Indian Diabetes Dataset to assess the effectiveness of our approach. The results confirm that SVM Ranking with Backward Search approach leads to promising improvement on feature selection and enhances classification accuracy. more

Topics: Feature selection (61%), Feature (computer vision) (59%), Statistical classification (56%) more

Book ChapterDOI: 10.1016/B978-0-12-206090-8.50028-1
E.R. Davies1Institutions (1)
01 Jan 1990-
Abstract: This chapter introduces the subject of statistical pattern recognition (SPR). It starts by considering how features are defined and emphasizes that the nearest neighbor algorithm achieves error rates comparable with those of an ideal Bayes’ classifier. The concepts of an optimal number of features, representativeness of the training data, and the need to avoid overfitting to the training data are stressed. The chapter shows that methods such as the support vector machine and artificial neural networks are subject to these same training limitations, although each has its advantages. For neural networks, the multilayer perceptron architecture and back-propagation algorithm are described. The chapter distinguishes between supervised and unsupervised learning, demonstrating the advantages of the latter and showing how methods such as clustering and principal components analysis fit into the SPR framework. The chapter also defines the receiver operating characteristic, which allows an optimum balance between false positives and false negatives to be achieved. more

Topics: Overfitting (60%), Artificial neural network (58%), Multilayer perceptron (58%) more

1,150 Citations

Open accessJournal ArticleDOI: 10.5120/IJAIS12-450593
Abstract: Classifying data is a common task in Machine learning. Data mining plays an essential role for extracting knowledge from large databases from enterprises operational databases. Data mining in health care is an emerging field of high importance for providing prognosis and a deeper understanding of medical data. Most data mining methods depend on a set of features that define the behaviour of the learning algorithm and directly or indirectly influence the complexity of resulting models. Heart disease is the leading cause of death in the world over the past 10 years. Researches have been using several data mining techniques in the diagnosis of heart disease. Diabetes is a chronic disease that occurs when the pancreas does not produce enough insulin, or when the body cannot effectively use the insulin it produces. Most of these systems have successfully employed Machine learning methods such as Naive Bayes and Support Vector Machines for the classification purpose. Support vector machines are a modern technique in the field of machine learning and have been successfully used in different fields of application. Using diabetics’ diagnosis, the system exhibited good accuracy and predicts attributes such as age, sex, blood pressure and blood sugar and the chances of a diabetic patient getting a heart disease. more

Topics: Naive Bayes classifier (51%)

50 Citations

Open access
01 Jan 2013-
Abstract: In the last decade there has been increasing usage of data mining techniques on medical data for discovering useful trends or patterns that are used in diagnosis and decision making. Data mining techniques such as clustering, classification, regression, association rule mining, CART (Classification and Regression Tree) are widely used in healthcare domain. Data mining algorithms, when appropriately used, are capable of improving the quality of prediction, diagnosis and disease classification. The main focus of this paper is to analyze data mining techniques required for medical data mining especially to discover locally frequent diseases such as heart ailments, lung cancer, breast cancer and so on. We evaluate the data mining techniques for finding locally frequent patterns in terms of cost, performance, speed and accuracy. We also compare data mining techniques with conventional methods. more

38 Citations

Proceedings ArticleDOI: 10.1109/CIACT.2017.7977277
Narander Kumar1, Sabita Khatri1Institutions (1)
01 Feb 2017-
Abstract: In recent years, the advent of latest web and data technologies has encouraged massive data growth in almost every sector. Businesses and leading industries are viewing these huge data repositories as a tool to design future strategies, prediction models by analyzing patterns and gaining knowledge from this unstructured data by applying different data mining techniques. Medical domain has now become richer in term of maintaining digital records of patients related to their diagnosis and treatment. These huge data repositories can range from patient personnel data, diagnosis, treatment histories, test diagnosis, images and various scans. This terabytes of medical data is quantity rich but weaker in information in terms of knowledge and robust tools to identify hidden patterns of knowledge specifically in medical sector. Data Mining as a field of research has already well proven capabilities of identifying hidden patterns, analysis and knowledge applied on different research domains, now gaining popularity day by day among researchers and scientist towards generating novel and deep insights of these large biomedical datasets also. Uncovering new biomedical and healthcare related knowledge in order to support clinical decision making, is another dimension of data mining. Through massive literature survey, it is found that early disease prediction is the most demanded area of research in health care sector. As health care domain is bit wider domain and having different disease characteristics, different techniques have their own prediction efficiencies, which can be enhanced and changed in order to get into most optimize way. In this research work, authors have comprehensively compared different data classification techniques and their prediction accuracy for chronic kidney disease. Authors have compared J48, Naive Bayes, Random Forest, SVM and k-NN classifiers using performance measures like ROC, kappa statistics, RMSE and MAE using WEKA tool. Authors have also compared these classifiers on various accuracy measures like TP rate, FP rate, precision, recall and f-measure by implementing on WEKA. Experimental result shows that random forest classifier has better classification accuracy over others for chronic kidney disease dataset. more

Topics: Data classification (56%), Unstructured data (54%), C4.5 algorithm (52%) more

30 Citations

Open accessJournal ArticleDOI: 10.2478/AMCS-2014-0009
Abstract: The feature selection problem often occurs in pattern recognition, and more specific, classification. Although these patterns could contain a large number of features, some of them could prove to be irrelevant, redundant or even detrimental to classification accuracy. Thus, it is important to remove these kinds of features which in turn leads to problem dimensionality reduction and could eventually improve the classification accuracy. In this paper an approach to dimensionality reduction based on differential evolution which represents a wrapper and explores the solution space is presented. The solutions, subsets of the whole feature set, are evaluated using the k-nearest neighbour algorithm. High quality solutions found during execution of the differential evolution fill the archive. A final solution is obtained by conducting k-fold cross validation on the archive solutions and selecting the best. Experimental analysis was conducted on several standard test sets. Classification accuracy of the k-nearest neighbour algorithm using the full feature set and the accuracy of the same algorithm using only the subset provided by the proposed approach and some other optimization algorithms which were used as wrappers are compared. The analysis has shown that the proposed approach successfully determines good feature subsets which may increase classification accuracy. more

26 Citations


Open accessBook
01 Jan 1973-
Abstract: Provides a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition. The topics treated include Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis. more

Topics: Unsupervised learning (57%), Cluster analysis (56%), Linear discriminant analysis (55%) more

13,634 Citations

Open access
01 Jan 1998-

12,776 Citations

Open accessJournal ArticleDOI: 10.1016/S0004-3702(97)00043-X
Abstract: In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain. We study the strengths and weaknesses of the wrapper approach and show a series of improved designs. We compare the wrapper approach to induction without feature subset selection and to Relief, a filter approach to feature subset selection. Significant improvement in accuracy is achieved for some datasets for the two families of induction algorithms used: decision trees and Naive-Bayes. more

7,958 Citations

Journal ArticleDOI: 10.1126/SCIENCE.3287615
03 Jun 1988-Science
Abstract: Diagnostic systems of several kinds are used to distinguish between two classes of events, essentially "signals" and "noise". For them, analysis in terms of the "relative operating characteristic" of signal detection theory provides a precise and valid measure of diagnostic accuracy. It is the only measure available that is uninfluenced by decision biases and prior probabilities, and it places the performances of diverse systems on a common, easily interpreted scale. Representative values of this measure are reported here for systems in medical imaging, materials testing, weather forecasting, information retrieval, polygraph lie detection, and aptitude testing. Though the measure itself is sound, the values obtained from tests of diagnostic systems often require qualification because the test data on which they are based are of unsure quality. A common set of problems in testing is faced in all fields. How well these problems are handled, or can be handled in a given field, determines the degree of confidence that can be placed in a measured value of accuracy. Some fields fare much better than others. more

Topics: Reliability (statistics) (50%)

7,649 Citations

No. of citations received by the Paper in previous years