scispace - formally typeset
Search or ask a question

Evolutionary approaches for feature selection in biological data

01 Jan 2014-
TL;DR: This dissertation aims to provide a history of web exceptionalism from 1989 to 2002, a period chosen in order to explore its roots as well as specific cases up to and including the year in which descriptions of “Web 2.0” began to circulate.
Abstract: Data mining techniques have been used widely in many areas such as business, science, engineering and medicine The techniques allow a vast amount of data to be explored in order to extract useful information from the data One of the foci in the health area is finding interesting biomarkers from biomedical data Mass throughput data generated from microarrays and mass spectrometry from biological samples are high dimensional and is small in sample size Examples include DNA microarray datasets with up to 500,000 genes and mass spectrometry data with 300,000 m/z values While the availability of such datasets can aid in the development of techniques/drugs to improve diagnosis and treatment of diseases, a major challenge involves its analysis to extract useful and meaningful information The aims of this project are: 1) to investigate and develop feature selection algorithms that incorporate various evolutionary strategies, 2) using the developed algorithms to find the “most relevant” biomarkers contained in biological datasets and 3) and evaluate the goodness of extracted feature subsets for relevance (examined in terms of existing biomedical domain knowledge and from classification accuracy obtained using different classifiers) The project aims to generate good predictive models for classifying diseased samples from control

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI
TL;DR: An unsupervised approach for finding out the significant genes from microarray gene expression datasets using a quantum clustering approach to represent gene-expression data as equations and uses the procedure to search for the most probable set of clusters given the available data.
Abstract: In this paper, we have implemented an unsupervised approach for finding out the significant genes from microarray gene expression datasets. The proposed method is based on implements a quantum clustering approach to represent gene-expression data as equations and uses the procedure to search for the most probable set of clusters given the available data. The main contribution of this approach lies in the ability to take into account the essential features or genes using clustering. Here, we present a novel clustering approach that extends ideas from scale-space clustering and support-vector clustering. This clustering method is used as a feature selection method. Our approach is fundamentally based on the representation of datapoints or features in the Hilbert space, which is then represented by the Schrodinger equation, of which the probability function is a solution. This Schrodinger equation contains a potential function that is extended from the initial probability function.The minima of the potential values are then treated as cluster centres. The cluster centres thus stand out as representative genes. These genes are evaluated using classifiers, and their performance is recorded over various indices of classification. From the experiments, it is found that the classification performance of the reduced set is much better than the entire dataset.The only free-scale parameter, sigma, is then altered to obtain the highest accuracy, and the corresponding biological significance of the genes is noted.

1 citations

Proceedings ArticleDOI
21 Feb 2016
TL;DR: A hybrid approach incorporating the Nearest Shrunken Centroid (NSC) and Memetic Algorithm (MA) is proposed to automatically search for an optimal range of shrinkage threshold values for the NSC to improve feature selection and classification accuracy.
Abstract: High-throughput technologies such as microarrays and mass spectrometry produced high dimensional biological datasets both in abundance and with increasing complexity. Prediction Analysis for Microarrays (PAM) is a well-known implementation of the Nearest Shrunken Centroid (NSC) method which has been widely used for classification of biological data. In this paper, a hybrid approach incorporating the Nearest Shrunken Centroid (NSC) and Memetic Algorithm (MA) is proposed to automatically search for an optimal range of shrinkage threshold values for the NSC to improve feature selection and classification accuracy. Evaluation of the approach involved nine biological datasets and results showed improved feature selection stability over existing evolutionary approaches as well as improved classification accuracy.

1 citations


Cites background from "Evolutionary approaches for feature..."

  • ...The parameters that are tuned include population size, crossover probability rate, and mutation probability rate, with these values in the table, being taken from an empirical experiment (Dang, 2014)....

    [...]

Book ChapterDOI
01 Jan 2021
TL;DR: A clustering-based feature selection algorithm to select the particular gene responsible for a particular disease has been proposed and compared with two other well-established feature selection techniques under three different classification approaches, in terms of accuracy, precision, recall and F-score.
Abstract: Genes are the blueprint for all activities of living systems that help them to sustain and have a stable life cycle under normal conditions. Any mistakes in the genetic regulation can disturb their synchronous activity and cause a disease. Due to this, identifying the particular disease-causing genes is very significant research area in bioinformatics. In this paper, a clustering-based feature selection algorithm to select the particular gene responsible for a particular disease has been proposed by us. We have used a well-established clustering algorithm, mean shift clustering for this purpose. Mathematically, we can say that each cluster will represent genes having characteristics different from genes in other clusters. From each cluster, we shall fetch the cluster centres only and test our model on the dataset with reduced dimension. We have opted for density-based approach for its ability to predict the number of clusters by itself. Our algorithm is experimented on benchmark datasets which are publicly available and compared with two other well-established feature selection techniques under three different classification approaches, in terms of accuracy, precision, recall and F-score. Our proposed algorithm performed well in most of the cases.
References
More filters
Book
01 Sep 1988
TL;DR: In this article, the authors present the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields, including computer programming and mathematics.
Abstract: From the Publisher: This book brings together - in an informal and tutorial fashion - the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields Major concepts are illustrated with running examples, and major algorithms are illustrated by Pascal computer programs No prior knowledge of GAs or genetics is assumed, and only a minimum of computer programming and mathematics background is required

52,797 citations

Journal ArticleDOI
TL;DR: This paper suggests a non-dominated sorting-based MOEA, called NSGA-II (Non-dominated Sorting Genetic Algorithm II), which alleviates all of the above three difficulties, and modify the definition of dominance in order to solve constrained multi-objective problems efficiently.
Abstract: Multi-objective evolutionary algorithms (MOEAs) that use non-dominated sorting and sharing have been criticized mainly for: (1) their O(MN/sup 3/) computational complexity (where M is the number of objectives and N is the population size); (2) their non-elitism approach; and (3) the need to specify a sharing parameter. In this paper, we suggest a non-dominated sorting-based MOEA, called NSGA-II (Non-dominated Sorting Genetic Algorithm II), which alleviates all of the above three difficulties. Specifically, a fast non-dominated sorting approach with O(MN/sup 2/) computational complexity is presented. Also, a selection operator is presented that creates a mating pool by combining the parent and offspring populations and selecting the best N solutions (with respect to fitness and spread). Simulation results on difficult test problems show that NSGA-II is able, for most problems, to find a much better spread of solutions and better convergence near the true Pareto-optimal front compared to the Pareto-archived evolution strategy and the strength-Pareto evolutionary algorithm - two other elitist MOEAs that pay special attention to creating a diverse Pareto-optimal front. Moreover, we modify the definition of dominance in order to solve constrained multi-objective problems efficiently. Simulation results of the constrained NSGA-II on a number of test problems, including a five-objective, seven-constraint nonlinear problem, are compared with another constrained multi-objective optimizer, and the much better performance of NSGA-II is observed.

37,111 citations


"Evolutionary approaches for feature..." refers background or methods in this paper

  • ...Motivated by 1) the effectiveness of MOEA (NSGA2) in its potential to find multiple solutions, 2) the NSC algorithm in FS and classification, and 3) the automated shrinkage threshold optimization in NSC-GA, a hybrid approach incorporating NSGA2 (Deb et al., 2002) and NSC algorithm (Tibshirani et al....

    [...]

  • ...…... 228 Figure 8-6 Crowded tournament selection algorithm used in Deb et al. (2002) .......... 230 Figure 8-7 Crowding distance algorithm used in Deb et al. (2002) ............................. 230 Figure 8-8 Non-dominated sorting procedure used in Deb et al. (2002) ...................... 232…...

    [...]

  • ...…8-7 Crowding distance algorithm used in Deb et al. (2002) ............................. 230 Figure 8-8 Non-dominated sorting procedure used in Deb et al. (2002) ...................... 232 Figure 8-9 Steps for generating the new population from the combined population ... 233 Figure 8-10 An…...

    [...]

  • ...…shrinkage threshold values and the number of features in their corresponding sets ... 228 Figure 8-6 Crowded tournament selection algorithm used in Deb et al. (2002) .......... 230 Figure 8-7 Crowding distance algorithm used in Deb et al. (2002) ............................. 230 Figure 8-8…...

    [...]

  • ...227 improve the performance of the algorithm and avoid losing good solutions, and not using a sharing parameter provided by the user (Deb et al., 2002)....

    [...]

01 Jan 1967
TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.
Abstract: The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give partitions which are reasonably efficient in the sense of within-class variance. That is, if p is the probability mass function for the population, S = {S1, S2, * *, Sk} is a partition of EN, and ui, i = 1, 2, * , k, is the conditional mean of p over the set Si, then W2(S) = ff=ISi f z u42 dp(z) tends to be low for the partitions S generated by the method. We say 'tends to be low,' primarily because of intuitive considerations, corroborated to some extent by mathematical analysis and practical computational experience. Also, the k-means procedure is easily programmed and is computationally economical, so that it is feasible to process very large samples on a digital computer. Possible applications include methods for similarity grouping, nonlinear prediction, approximating multivariate distributions, and nonparametric tests for independence among several variables. In addition to suggesting practical classification methods, the study of k-means has proved to be theoretically interesting. The k-means concept represents a generalization of the ordinary sample mean, and one is naturally led to study the pertinent asymptotic behavior, the object being to establish some sort of law of large numbers for the k-means. This problem is sufficiently interesting, in fact, for us to devote a good portion of this paper to it. The k-means are defined in section 2.1, and the main results which have been obtained on the asymptotic behavior are given there. The rest of section 2 is devoted to the proofs of these results. Section 3 describes several specific possible applications, and reports some preliminary results from computer experiments conducted to explore the possibilities inherent in the k-means idea. The extension to general metric spaces is indicated briefly in section 4. The original point of departure for the work described here was a series of problems in optimal classification (MacQueen [9]) which represented special

24,320 citations

Book
08 Sep 2000
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

23,600 citations

Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations