Evolutionary approaches for feature selection in biological data

Home
/
Papers
/
Evolutionary approaches for feature selection in biological data

Evolutionary approaches for feature selection in biological data

01 Jan 2014-

TL;DR: This dissertation aims to provide a history of web exceptionalism from 1989 to 2002, a period chosen in order to explore its roots as well as specific cases up to and including the year in which descriptions of “Web 2.0” began to circulate.

read less

Abstract: Data mining techniques have been used widely in many areas such as business, science, engineering and medicine The techniques allow a vast amount of data to be explored in order to extract useful information from the data One of the foci in the health area is finding interesting biomarkers from biomedical data Mass throughput data generated from microarrays and mass spectrometry from biological samples are high dimensional and is small in sample size Examples include DNA microarray datasets with up to 500,000 genes and mass spectrometry data with 300,000 m/z values While the availability of such datasets can aid in the development of techniques/drugs to improve diagnosis and treatment of diseases, a major challenge involves its analysis to extract useful and meaningful information The aims of this project are: 1) to investigate and develop feature selection algorithms that incorporate various evolutionary strategies, 2) using the developed algorithms to find the “most relevant” biomarkers contained in biological datasets and 3) and evaluate the goodness of extracted feature subsets for relevance (examined in terms of existing biomedical domain knowledge and from classification accuracy obtained using different classifiers) The project aims to generate good predictive models for classifying diseased samples from control

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

A Random-Key Genetic Algorithm for Printing Problems

[...]

Arnaud Vandaele, Daniel Tuyttens

12 Jul 2015

1 citations

Book Chapter•DOI•

Biomarker Gene Identification Using a Quantum Inspired Clustering Approach

[...]

Srirupa Dasgupta¹, Arpita Das², Abhinandan Khan², Rajat Kumar Pal², Goutam Saha³ - Show less +1 more•Institutions (3)

Government College of Engineering and Leather Technology¹, University of Calcutta², North Eastern Hill University³

01 Jan 2020-Advances in intelligent systems and computing

TL;DR: An unsupervised approach for finding out the significant genes from microarray gene expression datasets using a quantum clustering approach to represent gene-expression data as equations and uses the procedure to search for the most probable set of clusters given the available data.

...read moreread less

Abstract: In this paper, we have implemented an unsupervised approach for finding out the significant genes from microarray gene expression datasets. The proposed method is based on implements a quantum clustering approach to represent gene-expression data as equations and uses the procedure to search for the most probable set of clusters given the available data. The main contribution of this approach lies in the ability to take into account the essential features or genes using clustering. Here, we present a novel clustering approach that extends ideas from scale-space clustering and support-vector clustering. This clustering method is used as a feature selection method. Our approach is fundamentally based on the representation of datapoints or features in the Hilbert space, which is then represented by the Schrodinger equation, of which the probability function is a solution. This Schrodinger equation contains a potential function that is extended from the initial probability function.The minima of the potential values are then treated as cluster centres. The cluster centres thus stand out as representative genes. These genes are evaluated using classifiers, and their performance is recorded over various indices of classification. From the experiments, it is found that the classification performance of the reduced set is much better than the entire dataset.The only free-scale parameter, sigma, is then altered to obtain the highest accuracy, and the corresponding biological significance of the genes is noted.

...read moreread less

1 citations

Proceedings Article•DOI•

Gene Selection using a Hybrid Memetic and Nearest Shrunken Centroid Algorithm

[...]

Vinh Quoc Dang¹, Chiou Peng Lam¹•Institutions (1)

Edith Cowan University¹

21 Feb 2016

TL;DR: A hybrid approach incorporating the Nearest Shrunken Centroid (NSC) and Memetic Algorithm (MA) is proposed to automatically search for an optimal range of shrinkage threshold values for the NSC to improve feature selection and classification accuracy.

...read moreread less

Abstract: High-throughput technologies such as microarrays and mass spectrometry produced high dimensional biological datasets both in abundance and with increasing complexity. Prediction Analysis for Microarrays (PAM) is a well-known implementation of the Nearest Shrunken Centroid (NSC) method which has been widely used for classification of biological data. In this paper, a hybrid approach incorporating the Nearest Shrunken Centroid (NSC) and Memetic Algorithm (MA) is proposed to automatically search for an optimal range of shrinkage threshold values for the NSC to improve feature selection and classification accuracy. Evaluation of the approach involved nine biological datasets and results showed improved feature selection stability over existing evolutionary approaches as well as improved classification accuracy.

...read moreread less

1 citations

Cites background from "Evolutionary approaches for feature..."

...The parameters that are tuned include population size, crossover probability rate, and mutation probability rate, with these values in the table, being taken from an empirical experiment (Dang, 2014)....
[...]

Book Chapter•DOI•

Disease-Relevant Gene Selection Using Mean Shift Clustering

[...]

Srirupa Dasgupta¹, Sharmistha Bhattacharya², Abhinandan Khan², Anindya Halder³, Goutam Saha³, Rajat Kumar Pal² - Show less +2 more•Institutions (3)

Government College of Engineering and Leather Technology¹, University of Calcutta², North Eastern Hill University³

01 Jan 2021

TL;DR: A clustering-based feature selection algorithm to select the particular gene responsible for a particular disease has been proposed and compared with two other well-established feature selection techniques under three different classification approaches, in terms of accuracy, precision, recall and F-score.

...read moreread less

Abstract: Genes are the blueprint for all activities of living systems that help them to sustain and have a stable life cycle under normal conditions. Any mistakes in the genetic regulation can disturb their synchronous activity and cause a disease. Due to this, identifying the particular disease-causing genes is very significant research area in bioinformatics. In this paper, a clustering-based feature selection algorithm to select the particular gene responsible for a particular disease has been proposed by us. We have used a well-established clustering algorithm, mean shift clustering for this purpose. Mathematically, we can say that each cluster will represent genes having characteristics different from genes in other clusters. From each cluster, we shall fetch the cluster centres only and test our model on the dataset with reduced dimension. We have opted for density-based approach for its ability to predict the number of clusters by itself. Our algorithm is experimented on benchmark datasets which are publicly available and compared with two other well-established feature selection techniques under three different classification approaches, in terms of accuracy, precision, recall and F-score. Our proposed algorithm performed well in most of the cases.

...read moreread less

References

PDF

Open Access

More filters

Book•

Genetic algorithms in search, optimization, and machine learning

[...]

David E. Goldberg

01 Sep 1988

TL;DR: In this article, the authors present the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields, including computer programming and mathematics.

...read moreread less

Abstract: From the Publisher: This book brings together - in an informal and tutorial fashion - the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields Major concepts are illustrated with running examples, and major algorithms are illustrated by Pascal computer programs No prior knowledge of GAs or genetics is assumed, and only a minimum of computer programming and mathematics background is required

...read moreread less

52,797 citations

Journal Article•DOI•

A fast and elitist multiobjective genetic algorithm: NSGA-II

[...]

Kalyanmoy Deb¹, Amrit Pratap¹, Sameer Agarwal¹, T. Meyarivan¹•Institutions (1)

Indian Institute of Technology Kanpur¹

01 Apr 2002-IEEE Transactions on Evolutionary Computation

TL;DR: This paper suggests a non-dominated sorting-based MOEA, called NSGA-II (Non-dominated Sorting Genetic Algorithm II), which alleviates all of the above three difficulties, and modify the definition of dominance in order to solve constrained multi-objective problems efficiently.

...read moreread less

Abstract: Multi-objective evolutionary algorithms (MOEAs) that use non-dominated sorting and sharing have been criticized mainly for: (1) their O(MN/sup 3/) computational complexity (where M is the number of objectives and N is the population size); (2) their non-elitism approach; and (3) the need to specify a sharing parameter. In this paper, we suggest a non-dominated sorting-based MOEA, called NSGA-II (Non-dominated Sorting Genetic Algorithm II), which alleviates all of the above three difficulties. Specifically, a fast non-dominated sorting approach with O(MN/sup 2/) computational complexity is presented. Also, a selection operator is presented that creates a mating pool by combining the parent and offspring populations and selecting the best N solutions (with respect to fitness and spread). Simulation results on difficult test problems show that NSGA-II is able, for most problems, to find a much better spread of solutions and better convergence near the true Pareto-optimal front compared to the Pareto-archived evolution strategy and the strength-Pareto evolutionary algorithm - two other elitist MOEAs that pay special attention to creating a diverse Pareto-optimal front. Moreover, we modify the definition of dominance in order to solve constrained multi-objective problems efficiently. Simulation results of the constrained NSGA-II on a number of test problems, including a five-objective, seven-constraint nonlinear problem, are compared with another constrained multi-objective optimizer, and the much better performance of NSGA-II is observed.

...read moreread less

37,111 citations

"Evolutionary approaches for feature..." refers background or methods in this paper

...Motivated by 1) the effectiveness of MOEA (NSGA2) in its potential to find multiple solutions, 2) the NSC algorithm in FS and classification, and 3) the automated shrinkage threshold optimization in NSC-GA, a hybrid approach incorporating NSGA2 (Deb et al., 2002) and NSC algorithm (Tibshirani et al....
[...]
...…... 228 Figure 8-6 Crowded tournament selection algorithm used in Deb et al. (2002) .......... 230 Figure 8-7 Crowding distance algorithm used in Deb et al. (2002) ............................. 230 Figure 8-8 Non-dominated sorting procedure used in Deb et al. (2002) ...................... 232…...
[...]
...…8-7 Crowding distance algorithm used in Deb et al. (2002) ............................. 230 Figure 8-8 Non-dominated sorting procedure used in Deb et al. (2002) ...................... 232 Figure 8-9 Steps for generating the new population from the combined population ... 233 Figure 8-10 An…...
[...]
...…shrinkage threshold values and the number of features in their corresponding sets ... 228 Figure 8-6 Crowded tournament selection algorithm used in Deb et al. (2002) .......... 230 Figure 8-7 Crowding distance algorithm used in Deb et al. (2002) ............................. 230 Figure 8-8…...
[...]
...227 improve the performance of the algorithm and avoid losing good solutions, and not using a sharing parameter provided by the user (Deb et al., 2002)....
[...]

Some methods for classification and analysis of multivariate observations

[...]

James B. MacQueen

01 Jan 1967

TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.

...read moreread less

Abstract: The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give partitions which are reasonably efficient in the sense of within-class variance. That is, if p is the probability mass function for the population, S = {S1, S2, * *, Sk} is a partition of EN, and ui, i = 1, 2, * , k, is the conditional mean of p over the set Si, then W2(S) = ff=ISi f z u42 dp(z) tends to be low for the partitions S generated by the method. We say 'tends to be low,' primarily because of intuitive considerations, corroborated to some extent by mathematical analysis and practical computational experience. Also, the k-means procedure is easily programmed and is computationally economical, so that it is feasible to process very large samples on a digital computer. Possible applications include methods for similarity grouping, nonlinear prediction, approximating multivariate distributions, and nonparametric tests for independence among several variables. In addition to suggesting practical classification methods, the study of k-means has proved to be theoretically interesting. The k-means concept represents a generalization of the ordinary sample mean, and one is naturally led to study the pertinent asymptotic behavior, the object being to establish some sort of law of large numbers for the k-means. This problem is sufficiently interesting, in fact, for us to devote a good portion of this paper to it. The k-means are defined in section 2.1, and the main results which have been obtained on the asymptotic behavior are given there. The rest of section 2 is devoted to the proofs of these results. Section 3 describes several specific possible applications, and reports some preliminary results from computer experiments conducted to explore the possibilities inherent in the k-means idea. The extension to general metric spaces is indicated briefly in section 4. The original point of departure for the work described here was a series of problems in optimal classification (MacQueen [9]) which represented special

...read moreread less

24,320 citations

Book•

Data Mining: Concepts and Techniques

[...]

Jiawei Han¹, Micheline Kamber², Jian Pei²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Simon Fraser University²

08 Sep 2000

TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.

...read moreread less

Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

...read moreread less

23,600 citations

Book•

Data Mining: Practical Machine Learning Tools and Techniques

[...]

Ian H. Witten, Eibe Frank, Mark Hall

25 Oct 1999

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

...read moreread less

Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

...read moreread less

20,196 citations