scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

The Impact of Representation on the Optimization of Marker Panels for Single-cell RNA Data

TL;DR: A GA-based approach to solve the problem of the identification of succinct marker panels, and shows that the marker panels identified by GAs can outperform manually curated solutions, especially in the case of 0-knowledge problems.
Abstract: The increasing number of single-cell transcriptomic and single-cell RNA sequencing studies are allowing for a deeper understanding of the molecular processes underlying the normal development of an organism as well as the onset of pathologies. These studies continuously refine the functional roles of known cell populations, and provide their characterization as soon as putatively novel cell populations are detected. In order to isolate the cell populations for further tailored analysis, succinct marker panels—composed of a few cell surface proteins and clusters of differentiation molecules—must be identified. The identification of these marker panels is a challenging computational problem due to its intrinsic combinatorial nature, which makes it an NP-hard problem. Genetic Algorithms (GAs) have been successfully used in Bioinformatics and other biomedical applications to tackle combinatorial problems. We present here a GA-based approach to solve the problem of the identification of succinct marker panels. Since the performance of a GA is strictly related to the representation of the candidate solutions, we propose and compare three alternative representations, able to implicitly introduce different constraints on the search space. For each representation, we perform a fine-tuning of the parameter settings to calibrate the GA, and we show that different representations yield different performance, where the most relaxed representations— in which the GA can also evolve the number of genes in the panel—turn out to be the more effective, especially in the case of 0-knowledge problems. Our results also show that the marker panels identified by GAs can outperform manually curated solutions.
Citations
More filters
01 Aug 2016
TL;DR: This paper used massively parallel single-cell RNA profiling and optimized computational methods on a heterogeneous class of neurons, mouse retinal bipolar cells (BCs), and derived a molecular classification that identified 15 types, including all types observed previously and two novel types, one of which has a non-canonical morphology and position.
Abstract: Patterns of gene expression can be used to characterize and classify neuronal types. It is challenging, however, to generate taxonomies that fulfill the essential criteria of being comprehensive, harmonizing with conventional classification schemes, and lacking superfluous subdivisions of genuine types. To address these challenges, we used massively parallel single-cell RNA profiling and optimized computational methods on a heterogeneous class of neurons, mouse retinal bipolar cells (BCs). From a population of ∼25,000 BCs, we derived a molecular classification that identified 15 types, including all types observed previously and two novel types, one of which has a non-canonical morphology and position. We validated the classification scheme and identified dozens of novel markers using methods that match molecular expression to cell morphology. This work provides a systematic methodology for achieving comprehensive molecular classification of neurons, identifies novel neuronal types, and uncovers transcriptional differences that distinguish types within a class.

24 citations

Proceedings ArticleDOI
15 Aug 2022
TL;DR: The results show that the multi-objective optimization algorithms are better than GAs, considering both the quality and the consistency of the obtained marker panels, and point out that different representations of the candidate solutions have a relevant impact on the performance of the optimization algorithms.
Abstract: The computational analyses of single-cell data, aimed at elucidating and characterizing the functional roles of known and putative novel cell types, are enabling a thorough understanding of the processes driving cell development and pathology progression. The isolation of specific cell types is a crucial step to perform detailed analyses but requires the identification of succinct marker panels, which include genes that refer to cell surface proteins and clusters of differentiation molecules. This still represents a challenging NP-hard computational problem, which can be tackled through global optimization techniques. In this work, we formulate the marker panel identification problem as a bi-objective optimization problem, where the first objective regards the capability of the marker panels to accurately discriminate different cell types, while the second objective is related to the number of genes to include in the panel. In particular, we compared the performance of two multi-objective optimization algorithms, as well as of Genetic Algorithms (GAs) when considering only the first objective, employing two different representations for the candidate solutions. Our results show that the multi-objective optimization algorithms are better than GAs, considering both the quality and the consistency of the obtained marker panels; moreover, the collected results point out that different representations of the candidate solutions have a relevant impact on the performance of the optimization algorithms.
Proceedings ArticleDOI
15 Aug 2022
TL;DR: A novel fully-automatic computational pipeline, named single-cell Automatic Labeling of cell POpulations (scALPO), which leverages a Long Short-Term Memory Neural Network to assign the cell types and can label the provided clusters by simply relying on marker genes rather than gene expressions.
Abstract: The increasing number of single-cell transcriptomics and single-cell RNA sequencing studies are allowing for a deeper understanding of the molecular processes underlying the normal development of an organism, as well as the onset of pathologies. In this context, cell type annotation represents a crucial step for the analysis of single-cell RNA sequencing data, which is usually performed by means of time-consuming and possibly biased manual processes, carried out by expert biologists. Recently, alternative computational tools have been proposed to realize an automatic cell identification either based on supervised or unsupervised Machine Learning approaches. These methods typically exploit gene expression data of curated marker gene databases to associate gene expression profiles of single cells with a cell type. In this paper, we propose a novel fully-automatic computational pipeline, named single-cell Automatic Labeling of cell POpulations (scALPO), which leverages a Long Short-Term Memory Neural Network to assign the cell types. Specifically, scALPO can label the provided clusters by simply relying on marker genes rather than gene expressions. Our results, obtained by considering two different datasets, show that scALPO outperforms the most promising state-of-the-art approaches (i.e., SCSA and scType), achieving a cell type annotation more similar to the manually-created ground truth.
Proceedings ArticleDOI
15 Aug 2022
TL;DR: In this article , the marker panel identification problem is formulated as a bi-objective optimization problem, where the first objective regards the capability of the marker panels to accurately discriminate different cell types, while the second objective is related to the number of genes to include in the panel.
Abstract: The computational analyses of single-cell data, aimed at elucidating and characterizing the functional roles of known and putative novel cell types, are enabling a thorough understanding of the processes driving cell development and pathology progression. The isolation of specific cell types is a crucial step to perform detailed analyses but requires the identification of succinct marker panels, which include genes that refer to cell surface proteins and clusters of differentiation molecules. This still represents a challenging NP-hard computational problem, which can be tackled through global optimization techniques. In this work, we formulate the marker panel identification problem as a bi-objective optimization problem, where the first objective regards the capability of the marker panels to accurately discriminate different cell types, while the second objective is related to the number of genes to include in the panel. In particular, we compared the performance of two multi-objective optimization algorithms, as well as of Genetic Algorithms (GAs) when considering only the first objective, employing two different representations for the candidate solutions. Our results show that the multi-objective optimization algorithms are better than GAs, considering both the quality and the consistency of the obtained marker panels; moreover, the collected results point out that different representations of the candidate solutions have a relevant impact on the performance of the optimization algorithms.
Proceedings ArticleDOI
15 Aug 2022
TL;DR: In this paper , a Long Short-Term Memory Neural Network (LSTM) was used to assign the cell types to the provided clusters by simply relying on marker genes rather than gene expressions.
Abstract: The increasing number of single-cell transcriptomics and single-cell RNA sequencing studies are allowing for a deeper understanding of the molecular processes underlying the normal development of an organism, as well as the onset of pathologies. In this context, cell type annotation represents a crucial step for the analysis of single-cell RNA sequencing data, which is usually performed by means of time-consuming and possibly biased manual processes, carried out by expert biologists. Recently, alternative computational tools have been proposed to realize an automatic cell identification either based on supervised or unsupervised Machine Learning approaches. These methods typically exploit gene expression data of curated marker gene databases to associate gene expression profiles of single cells with a cell type. In this paper, we propose a novel fully-automatic computational pipeline, named single-cell Automatic Labeling of cell POpulations (scALPO), which leverages a Long Short-Term Memory Neural Network to assign the cell types. Specifically, scALPO can label the provided clusters by simply relying on marker genes rather than gene expressions. Our results, obtained by considering two different datasets, show that scALPO outperforms the most promising state-of-the-art approaches (i.e., SCSA and scType), achieving a cell type annotation more similar to the manually-created ground truth.
References
More filters
Book
01 Sep 1988
TL;DR: In this article, the authors present the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields, including computer programming and mathematics.
Abstract: From the Publisher: This book brings together - in an informal and tutorial fashion - the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields Major concepts are illustrated with running examples, and major algorithms are illustrated by Pascal computer programs No prior knowledge of GAs or genetics is assumed, and only a minimum of computer programming and mathematics background is required

52,797 citations

01 Jan 1989
TL;DR: This book brings together the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields.
Abstract: From the Publisher: This book brings together - in an informal and tutorial fashion - the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields. Major concepts are illustrated with running examples, and major algorithms are illustrated by Pascal computer programs. No prior knowledge of GAs or genetics is assumed, and only a minimum of computer programming and mathematics background is required.

33,034 citations

Book ChapterDOI
Frank Wilcoxon1
TL;DR: The comparison of two treatments generally falls into one of the following two categories: (a) a number of replications for each of the two treatments, which are unpaired, or (b) we may have a series of paired comparisons, some of which may be positive and some negative as mentioned in this paper.
Abstract: The comparison of two treatments generally falls into one of the following two categories: (a) we may have a number of replications for each of the two treatments, which are unpaired, or (b) we may have a number of paired comparisons leading to a series of differences, some of which may be positive and some negative. The appropriate methods for testing the significance of the differences of the means in these two cases are described in most of the textbooks on statistical methods.

12,871 citations

BookDOI
01 May 1992
TL;DR: Initially applying his concepts to simply defined artificial systems with limited numbers of parameters, Holland goes on to explore their use in the study of a wide range of complex, naturally occuring processes, concentrating on systems having multiple factors that interact in nonlinear ways.
Abstract: From the Publisher: Genetic algorithms are playing an increasingly important role in studies of complex adaptive systems, ranging from adaptive agents in economic theory to the use of machine learning techniques in the design of complex devices such as aircraft turbines and integrated circuits. Adaptation in Natural and Artificial Systems is the book that initiated this field of study, presenting the theoretical foundations and exploring applications. In its most familiar form, adaptation is a biological process, whereby organisms evolve by rearranging genetic material to survive in environments confronting them. In this now classic work, Holland presents a mathematical model that allows for the nonlinearity of such complex interactions. He demonstrates the model's universality by applying it to economics, physiological psychology, game theory, and artificial intelligence and then outlines the way in which this approach modifies the traditional views of mathematical genetics. Initially applying his concepts to simply defined artificial systems with limited numbers of parameters, Holland goes on to explore their use in the study of a wide range of complex, naturally occuring processes, concentrating on systems having multiple factors that interact in nonlinear ways. Along the way he accounts for major effects of coadaptation and coevolution: the emergence of building blocks, or schemata, that are recombined and passed on to succeeding generations to provide, innovations and improvements. John H. Holland is Professor of Psychology and Professor of Electrical Engineering and Computer Science at the University of Michigan. He is also Maxwell Professor at the Santa Fe Institute and isDirector of the University of Michigan/Santa Fe Institute Advanced Research Program.

12,584 citations


"The Impact of Representation on the..." refers background in this paper

  • ...GAs are a family of stochastic meta-heuristics for global optimization [21], [22], which exploit Darwinian processes to optimize a given objective function, called fitness function [23]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors show that the limit distribution is normal if n, n$ go to infinity in any arbitrary manner, where n = m = 8 and n = n = 8.
Abstract: Let $x$ and $y$ be two random variables with continuous cumulative distribution functions $f$ and $g$. A statistic $U$ depending on the relative ranks of the $x$'s and $y$'s is proposed for testing the hypothesis $f = g$. Wilcoxon proposed an equivalent test in the Biometrics Bulletin, December, 1945, but gave only a few points of the distribution of his statistic. Under the hypothesis $f = g$ the probability of obtaining a given $U$ in a sample of $n x's$ and $m y's$ is the solution of a certain recurrence relation involving $n$ and $m$. Using this recurrence relation tables have been computed giving the probability of $U$ for samples up to $n = m = 8$. At this point the distribution is almost normal. From the recurrence relation explicit expressions for the mean, variance, and fourth moment are obtained. The 2rth moment is shown to have a certain form which enabled us to prove that the limit distribution is normal if $m, n$ go to infinity in any arbitrary manner. The test is shown to be consistent with respect to the class of alternatives $f(x) > g(x)$ for every $x$.

11,055 citations


"The Impact of Representation on the..." refers methods in this paper

  • ...For this purpose, we applied the Mann–Whitney U test with the Bonferroni correction [32]–[34]....

    [...]