Optimal gene selection for cell type discrimination in single cell analyses

doi:10.1101/599654

Home
/
Papers
/
Optimal gene selection for cell type discrimination in single cell analyses

Posted Content•DOI•

Optimal gene selection for cell type discrimination in single cell analyses

Bianca Dumitrascu¹, Soledad Villar², Dustin G. Mixon³, Barbara E. Engelhardt¹•Institutions (3)

Princeton University¹, New York University², Ohio State University³

04 Apr 2019-bioRxiv (Cold Spring Harbor Laboratory)-pp 599654

TL;DR: Given single cell RNA-seq data and a set of cellular labels to discriminate, scGene-Fit selects gene transcript markers that jointly optimize cell label recovery using label-aware compressive classification methods, resulting in a substantially more robust and less redundant set of markers.

read less

Abstract: Single-cell technologies characterize complex cell populations across multiple data modalities at un-precedented scale and resolution. Multi-omic data for single cell gene expression, in situ hybridization, or single cell chromatin states are increasingly available across diverse tissue types. When isolating specific cell types from a sample of disassociated cells or performing in situ sequencing in collections of heterogeneous cells, one challenging task is to select a small set of informative markers to identify and differentiate specific cell types or cell states as precisely as possible. Given single cell RNA-seq data and a set of cellular labels to discriminate, scGene-Fit selects gene transcript markers that jointly optimize cell label recovery using label-aware compressive classification methods, resulting in a substantially more robust and less redundant set of markers than existing methods. When applied to a data set given a hierarchy of cell type labels, the markers found by our method enable the recovery of the label hierarchy through a computationally efficient and principled optimization.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

Estimation of Wasserstein distances in the Spiked Transport Model

[...]

Jonathan Niles-Weed, Philippe Rigollet

16 Sep 2019-arXiv: Statistics Theory

TL;DR: A new statistical model is proposed, the spiked transport model, which formalizes the assumption that two probability distributions differ only on a low-dimensional subspace and establishes a lower bound showing that, in the absence of such structure, the plug-in estimator is nearly rate-optimal for estimating the Wasserstein distance in high dimension.

...read moreread less

Abstract: We propose a new statistical model, the spiked transport model, which formalizes the assumption that two probability distributions differ only on a low-dimensional subspace. We study the minimax rate of estimation for the Wasserstein distance under this model and show that this low-dimensional structure can be exploited to avoid the curse of dimensionality. As a byproduct of our minimax analysis, we establish a lower bound showing that, in the absence of such structure, the plug-in estimator is nearly rate-optimal for estimating the Wasserstein distance in high dimension. We also give evidence for a statistical-computational gap and conjecture that any computationally efficient estimator is bound to suffer from the curse of dimensionality.

...read moreread less

83 citations

Cites background from "Optimal gene selection for cell typ..."

...Mathematical formulations of this problem [31, 69] analyze a model in which two groups of cells differ on a low-dimensional subspace and have independent, identical distributions orthogonal to this subspace—an instance of the model proposed in (2)....
[...]

Journal Article•DOI•

A rank-based marker selection method for high throughput scRNA-seq data.

[...]

Alexander Vargo¹, Anna C. Gilbert²•Institutions (2)

University of Michigan¹, Yale University²

23 Oct 2020-BMC Bioinformatics

TL;DR: RankCorr is a fast method with strong mathematical underpinnings that performs multi-class marker selection in an informed manner and is consistently one of most optimal marker selection methods on scRNA-seq data.

...read moreread less

Abstract: High throughput microfluidic protocols in single cell RNA sequencing (scRNA-seq) collect mRNA counts from up to one million individual cells in a single experiment; this enables high resolution studies of rare cell types and cell development pathways. Determining small sets of genetic markers that can identify specific cell populations is thus one of the major objectives of computational analysis of mRNA counts data. Many tools have been developed for marker selection on single cell data; most of them, however, are based on complex statistical models and handle the multi-class case in an ad-hoc manner. We introduce RankCorr, a fast method with strong mathematical underpinnings that performs multi-class marker selection in an informed manner. RankCorr proceeds by ranking the mRNA counts data before linearly separating the ranked data using a small number of genes. The step of ranking is intuitively natural for scRNA-seq data and provides a non-parametric method for analyzing count data. In addition, we present several performance measures for evaluating the quality of a set of markers when there is no known ground truth. Using these metrics, we compare the performance of RankCorr to a variety of other marker selection methods on an assortment of experimental and synthetic data sets that range in size from several thousand to one million cells. According to the metrics introduced in this work, RankCorr is consistently one of most optimal marker selection methods on scRNA-seq data. Most methods show similar overall performance, however; thus, the speed of the algorithm is the most important consideration for large data sets (and comparing the markers selected by several methods can be fruitful). RankCorr is fast enough to easily handle the largest data sets and, as such, it is a useful tool to add into computational pipelines when dealing with high throughput scRNA-seq data. RankCorr software is available for download at https://github.com/ahsv/RankCorr with extensive documentation.

...read moreread less

24 citations

Cites methods or result from "Optimal gene selection for cell typ..."

...Another recent method [20] defines markers in terms of their overall importance to a clustering, eschewing the notion of markers for specific cell types....
[...]
...The scGeneFit method is introduced in the preprint [20]....
[...]
...The trade-offs between efficiency and marker quality are not yet fully explored in the preprint [20], however (for example, the number of constraints required for quality should probably be related to the number of clusters; this method of random sampling also seems to deemphasize rare cell types)....
[...]
...Therefore, further considering that scGeneFit [20] is presented in a preprint (and is thus subject to significant future change), we report our scGeneFit results in Additional file 1 (Figures 2 and 3 for ZEISEL; Figures 5 and 6 for PAUL) rather than the main manuscript....
[...]

Journal Article•DOI•

Ensemble feature selection for stable biomarker identification and cancer classification from microarray expression data

[...]

Aiguo Wang, Huancheng Liu, Jing Yang, Guilin Chen

01 Jan 2022-Computers in Biology and Medicine

TL;DR: In this article , an ensemble feature selection framework was proposed to improve the discrimination and stability of finally selected features in microarray data, and two aggregation strategies were developed to combine multiple feature subsets into one set.

...read moreread less

20 citations

Posted Content•

Tree! I am no Tree! I am a Low Dimensional Hyperbolic Embedding

[...]

Rishi Sonthalia¹, Anna C. Gilbert²•Institutions (2)

University of Michigan¹, Yale University²

08 May 2020-arXiv: Learning

TL;DR: A novel fast algorithm TreeRep is presented such that, given a $\delta$-hyperbolic metric, the algorithm learns a tree structure that approximates the original metric and analytically shows that TreeRep exactly recovers the original tree structure.

...read moreread less

Abstract: Given data, finding a faithful low-dimensional hyperbolic embedding of the data is a key method by which we can extract hierarchical information or learn representative geometric features of the data. In this paper, we explore a new method for learning hyperbolic representations by taking a metric-first approach. Rather than determining the low-dimensional hyperbolic embedding directly, we learn a tree structure on the data. This tree structure can then be used directly to extract hierarchical information, embedded into a hyperbolic manifold using Sarkar's construction \cite{sarkar}, or used as a tree approximation of the original metric. To this end, we present a novel fast algorithm \textsc{TreeRep} such that, given a $\delta$-hyperbolic metric (for any $\delta \geq 0$), the algorithm learns a tree structure that approximates the original metric. In the case when $\delta = 0$, we show analytically that \textsc{TreeRep} exactly recovers the original tree structure. We show empirically that \textsc{TreeRep} is not only many orders of magnitude faster than previously known algorithms, but also produces metrics with lower average distortion and higher mean average precision than most previous algorithms for learning hyperbolic embeddings, extracting hierarchical information, and approximating metrics via tree metrics.

...read moreread less

20 citations

Journal Article•DOI•

A robust nonlinear low-dimensional manifold for single cell RNA-seq data

[...]

Archit Verma¹, Barbara E. Engelhardt¹•Institutions (1)

Princeton University¹

21 Jul 2020-BMC Bioinformatics

TL;DR: A nonlinear latent variable model with robust, heavy-tailed error and adaptive kernel learning to estimate low-dimensional nonlinear structure in scRNA-seq data is presented and is well suited for raw, unfiltered gene counts from high-throughput sequencing technologies for visualization, exploration, and uncertainty estimation of cell states.

...read moreread less

Abstract: Modern developments in single-cell sequencing technologies enable broad insights into cellular state. Single-cell RNA sequencing (scRNA-seq) can be used to explore cell types, states, and developmental trajectories to broaden our understanding of cellular heterogeneity in tissues and organs. Analysis of these sparse, high-dimensional experimental results requires dimension reduction. Several methods have been developed to estimate low-dimensional embeddings for filtered and normalized single-cell data. However, methods have yet to be developed for unfiltered and unnormalized count data that estimate uncertainty in the low-dimensional space. We present a nonlinear latent variable model with robust, heavy-tailed error and adaptive kernel learning to estimate low-dimensional nonlinear structure in scRNA-seq data. Gene expression in a single cell is modeled as a noisy draw from a Gaussian process in high dimensions from low-dimensional latent positions. This model is called the Gaussian process latent variable model (GPLVM). We model residual errors with a heavy-tailed Student’s t-distribution to estimate a manifold that is robust to technical and biological noise found in normalized scRNA-seq data. We compare our approach to common dimension reduction tools across a diverse set of scRNA-seq data sets to highlight our model’s ability to enable important downstream tasks such as clustering, inferring cell developmental trajectories, and visualizing high throughput experiments on available experimental data. We show that our adaptive robust statistical approach to estimate a nonlinear manifold is well suited for raw, unfiltered gene counts from high-throughput sequencing technologies for visualization, exploration, and uncertainty estimation of cell states.

...read moreread less

17 citations

Cites background from "Optimal gene selection for cell typ..."

...Lower-dimensional mappings also provide convenient visualizations that lead to hypothesis generation, and inform analytic methods and future experiments [12, 13]....
[...]

1
2
3
4
…
5
6

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets

[...]

Evan Z. Macosko¹, Evan Z. Macosko², Anindita Basu², Anindita Basu¹, Rahul Satija², Rahul Satija³, James Nemesh¹, James Nemesh², Karthik Shekhar², Melissa Goldman¹, Melissa Goldman², Itay Tirosh², Allison R. Bialas⁴, Nolan Kamitaki², Nolan Kamitaki¹, Emily M. Martersteck¹, John J. Trombetta², David A. Weitz¹, Joshua R. Sanes¹, Alex K. Shalek⁵, Alex K. Shalek², Alex K. Shalek⁶, Aviv Regev², Aviv Regev⁷, Aviv Regev⁶, Steven A. McCarroll¹, Steven A. McCarroll² - Show less +23 more•Institutions (7)

Harvard University¹, Broad Institute², New York University³, Boston Children's Hospital⁴, Ragon Institute of MGH, MIT and Harvard⁵, Massachusetts Institute of Technology⁶, Howard Hughes Medical Institute⁷

21 May 2015-Cell

TL;DR: Drop-seq will accelerate biological discovery by enabling routine transcriptional profiling at single-cell resolution by separating them into nanoliter-sized aqueous droplets, associating a different barcode with each cell's RNAs, and sequencing them all together.

...read moreread less

5,506 citations

Proceedings Article•

Distance Metric Learning for Large Margin Nearest Neighbor Classification

[...]

Kilian Q. Weinberger¹, John Blitzer¹, Lawrence K. Saul¹•Institutions (1)

University of Pennsylvania¹

05 Dec 2005

TL;DR: In this article, a Mahanalobis distance metric for k-NN classification is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin.

...read moreread less

Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification.

...read moreread less

4,433 citations

Journal Article•DOI•

Massively parallel digital transcriptional profiling of single cells

[...]

Grace X.Y. Zheng, Jessica M. Terry, Phillip Belgrader, Paul Ryvkin, Zachary Bent, Ryan Wilson, Solongo B. Ziraldo, Tobias Daniel Wheeler, Geoffrey P. McDermott, Junjie Zhu, Mark T. Gregory¹, Joe Shuga, Luz Montesclaros, Jason G. Underwood², Donald A. Masquelier, Stefanie Y. Nishimura, Michael Schnall-Levin, Paul Wyatt, Christopher Hindson, Rajiv Bharadwaj, Alexander Wong, Kevin D. Ness, Lan Beppu¹, H. Joachim Deeg¹, Christopher McFarland³, Keith R. Loeb¹, Keith R. Loeb², William J. Valente², William J. Valente¹, Nolan G. Ericson¹, Emily A. Stevens¹, Jerald P. Radich¹, Tarjei S. Mikkelsen, Benjamin J. Hindson, Jason H. Bielas - Show less +31 more•Institutions (3)

Fred Hutchinson Cancer Research Center¹, University of Washington², Seattle Cancer Care Alliance³

16 Jan 2017-Nature Communications

TL;DR: A droplet-based system that enables 3′ mRNA counting of tens of thousands of single cells per sample is described and sequence variation in the transcriptome data is used to determine host and donor chimerism at single-cell resolution from bone marrow mononuclear cells isolated from transplant patients.

...read moreread less

Abstract: Characterizing the transcriptome of individual cells is fundamental to understanding complex biological systems. We describe a droplet-based system that enables 3′ mRNA counting of tens of thousands of single cells per sample. Cell encapsulation, of up to 8 samples at a time, takes place in ∼6 min, with ∼50% cell capture efficiency. To demonstrate the system’s technical performance, we collected transcriptome data from ∼250k single cells across 29 samples. We validated the sensitivity of the system and its ability to detect rare populations using cell lines and synthetic RNAs. We profiled 68k peripheral blood mononuclear cells to demonstrate the system’s ability to characterize large immune populations. Finally, we used sequence variation in the transcriptome data to determine host and donor chimerism at single-cell resolution from bone marrow mononuclear cells isolated from transplant patients. Single-cell gene expression analysis is challenging. This work describes a new droplet-based single cell RNA-seq platform capable of processing tens of thousands of cells across 8 independent samples in minutes, and demonstrates cellular subtypes and host–donor chimerism in transplant patients.

...read moreread less

4,219 citations

Journal Article•DOI•

Distance Metric Learning for Large Margin Nearest Neighbor Classification

[...]

Kilian Q. Weinberger, Lawrence K. Saul

01 Dec 2009-Journal of Machine Learning Research

TL;DR: This paper shows how to learn a Mahalanobis distance metric for kNN classification from labeled examples in a globally integrated manner and finds that metrics trained in this way lead to significant improvements in kNN Classification.

...read moreread less

Abstract: The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner.

...read moreread less

4,157 citations

Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets

[...]

Evan Z. Macosko¹, Evan Z. Macosko², Anindita Basu¹, Anindita Basu², Rahul Satija³, Rahul Satija¹, James Nemesh¹, James Nemesh², Karthik Shekhar¹, Melissa Goldman¹, Melissa Goldman², Itay Tirosh¹, Allison R. Bialas⁴, Nolan Kamitaki¹, Nolan Kamitaki², Emily M. Martersteck², John J. Trombetta¹, David A. Weitz², Joshua R. Sanes², Alex K. Shalek⁵, Alex K. Shalek⁶, Alex K. Shalek¹, Aviv Regev¹, Aviv Regev⁶, Aviv Regev⁷, Steven A. McCarroll¹, Steven A. McCarroll² - Show less +23 more•Institutions (7)

Broad Institute¹, Harvard University², New York University³, Boston Children's Hospital⁴, Ragon Institute of MGH, MIT and Harvard⁵, Massachusetts Institute of Technology⁶, Howard Hughes Medical Institute⁷

01 May 2015

TL;DR: Drop-seq as discussed by the authors analyzes mRNA transcripts from thousands of individual cells simultaneously while remembering transcripts' cell of origin, and identifies 39 transcriptionally distinct cell populations, creating a molecular atlas of gene expression for known retinal cell classes and novel candidate cell subtypes.

...read moreread less

Abstract: Cells, the basic units of biological structure and function, vary broadly in type and state. Single-cell genomics can characterize cell identity and function, but limitations of ease and scale have prevented its broad application. Here we describe Drop-seq, a strategy for quickly profiling thousands of individual cells by separating them into nanoliter-sized aqueous droplets, associating a different barcode with each cell's RNAs, and sequencing them all together. Drop-seq analyzes mRNA transcripts from thousands of individual cells simultaneously while remembering transcripts' cell of origin. We analyzed transcriptomes from 44,808 mouse retinal cells and identified 39 transcriptionally distinct cell populations, creating a molecular atlas of gene expression for known retinal cell classes and novel candidate cell subtypes. Drop-seq will accelerate biological discovery by enabling routine transcriptional profiling at single-cell resolution. VIDEO ABSTRACT.

...read moreread less

3,365 citations