A rank-based marker selection method for high throughput scRNA-seq data.
Alexander Vargo,Anna C. Gilbert +1 more
Reads0
Chats0
TLDR
RankCorr is a fast method with strong mathematical underpinnings that performs multi-class marker selection in an informed manner and is consistently one of most optimal marker selection methods on scRNA-seq data.Abstract:
High throughput microfluidic protocols in single cell RNA sequencing (scRNA-seq) collect mRNA counts from up to one million individual cells in a single experiment; this enables high resolution studies of rare cell types and cell development pathways. Determining small sets of genetic markers that can identify specific cell populations is thus one of the major objectives of computational analysis of mRNA counts data. Many tools have been developed for marker selection on single cell data; most of them, however, are based on complex statistical models and handle the multi-class case in an ad-hoc manner. We introduce RankCorr, a fast method with strong mathematical underpinnings that performs multi-class marker selection in an informed manner. RankCorr proceeds by ranking the mRNA counts data before linearly separating the ranked data using a small number of genes. The step of ranking is intuitively natural for scRNA-seq data and provides a non-parametric method for analyzing count data. In addition, we present several performance measures for evaluating the quality of a set of markers when there is no known ground truth. Using these metrics, we compare the performance of RankCorr to a variety of other marker selection methods on an assortment of experimental and synthetic data sets that range in size from several thousand to one million cells. According to the metrics introduced in this work, RankCorr is consistently one of most optimal marker selection methods on scRNA-seq data. Most methods show similar overall performance, however; thus, the speed of the algorithm is the most important consideration for large data sets (and comparing the markers selected by several methods can be fruitful). RankCorr is fast enough to easily handle the largest data sets and, as such, it is a useful tool to add into computational pipelines when dealing with high throughput scRNA-seq data. RankCorr software is available for download at
https://github.com/ahsv/RankCorr
with extensive documentation.read more
Citations
More filters
Journal ArticleDOI
A machine learning method for the discovery of minimum marker gene combinations for cell type identification from single-cell RNA sequencing.
Brian D. Aevermann,Yun Zhang,Mark Novotny,Mohamed Keshk,Trygve E. Bakken,Jeremy A. Miller,Rebecca D. Hodge,Boudewijn P. F. Lelieveldt,Ed S. Lein,Richard H. Scheuermann,Richard H. Scheuermann,Richard H. Scheuermann +11 more
TL;DR: In this article, a machine learning-based marker gene selection algorithm, NS-Forest version 2.0, was proposed to identify marker genes for human brain middle temporal gyrus cell types.
Journal ArticleDOI
Feature selection revisited in the single-cell era.
TL;DR: In this article, a review of feature selection techniques for single-cell data analysis is presented, highlighting some of the challenges and future directions and finally considering their scalability and making general recommendations on each type of selection method.
Journal ArticleDOI
Single-cell manifold-preserving feature selection for detecting rare cell populations
Shaoheng Liang,Shaoheng Liang,Vakul Mohanty,Jinzhuang Dou,Qi Miao,Qi Miao,Yuefan Huang,Yuefan Huang,Muharrem Muftuoglu,Li Ding,Weiyi Peng,Ken Chen +11 more
TL;DR: It is found that SCMER can identify non-redundant features that sensitively delineate both common cell lineages and rare cellular states and can be used for discovering molecular features in a high-dimensional dataset, designing targeted, cost-effective assays for clinical applications and facilitating multi-modality integration.
Posted ContentDOI
SMaSH: A scalable, general marker gene identification framework for single-cell RNA sequencing and Spatial Transcriptomics
TL;DR: SMaSH as mentioned in this paper is a general computational framework for extracting key marker genes from single-cell RNA sequencing data for spatial transcriptomics approaches, which characterises the given data-set better than existing and limited computational approaches for global marker gene calculation.
Journal ArticleDOI
SMaSH: a scalable, general marker gene identification framework for single-cell RNA-sequencing
TL;DR: SMaSH as mentioned in this paper is a general computational framework for extracting key marker genes from single-cell RNA-sequencing data which reliably characterises highly-specific and niche populations of cells in numerous different biological data-sets.
References
More filters
Journal Article
Scikit-learn: Machine Learning in Python
Fabian Pedregosa,Gaël Varoquaux,Alexandre Gramfort,Vincent Michel,Bertrand Thirion,Olivier Grisel,Mathieu Blondel,Peter Prettenhofer,Ron Weiss,Vincent Dubourg,Jake Vanderplas,Alexandre Passos,David Cournapeau,Matthieu Brucher,Matthieu Perrot,Edouard Duchesnay +15 more
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Journal ArticleDOI
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Journal ArticleDOI
edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.
TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.
Posted Content
Scikit-learn: Machine Learning in Python
Fabian Pedregosa,Gaël Varoquaux,Alexandre Gramfort,Vincent Michel,Bertrand Thirion,Olivier Grisel,Mathieu Blondel,Andreas Müller,Joel Nothman,Gilles Louppe,Peter Prettenhofer,Ron Weiss,Vincent Dubourg,Jake Vanderplas,Alexandre Passos,David Cournapeau,Matthieu Brucher,Matthieu Perrot,Edouard Duchesnay +18 more
TL;DR: Scikit-learn as mentioned in this paper is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.
Related Papers (5)
Bias, robustness and scalability in differential expression analysis of single-cell RNA-seq data
Charlotte Soneson,Robinson +1 more
Benchmarking Computational Doublet-Detection Methods for Single-Cell RNA Sequencing Data.
Nan Miles Xi,Jingyi Jessica Li +1 more