scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Challenges in unsupervised clustering of single-cell RNA-seq data.

01 May 2019-Nature Reviews Genetics (Nature Publishing Group)-Vol. 20, Iss: 5, pp 273-282
TL;DR: This Review discusses the multiple algorithmic options for clustering scRNA-seq data, including various technical, biological and computational considerations.
Abstract: Single-cell RNA sequencing (scRNA-seq) allows researchers to collect large catalogues detailing the transcriptomes of individual cells. Unsupervised clustering is of central importance for the analysis of these data, as it is used to identify putative cell types. However, there are many challenges involved. We discuss why clustering is a challenging problem from a computational point of view and what aspects of the data make it challenging. We also consider the difficulties related to the biological interpretation and annotation of the identified clusters.
Citations
More filters
Journal ArticleDOI
TL;DR: This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years in single-cell data science.
Abstract: The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.

677 citations

Journal ArticleDOI
11 Aug 2021-Nature
TL;DR: Spatial transcriptomics can also be used for hypothesis testing using experimental designs that compare time points or conditions, including genetic or environmental perturbations as mentioned in this paper, and is naturally amenable to integration with other data modalities, providing an expandable framework for insight into tissue organization.
Abstract: Deciphering the principles and mechanisms by which gene activity orchestrates complex cellular arrangements in multicellular organisms has far-reaching implications for research in the life sciences. Recent technological advances in next-generation sequencing- and imaging-based approaches have established the power of spatial transcriptomics to measure expression levels of all or most genes systematically throughout tissue space, and have been adopted to generate biological insights in neuroscience, development and plant biology as well as to investigate a range of disease contexts, including cancer. Similar to datasets made possible by genomic sequencing and population health surveys, the large-scale atlases generated by this technology lend themselves to exploratory data analysis for hypothesis generation. Here we review spatial transcriptomic technologies and describe the repertoire of operations available for paths of analysis of the resulting data. Spatial transcriptomics can also be deployed for hypothesis testing using experimental designs that compare time points or conditions—including genetic or environmental perturbations. Finally, spatial transcriptomic data are naturally amenable to integration with other data modalities, providing an expandable framework for insight into tissue organization. This review describes the state of spatial transcriptomics technologies and analysis tools that are being used to generate biological insights in diverse areas of biology.

358 citations

Journal ArticleDOI
TL;DR: This Perspective highlights open-source software for single-cell analysis released as part of the Bioconductor project, providing an overview for users and developers.
Abstract: Recent technological advancements have enabled the profiling of a large number of genome-wide features in individual cells. However, single-cell data present unique challenges that require the development of specialized methods and software infrastructure to successfully derive biological insights. The Bioconductor project has rapidly grown to meet these demands, hosting community-developed open-source software distributed as R packages. Featuring state-of-the-art computational methods, standardized data infrastructure and interactive data visualization tools, we present an overview and online book (https://osca.bioconductor.org) of single-cell methods for prospective users.

332 citations

Journal ArticleDOI
TL;DR: In this paper, a suite of recently developed techniques that localize RNA within tissue, including multiplexed in situ hybridization and in situ sequencing (here defined as high-plex RNA imaging) and spatial barcoding, can help address this issue.
Abstract: Single-cell RNA sequencing (scRNA-seq) identifies cell subpopulations within tissue but does not capture their spatial distribution nor reveal local networks of intercellular communication acting in situ. A suite of recently developed techniques that localize RNA within tissue, including multiplexed in situ hybridization and in situ sequencing (here defined as high-plex RNA imaging) and spatial barcoding, can help address this issue. However, no method currently provides as complete a scope of the transcriptome as does scRNA-seq, underscoring the need for approaches to integrate single-cell and spatial data. Here, we review efforts to integrate scRNA-seq with spatial transcriptomics, including emerging integrative computational methods, and propose ways to effectively combine current methodologies.

288 citations

Journal ArticleDOI
TL;DR: The atlas of collagen-producing cells provides a roadmap for studying the roles of these unique populations in homeostasis and pathologic fibrosis and shows a pro-fibrotic phenotype.
Abstract: Collagen-producing cells maintain the complex architecture of the lung and drive pathologic scarring in pulmonary fibrosis. Here we perform single-cell RNA-sequencing to identify all collagen-producing cells in normal and fibrotic lungs. We characterize multiple collagen-producing subpopulations with distinct anatomical localizations in different compartments of murine lungs. One subpopulation, characterized by expression of Cthrc1 (collagen triple helix repeat containing 1), emerges in fibrotic lungs and expresses the highest levels of collagens. Single-cell RNA-sequencing of human lungs, including those from idiopathic pulmonary fibrosis and scleroderma patients, demonstrate similar heterogeneity and CTHRC1-expressing fibroblasts present uniquely in fibrotic lungs. Immunostaining and in situ hybridization show that these cells are concentrated within fibroblastic foci. We purify collagen-producing subpopulations and find disease-relevant phenotypes of Cthrc1-expressing fibroblasts in in vitro and adoptive transfer experiments. Our atlas of collagen-producing cells provides a roadmap for studying the roles of these unique populations in homeostasis and pathologic fibrosis.

271 citations

References
More filters
Journal ArticleDOI
TL;DR: The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.
Abstract: Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.

35,225 citations

Journal Article
TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large datasets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of datasets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the datasets.

30,124 citations

Book
21 Oct 1957
TL;DR: The more the authors study the information processing aspects of the mind, the more perplexed and impressed they become, and it will be a very long time before they understand these processes sufficiently to reproduce them.
Abstract: From the Publisher: An introduction to the mathematical theory of multistage decision processes, this text takes a functional equation approach to the discovery of optimum policies. Written by a leading developer of such policies, it presents a series of methods, uniqueness and existence theorems, and examples for solving the relevant equations. The text examines existence and uniqueness theorems, the optimal inventory equation, bottleneck problems in multistage production processes, a new formalism in the calculus of variation, strategies behind multistage games, and Markovian decision processes. Each chapter concludes with a problem set that Eric V. Denardo of Yale University, in his informative new introduction, calls a rich lode of applications and research topics. 1957 edition. 37 figures.

14,187 citations

Journal ArticleDOI
TL;DR: This work proposes a heuristic method that is shown to outperform all other known community detection methods in terms of computation time and the quality of the communities detected is very good, as measured by the so-called modularity.
Abstract: We propose a simple method to extract the community structure of large networks. Our method is a heuristic method that is based on modularity optimization. It is shown to outperform all other known community detection method in terms of computation time. Moreover, the quality of the communities detected is very good, as measured by the so-called modularity. This is shown first by identifying language communities in a Belgian mobile phone network of 2.6 million customers and by analyzing a web graph of 118 million nodes and more than one billion links. The accuracy of our algorithm is also verified on ad-hoc modular networks. .

13,519 citations

Journal ArticleDOI
S. P. Lloyd1
TL;DR: In this article, the authors derived necessary conditions for any finite number of quanta and associated quantization intervals of an optimum finite quantization scheme to achieve minimum average quantization noise power.
Abstract: It has long been realized that in pulse-code modulation (PCM), with a given ensemble of signals to handle, the quantum values should be spaced more closely in the voltage regions where the signal amplitude is more likely to fall. It has been shown by Panter and Dite that, in the limit as the number of quanta becomes infinite, the asymptotic fractional density of quanta per unit voltage should vary as the one-third power of the probability density per unit voltage of signal amplitudes. In this paper the corresponding result for any finite number of quanta is derived; that is, necessary conditions are found that the quanta and associated quantization intervals of an optimum finite quantization scheme must satisfy. The optimization criterion used is that the average quantization noise power be a minimum. It is shown that the result obtained here goes over into the Panter and Dite result as the number of quanta become large. The optimum quautization schemes for 2^{b} quanta, b=1,2, \cdots, 7 , are given numerically for Gaussian and for Laplacian distribution of signal amplitudes.

11,872 citations

Related Papers (5)