Challenges in unsupervised clustering of single-cell RNA-seq data.

doi:10.1038/S41576-018-0088-9

Home
/
Papers
/
Challenges in unsupervised clustering of single-cell RNA-seq data.

Journal Article•DOI•

Challenges in unsupervised clustering of single-cell RNA-seq data.

Vladimir Yu. Kiselev¹, Tallulah S. Andrews¹, Martin Hemberg¹•Institutions (1)

Wellcome Trust Sanger Institute¹

01 May 2019-Nature Reviews Genetics (Nature Publishing Group)-Vol. 20, Iss: 5, pp 273-282

TL;DR: This Review discusses the multiple algorithmic options for clustering scRNA-seq data, including various technical, biological and computational considerations.

read less

Abstract: Single-cell RNA sequencing (scRNA-seq) allows researchers to collect large catalogues detailing the transcriptomes of individual cells. Unsupervised clustering is of central importance for the analysis of these data, as it is used to identify putative cell types. However, there are many challenges involved. We discuss why clustering is a challenging problem from a computational point of view and what aspects of the data make it challenging. We also consider the difficulties related to the biological interpretation and annotation of the identified clusters.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Eleven grand challenges in single-cell data science

[...]

David Lähnemann¹, David Lähnemann², Johannes Köster³, Johannes Köster², Ewa Szczurek⁴, Davis J. McCarthy⁵, Davis J. McCarthy⁶, Stephanie C. Hicks⁷, Mark D. Robinson⁸, Catalina A. Vallejos⁹, Catalina A. Vallejos¹⁰, Kieran R Campbell¹¹, Kieran R Campbell¹², Niko Beerenwinkel¹³, Niko Beerenwinkel⁸, Ahmed Mahfouz¹⁴, Ahmed Mahfouz¹⁵, Luca Pinello³, Luca Pinello¹⁶, Pavel Skums¹⁷, Alexandros Stamatakis¹⁸, Alexandros Stamatakis¹⁹, Camille Stephan Otto Attolini, Samuel Aparicio¹², Samuel Aparicio¹¹, Jasmijn A. Baaijens²⁰, Marleen Balvert²⁰, Marleen Balvert²¹, Buys de Barbanson²¹, Antonio Cappuccio²², Giacomo Corleone²³, Bas E. Dutilh²⁴, Bas E. Dutilh²¹, Maria Florescu²¹, Victor Guryev²⁵, Rens Holmer²⁶, Katharina Jahn⁸, Katharina Jahn¹³, Thamar Jessurun Lobo²⁵, Emma M. Keizer²⁶, Indu Khatri¹⁵, Szymon M. Kielbasa¹⁵, Jan O. Korbel, Alexey M. Kozlov¹⁹, Tzu Hao Kuo, Boudewijn P. F. Lelieveldt¹⁵, Boudewijn P. F. Lelieveldt¹⁴, Ion I. Mandoiu²⁷, John C. Marioni²⁸, John C. Marioni²⁹, John C. Marioni³⁰, Tobias Marschall³¹, Tobias Marschall³², Felix Mölder², Amir Niknejad³³, Lukasz Raczkowski⁴, Marcel J. T. Reinders¹⁴, Marcel J. T. Reinders¹⁵, Jeroen de Ridder²¹, Antoine-Emmanuel Saliba, Antonios Somarakis¹⁵, Oliver Stegle²⁸, Oliver Stegle³⁴, Fabian J. Theis, Huan Yang³⁵, Alexander Zelikovsky¹⁷, Alexander Zelikovsky³⁶, Alice C. McHardy, Benjamin J. Raphael³⁷, Sohrab P. Shah³⁸, Alexander Schönhuth²¹, Alexander Schönhuth²⁰ - Show less +68 more•Institutions (38)

University of Düsseldorf¹, University of Duisburg-Essen², Harvard University³, University of Warsaw⁴, University of Melbourne⁵, St. Vincent's Institute of Medical Research⁶, Johns Hopkins University⁷, Swiss Institute of Bioinformatics⁸, The Turing Institute⁹, Western General Hospital¹⁰, University of British Columbia¹¹, BC Cancer Agency¹², ETH Zurich¹³, Delft University of Technology¹⁴, Leiden University Medical Center¹⁵, Broad Institute¹⁶, Georgia State University¹⁷, Karlsruhe Institute of Technology¹⁸, Heidelberg Institute for Theoretical Studies¹⁹, Centrum Wiskunde & Informatica²⁰, Utrecht University²¹, University of Amsterdam²², Imperial College London²³, Radboud University Nijmegen²⁴, University Medical Center Groningen²⁵, Wageningen University and Research Centre²⁶, University of Connecticut²⁷, European Bioinformatics Institute²⁸, University of Cambridge²⁹, Wellcome Trust Sanger Institute³⁰, Saarland University³¹, Max Planck Society³², Zuse Institute Berlin³³, German Cancer Research Center³⁴, Leiden University³⁵, I.M. Sechenov First Moscow State Medical University³⁶, Princeton University³⁷, Memorial Sloan Kettering Cancer Center³⁸

07 Feb 2020-Genome Biology

TL;DR: This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years in single-cell data science.

...read moreread less

Abstract: The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.

...read moreread less

677 citations

Journal Article•DOI•

Exploring tissue architecture using spatial transcriptomics.

[...]

Anjali Rao, Dalia Barkley, Gustavo S. França, Itai Yanai

11 Aug 2021-Nature

TL;DR: Spatial transcriptomics can also be used for hypothesis testing using experimental designs that compare time points or conditions, including genetic or environmental perturbations as mentioned in this paper, and is naturally amenable to integration with other data modalities, providing an expandable framework for insight into tissue organization.

...read moreread less

Abstract: Deciphering the principles and mechanisms by which gene activity orchestrates complex cellular arrangements in multicellular organisms has far-reaching implications for research in the life sciences. Recent technological advances in next-generation sequencing- and imaging-based approaches have established the power of spatial transcriptomics to measure expression levels of all or most genes systematically throughout tissue space, and have been adopted to generate biological insights in neuroscience, development and plant biology as well as to investigate a range of disease contexts, including cancer. Similar to datasets made possible by genomic sequencing and population health surveys, the large-scale atlases generated by this technology lend themselves to exploratory data analysis for hypothesis generation. Here we review spatial transcriptomic technologies and describe the repertoire of operations available for paths of analysis of the resulting data. Spatial transcriptomics can also be deployed for hypothesis testing using experimental designs that compare time points or conditions—including genetic or environmental perturbations. Finally, spatial transcriptomic data are naturally amenable to integration with other data modalities, providing an expandable framework for insight into tissue organization. This review describes the state of spatial transcriptomics technologies and analysis tools that are being used to generate biological insights in diverse areas of biology.

...read moreread less

358 citations

Journal Article•DOI•

Orchestrating Single-Cell Analysis with Bioconductor

[...]

Robert A. Amezquita¹, Aaron T. L. Lun², Aaron T. L. Lun³, Etienne Becht¹, Vincent J. Carey⁴, Lindsay N. Carpp¹, Ludwig Geistlinger⁵, Federico Marini, Kevin Rue-Albrecht⁶, Davide Risso⁷, Davide Risso⁸, Charlotte Soneson⁹, Charlotte Soneson¹⁰, Levi Waldron⁵, Hervé Pagès¹, Mike L. Smith, Wolfgang Huber, Martin Morgan¹¹, Raphael Gottardo¹, Stephanie C. Hicks¹² - Show less +16 more•Institutions (12)

Fred Hutchinson Cancer Research Center¹, University of Cambridge², Genentech³, Brigham and Women's Hospital⁴, City University of New York⁵, University of Oxford⁶, University of Padua⁷, Cornell University⁸, Swiss Institute of Bioinformatics⁹, Friedrich Miescher Institute for Biomedical Research¹⁰, Roswell Park Cancer Institute¹¹, Johns Hopkins University¹²

01 Feb 2020-Nature Methods

TL;DR: This Perspective highlights open-source software for single-cell analysis released as part of the Bioconductor project, providing an overview for users and developers.

...read moreread less

Abstract: Recent technological advancements have enabled the profiling of a large number of genome-wide features in individual cells. However, single-cell data present unique challenges that require the development of specialized methods and software infrastructure to successfully derive biological insights. The Bioconductor project has rapidly grown to meet these demands, hosting community-developed open-source software distributed as R packages. Featuring state-of-the-art computational methods, standardized data infrastructure and interactive data visualization tools, we present an overview and online book (https://osca.bioconductor.org) of single-cell methods for prospective users.

...read moreread less

332 citations

Journal Article•DOI•

Integrating single-cell and spatial transcriptomics to elucidate intercellular tissue dynamics

[...]

Sophia K Longo¹, Margaret Guo¹, Andrew L. Ji¹, Paul A. Khavari², Paul A. Khavari¹ - Show less +1 more•Institutions (2)

Stanford University¹, Veterans Health Administration²

18 Jun 2021-Nature Reviews Genetics

TL;DR: In this paper, a suite of recently developed techniques that localize RNA within tissue, including multiplexed in situ hybridization and in situ sequencing (here defined as high-plex RNA imaging) and spatial barcoding, can help address this issue.

...read moreread less

Abstract: Single-cell RNA sequencing (scRNA-seq) identifies cell subpopulations within tissue but does not capture their spatial distribution nor reveal local networks of intercellular communication acting in situ. A suite of recently developed techniques that localize RNA within tissue, including multiplexed in situ hybridization and in situ sequencing (here defined as high-plex RNA imaging) and spatial barcoding, can help address this issue. However, no method currently provides as complete a scope of the transcriptome as does scRNA-seq, underscoring the need for approaches to integrate single-cell and spatial data. Here, we review efforts to integrate scRNA-seq with spatial transcriptomics, including emerging integrative computational methods, and propose ways to effectively combine current methodologies.

...read moreread less

288 citations

Journal Article•DOI•

Collagen-producing lung cell atlas identifies multiple subsets with distinct localization and relevance to fibrosis.

[...]

Tatsuya Tsukui¹, Kai-Hui Sun¹, J. Wetter², John R. Wilson-Kanamori³, Lisa A Hazelwood², Neil C. Henderson³, Taylor Adams⁴, Jonas C. Schupp⁴, Sergio Poli⁵, Ivan O. Rosas⁵, Naftali Kaminski⁴, Michael A. Matthay¹, Paul J. Wolters¹, Dean Sheppard¹ - Show less +10 more•Institutions (5)

University of California, San Francisco¹, AbbVie², University of Edinburgh³, Yale University⁴, Brigham and Women's Hospital⁵

21 Apr 2020-Nature Communications

TL;DR: The atlas of collagen-producing cells provides a roadmap for studying the roles of these unique populations in homeostasis and pathologic fibrosis and shows a pro-fibrotic phenotype.

...read moreread less

Abstract: Collagen-producing cells maintain the complex architecture of the lung and drive pathologic scarring in pulmonary fibrosis. Here we perform single-cell RNA-sequencing to identify all collagen-producing cells in normal and fibrotic lungs. We characterize multiple collagen-producing subpopulations with distinct anatomical localizations in different compartments of murine lungs. One subpopulation, characterized by expression of Cthrc1 (collagen triple helix repeat containing 1), emerges in fibrotic lungs and expresses the highest levels of collagens. Single-cell RNA-sequencing of human lungs, including those from idiopathic pulmonary fibrosis and scleroderma patients, demonstrate similar heterogeneity and CTHRC1-expressing fibroblasts present uniquely in fibrotic lungs. Immunostaining and in situ hybridization show that these cells are concentrated within fibroblastic foci. We purify collagen-producing subpopulations and find disease-relevant phenotypes of Cthrc1-expressing fibroblasts in in vitro and adoptive transfer experiments. Our atlas of collagen-producing cells provides a roadmap for studying the roles of these unique populations in homeostasis and pathologic fibrosis.

...read moreread less

271 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Gene Ontology: tool for the unification of biology

[...]

M Ashburner¹, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. M. Cherry, Allan Peter Davis, Kara Dolinski, Selina S. Dwight, J.T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna E. Lewis, John C. Matese, Joel E. Richardson, M. Ringwald, Gerald M. Rubin, Gavin Sherlock - Show less +16 more•Institutions (1)

Stanford University¹

01 May 2000-Nature Genetics

TL;DR: The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.

...read moreread less

Abstract: Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.

...read moreread less

35,225 citations

Journal Article•

Visualizing Data using t-SNE

[...]

Laurens van der Maaten, Geoffrey E. Hinton

01 Jan 2008-Journal of Machine Learning Research

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.

...read moreread less

Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large datasets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of datasets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the datasets.

...read moreread less

30,124 citations

Book•

Dynamic Programming

[...]

Richard Ernest Bellman

21 Oct 1957

TL;DR: The more the authors study the information processing aspects of the mind, the more perplexed and impressed they become, and it will be a very long time before they understand these processes sufficiently to reproduce them.

...read moreread less

Abstract: From the Publisher: An introduction to the mathematical theory of multistage decision processes, this text takes a functional equation approach to the discovery of optimum policies. Written by a leading developer of such policies, it presents a series of methods, uniqueness and existence theorems, and examples for solving the relevant equations. The text examines existence and uniqueness theorems, the optimal inventory equation, bottleneck problems in multistage production processes, a new formalism in the calculus of variation, strategies behind multistage games, and Markovian decision processes. Each chapter concludes with a problem set that Eric V. Denardo of Yale University, in his informative new introduction, calls a rich lode of applications and research topics. 1957 edition. 37 figures.

...read moreread less

14,187 citations

Journal Article•DOI•

Fast unfolding of communities in large networks

[...]

Vincent D. Blondel¹, Jean-Loup Guillaume², Jean-Loup Guillaume¹, Renaud Lambiotte³, Renaud Lambiotte¹, Etienne Lefebvre¹ - Show less +2 more•Institutions (3)

Université catholique de Louvain¹, Pierre-and-Marie-Curie University², Imperial College London³

04 Mar 2008-arXiv: Physics and Society

TL;DR: This work proposes a heuristic method that is shown to outperform all other known community detection methods in terms of computation time and the quality of the communities detected is very good, as measured by the so-called modularity.

...read moreread less

Abstract: We propose a simple method to extract the community structure of large networks. Our method is a heuristic method that is based on modularity optimization. It is shown to outperform all other known community detection method in terms of computation time. Moreover, the quality of the communities detected is very good, as measured by the so-called modularity. This is shown first by identifying language communities in a Belgian mobile phone network of 2.6 million customers and by analyzing a web graph of 118 million nodes and more than one billion links. The accuracy of our algorithm is also verified on ad-hoc modular networks. .

...read moreread less

13,519 citations

Journal Article•DOI•

Least squares quantization in PCM

[...]

S. P. Lloyd¹•Institutions (1)

Bell Labs¹

01 Mar 1982-IEEE Transactions on Information Theory

TL;DR: In this article, the authors derived necessary conditions for any finite number of quanta and associated quantization intervals of an optimum finite quantization scheme to achieve minimum average quantization noise power.

...read moreread less

Abstract: It has long been realized that in pulse-code modulation (PCM), with a given ensemble of signals to handle, the quantum values should be spaced more closely in the voltage regions where the signal amplitude is more likely to fall. It has been shown by Panter and Dite that, in the limit as the number of quanta becomes infinite, the asymptotic fractional density of quanta per unit voltage should vary as the one-third power of the probability density per unit voltage of signal amplitudes. In this paper the corresponding result for any finite number of quanta is derived; that is, necessary conditions are found that the quanta and associated quantization intervals of an optimum finite quantization scheme must satisfy. The optimization criterion used is that the average quantization noise power be a minimum. It is shown that the result obtained here goes over into the Panter and Dite result as the number of quanta become large. The optimum quautization schemes for 2^{b} quanta, b=1,2, \cdots, 7 , are given numerically for Gaussian and for Laplacian distribution of signal amplitudes.

...read moreread less

11,872 citations