Integrating single-cell transcriptomic data across different conditions, technologies, and species.

doi:10.1038/NBT.4096

Home
/
Papers
/
Integrating single-cell transcriptomic data across different conditions, technologies, and species.

Journal Article•DOI•

Integrating single-cell transcriptomic data across different conditions, technologies, and species.

Andrew Butler, Paul J. Hoffman, Peter Smibert, Efthymia Papalexi¹, Rahul Satija¹ - Show less +1 more•Institutions (1)

New York University¹

02 Apr 2018-Nature Biotechnology (Nature Publishing Group)-Vol. 36, Iss: 5, pp 411-420

TL;DR: An analytical strategy for integrating scRNA-seq data sets based on common sources of variation is introduced, enabling the identification of shared populations across data sets and downstream comparative analysis.

read less

Abstract: Computational single-cell RNA-seq (scRNA-seq) methods have been successfully applied to experiments representing a single condition, technology, or species to discover and define cellular phenotypes. However, identifying subpopulations of cells that are present across multiple data sets remains challenging. Here, we introduce an analytical strategy for integrating scRNA-seq data sets based on common sources of variation, enabling the identification of shared populations across data sets and downstream comparative analysis. We apply this approach, implemented in our R toolkit Seurat (http://satijalab.org/seurat/), to align scRNA-seq data sets of peripheral blood mononuclear cells under resting and stimulated conditions, hematopoietic progenitors sequenced using two profiling technologies, and pancreatic cell 'atlases' generated from human and mouse islets. In each case, we learn distinct or transitional cell states jointly across data sets, while boosting statistical power through integrated analysis. Our approach facilitates general comparisons of scRNA-seq data sets, potentially deepening our understanding of how distinct cell states respond to perturbation, disease, and evolution.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Comprehensive Integration of Single-Cell Data.

[...]

Tim Stuart, Andrew Butler¹, Paul J. Hoffman, Christoph Hafemeister, Efthymia Papalexi¹, William M. Mauck¹, Yuhan Hao¹, Marlon Stoeckius², Peter Smibert², Rahul Satija¹ - Show less +6 more•Institutions (2)

New York University¹, Harvard University²

13 Jun 2019-Cell

TL;DR: A strategy to "anchor" diverse datasets together, enabling us to integrate single-cell measurements not only across scRNA-seq technologies, but also across different modalities.

...read moreread less

7,892 citations

Cites background or methods or result from "Integrating single-cell transcripto..."

...To overcome this, we first jointly reduce the dimensionality of both datasets using diagonalized CCA, then apply L2-normalization to the canonical correlation vectors (Figures 1A and 1B)....
[...]
...We also tested the following existing integration methods on the same holdout datasets: Seurat v2 [Butler et al., 2018], mnnCorrect [Haghverdi et al., 2018], and scanorama [Hie et al., 2019]....
[...]
...Canonical correlation vectors are calculated as described previously [Butler et al., 2018]....
[...]
...3.3 Butler et al., 2018 https://github.com/satijalab/seurat/releases/tag/ v2....
[...]
...For Seurat v2, we used the same feature set as determined for Seurat v3 to run a multi-CCA analysis followed by alignment (RunMultiCCA and AlignSubspace in Seurat v2)....
[...]

Journal Article•DOI•

Dimensionality reduction for visualizing single-cell data using UMAP.

[...]

Etienne Becht¹, Leland McInnes, John Healy, Charles-Antoine Dutertre¹, Immanuel Kwok¹, Lai Guan Ng¹, Florent Ginhoux¹, Evan W. Newell¹, Evan W. Newell² - Show less +5 more•Institutions (2)

Agency for Science, Technology and Research¹, Fred Hutchinson Cancer Research Center²

01 Jan 2019-Nature Biotechnology

TL;DR: Comparing the performance of UMAP with five other tools, it is found that UMAP provides the fastest run times, highest reproducibility and the most meaningful organization of cell clusters.

...read moreread less

Abstract: Advances in single-cell technologies have enabled high-resolution dissection of tissue composition. Several tools for dimensionality reduction are available to analyze the large number of parameters generated in single-cell studies. Recently, a nonlinear dimensionality-reduction technique, uniform manifold approximation and projection (UMAP), was developed for the analysis of any type of high-dimensional data. Here we apply it to biological data, using three well-characterized mass cytometry and single-cell RNA sequencing datasets. Comparing the performance of UMAP with five other tools, we find that UMAP provides the fastest run times, highest reproducibility and the most meaningful organization of cell clusters. The work highlights the use of UMAP for improved visualization and interpretation of single-cell data.

...read moreread less

3,016 citations

Posted Content•DOI•

Integrated analysis of multimodal single-cell data

[...]

Yuhan Hao¹, Stephanie Hao², Erica Andersen-Nissen³, William M. Mauck¹, Shiwei Zheng¹, Andrew Butler¹, Maddie Jane Lee⁴, Aaron J. Wilk⁴, Charlotte A. Darby¹, Michael Zagar³, Paul Hoffman¹, Marlon Stoeckius², Efthymia Papalexi¹, Eleni P. Mimitou², Jaison Jain¹, Avi Srivastava¹, Tim Stuart¹, Lamar Ballweber Fleming³, Bertrand Z. Yeung, Angela J. Rogers⁴, Juliana M. McElrath³, Catherine A. Blish⁴, Raphael Gottardo³, Peter Smibert², Rahul Satija¹ - Show less +21 more•Institutions (4)

New York University¹, Harvard University², Fred Hutchinson Cancer Research Center³, Stanford University⁴

12 Oct 2020-bioRxiv

TL;DR: ‘weighted-nearest neighbor’ analysis is introduced, an unsupervised framework to learn the relative utility of each data type in each cell, enabling an integrative analysis of multiple modalities.

...read moreread less

Abstract: The simultaneous measurement of multiple modalities, known as multimodal analysis, represents an exciting frontier for single-cell genomics and necessitates new computational methods that can define cellular states based on multiple data types. Here, we introduce ‘weighted-nearest neighbor’ analysis, an unsupervised framework to learn the relative utility of each data type in each cell, enabling an integrative analysis of multiple modalities. We apply our procedure to a CITE-seq dataset of hundreds of thousands of human white blood cells alongside a panel of 228 antibodies to construct a multimodal reference atlas of the circulating immune system. We demonstrate that integrative analysis substantially improves our ability to resolve cell states and validate the presence of previously unreported lymphoid subpopulations. Moreover, we demonstrate how to leverage this reference to rapidly map new datasets, and to interpret immune responses to vaccination and COVID-19. Our approach represents a broadly applicable strategy to analyze single-cell multimodal datasets, including paired measurements of RNA and chromatin state, and to look beyond the transcriptome towards a unified and multimodal definition of cellular identity. Availability Installation instructions, documentation, tutorials, and CITE-seq datasets are available at http://www.satijalab.org/seurat

...read moreread less

2,924 citations

Cites result from "Integrating single-cell transcripto..."

...We note that in contrast to our previously developed scRNA-seq integration algorithms [37, 54], the reference dataset and visualization can remain constant during this procedure....
[...]

Journal Article•DOI•

Fast, sensitive and accurate integration of single-cell data with Harmony.

[...]

Ilya Korsunsky, Nghia Millard, Jean Fan¹, Kamil Slowikowski, Fan Zhang, Kevin Wei², Yuriy Baglaenko, Michael B. Brenner², Po-Ru Loh¹, Po-Ru Loh³, Po-Ru Loh², Soumya Raychaudhuri - Show less +8 more•Institutions (3)

Harvard University¹, Brigham and Women's Hospital², Broad Institute³

18 Nov 2019-Nature Methods

TL;DR: Harmony, for the integration of single-cell transcriptomic data, identifies broad and fine-grained populations, scales to large datasets, and can integrate sequencing- and imaging-based data.

...read moreread less

Abstract: The emerging diversity of single-cell RNA-seq datasets allows for the full transcriptional characterization of cell types across a wide variety of biological and clinical conditions. However, it is challenging to analyze them together, particularly when datasets are assayed with different technologies, because biological and technical differences are interspersed. We present Harmony ( https://github.com/immunogenomics/harmony ), an algorithm that projects cells into a shared embedding in which cells group by cell type rather than dataset-specific conditions. Harmony simultaneously accounts for multiple experimental and biological factors. In six analyses, we demonstrate the superior performance of Harmony to previously published algorithms while requiring fewer computational resources. Harmony enables the integration of ~106 cells on a personal computer. We apply Harmony to peripheral blood mononuclear cells from datasets with large experimental differences, five studies of pancreatic islet cells, mouse embryogenesis datasets and the integration of scRNA-seq with spatial transcriptomics data. Harmony, for the integration of single-cell transcriptomic data, identifies broad and fine-grained populations, scales to large datasets, and can integrate sequencing- and imaging-based data.

...read moreread less

2,459 citations

Posted Content•DOI•

Comprehensive integration of single cell data

[...]

Tim Stuart, Andrew Butler¹, Paul J. Hoffman, Christoph Hafemeister, Efthymia Papalexi¹, William M. Mauck¹, Marlon Stoeckius², Peter Smibert², Rahul Satija¹ - Show less +5 more•Institutions (2)

New York University¹, Harvard University²

02 Nov 2018-bioRxiv

TL;DR: This work presents a strategy for comprehensive integration of single cell data, including the assembly of harmonized references, and the transfer of information across datasets, and demonstrates how anchoring can harmonize in-situ gene expression and scRNA-seq datasets.

...read moreread less

Abstract: Single cell transcriptomics (scRNA-seq) has transformed our ability to discover and annotate cell types and states, but deep biological understanding requires more than a taxonomic listing of clusters. As new methods arise to measure distinct cellular modalities, including high-dimensional immunophenotypes, chromatin accessibility, and spatial positioning, a key analytical challenge is to integrate these datasets into a harmonized atlas that can be used to better understand cellular identity and function. Here, we develop a computational strategy to "anchor" diverse datasets together, enabling us to integrate and compare single cell measurements not only across scRNA-seq technologies, but different modalities as well. After demonstrating substantial improvement over existing methods for data integration, we anchor scRNA-seq experiments with scATAC-seq datasets to explore chromatin differences in closely related interneuron subsets, and project single cell protein measurements onto a human bone marrow atlas to annotate and characterize lymphocyte populations. Lastly, we demonstrate how anchoring can harmonize in-situ gene expression and scRNA-seq datasets, allowing for the transcriptome-wide imputation of spatial gene expression patterns, and the identification of spatial relationships between mapped cell types in the visual cortex. Our work presents a strategy for comprehensive integration of single cell data, including the assembly of harmonized references, and the transfer of information across datasets. Availability: Installation instructions, documentation, and tutorials are available at: https://www.satijalab.org/seurat

...read moreread less

2,037 citations

Cites methods or result from "Integrating single-cell transcripto..."

...As we have previously demonstrated [Butler et al., 2018], the canonical correlation vectors described by CCA effectively capture correlated gene modules that are present in both datasets, representing genes that define a shared biological state....
[...]
...We also tested the following existing integration methods on the same holdout datasets: Seurat v2 [Butler et al., 2018], mnnCorrect [Haghverdi et al....
[...]
...Canonical correlation vectors are calculated as described previously [Butler et al., 2018]....
[...]
...While we have previously suggested using saturation or statistical-resampling based approaches to estimate dataset dimensionality [Butler et al., 2018], a robust fully unsupervised procedure to identify this value remains a fundamental challenge in the analysis of high-dimensional data....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

limma powers differential expression analyses for RNA-sequencing and microarray studies

[...]

Matthew E. Ritchie¹, Belinda Phipson², Di Wu³, Yifang Hu¹, Charity W. Law⁴, Wei Shi¹, Gordon K. Smyth⁵, Gordon K. Smyth¹ - Show less +4 more•Institutions (5)

Walter and Eliza Hall Institute of Medical Research¹, Royal Children's Hospital², Harvard University³, University of Zurich⁴, University of Melbourne⁵

20 Apr 2015-Nucleic Acids Research

TL;DR: The philosophy and design of the limma package is reviewed, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

...read moreread less

Abstract: limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

...read moreread less

22,147 citations

Journal Article•DOI•

Adjusting batch effects in microarray expression data using empirical Bayes methods

[...]

W. Evan Johnson¹, Cheng Li¹, Ariel Rabinovic¹•Institutions (1)

Harvard University¹

01 Jan 2007-Biostatistics

TL;DR: This paper proposed parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples.

...read moreread less

Abstract: SUMMARY Non-biological experimental variation or “batch effects” are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches difficult. The ability to combine microarray data sets is advantageous to researchers to increase statistical power to detect biological phenomena from studies where logistical considerations restrict sample size or in studies that require the sequential hybridization of arrays. In general, it is inappropriate to combine data sets without adjusting for batch effects. Methods have been proposed to filter batch effects from data, but these are often complicated and require large batch sizes (>25) to implement. Because the majority of microarray studies are conducted using much smaller sample sizes, existing methods are not sufficient. We propose parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples. We illustrate our methods using two example data sets and show that our methods are justifiable, easy to apply, and useful in practice. Software for our method is freely available at: http://biosun1.harvard.edu/complab/batch/.

...read moreread less

6,319 citations

Journal Article•DOI•

Enrichr: a comprehensive gene set enrichment analysis web server 2016 update

[...]

Maxim V. Kuleshov¹, Matthew R. Jones¹, Andrew D. Rouillard¹, Nicolas F. Fernandez¹, Qiaonan Duan¹, Zichen Wang¹, Simon Koplev¹, Sherry L. Jenkins¹, Kathleen M. Jagodnik², Alexander Lachmann¹, Michael G. McDermott¹, Caroline D. Monteiro¹, Gregory W. Gundersen¹, Avi Ma'ayan¹ - Show less +10 more•Institutions (2)

Icahn School of Medicine at Mount Sinai¹, Glenn Research Center²

08 Jul 2016-Nucleic Acids Research

TL;DR: A significant update to one of the tools in this domain called Enrichr, a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries is presented.

...read moreread less

Abstract: Enrichment analysis is a popular method for analyzing gene sets generated by genome-wide experiments. Here we present a significant update to one of the tools in this domain called Enrichr. Enrichr currently contains a large collection of diverse gene set libraries available for analysis and download. In total, Enrichr currently contains 180 184 annotated gene sets from 102 gene set libraries. New features have been added to Enrichr including the ability to submit fuzzy sets, upload BED files, improved application programming interface and visualization of the results as clustergrams. Overall, Enrichr is a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries. Enrichr is freely available at: http://amp.pharm.mssm.edu/Enrichr.

...read moreread less

6,201 citations

Book Chapter•DOI•

Relations Between Two Sets of Variates

[...]

Harold Hotelling¹•Institutions (1)

Columbia University¹

01 Dec 1936-Biometrika

TL;DR: The concept of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions as discussed by the authors, where the correlation of the horizontal components is ordinarily discussed, whereas the complex consisting of horizontal and vertical deviations may be even more interesting.

...read moreread less

Abstract: Concepts of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions. Marksmen side by side firing simultaneous shots at targets, so that the deviations are in part due to independent individual errors and in part to common causes such as wind, provide a familiar introduction to the theory of correlation; but only the correlation of the horizontal components is ordinarily discussed, whereas the complex consisting of horizontal and vertical deviations may be even more interesting. The wind at two places may be compared, using both components of the velocity in each place. A fluctuating vector is thus matched at each moment with another fluctuating vector. The study of individual differences in mental and physical traits calls for a detailed study of the relations between sets of correlated variates. For example the scores on a number of mental tests may be compared with physical measurements on the same persons. The questions then arise of determining the number and nature of the independent relations of mind and body shown by these data to exist, and of extracting from the multiplicity of correlations in the system suitable characterizations of these independent relations. As another example, the inheritance of intelligence in rats might be studied by applying not one but s different mental tests to N mothers and to a daughter of each

...read moreread less

6,122 citations

Journal Article•DOI•

Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets

[...]

Evan Z. Macosko¹, Evan Z. Macosko², Anindita Basu², Anindita Basu¹, Rahul Satija¹, Rahul Satija³, James Nemesh², James Nemesh¹, Karthik Shekhar¹, Melissa Goldman¹, Melissa Goldman², Itay Tirosh¹, Allison R. Bialas⁴, Nolan Kamitaki¹, Nolan Kamitaki², Emily M. Martersteck², John J. Trombetta¹, David A. Weitz², Joshua R. Sanes², Alex K. Shalek¹, Alex K. Shalek⁵, Alex K. Shalek⁶, Aviv Regev¹, Aviv Regev⁷, Aviv Regev⁵, Steven A. McCarroll², Steven A. McCarroll¹ - Show less +23 more•Institutions (7)

Broad Institute¹, Harvard University², New York University³, Boston Children's Hospital⁴, Massachusetts Institute of Technology⁵, Ragon Institute of MGH, MIT and Harvard⁶, Howard Hughes Medical Institute⁷

21 May 2015-Cell

TL;DR: Drop-seq will accelerate biological discovery by enabling routine transcriptional profiling at single-cell resolution by separating them into nanoliter-sized aqueous droplets, associating a different barcode with each cell's RNAs, and sequencing them all together.

...read moreread less

5,506 citations