Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles

doi:10.1073/PNAS.0506580102

Home
/
Papers
/
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles

Journal Article•DOI•

Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles

Aravind Subramanian¹, Pablo Tamayo¹, Vamsi K. Mootha², Sayan Mukherjee³, Benjamin L. Ebert², Michael A. Gillette², Amanda G. Paulovich⁴, Scott L. Pomeroy², Todd R. Golub², Eric S. Lander¹, Jill P. Mesirov¹ - Show less +7 more•Institutions (4)

Massachusetts Institute of Technology¹, Harvard University², Duke University³, Fred Hutchinson Cancer Research Center⁴

25 Oct 2005-Proceedings of the National Academy of Sciences of the United States of America (National Academy of Sciences)-Vol. 102, Iss: 43, pp 15545-15550

TL;DR: The Gene Set Enrichment Analysis (GSEA) method as discussed by the authors focuses on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation.

read less

Abstract: Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

[...]

Da-Wei Huang¹, Brad T. Sherman¹, Richard A. Lempicki¹•Institutions (1)

Science Applications International Corporation¹

01 Jan 2009-Nature Protocols

TL;DR: By following this protocol, investigators are able to gain an in-depth understanding of the biological themes in lists of genes that are enriched in genome-scale studies.

...read moreread less

Abstract: DAVID bioinformatics resources consists of an integrated biological knowledgebase and analytic tools aimed at systematically extracting biological meaning from large gene/protein lists. This protocol explains how to use DAVID, a high-throughput and integrated data-mining environment, to analyze gene lists derived from high-throughput genomic experiments. The procedure first requires uploading a gene list containing any number of common gene identifiers followed by analysis using one or more text and pathway-mining tools such as gene functional classification, functional annotation chart or clustering and functional annotation table. By following this protocol, investigators are able to gain an in-depth understanding of the biological themes in lists of genes that are enriched in genome-scale studies.

...read moreread less

31,015 citations

Journal Article•DOI•

limma powers differential expression analyses for RNA-sequencing and microarray studies

[...]

Matthew E. Ritchie¹, Belinda Phipson², Di Wu³, Yifang Hu¹, Charity W. Law⁴, Wei Shi¹, Gordon K. Smyth⁵, Gordon K. Smyth¹ - Show less +4 more•Institutions (5)

Walter and Eliza Hall Institute of Medical Research¹, Royal Children's Hospital², Harvard University³, University of Zurich⁴, University of Melbourne⁵

20 Apr 2015-Nucleic Acids Research

TL;DR: The philosophy and design of the limma package is reviewed, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

...read moreread less

Abstract: limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

...read moreread less

22,147 citations

Journal Article•DOI•

Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists

[...]

Da-Wei Huang¹, Brad T. Sherman¹, Richard A. Lempicki¹•Institutions (1)

Science Applications International Corporation¹

01 Jan 2009-Nucleic Acids Research

TL;DR: The survey will help tool designers/developers and experienced end users understand the underlying algorithms and pertinent details of particular tool categories/tools, enabling them to make the best choices for their particular research interests.

...read moreread less

Abstract: Functional analysis of large gene lists, derived in most cases from emerging high-throughput genomic, proteomic and bioinformatics scanning approaches, is still a challenging and daunting task. The gene-annotation enrichment analysis is a promising high-throughput strategy that increases the likelihood for investigators to identify biological processes most pertinent to their study. Approximately 68 bioinformatics enrichment tools that are currently available in the community are collected in this survey. Tools are uniquely categorized into three major classes, according to their underlying enrichment algorithms. The comprehensive collections, unique tool classifications and associated questions/issues will provide a more comprehensive and up-to-date view regarding the advantages, pitfalls and recent trends in a simpler tool-class level rather than by a tool-by-tool approach. Thus, the survey will help tool designers/developers and experienced end users understand the underlying algorithms and pertinent details of particular tool categories/tools, enabling them to make the best choices for their particular research interests.

...read moreread less

13,102 citations

Posted Content•

Inductive Representation Learning on Large Graphs

[...]

William L. Hamilton, Rex Ying, Jure Leskovec

07 Jun 2017-arXiv: Social and Information Networks

TL;DR: GraphSAGE is presented, a general, inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data and outperforms strong baselines on three inductive node-classification benchmarks.

...read moreread less

Abstract: Low-dimensional embeddings of nodes in large graphs have proved extremely useful in a variety of prediction tasks, from content recommendation to identifying protein functions. However, most existing approaches require that all nodes in the graph are present during training of the embeddings; these previous approaches are inherently transductive and do not naturally generalize to unseen nodes. Here we present GraphSAGE, a general, inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, we learn a function that generates embeddings by sampling and aggregating features from a node's local neighborhood. Our algorithm outperforms strong baselines on three inductive node-classification benchmarks: we classify the category of unseen nodes in evolving information graphs based on citation and Reddit post data, and we show that our algorithm generalizes to completely unseen graphs using a multi-graph dataset of protein-protein interactions.

...read moreread less

7,926 citations

Journal Article•DOI•

Metascape provides a biologist-oriented resource for the analysis of systems-level datasets.

[...]

Yingyao Zhou¹, Bin Zhou¹, Lars Pache², Max W. Chang³, Alireza Hadj Khodabakhshi¹, Olga Tanaseichuk¹, Christopher Benner³, Sumit K. Chanda² - Show less +4 more•Institutions (3)

Genomics Institute of the Novartis Research Foundation¹, Discovery Institute², University of California, San Diego³

03 Apr 2019-Nature Communications

TL;DR: A biologist-oriented portal that provides a gene list annotation, enrichment and interactome resource and enables integrated analysis of multi-OMICs datasets, Metascape is an effective and efficient tool for experimental biologists to comprehensively analyze and interpret OMICs-based studies in the big data era.

...read moreread less

Abstract: A critical component in the interpretation of systems-level studies is the inference of enriched biological pathways and protein complexes contained within OMICs datasets Successful analysis requires the integration of a broad set of current biological databases and the application of a robust analytical pipeline to produce readily interpretable results Metascape is a web-based portal designed to provide a comprehensive gene list annotation and analysis resource for experimental biologists In terms of design features, Metascape combines functional enrichment, interactome analysis, gene annotation, and membership search to leverage over 40 independent knowledgebases within one integrated portal Additionally, it facilitates comparative analyses of datasets across multiple independent and orthogonal experiments Metascape provides a significantly simplified user experience through a one-click Express Analysis interface to generate interpretable outputs Taken together, Metascape is an effective and efficient tool for experimental biologists to comprehensively analyze and interpret OMICs-based studies in the big data era

...read moreread less

6,282 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Quantitative monitoring of gene expression patterns with a complementary DNA microarray.

[...]

Mark Schena¹, Dari Shalon¹, Ronald W. Davis¹, Patrick O. Brown¹•Institutions (1)

Stanford University¹

20 Oct 1995-Science

TL;DR: A high-capacity system was developed to monitor the expression of many genes in parallel by means of simultaneous, two-color fluorescence hybridization, which enabled detection of rare transcripts in probe mixtures derived from 2 micrograms of total cellular messenger RNA.

...read moreread less

Abstract: A high-capacity system was developed to monitor the expression of many genes in parallel. Microarrays prepared by high-speed robotic printing of complementary DNAs on glass were used for quantitative expression measurements of the corresponding genes. Because of the small format and high density of the arrays, hybridization volumes of 2 microliters could be used that enabled detection of rare transcripts in probe mixtures derived from 2 micrograms of total cellular messenger RNA. Differential expression measurements of 45 Arabidopsis genes were made by means of simultaneous, two-color fluorescence hybridization.

...read moreread less

10,287 citations

Journal Article•DOI•

PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes

[...]

Vamsi K. Mootha¹, Cecilia M. Lindgren¹, Cecilia M. Lindgren², Karl-Fredrik Eriksson², Aravind Subramanian¹, Smita Sihag¹, J. Lehar¹, Pere Puigserver³, Emma Carlsson², Martin Ridderstråle², Esa Laurila², Nicholas E. Houstis¹, Mark J. Daly¹, Nick Patterson¹, Jill P. Mesirov¹, Todd R. Golub³, Todd R. Golub¹, Pablo Tamayo¹, Bruce M. Spiegelman³, Eric S. Lander¹, Joel N. Hirschhorn³, Joel N. Hirschhorn¹, Joel N. Hirschhorn⁴, David Altshuler, Leif Groop² - Show less +21 more•Institutions (4)

Massachusetts Institute of Technology¹, Lund University², Harvard University³, Boston Children's Hospital⁴

01 Jul 2003-Nature Genetics

TL;DR: An analytical strategy is introduced, Gene Set Enrichment Analysis, designed to detect modest but coordinate changes in the expression of groups of functionally related genes, which identifies a set of genes involved in oxidative phosphorylation whose expression is coordinately decreased in human diabetic muscle.

...read moreread less

Abstract: DNA microarrays can be used to identify gene expression changes characteristic of human disease. This is challenging, however, when relevant differences are subtle at the level of individual genes. We introduce an analytical strategy, Gene Set Enrichment Analysis, designed to detect modest but coordinate changes in the expression of groups of functionally related genes. Using this approach, we identify a set of genes involved in oxidative phosphorylation whose expression is coordinately decreased in human diabetic muscle. Expression of these genes is high at sites of insulin-mediated glucose disposal, activated by PGC-1α and correlated with total-body aerobic capacity. Our results associate this gene set with clinically important variation in human metabolism and illustrate the value of pathway relationships in the analysis of genomic profiling experiments.

...read moreread less

7,997 citations

Book•

Nonparametric Statistical Methods

[...]

Myles Hollander, Douglas A. Wolfe

01 Mar 1973

TL;DR: An ideal text for an upper-level undergraduate or first-year graduate course, Nonparametric Statistical Methods, Second Edition is also an invaluable source for professionals who want to keep abreast of the latest developments within this dynamic branch of modern statistics.

...read moreread less

Abstract: This Second Edition of Myles Hollander and Douglas A. Wolfe's successful Nonparametric Statistical Methods meets the needs of a new generation of users, with completely up-to-date coverage of this important statistical area. Like its predecessor, the revised edition, along with its companion ftp site, aims to equip readers with the conceptual and technical skills necessary to select and apply the appropriate procedures for a given situation. An extensive array of examples drawn from actual experiments illustrates clearly how to use nonparametric approaches to handle one- or two-sample location and dispersion problems, dichotomous data, and one-way and two-way layout problems. An ideal text for an upper-level undergraduate or first-year graduate course, Nonparametric Statistical Methods, Second Edition is also an invaluable source for professionals who want to keep abreast of the latest developments within this dynamic branch of modern statistics.

...read moreread less

7,240 citations

Patent•DOI•

Expression monitoring by hybridization to high density oligonucleotide arrays

[...]

David J. Lockhart¹, Eugene L. Brown¹, Gordon G. Wong¹, Mark S. Chee¹, Thomas R. Gingeras¹ - Show less +1 more•Institutions (1)

Affymetrix¹

13 Sep 1996-Nature Biotechnology

TL;DR: In this article, the authors proposed a method for monitoring the expression levels of a multiplicity of genes by hybridizing a nucleic acid sample to a high density array of oligonucleotide probes and quantifying the hybridized nucleic acids in the array.

...read moreread less

Abstract: This invention provides methods of monitoring the expression levels of a multiplicity of genes. The methods involve hybridizing a nucleic acid sample to a high density array of oligonucleotide probes where the high density array contains oligonucleotide probes complementary to subsequences of target nucleic acids in the nucleic acid sample. In one embodiment, the method involves providing a pool of target nucleic acids comprising RNA transcripts of one or more target genes, or nucleic acids derived from the RNA transcripts, hybridizing said pool of nucleic acids to an array of oligonucleotide probes immobilized on surface, where the array comprising more than 100 different oligonucleotides and each different oligonucleotide is localized in a predetermined region of the surface, the density of the different oligonucleotides is greater than about 60 different oligonucleotides per 1 cm2, and the oligonucleotide probes are complementary to the RNA transcripts or nucleic acids derived from the RNA transcripts; and quantifying the hybridized nucleic acids in the array.

...read moreread less

4,382 citations

Journal Article•DOI•

Nonparametric Statistical Methods

[...]

M. Knott¹, Myles Hollander, Douglas A. Wolfe•Institutions (1)

London School of Economics and Political Science¹

01 Mar 1974

3,841 citations