scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Using predictive specificity to determine when gene set analysis is biologically meaningful

TL;DR: It is shown that heavily annotated (‘multifunctional’) genes are likely to appear in genomics study results and drive the generation of biologically non-specific enrichment results as well as highly fragile significances.
Abstract: Gene set analysis, which translates gene lists into enriched functions, is among the most common bioinformatic methods. Yet few would advocate taking the results at face value. Not only is there no agreement on the algorithms themselves, there is no agreement on how to benchmark them. In this paper, we evaluate the robustness and uniqueness of enrichment results as a means of assessing methods even where correctness is unknown. We show that heavily annotated (‘multifunctional’) genes are likely to appear in genomics study results and drive the generation of biologically non-specific enrichment results as well as highly fragile significances. By providing a means of determining where enrichment analyses report non-specific and non-robust findings, we are able to assess where we can be confident in their use. We find significant progress in recent bias correction methods for enrichment and provide our own software implementation. Our approach can be readily adapted to any pre-existing package.
Citations
More filters
Journal ArticleDOI
TL;DR: This protocol describes pathway enrichment analysis of gene lists from RNA-seq and other genomics experiments using g:Profiler, GSEA, Cytoscape and EnrichmentMap software, and describes innovative visualization techniques.
Abstract: Pathway enrichment analysis helps researchers gain mechanistic insight into gene lists generated from genome-scale (omics) experiments. This method identifies biological pathways that are enriched in a gene list more than would be expected by chance. We explain the procedures of pathway enrichment analysis and present a practical step-by-step guide to help interpret gene lists resulting from RNA-seq and genome-sequencing experiments. The protocol comprises three major steps: definition of a gene list from omics data, determination of statistically enriched pathways, and visualization and interpretation of the results. We describe how to use this protocol with published examples of differentially expressed genes and mutated cancer genes; however, the principles can be applied to diverse types of omics data. The protocol describes innovative visualization techniques, provides comprehensive background and troubleshooting guidelines, and uses freely available and frequently updated software, including g:Profiler, Gene Set Enrichment Analysis (GSEA), Cytoscape and EnrichmentMap. The complete protocol can be performed in ~4.5 h and is designed for use by biologists with no prior bioinformatics training.

958 citations

Journal ArticleDOI
TL;DR: It is suggested that GO evolution may have affected the interpretation and possibly reproducibility of experiments over time and researchers must exercise caution when interpreting GO enrichment analyses and should reexamine previous analyses with the most recent GO version.
Abstract: Gene Ontology (GO) enrichment analysis is ubiquitously used for interpreting high throughput molecular data and generating hypotheses about underlying biological phenomena of experiments. However, the two building blocks of this analysis - the ontology and the annotations - evolve rapidly. We used gene signatures derived from 104 disease analyses to systematically evaluate how enrichment analysis results were affected by evolution of the GO over a decade. We found low consistency between enrichment analyses results obtained with early and more recent GO versions. Furthermore, there continues to be a strong annotation bias in the GO annotations where 58% of the annotations are for 16% of the human genes. Our analysis suggests that GO evolution may have affected the interpretation and possibly reproducibility of experiments over time. Hence, researchers must exercise caution when interpreting GO enrichment analyses and should reexamine previous analyses with the most recent GO version.

94 citations

Journal ArticleDOI
TL;DR: In this article, the authors show that conventional gene category enrichment analysis (GCEA) applied to brain-wide atlas data yields biased results and develop a flexible ensemble-based null model framework to enable appropriate inference in GCEA.
Abstract: Transcriptomic atlases have improved our understanding of the correlations between gene-expression patterns and spatially varying properties of brain structure and function. Gene-category enrichment analysis (GCEA) is a common method to identify functional gene categories that drive these associations, using gene-to-category annotation systems like the Gene Ontology (GO). Here, we show that applying standard GCEA methodology to spatial transcriptomic data is affected by substantial false-positive bias, with GO categories displaying an over 500-fold average inflation of false-positive associations with random neural phenotypes in mouse and human. The estimated false-positive rate of a GO category is associated with its rate of being reported as significantly enriched in the literature, suggesting that published reports are affected by this false-positive bias. We show that within-category gene–gene coexpression and spatial autocorrelation are key drivers of the false-positive bias and introduce flexible ensemble-based null models that can account for these effects, made available as a software toolbox. Identifying enriched gene sets in transcriptomic data is routine analysis. Here, the authors show that conventional gene category enrichment analysis (GCEA) applied to brain-wide atlas data yields biased results and develop a flexible ensemble-based null model framework to enable appropriate inference in GCEA.

43 citations

Book ChapterDOI
TL;DR: This chapter discusses cell migration and neurite outgrowth and the role of these processes in neurodevelopment and NDDs, which will delve into the roles of neuriteOutgrowth and cell migration in the formation of the brain and how errors in these processes affect brain development.
Abstract: Despite decades of study, elucidation of the underlying etiology of complex developmental disorders such as autism spectrum disorder (ASD), schizophrenia (SCZ), intellectual disability (ID), and bipolar disorder (BPD) has been hampered by the inability to study human neurons, the heterogeneity of these disorders, and the relevance of animal model systems Moreover, a majority of these developmental disorders have multifactorial or idiopathic (unknown) causes making them difficult to model using traditional methods of genetic alteration Examination of the brains of individuals with ASD and other developmental disorders in both post-mortem and MRI studies shows defects that are suggestive of dysregulation of embryonic and early postnatal development For ASD, more recent genetic studies have also suggested that risk genes largely converge upon the developing human cerebral cortex between weeks 8 and 24 in utero Yet, an overwhelming majority of studies in autism rodent models have focused on postnatal development or adult synaptic transmission defects in autism related circuits Thus, studies looking at early developmental processes such as proliferation, cell migration, and early differentiation, which are essential to build the brain, are largely lacking Yet, interestingly, a few studies that did assess early neurodevelopment found that alterations in brain structure and function associated with neurodevelopmental disorders (NDDs) begin as early as the initial formation and patterning of the neural tube By the early to mid-2000s, the derivation of human embryonic stem cells (hESCs) and later induced pluripotent stem cells (iPSCs) allowed us to study living human neural cells in culture for the first time Specifically, iPSCs gave us the unprecedented ability to study cells derived from individuals with idiopathic disorders Studies indicate that iPSC-derived neural cells, whether precursors or "matured" neurons, largely resemble cortical cells of embryonic humans from weeks 8 to 24 Thus, these cells are an excellent model to study early human neurodevelopment, particularly in the context of genetically complex diseases Indeed, since 2011, numerous studies have assessed developmental phenotypes in neurons derived from individuals with both genetic and idiopathic forms of ASD and other NDDs However, while iPSC-derived neurons are fetal in nature, they are post-mitotic and thus cannot be used to study developmental processes that occur before terminal differentiation Moreover, it is important to note that during the 8-24-week window of human neurodevelopment, neural precursor cells are actively undergoing proliferation, migration, and early differentiation to form the basic cytoarchitecture of the brain Thus, by studying NPCs specifically, we could gain insight into how early neurodevelopmental processes contribute to the pathogenesis of NDDs Indeed, a few studies have explored NPC phenotypes in NDDs and have uncovered dysregulations in cell proliferation Yet, few studies have explored migration and early differentiation phenotypes of NPCs in NDDs In this chapter, we will discuss cell migration and neurite outgrowth and the role of these processes in neurodevelopment and NDDs We will begin by reviewing the processes that are important in early neurodevelopment and early cortical development We will then delve into the roles of neurite outgrowth and cell migration in the formation of the brain and how errors in these processes affect brain development We also provide review of a few key molecules that are involved in the regulation of neurite outgrowth and migration while discussing how dysregulations in these molecules can lead to abnormalities in brain structure and function thereby highlighting their contribution to pathogenesis of NDDs Then we will discuss whether neurite outgrowth, migration, and the molecules that regulate these processes are associated with ASD Lastly, we will review the utility of iPSCs in modeling NDDs and discuss future goals for the study of NDDs using this technology

30 citations

Journal ArticleDOI
TL;DR: It is demonstrated that salamanders regulate water loss using temperature-sensitive gene expression related to blood vessel regeneration and skin lipids, indicating that tissue regeneration may be used for physiological purposes beyond replacing lost limbs.
Abstract: Organisms rely upon external cues to avoid detrimental conditions during environmental change. Rapid water loss, or desiccation, is a universal threat for terrestrial plants and animals, especially under climate change, but the cues that facilitate plastic responses to avoid desiccation are unclear. We integrate acclimation experiments with gene expression analyses to identify the cues that regulate resistance to water loss at the physiological and regulatory level in a montane salamander (Plethodon metcalfi). Here we show that temperature is an important cue for developing a desiccation-resistant phenotype and might act as a reliable cue for organisms across the globe. Gene expression analyses consistently identify regulation of stem cell differentiation and embryonic development of vasculature. The temperature-sensitive blood vessel development suggests that salamanders regulate water loss through the regression and regeneration of capillary beds in the skin, indicating that tissue regeneration may be used for physiological purposes beyond replacing lost limbs. Climate change will threaten plants and animals across the planet by increasing the risk of desiccation. Here, authors demonstrate that salamanders regulate water loss using temperature-sensitive gene expression related to blood vessel regeneration and skin lipids.

26 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Abstract: SUMMARY The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

83,420 citations


"Using predictive specificity to det..." refers methods in this paper

  • ...Most methods perform their own multiple hypothesis test corrections, and when able, we specified for Benjamini–Hochberg....

    [...]

  • ...The false discovery rate (FDR) was controlled using the method of Benjamini and Hochberg (28)....

    [...]

Journal ArticleDOI
TL;DR: The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.
Abstract: Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.

35,225 citations


"Using predictive specificity to det..." refers background in this paper

  • ...Average SD Average SD Down-sampling (5) SD...

    [...]

  • ...GO (5) with its annotations (31)) capture the relationships of the detected genes to those functions, we would expect an enrichment analysis to rank those functions highly....

    [...]

  • ...Gene Ontology (GO) (5), KEGG (6) or OMIM (7))....

    [...]

Journal ArticleDOI
TL;DR: The Gene Set Enrichment Analysis (GSEA) method as discussed by the authors focuses on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation.
Abstract: Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

34,830 citations

Journal ArticleDOI
TL;DR: By following this protocol, investigators are able to gain an in-depth understanding of the biological themes in lists of genes that are enriched in genome-scale studies.
Abstract: DAVID bioinformatics resources consists of an integrated biological knowledgebase and analytic tools aimed at systematically extracting biological meaning from large gene/protein lists. This protocol explains how to use DAVID, a high-throughput and integrated data-mining environment, to analyze gene lists derived from high-throughput genomic experiments. The procedure first requires uploading a gene list containing any number of common gene identifiers followed by analysis using one or more text and pathway-mining tools such as gene functional classification, functional annotation chart or clustering and functional annotation table. By following this protocol, investigators are able to gain an in-depth understanding of the biological themes in lists of genes that are enriched in genome-scale studies.

31,015 citations

Journal ArticleDOI
TL;DR: The Kyoto Encyclopedia of Genes and Genomes (KEGG) as discussed by the authors is a knowledge base for systematic analysis of gene functions in terms of the networks of genes and molecules.
Abstract: Kyoto Encyclopedia of Genes and Genomes (KEGG) is a knowledge base for systematic analysis of gene functions in terms of the networks of genes and molecules. The major component of KEGG is the PATHWAY database that consists of graphical diagrams of biochemical pathways including most of the known metabolic pathways and some of the known regulatory pathways. The pathway information is also represented by the ortholog group tables summarizing orthologous and paralogous gene groups among different organisms. KEGG maintains the GENES database for the gene catalogs of all organisms with complete genomes and selected organisms with partial genomes, which are continuously re-annotated, as well as the LIGAND database for chemical compounds and enzymes. Each gene catalog is associated with the graphical genome map for chromosomal locations that is represented by Java applet. In addition to the data collection efforts, KEGG develops and provides various computational tools, such as for reconstructing biochemical pathways from the complete genome sequence and for predicting gene regulatory networks from the gene expression profiles. The KEGG databases are daily updated and made freely available (http://www.genome.ad.jp/kegg/).

24,024 citations

Related Papers (5)