scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Large scale comparison of global gene expression patterns in human and mouse

23 Dec 2010-Genome Biology (BioMed Central)-Vol. 11, Iss: 12, pp 1-11
TL;DR: The results indicate that the global patterns of tissue-specific expression of orthologous genes are conserved in human and mouse.
Abstract: It is widely accepted that orthologous genes between species are conserved at the sequence level and perform similar functions in different organisms. However, the level of conservation of gene expression patterns of the orthologous genes in different species has been unclear. To address the issue, we compared gene expression of orthologous genes based on 2,557 human and 1,267 mouse samples with high quality gene expression data, selected from experiments stored in the public microarray repository ArrayExpress. In a principal component analysis (PCA) of combined data from human and mouse samples merged on orthologous probesets, samples largely form distinctive clusters based on their tissue sources when projected onto the top principal components. The most prominent groups are the nervous system, muscle/heart tissues, liver and cell lines. Despite the great differences in sample characteristics and experiment conditions, the overall patterns of these prominent clusters are strikingly similar for human and mouse. We further analyzed data for each tissue separately and found that the most variable genes in each tissue are highly enriched with human-mouse tissue-specific orthologs and the least variable genes in each tissue are enriched with human-mouse housekeeping orthologs. The results indicate that the global patterns of tissue-specific expression of orthologous genes are conserved in human and mouse. The expression of groups of orthologous genes co-varies in the two species, both for the most variable genes and the most ubiquitously expressed genes.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper , a comparative analysis of protein expression between mouse, rat and human tissues was carried out, showing a high level of correlation among orthologs between all three species in brain, kidney, heart and liver samples.
Abstract: The increasingly large amount of proteomics data in the public domain enables, among other applications, the combined analyses of datasets to create comparative protein expression maps covering different organisms and different biological conditions. Here we have reanalysed public proteomics datasets from mouse and rat tissues (14 and 9 datasets, respectively), to assess baseline protein abundance. Overall, the aggregated dataset contained 23 individual datasets, including a total of 211 samples coming from 34 different tissues across 14 organs, comprising 9 mouse and 3 rat strains, respectively. In all cases, we studied the distribution of canonical proteins between the different organs. The number of canonical proteins per dataset ranged from 273 (tendon) and 9,715 (liver) in mouse, and from 101 (tendon) and 6,130 (kidney) in rat. Then, we studied how protein abundances compared across different datasets and organs for both species. As a key point we carried out a comparative analysis of protein expression between mouse, rat and human tissues. We observed a high level of correlation of protein expression among orthologs between all three species in brain, kidney, heart and liver samples, whereas the correlation of protein expression was generally slightly lower between organs within the same species. Protein expression results have been integrated into the resource Expression Atlas for widespread dissemination.

5 citations

Journal ArticleDOI
TL;DR: In this paper , a cross-species comparison of transcriptomes between humans and cattle was conducted to elucidating evolutionary molecular mechanisms underpinning phenotypic variation between and within species, which can help decipher the genetic and evolutionary basis of complex traits in both species.
Abstract: Abstract Background Cross-species comparison of transcriptomes is important for elucidating evolutionary molecular mechanisms underpinning phenotypic variation between and within species, yet to date it has been essentially limited to model organisms with relatively small sample sizes. Results Here, we systematically analyze and compare 10,830 and 4866 publicly available RNA-seq samples in humans and cattle, respectively, representing 20 common tissues. Focusing on 17,315 orthologous genes, we demonstrate that mean/median gene expression, inter-individual variation of expression, expression quantitative trait loci, and gene co-expression networks are generally conserved between humans and cattle. By examining large-scale genome-wide association studies for 46 human traits (average n = 327,973) and 45 cattle traits (average n = 24,635), we reveal that the heritability of complex traits in both species is significantly more enriched in transcriptionally conserved than diverged genes across tissues. Conclusions In summary, our study provides a comprehensive comparison of transcriptomes between humans and cattle, which might help decipher the genetic and evolutionary basis of complex traits in both species.

5 citations

DissertationDOI
01 Jan 2013
TL;DR: An annotation pipeline is developed, which can effectively identify IncRNAs in entire transcriptomes, and it is demonstrated that positional conservation of lncRNAs with a flanking coding gene is generally independent from the conservation of the lncRNA expression with respect to the coding gene.
Abstract: Long non-coding RNAs (IncRNAs) are a biological entity defined by what they are not, rather than by what they are. This indicates that our knowledge about them is sensibly limited. The aim of my PhD is to gain insights into the evolution and the functions of IncRNAs through computational approaches and the usage of large scale functional genomics dataset. I developed an annotation pipeline, which can effectively identify IncRNAs in entire transcriptomes. The pipeline is able to accurately annotate the coding genes while predicting a conservative estimate of the IncRNA population. It allowed me to show, for the first time, the presence of lncRNA transcription in a diverse range of organisms. Further, I analysed sequence and positional conservation of lncRNAs, demonstrating the presence of short segments of conserved sequence in IncRNAs and the existence of several syntenically conserved non-coding transcripts over large evolutionary distances. However, I also demonstrate that positional conservation of lncRNAs with a flanking coding gene is generally independent from the conservation of the lncRNA expression with respect to the coding gene. Finally, I have characterised the diversity of lncRNA transcription in specific cells and developmental stages of two teleost fishes. In summary, the work presented in the thesis provides novel findings and contributions in the field of lncRNAomics.

5 citations

Posted ContentDOI
01 Jun 2017-bioRxiv
TL;DR: SATORI enables researchers to seamlessly search, browse, and semantically query data repositories via two visualizations that are highly interconnected with a powerful search interface that is informed by a requirements analysis through a series of semi-structured interviews.
Abstract: The number of data sets in biomedical repositories has grown rapidly over the past decade, providing scientists in fields like genomics and other areas of high-throughput biology with tremendous opportunities to re-use data. Scientists are able to test hypotheses computationally instead of generating their own data, to complement their own data sets with data generated by others, and to conduct meta analyses across many data sets. In order to effectively exploit existing data, it is crucial to understand the content of repositories and to discover data relevant to a question of interest. These are challenging tasks, as most repositories currently only support finding data sets through text-based search of metadata and in some cases also through metadata-based browsing. In order to address these challenges, we have developed SATORI - an ontology-guided visual exploration system - that combines a powerful metadata search with a tree map and a node-link diagram that visualize the repository structure, provide context to retrieved data sets, and serve as an interface to drive semantic querying and browsing of the repository. The requirements for SATORI were derived in semi-structured interviews with biomedical data scientists. We demonstrate its utility by describing several usage scenarios using a stem cell data repository, discoveries we made in the process of developing them, and an evaluation of SATORI with domain experts. We have integrated an open-source, web-based implementation of SATORI in the data repository of the Refinery Platform for biomedical data analysis and visualization (http://refinery-platform.org).

5 citations


Additional excerpts

  • ...[49]....

    [...]

Posted ContentDOI
21 Jul 2020-bioRxiv
TL;DR: XGSEA (Cross-species Gene Set Enrichment Analysis) is proposed, with three steps of GSEA; 2) domain adaptation; and 3) regression, which indicates that XGSEA significantly outperformed three baseline methods and confirmed the reliability of XG SEA.
Abstract: Gene set enrichment analysis (GSEA) has been widely used to identify gene sets with statistically significant difference between cases and controls against a large gene set. GSEA needs both phenotype labels and expression of genes. However, gene expression are assessed more often for model organisms than minor species. More importantly, gene expression could not be measured under specific conditions for human, due to high healthy risk of direct experiments, such as non-approved treatment or gene knockout, and then often substituted by mouse. Thus predicting enrichment significance (on a phenotype) of a given gene set of a species (target, say human), by using gene expression measured under the same phenotype of the other species (source, say mouse) is a vital and challenging problem, which we call CROSS-species Gene Set Enrichment Problem (XGSEP). For XGSEP, we propose XGSEA (Cross-species Gene Set Enrichment Analysis), with three steps of: 1) running GSEA for a source species to obtain enrichment scores and p-values of source gene sets; 2) representing the relation between source and target gene sets by domain adaptation; and 3) using regression to predict p-values of target gene sets, based on the representation in 2). We extensively validated XGSEA by using four real data sets under various settings, proving that XGSEA significantly outperformed three baseline methods. A case study of identifying important human pathways for T cell dysfunction and reprogramming from mouse ATAC-Seq data further confirmed the reliability of XGSEA. Source code is available through https://github.com/LiminLi-xjtu/XGSEA Author summary Gene set enrichment analysis (GSEA) is a powerful tool in the gene sets differential analysis given a ranked gene list. GSEA requires complete data, gene expression with phenotype labels. However, gene expression could not be measured under specific conditions for human, due to high risk of direct experiments, such as non-approved treatment or gene knockout, and then often substituted by mouse. Thus no availability of gene expression leads to more challenging problem, CROSS-species Gene Set Enrichment Problem (XGSEP), in which enrichment significance (on a phenotype) of a given gene set of a species (target, say human) is predicted by using gene expression measured under the same phenotype of the other species (source, say mouse). In this work, we propose XGSEA (Cross-species Gene Set Enrichment Analysis) for XGSEP, with three steps of: 1) GSEA; 2) domain adaptation; and 3) regression. The results of four real data sets and a case study indicate that XGSEA significantly outperformed three baseline methods and confirmed the reliability of XGSEA.

5 citations

References
More filters
Journal ArticleDOI
TL;DR: The Gene Set Enrichment Analysis (GSEA) method as discussed by the authors focuses on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation.
Abstract: Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

34,830 citations

Journal ArticleDOI
TL;DR: There is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities, and the exploratory data analyses of the probe level data motivate a new summary measure that is a robust multi-array average (RMA) of background-adjusted, normalized, and log-transformed PM values.
Abstract: SUMMARY In this paper we report exploratory analyses of high-density oligonucleotide array data from the Affymetrix GeneChip R � system with the objective of improving upon currently used measures of gene expression. Our analyses make use of three data sets: a small experimental study consisting of five MGU74A mouse GeneChip R � arrays, part of the data from an extensive spike-in study conducted by Gene Logic and Wyeth’s Genetics Institute involving 95 HG-U95A human GeneChip R � arrays; and part of a dilution study conducted by Gene Logic involving 75 HG-U95A GeneChip R � arrays. We display some familiar features of the perfect match and mismatch probe ( PM and MM )v alues of these data, and examine the variance–mean relationship with probe-level data from probes believed to be defective, and so delivering noise only. We explain why we need to normalize the arrays to one another using probe level intensities. We then examine the behavior of the PM and MM using spike-in data and assess three commonly used summary measures: Affymetrix’s (i) average difference (AvDiff) and (ii) MAS 5.0 signal, and (iii) the Li and Wong multiplicative model-based expression index (MBEI). The exploratory data analyses of the probe level data motivate a new summary measure that is a robust multiarray average (RMA) of background-adjusted, normalized, and log-transformed PM values. We evaluate the four expression summary measures using the dilution study data, assessing their behavior in terms of bias, variance and (for MBEI and RMA) model fit. Finally, we evaluate the algorithms in terms of their ability to detect known levels of differential expression using the spike-in data. We conclude that there is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities. ∗ To whom correspondence should be addressed

10,711 citations


"Large scale comparison of global ge..." refers methods in this paper

  • ...The resulting 1,323 CEL files were pre-processed using Bioconductor’s RMA package [32] to create an integrated, normalized data matrix....

    [...]

Journal ArticleDOI
TL;DR: In this paper, high-density oligonucleotide arrays offer the opportunity to examine patterns of gene expression on a genome scale, and the authors have designed custom arrays that interrogate the expression of the vast majority of proteinencoding human and mouse genes and have used them to profile a panel of 79 human and 61 mouse tissues.
Abstract: The tissue-specific pattern of mRNA expression can indicate important clues about gene function. High-density oligonucleotide arrays offer the opportunity to examine patterns of gene expression on a genome scale. Toward this end, we have designed custom arrays that interrogate the expression of the vast majority of protein-encoding human and mouse genes and have used them to profile a panel of 79 human and 61 mouse tissues. The resulting data set provides the expression patterns for thousands of predicted genes, as well as known and poorly characterized genes, from mice and humans. We have explored this data set for global trends in gene expression, evaluated commonly used lines of evidence in gene prediction methodologies, and investigated patterns indicative of chromosomal organization of transcription. We describe hundreds of regions of correlated transcription and show that some are subject to both tissue and parental allele-specific expression, suggesting a link between spatial expression and imprinting.

3,513 citations


"Large scale comparison of global ge..." refers background or result in this paper

  • ...While studies suggested that orthologous genes do not share similar expression patterns [1-5], other groups reported the opposite observations [6-9]....

    [...]

  • ...Alternatively, many other studies made use of species-specific arrays to identify coexpressed groups of orthologous genes [4-6,16,17]....

    [...]

Journal ArticleDOI
TL;DR: The ability of the trained ANN models to recognize SRBCTs is demonstrated, and the potential applications of these methods for tumor diagnosis and the identification of candidate targets for therapy are demonstrated.
Abstract: The purpose of this study was to develop a method of classifying cancers to specific diagnostic categories based on their gene expression signatures using artificial neural networks (ANNs). We trained the ANNs using the small, round blue-cell tumors (SRBCTs) as a model. These cancers belong to four distinct diagnostic categories and often present diagnostic dilemmas in clinical practice. The ANNs correctly classified all samples and identified the genes most relevant to the classification. Expression of several of these genes has been reported in SRBCTs, but most have not been associated with these cancers. To test the ability of the trained ANN models to recognize SRBCTs, we analyzed additional blinded samples that were not previously used for the training procedure, and correctly classified them in all cases. This study demonstrates the potential applications of these methods for tumor diagnosis and the identification of candidate targets for therapy.

2,683 citations


"Large scale comparison of global ge..." refers methods in this paper

  • ...PCA has been often used to study high-dimensional data generated by genome-wide gene expression studies [22-25]....

    [...]

Book
27 Jan 2006
TL;DR: In this article, the authors present a detailed case study of R algorithms with publicly available data, and a major section of the book is devoted to fully worked case studies, with a companion website where readers can reproduce every number, figure and table on their own computers.
Abstract: Full four-color book. Some of the editors created the Bioconductor project and Robert Gentleman is one of the two originators of R. All methods are illustrated with publicly available data, and a major section of the book is devoted to fully worked case studies. Code underlying all of the computations that are shown is made available on a companion website, and readers can reproduce every number, figure, and table on their own computers.

2,625 citations

Related Papers (5)