scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Large scale comparison of global gene expression patterns in human and mouse

23 Dec 2010-Genome Biology (BioMed Central)-Vol. 11, Iss: 12, pp 1-11
TL;DR: The results indicate that the global patterns of tissue-specific expression of orthologous genes are conserved in human and mouse.
Abstract: It is widely accepted that orthologous genes between species are conserved at the sequence level and perform similar functions in different organisms. However, the level of conservation of gene expression patterns of the orthologous genes in different species has been unclear. To address the issue, we compared gene expression of orthologous genes based on 2,557 human and 1,267 mouse samples with high quality gene expression data, selected from experiments stored in the public microarray repository ArrayExpress. In a principal component analysis (PCA) of combined data from human and mouse samples merged on orthologous probesets, samples largely form distinctive clusters based on their tissue sources when projected onto the top principal components. The most prominent groups are the nervous system, muscle/heart tissues, liver and cell lines. Despite the great differences in sample characteristics and experiment conditions, the overall patterns of these prominent clusters are strikingly similar for human and mouse. We further analyzed data for each tissue separately and found that the most variable genes in each tissue are highly enriched with human-mouse tissue-specific orthologs and the least variable genes in each tissue are enriched with human-mouse housekeeping orthologs. The results indicate that the global patterns of tissue-specific expression of orthologous genes are conserved in human and mouse. The expression of groups of orthologous genes co-varies in the two species, both for the most variable genes and the most ubiquitously expressed genes.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
17 Oct 2013-PLOS ONE
TL;DR: Using PhysioSpace with clinical cancer datasets reveals that such data exhibits large heterogeneity in the number of significant signature associations, indicating shared biological functionalities in disease associated processes.
Abstract: Relating expression signatures from different sources such as cell lines, in vitro cultures from primary cells and biopsy material is an important task in drug development and translational medicine as well as for tracking of cell fate and disease progression. Especially the comparison of large scale gene expression changes to tissue or cell type specific signatures is of high interest for the tracking of cell fate in (trans-) differentiation experiments and for cancer research, which increasingly focuses on shared processes and the involvement of the microenvironment. These signature relation approaches require robust statistical methods to account for the high biological heterogeneity in clinical data and must cope with small sample sizes in lab experiments and common patterns of co-expression in ubiquitous cellular processes. We describe a novel method, called PhysioSpace, to position dynamics of time series data derived from cellular differentiation and disease progression in a genome-wide expression space. The PhysioSpace is defined by a compendium of publicly available gene expression signatures representing a large set of biological phenotypes. The mapping of gene expression changes onto the PhysioSpace leads to a robust ranking of physiologically relevant signatures, as rigorously evaluated via sample-label permutations. A spherical transformation of the data improves the performance, leading to stable results even in case of small sample sizes. Using PhysioSpace with clinical cancer datasets reveals that such data exhibits large heterogeneity in the number of significant signature associations. This behavior was closely associated with the classification endpoint and cancer type under consideration, indicating shared biological functionalities in disease associated processes. Even though the time series data of cell line differentiation exhibited responses in larger clusters covering several biologically related patterns, top scoring patterns were highly consistent with a priory known biological information and separated from the rest of response patterns.

21 citations

Journal ArticleDOI
TL;DR: The proposed method works on the observation that noisy data lie on a higher dimension space even though the actual data are embedded in a low dimensional manifold and is able to reduce the noise by around 70%.
Abstract: In this paper, we investigate the problem of sensor node localization and propose a non-linear semi supervised noise minimization algorithm through iterative manifold learning. The method works on the observation that noisy data lie on a higher dimension space even though the actual data are embedded in a low dimensional manifold. The collective labeled and unlabeled data are represented as a weighted graph. A prediction function is created based on the available labeled data along with manifold learning to exploit the intrinsic geometry. On top of prediction function, iterative feedback mechanism is used, which incrementally flattens the higher dimensional manifold. This reduces the error boundary in each stage for every data point. Result found to converge after a few iterations. This is followed by localized Procrustes analysis to further reduce the error. Experiment using TelosB motes and simulation with labeled and unlabeled data show that the proposed technique is able to reduce the noise, on an average, by around 70%. Results also show that the mechanism is able to localize the sensor nodes with high accuracy and outperforms the baseline method and LapRLS in different conditions.

20 citations


Cites methods from "Large scale comparison of global ge..."

  • ...Linear manifold learning methods like PCA [22]–[25] and MDS [26]–[29] are not suitable for non-linear data....

    [...]

Journal ArticleDOI
TL;DR: A new type of standardized datasets representative for the spatial and temporal dimensions of gene expression result from integrating expression data from a large number of globally normalized and quality controlled public experiments.
Abstract: Reference datasets are often used to compare, interpret or validate experimental data and analytical methods. In the field of gene expression, several reference datasets have been published. Typically, they consist of individual baseline or spike-in experiments carried out in a single laboratory and representing a particular set of conditions. Here, we describe a new type of standardized datasets representative for the spatial and temporal dimensions of gene expression. They result from integrating expression data from a large number of globally normalized and quality controlled public experiments. Expression data is aggregated by anatomical part or stage of development to yield a representative transcriptome for each category. For example, we created a genome-wide expression dataset representing the FDA tissue panel across 35 tissue types. The proposed datasets were created for human and several model organisms and are publicly available at http://www.expressiondata.org .

20 citations


Cites result from "Large scale comparison of global ge..."

  • ...These results confirm previous findings on comparing human and mouse tissues based on datasets that were normalized differently and in which tissue samples are represented individually [18]....

    [...]

Journal ArticleDOI
TL;DR: Evidence is shown that tissue expression profiles, if combined with sequence similarity, can improve the correct assignment of functionally related homologs across species and demonstrate that tissue-specific regulation is the main determinant of transcriptome composition and is highly conserved across mammalian species.
Abstract: Predicting molecular responses in human by extrapolating results from model organisms requires a precise understanding of the architecture and regulation of biological mechanisms across species. Here, we present a large-scale comparative analysis of organ and tissue transcriptomes involving the three mammalian species human, mouse and rat. To this end, we created a unique, highly standardized compendium of tissue expression. Representative tissue specific datasets were aggregated from more than 33,900 Affymetrix expression microarrays. For each organism, we created two expression datasets covering over 55 distinct tissue types with curated data from two independent microarray platforms. Principal component analysis (PCA) revealed that the tissue-specific architecture of transcriptomes is highly conserved between human, mouse and rat. Moreover, tissues with related biological function clustered tightly together, even if the underlying data originated from different labs and experimental settings. Overall, the expression variance caused by tissue type was approximately 10 times higher than the variance caused by perturbations or diseases, except for a subset of cancers and chemicals. Pairs of gene orthologs exhibited higher expression correlation between mouse and rat than with human. Finally, we show evidence that tissue expression profiles, if combined with sequence similarity, can improve the correct assignment of functionally related homologs across species. The results demonstrate that tissue-specific regulation is the main determinant of transcriptome composition and is highly conserved across mammalian species.

20 citations


Cites background or result from "Large scale comparison of global ge..."

  • ...For example, some studies suggested that orthologous genes have dissimilar expression patterns [3,4,9-11], while others reported congruent expression profiles [5-7,12-17]....

    [...]

  • ...[6,20], although here each category in the plot represents an average vector aggregated from a population of samples rather than plotting individual samples in the PCA....

    [...]

  • ...[5,7,16]), or to a larger but only partly overlapping set of tissues between human and mouse [6]....

    [...]

  • ...Most of these studies were restricted to comparing the human and mouse transcriptomes, thereby limiting the interpretation to a bilateral relationship without evidence from further organisms [3-7]....

    [...]

  • ...This result confirms previous findings in a comparison of human and mouse [6,21]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors studied the role of positive and relaxed selection in the evolution of Cardamine genes and concluded that the selective pressures associated with the habitats typical of C. resedifolia and C. impatiens may have caused the rapid evolution of genes involved in cold response.
Abstract: Elucidating the selective and neutral forces underlying molecular evolution is fundamental to understanding the genetic basis of adaptation. Plants have evolved a suite of adaptive responses to cope with variable environmental conditions, but relatively little is known about which genes are involved in such responses. Here we studied molecular evolution on a genome-wide scale in two species of Cardamine with distinct habitat preferences: C. resedifolia, found at high altitudes, and C. impatiens, found at low altitudes. Our analyses focussed on genes that are involved in stress responses to two factors that differentiate the high- and low-altitude habitats, namely temperature and irradiation. High-throughput sequencing was used to obtain gene sequences from C. resedifolia and C. impatiens. Using the available A. thaliana gene sequences and annotation, we identified nearly 3,000 triplets of putative orthologues, including genes involved in cold response, photosynthesis or in general stress responses. By comparing estimated rates of molecular substitution, codon usage, and gene expression in these species with those of Arabidopsis, we were able to evaluate the role of positive and relaxed selection in driving the evolution of Cardamine genes. Our analyses revealed a statistically significant higher rate of molecular substitution in C. resedifolia than in C. impatiens, compatible with more efficient positive selection in the former. Conversely, the genome-wide level of selective pressure is compatible with more relaxed selection in C. impatiens. Moreover, levels of selective pressure were heterogeneous between functional classes and between species, with cold responsive genes evolving particularly fast in C. resedifolia, but not in C. impatiens. Overall, our comparative genomic analyses revealed that differences in effective population size might contribute to the differences in the rate of protein evolution and in the levels of selective pressure between the C. impatiens and C. resedifolia lineages. The within-species analyses also revealed evolutionary patterns associated with habitat preference of two Cardamine species. We conclude that the selective pressures associated with the habitats typical of C. resedifolia may have caused the rapid evolution of genes involved in cold response.

18 citations

References
More filters
Journal ArticleDOI
TL;DR: The Gene Set Enrichment Analysis (GSEA) method as discussed by the authors focuses on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation.
Abstract: Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

34,830 citations

Journal ArticleDOI
TL;DR: There is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities, and the exploratory data analyses of the probe level data motivate a new summary measure that is a robust multi-array average (RMA) of background-adjusted, normalized, and log-transformed PM values.
Abstract: SUMMARY In this paper we report exploratory analyses of high-density oligonucleotide array data from the Affymetrix GeneChip R � system with the objective of improving upon currently used measures of gene expression. Our analyses make use of three data sets: a small experimental study consisting of five MGU74A mouse GeneChip R � arrays, part of the data from an extensive spike-in study conducted by Gene Logic and Wyeth’s Genetics Institute involving 95 HG-U95A human GeneChip R � arrays; and part of a dilution study conducted by Gene Logic involving 75 HG-U95A GeneChip R � arrays. We display some familiar features of the perfect match and mismatch probe ( PM and MM )v alues of these data, and examine the variance–mean relationship with probe-level data from probes believed to be defective, and so delivering noise only. We explain why we need to normalize the arrays to one another using probe level intensities. We then examine the behavior of the PM and MM using spike-in data and assess three commonly used summary measures: Affymetrix’s (i) average difference (AvDiff) and (ii) MAS 5.0 signal, and (iii) the Li and Wong multiplicative model-based expression index (MBEI). The exploratory data analyses of the probe level data motivate a new summary measure that is a robust multiarray average (RMA) of background-adjusted, normalized, and log-transformed PM values. We evaluate the four expression summary measures using the dilution study data, assessing their behavior in terms of bias, variance and (for MBEI and RMA) model fit. Finally, we evaluate the algorithms in terms of their ability to detect known levels of differential expression using the spike-in data. We conclude that there is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities. ∗ To whom correspondence should be addressed

10,711 citations


"Large scale comparison of global ge..." refers methods in this paper

  • ...The resulting 1,323 CEL files were pre-processed using Bioconductor’s RMA package [32] to create an integrated, normalized data matrix....

    [...]

Journal ArticleDOI
TL;DR: In this paper, high-density oligonucleotide arrays offer the opportunity to examine patterns of gene expression on a genome scale, and the authors have designed custom arrays that interrogate the expression of the vast majority of proteinencoding human and mouse genes and have used them to profile a panel of 79 human and 61 mouse tissues.
Abstract: The tissue-specific pattern of mRNA expression can indicate important clues about gene function. High-density oligonucleotide arrays offer the opportunity to examine patterns of gene expression on a genome scale. Toward this end, we have designed custom arrays that interrogate the expression of the vast majority of protein-encoding human and mouse genes and have used them to profile a panel of 79 human and 61 mouse tissues. The resulting data set provides the expression patterns for thousands of predicted genes, as well as known and poorly characterized genes, from mice and humans. We have explored this data set for global trends in gene expression, evaluated commonly used lines of evidence in gene prediction methodologies, and investigated patterns indicative of chromosomal organization of transcription. We describe hundreds of regions of correlated transcription and show that some are subject to both tissue and parental allele-specific expression, suggesting a link between spatial expression and imprinting.

3,513 citations


"Large scale comparison of global ge..." refers background or result in this paper

  • ...While studies suggested that orthologous genes do not share similar expression patterns [1-5], other groups reported the opposite observations [6-9]....

    [...]

  • ...Alternatively, many other studies made use of species-specific arrays to identify coexpressed groups of orthologous genes [4-6,16,17]....

    [...]

Journal ArticleDOI
TL;DR: The ability of the trained ANN models to recognize SRBCTs is demonstrated, and the potential applications of these methods for tumor diagnosis and the identification of candidate targets for therapy are demonstrated.
Abstract: The purpose of this study was to develop a method of classifying cancers to specific diagnostic categories based on their gene expression signatures using artificial neural networks (ANNs). We trained the ANNs using the small, round blue-cell tumors (SRBCTs) as a model. These cancers belong to four distinct diagnostic categories and often present diagnostic dilemmas in clinical practice. The ANNs correctly classified all samples and identified the genes most relevant to the classification. Expression of several of these genes has been reported in SRBCTs, but most have not been associated with these cancers. To test the ability of the trained ANN models to recognize SRBCTs, we analyzed additional blinded samples that were not previously used for the training procedure, and correctly classified them in all cases. This study demonstrates the potential applications of these methods for tumor diagnosis and the identification of candidate targets for therapy.

2,683 citations


"Large scale comparison of global ge..." refers methods in this paper

  • ...PCA has been often used to study high-dimensional data generated by genome-wide gene expression studies [22-25]....

    [...]

Book
27 Jan 2006
TL;DR: In this article, the authors present a detailed case study of R algorithms with publicly available data, and a major section of the book is devoted to fully worked case studies, with a companion website where readers can reproduce every number, figure and table on their own computers.
Abstract: Full four-color book. Some of the editors created the Bioconductor project and Robert Gentleman is one of the two originators of R. All methods are illustrated with publicly available data, and a major section of the book is devoted to fully worked case studies. Code underlying all of the computations that are shown is made available on a companion website, and readers can reproduce every number, figure, and table on their own computers.

2,625 citations

Related Papers (5)