Reuse of public genome-wide gene expression data

doi:10.1038/NRG3394

Home
/
Papers
/
Reuse of public genome-wide gene expression data

Journal Article•DOI•

Reuse of public genome-wide gene expression data

Johan Rung¹, Alvis Brazma¹•Institutions (1)

Wellcome Trust¹

01 Feb 2013-Nature Reviews Genetics (Nat Rev Genet)-Vol. 14, Iss: 2, pp 89-99

TL;DR: The utility of the gene expression data that are in the public domain and how researchers are making use of these data are discussed and recommendations are provided that can improve the utility of such data.

read less

Abstract: Our understanding of gene expression has changed dramatically over the past decade, largely catalysed by technological developments. High-throughput experiments - microarrays and next-generation sequencing - have generated large amounts of genome-wide gene expression data that are collected in public archives. Added-value databases process, analyse and annotate these data further to make them accessible to every biologist. In this Review, we discuss the utility of the gene expression data that are in the public domain and how researchers are making use of these data. Reuse of public data can be very powerful, but there are many obstacles in data preparation and analysis and in the interpretation of the results. We will discuss these challenges and provide recommendations that we believe can improve the utility of such data.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

K-Profiles: A Nonlinear Clustering Method for Pattern Detection in High Dimensional Data.

[...]

Kai Wang¹, Qing Zhao², Jianwei Lu², Tianwei Yu¹•Institutions (2)

Emory University¹, Tongji University²

03 Aug 2015-BioMed Research International

TL;DR: The nonlinear K-profiles clustering method is designed, which can be seen as the nonlinear counterpart of the K-means clustering algorithm, and has a built-in statistical testing procedure that ensures genes not belonging to any cluster do not impact the estimation of cluster profiles.

...read moreread less

Abstract: With modern technologies such as microarray, deep sequencing, and liquid chromatography-mass spectrometry (LC-MS), it is possible to measure the expression levels of thousands of genes/proteins simultaneously to unravel important biological processes. A very first step towards elucidating hidden patterns and understanding the massive data is the application of clustering techniques. Nonlinear relations, which were mostly unutilized in contrast to linear correlations, are prevalent in high-throughput data. In many cases, nonlinear relations can model the biological relationship more precisely and reflect critical patterns in the biological systems. Using the general dependency measure, Distance Based on Conditional Ordered List (DCOL) that we introduced before, we designed the nonlinear K-profiles clustering method, which can be seen as the nonlinear counterpart of the K-means clustering algorithm. The method has a built-in statistical testing procedure that ensures genes not belonging to any cluster do not impact the estimation of cluster profiles. Results from extensive simulation studies showed that K-profiles clustering not only outperformed traditional linear K-means algorithm, but also presented significantly better performance over our previous General Dependency Hierarchical Clustering (GDHC) algorithm. We further analyzed a gene expression dataset, on which K-profile clustering generated biologically meaningful results.

...read moreread less

1,005 citations

Journal Article•DOI•

ArrayExpress update—simplifying data submissions

[...]

Nikolay Kolesnikov¹, Emma Hastings¹, Maria Keays¹, Olga Melnichuk¹, Y. Amy Tang¹, Eleanor Williams¹, Miroslaw Dylag¹, Natalja Kurbatova¹, Marco Brandizi¹, Tony Burdett¹, Karyn Megy¹, Ekaterina Pilicheva¹, Gabriella Rustici¹, Andrew Tikhonov¹, Helen Parkinson¹, Robert Petryszak¹, Ugis Sarkans¹, Alvis Brazma¹ - Show less +14 more•Institutions (1)

European Bioinformatics Institute¹

28 Jan 2015-Nucleic Acids Research

TL;DR: The main development over the last two years has been the release of a new data submission tool Annotare, which has reduced the average submission time almost 3-fold and will become the only submission route into ArrayExpress, alongside MAGE-TAB format-based pipelines in the near future.

...read moreread less

Abstract: The ArrayExpress Archive of Functional Genomics Data (http://www.ebi.ac.uk/arrayexpress) is an international functional genomics database at the European Bioinformatics Institute (EMBL-EBI) recommended by most journals as a repository for data supporting peer-reviewed publications. It contains data from over 7000 public sequencing and 42 000 array-based studies comprising over 1.5 million assays in total. The proportion of sequencing-based submissions has grown significantly over the last few years and has doubled in the last 18 months, whilst the rate of microarray submissions is growing slightly. All data in ArrayExpress are available in the MAGE-TAB format, which allows robust linking to data analysis and visualization tools and standardized analysis. The main development over the last two years has been the release of a new data submission tool Annotare, which has reduced the average submission time almost 3-fold. In the near future, Annotare will become the only submission route into ArrayExpress, alongside MAGE-TAB format-based pipelines. ArrayExpress is a stable and highly accessed resource. Our future tasks include automation of data flows and further integration with other EMBL-EBI resources for the representation of multi-omics data.

...read moreread less

676 citations

Cites background from "Reuse of public genome-wide gene ex..."

...A recent study of a sample of around 100 peer-reviewed publications referring to ArrayExpress (9) showed that about 22% of the ArrayExpress users use our data for computational studies (e....
[...]

Journal Article•DOI•

CellNet: Network Biology Applied to Stem Cell Engineering

[...]

Patrick Cahan¹, Hu Li², Samantha A. Morris³, Samantha A. Morris¹, Edroaldo Lummertz da Rocha⁴, Edroaldo Lummertz da Rocha⁵, Edroaldo Lummertz da Rocha¹, George Q. Daley¹, James J. Collins⁵ - Show less +5 more•Institutions (5)

Harvard University¹, Mayo Clinic², Howard Hughes Medical Institute³, Universidade Federal de Santa Catarina⁴, Boston University⁵

14 Aug 2014-Cell

TL;DR: It is found that cells derived via directed differentiation more closely resemble their in vivo counterparts than products of direct conversion, as reflected by the establishment of target cell-type gene regulatory networks (GRNs).

...read moreread less

477 citations

Cites background or methods from "Reuse of public genome-wide gene ex..."

...…decade methods to reconstruct GRNs using genome-wide expression data have matured substantially (Marbach et al., 2012), and expression repositories have accumulated a wide array of biological perturbations (Lukk et al., 2010; Rung and Brazma, 2013), which are needed for accurate GRN reconstruction....
[...]
..., 2012), and expression repositories have accumulated a wide array of biological perturbations (Lukk et al., 2010; Rung and Brazma, 2013), which are needed for accurate GRN reconstruction....
[...]

Journal Article•DOI•

Data reuse and the open data citation advantage

[...]

Heather A. Piwowar¹•Institutions (1)

National Evolutionary Synthesis Center¹

01 Oct 2013-PeerJ

TL;DR: There is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data, and a robust citation benefit from open data is found, although a smaller one than previously reported.

...read moreread less

Abstract: Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation benefit”. Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

...read moreread less

423 citations

Cites background or result from "Reuse of public genome-wide gene ex..."

...The citation benefit observed in the current study is consistent with data reuse found in this study and the small-scale annotation reported in Rung & Brazma (2013)....
[...]
...Usage statistics from primary data repositories and value-added repositories are also useful sources of insight into reuse patterns (Rung & Brazma, 2013)....
[...]

Journal Article•DOI•

Learning from Co-expression Networks: Possibilities and Challenges

[...]

Elise A. R. Serin¹, Harm Nijveen, Henk W. M. Hilhorst¹, Wilco Ligterink¹•Institutions (1)

Wageningen University and Research Centre¹

08 Apr 2016-Frontiers in Plant Science

TL;DR: This study analyzes integrative genomics strategies used in recent studies that successfully identified candidate genes taking advantage of gene co-expression networks and discusses promising bioinformatics approaches that predict networks for specific purposes.

...read moreread less

Abstract: Plants are fascinating and complex organisms. A comprehensive understanding of the organization, function and evolution of plant genes is essential to disentangle important biological processes and to advance crop engineering and breeding strategies. The ultimate aim in deciphering complex biological processes is the discovery of causal genes and regulatory mechanisms controlling these processes. The recent surge of omics data has opened the door to a system-wide understanding of the flow of biological information underlying complex traits. However, dealing with the corresponding large data sets represents a challenging endeavor that calls for the development of powerful bioinformatics methods. A popular approach is the construction and analysis of gene networks. Such networks are often used for genome-wide representation of the complex functional organization of biological systems. Network based on similarity in gene expression are called (gene) co-expression networks. One of the major application of gene co-expression networks is the functional annotation of unknown genes. Constructing co-expression networks is generally straightforward. In contrast, the resulting network of connected genes can become very complex, which limits its biological interpretation. Several strategies can be employed to enhance the interpretation of the networks. A strategy in coherence with the biological question addressed needs to be established to infer reliable networks. Additional benefits can be gained from network-based strategies using prior knowledge and data integration to further enhance the elucidation of gene regulatory relationships. As a result, biological networks provide many more applications beyond the simple visualization of co-expressed genes. In this study we review the different approaches for co-expression network inference in plants. We analyse integrative genomics strategies used in recent studies that successfully identified candidate genes taking advantage of gene co-expression networks. Additionally, we discuss promising bioinformatics approaches that predict networks for specific purposes.

...read moreread less

244 citations

Cites background from "Reuse of public genome-wide gene ex..."

...It has been reported that nearly one in four studies uses public data to address a biological problem without generating new raw data (Rung and Brazma, 2013)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Gene Ontology: tool for the unification of biology

[...]

M Ashburner¹, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. M. Cherry, Allan Peter Davis, Kara Dolinski, Selina S. Dwight, J.T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna E. Lewis, John C. Matese, Joel E. Richardson, M. Ringwald, Gerald M. Rubin, Gavin Sherlock - Show less +16 more•Institutions (1)

Stanford University¹

01 May 2000-Nature Genetics

TL;DR: The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.

...read moreread less

Abstract: Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.

...read moreread less

35,225 citations

Journal Article•DOI•

Bioconductor: open software development for computational biology and bioinformatics

[...]

Robert Gentleman¹, Vincent J. Carey², Douglas M. Bates³, Benjamin M. Bolstad⁴, Marcel Dettling, Sandrine Dudoit⁴, Byron Ellis¹, Laurent Gautier⁵, Yongchao Ge⁶, Jeff Gentry¹, Kurt Hornik⁷, Torsten Hothorn⁸, Wolfgang Huber⁹, Stefano Maria Iacus¹⁰, Rafael A. Irizarry¹¹, Friedrich Leisch⁷, Cheng Li¹, Martin Maechler, A. J. Rossini¹², Günther Sawitzki, Colin A. Smith¹³, Gordon K. Smyth¹⁴, Luke Tierney¹⁵, Jean Yang, Jianhua Zhang¹ - Show less +21 more•Institutions (15)

Harvard University¹, Brigham and Women's Hospital², University of Wisconsin-Madison³, University of California, Berkeley⁴, Technical University of Denmark⁵, Icahn School of Medicine at Mount Sinai⁶, Vienna University of Technology⁷, University of Erlangen-Nuremberg⁸, German Cancer Research Center⁹, University of Milan¹⁰, Johns Hopkins University¹¹, University of Washington¹², Scripps Research Institute¹³, Walter and Eliza Hall Institute of Medical Research¹⁴, University of Iowa¹⁵

15 Sep 2004-Genome Biology

TL;DR: Details of the aims and methods of Bioconductor, the collaborative creation of extensible software for computational biology and bioinformatics, and current challenges are described.

...read moreread less

Abstract: The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.

...read moreread less

12,142 citations

Journal Article•DOI•

RNA-Seq: a revolutionary tool for transcriptomics

[...]

Zhong Wang¹, Mark Gerstein¹, Michael Snyder¹•Institutions (1)

Yale University¹

01 Jan 2009-Nature Reviews Genetics

TL;DR: The RNA-Seq approach to transcriptome profiling that uses deep-sequencing technologies provides a far more precise measurement of levels of transcripts and their isoforms than other methods.

...read moreread less

Abstract: RNA-Seq is a recently developed approach to transcriptome profiling that uses deep-sequencing technologies. Studies using this method have already altered our view of the extent and complexity of eukaryotic transcriptomes. RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods. This article describes the RNA-Seq approach, the challenges associated with its application, and the advances made so far in characterizing several eukaryote transcriptomes.

...read moreread less

11,528 citations

Journal Article•DOI•

Gene Expression Omnibus: NCBI gene expression and hybridization array data repository

[...]

Ron Edgar¹, Michael Domrachev¹, Alex E. Lash¹•Institutions (1)

National Institutes of Health¹

01 Jan 2002-Nucleic Acids Research

TL;DR: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data and provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-power gene expression and genomic hybridization experiments.

...read moreread less

Abstract: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data. GEO provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-throughput gene expression and genomic hybridization experiments. GEO is not intended to replace in house gene expression databases that benefit from coherent data sets, and which are constructed to facilitate a particular analytic method, but rather complement these by acting as a tertiary, central data distribution hub. The three central data entities of GEO are platforms, samples and series, and were designed with gene expression and genomic hybridization experiments in mind. A platform is, essentially, a list of probes that define what set of molecules may be detected. A sample describes the set of molecules that are being probed and references a single platform used to generate its molecular abundance data. A series organizes samples into the meaningful data sets which make up an experiment. The GEO repository is publicly accessible through the World Wide Web at http://www.ncbi.nlm.nih.gov/geo.

...read moreread less

10,968 citations

Journal Article•DOI•

Quantitative monitoring of gene expression patterns with a complementary DNA microarray.

[...]

Mark Schena¹, Dari Shalon¹, Ronald W. Davis¹, Patrick O. Brown¹•Institutions (1)

Stanford University¹

20 Oct 1995-Science

TL;DR: A high-capacity system was developed to monitor the expression of many genes in parallel by means of simultaneous, two-color fluorescence hybridization, which enabled detection of rare transcripts in probe mixtures derived from 2 micrograms of total cellular messenger RNA.

...read moreread less

Abstract: A high-capacity system was developed to monitor the expression of many genes in parallel. Microarrays prepared by high-speed robotic printing of complementary DNAs on glass were used for quantitative expression measurements of the corresponding genes. Because of the small format and high density of the arrays, hybridization volumes of 2 microliters could be used that enabled detection of rare transcripts in probe mixtures derived from 2 micrograms of total cellular messenger RNA. Differential expression measurements of 45 Arabidopsis genes were made by means of simultaneous, two-color fluorescence hybridization.

...read moreread less

10,287 citations