Home
/
Authors
/
Johan Rung

Author

Johan Rung

Other affiliations: Wellcome Trust Sanger Institute, Charles III University of Madrid, Wellcome Trust Centre for Human Genetics

Bio: Johan Rung is an academic researcher from Wellcome Trust. The author has contributed to research in topics: Information management & Information system. The author has an hindex of 5, co-authored 6 publications receiving 755 citations. Previous affiliations of Johan Rung include Wellcome Trust Sanger Institute & Charles III University of Madrid.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

ArrayExpress update—trends in database growth and links to data analysis tools

[...]

Gabriella Rustici¹, Nikolay Kolesnikov¹, Marco Brandizi¹, Tony Burdett¹, Miroslaw Dylag¹, Ibrahim Emam¹, Anna Farne¹, Emma Hastings¹, Jon Ison¹, Maria Keays¹, Natalja Kurbatova¹, James Malone¹, Roby Mani¹, Annalisa Mupo¹, Rui Pedro Pereira¹, Ekaterina Pilicheva¹, Johan Rung¹, Anjan Sharma¹, Y. Amy Tang¹, Tobias Ternent¹, Andrew Tikhonov¹, Danielle Welter¹, Eleanor Williams¹, Alvis Brazma¹, Helen Parkinson¹, Ugis Sarkans¹ - Show less +22 more•Institutions (1)

Wellcome Trust Sanger Institute¹

27 Nov 2012-Nucleic Acids Research

TL;DR: The ArrayExpress Archive of Functional Genomics Data ( ArrayExpress) is one of three international functional genomics public data repositories, alongside the Gene Expression Omnibus at NCBI and the DDBJ Omics Archive, supporting peer-reviewed publications.

...read moreread less

Abstract: The ArrayExpress Archive of Functional Genomics Data (http://www.ebi.ac.uk/arrayexpress) is one of three international functional genomics public data repositories, alongside the Gene Expression Omnibus at NCBI and the DDBJ Omics Archive, supporting peer-reviewed publications. It accepts data generated by sequencing or array-based technologies and currently contains data from almost a million assays, from over 30 000 experiments. The proportion of sequencing-based submissions has grown significantly over the last 2 years and has reached, in 2012, 15% of all new data. All data are available from ArrayExpress in MAGE-TAB format, which allows robust linking to data analysis and visualization tools, including Bioconductor and GenomeSpace. Additionally, R objects, for microarray data, and binary alignment format files, for sequencing data, have been generated for a significant proportion of ArrayExpress data.

...read moreread less

377 citations

Journal Article•DOI•

Reuse of public genome-wide gene expression data

[...]

Johan Rung¹, Alvis Brazma¹•Institutions (1)

Wellcome Trust¹

01 Feb 2013-Nature Reviews Genetics

TL;DR: The utility of the gene expression data that are in the public domain and how researchers are making use of these data are discussed and recommendations are provided that can improve the utility of such data.

...read moreread less

Abstract: Our understanding of gene expression has changed dramatically over the past decade, largely catalysed by technological developments. High-throughput experiments - microarrays and next-generation sequencing - have generated large amounts of genome-wide gene expression data that are collected in public archives. Added-value databases process, analyse and annotate these data further to make them accessible to every biologist. In this Review, we discuss the utility of the gene expression data that are in the public domain and how researchers are making use of these data. Reuse of public data can be very powerful, but there are many obstacles in data preparation and analysis and in the interpretation of the results. We will discuss these challenges and provide recommendations that we believe can improve the utility of such data.

...read moreread less

335 citations

Journal Article•DOI•

A System for Information Management in BioMedical Studies--SIMBioMS

[...]

Maria Krestyaninova¹, Andris Zarins², Juris Viksna², Natalja Kurbatova², Peteris Rucevskis², Sudeshna Guha Neogi², Mike Gostev², Teemu Perheentupa², Juha Knuuttila², Amy Barrett², Ilkka Lappalainen², Johan Rung², Karlis Podnieks², Ugis Sarkans², Mark I. McCarthy², Alvis Brazma² - Show less +12 more•Institutions (2)

European Bioinformatics Institute¹, Wellcome Trust Centre for Human Genetics²

15 Oct 2009-Bioinformatics

TL;DR: SimBioMS provides a solution for the collection, storage, management and retrieval of information about research subjects and biomedical samples, as well as experimental data obtained using a range of high-throughput technologies, including gene expression, genotyping, proteomics and metabonomics.

...read moreread less

Abstract: Summary: SIMBioMS is a web-based open source software system for managing data and information in biomedical studies. It provides a solution for the collection, storage, management and retrieval of information about research subjects and biomedical samples, as well as experimental data obtained using a range of high-throughput technologies, including gene expression, genotyping, proteomics and metabonomics. The system can easily be customized and has proven to be successful in several large-scale multi-site collaborative projects. It is compatible with emerging functional genomics data standards and provides data import and export in accepted standard formats. Protocols for transferring data to durable archives at the European Bioinformatics Institute have been implemented. Availability: The source code, documentation and initialization scripts are available at http://simbioms.org. Contact: gro.smoibmis@troppus; ku.ca.ibe@kairam

...read moreread less

35 citations

Journal Article•DOI•

SAIL-a software system for sample and phenotype availability across biobanks and cohorts

[...]

Mikhail Gostev¹, Julio Fernandez-Banet², Johan Rung², Joern Dietrich², Inga Prokopenko², Samuli Ripatti², Mark I. McCarthy², Alvis Brazma², Maria Krestyaninova² - Show less +5 more•Institutions (2)

Wellcome Trust¹, Wellcome Trust Centre for Human Genetics²

15 Feb 2011-Bioinformatics

TL;DR: The Sample avAILability system—SAIL—is a web based application for searching, browsing and annotating biological sample collections or biobank entries by providing individual-level information on the availability of specific data types and samples within a collection, rather than the actual measurement data, resource integration can be facilitated.

...read moreread less

Abstract: Summary: The Sample avAILability system—SAIL—is a web based application for searching, browsing and annotating biological sample collections or biobank entries. By providing individual-level information on the availability of specific data types (phenotypes, genetic or genomic data) and samples within a collection, rather than the actual measurement data, resource integration can be facilitated. A flexible data structure enables the collection owners to provide descriptive information on their samples using existing or custom vocabularies. Users can query for the available samples by various parameters combining them via logical expressions. The system can be scaled to hold data from millions of samples with thousands of variables. Availability: SAIL is available under Aferro-GPL open source license: https://github.com/sail. Contact: ku.ca.ibe@vetsog, gro.smoibmis@troppus Supplementary information: Supplementary data are available at Bioinformatics online and from http://www.simbioms.org.

...read moreread less

29 citations

Journal Article•DOI•

A fully scalable online pre-processing algorithm for short oligonucleotide microarray atlases

[...]

Leo Lahti¹, Aurora Torrente², Laura L. Elo², Alvis Brazma², Johan Rung² - Show less +1 more•Institutions (2)

University of Helsinki¹, Charles III University of Madrid²

01 May 2013-Nucleic Acids Research

TL;DR: In this article, a fully scalable online learning algorithm for probe-level analysis and pre-processing of large microarray atlases involving tens of thousands of arrays is presented. But the model can use the most comprehensive data collections available to date to pinpoint individual probes affected by noise and biases.

...read moreread less

Abstract: Rapid accumulation of large and standardized microarray data collections is opening up novel opportunities for holistic characterization of genome function. The limited scalability of current preprocessing techniques has, however, formed a bottleneck for full utilization of these data resources. Although short oligonucleotide arrays constitute a major source of genome-wide profiling data, scalable probe-level techniques have been available only for few platforms based on precalculated probe effects from restricted reference training sets. To overcome these key limitations, we introduce a fully scalable online-learning algorithm for probe-level analysis and pre-processing of large microarray atlases involving tens of thousands of arrays. In contrast to the alternatives, our algorithm scales up linearly with respect to sample size and is applicable to all short oligonucleotide platforms. The model can use the most comprehensive data collections available to date to pinpoint individual probes affected by noise and biases, providing tools to guide array design and quality control. This is the only available algorithm that can learn probe-level parameters based on sequential hyperparameter updates at small consecutive batches of data, thus circumventing the extensive memory requirements of the standard approaches and opening up novel opportunities to take full advantage of contemporary microarray collections.

...read moreread less

28 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

The Molecular Signatures Database Hallmark Gene Set Collection

[...]

Arthur Liberzon¹, Chet Birger¹, Helga Thorvaldsdottir¹, Mahmoud Ghandi¹, Jill P. Mesirov², Pablo Tamayo² - Show less +2 more•Institutions (2)

Broad Institute¹, University of California, San Diego²

23 Dec 2015-Cell systems

TL;DR: A combination of automated approaches and expert curation is used to develop a collection of "hallmark" gene sets, derived from multiple "founder" sets, that conveys a specific biological state or process and displays coherent expression in MSigDB.

...read moreread less

Abstract: The Molecular Signatures Database (MSigDB) is one of the most widely used and comprehensive databases of gene sets for performing gene set enrichment analysis. Since its creation, MSigDB has grown beyond its roots in metabolic disease and cancer to include >10,000 gene sets. These better represent a wider range of biological processes and diseases, but the utility of the database is reduced by increased redundancy across, and heterogeneity within, gene sets. To address this challenge, here we use a combination of automated approaches and expert curation to develop a collection of “hallmark” gene sets as part of MSigDB. Each hallmark in this collection consists of a “refined” gene set, derived from multiple “founder” sets, that conveys a specific biological state or process and displays coherent expression. The hallmarks effectively summarize most of the relevant information of the original founder sets and, by reducing both variation and redundancy, provide more refined and concise inputs for gene set enrichment analysis.

...read moreread less

6,062 citations

Journal Article•DOI•

Inferring tumour purity and stromal and immune cell admixture from expression data

[...]

Kosuke Yoshihara¹, Maria Shahmoradgoli², Emmanuel Martinez, Rahulsimham Vegesna², Hoon Kim, Wandaliz Torres-Garcia², Victor Trevino, Hui Shen³, Peter W. Laird³, Douglas A. Levine⁴, Scott L. Carter⁵, Gad Getz⁵, Katherine Stemke-Hale², Gordon B. Mills, Roel G.W. Verhaak - Show less +11 more•Institutions (5)

University of Texas at Austin¹, University of Texas MD Anderson Cancer Center², University of Southern California³, Memorial Sloan Kettering Cancer Center⁴, Broad Institute⁵

11 Oct 2013-Nature Communications

TL;DR: A method that uses gene expression signatures to infer the fraction of stromal and immune cells in tumour samples and prediction accuracy is corroborated using 3,809 transcriptional profiles available elsewhere in the public domain.

...read moreread less

Abstract: Infiltrating stromal and immune cells form the major fraction of normal cells in tumour tissue and not only perturb the tumour signal in molecular studies but also have an important role in cancer biology. Here we describe 'Estimation of STromal and Immune cells in MAlignant Tumours using Expression data' (ESTIMATE)--a method that uses gene expression signatures to infer the fraction of stromal and immune cells in tumour samples. ESTIMATE scores correlate with DNA copy number-based tumour purity across samples from 11 different tumour types, profiled on Agilent, Affymetrix platforms or based on RNA sequencing and available through The Cancer Genome Atlas. The prediction accuracy is further corroborated using 3,809 transcriptional profiles available elsewhere in the public domain. The ESTIMATE method allows consideration of tumour-associated normal cells in genomic and transcriptomic studies. An R-library is available on https://sourceforge.net/projects/estimateproject/.

...read moreread less

4,651 citations

Journal Article•DOI•

PATRIC, the bacterial bioinformatics database and analysis resource

[...]

Alice R. Wattam¹, David Abraham¹, Oral Dalay¹, Terry Disz¹, Timothy P. Driscoll¹, Joseph L. Gabbard¹, Joseph J. Gillespie¹, Roger Gough¹, Deborah Hix¹, Ronald W. Kenyon¹, Dustin Machi¹, Chunhong Mao¹, Eric K. Nordberg¹, Robert Olson¹, Ross Overbeek¹, Gordon D. Pusch¹, Maulik Shukla¹, Julie R. Schulman¹, Rick Stevens¹, Daniel E. Sullivan¹, Veronika Vonstein¹, Andrew S. Warren¹, Rebecca Will¹, Meredith J. C. Wilson¹, Hyunseung Yoo¹, Chengdong Zhang¹, Yan Zhang¹, Bruno W. S. Sobral¹ - Show less +24 more•Institutions (1)

University of Maryland, Baltimore¹

01 Jan 2014-Nucleic Acids Research

TL;DR: The Pathosystems Resource Integration Center (PATRIC) is the all-bacterial Bioinformatics Resource Center (BRC) and describes updates to the PATRIC since its initial report in the 2007 NAR Database Issue.

...read moreread less

Abstract: The Pathosystems Resource Integration Center (PATRIC) is the all-bacterial Bioinformatics Resource Center (BRC) (http://www.patricbrc.org). A joint effort by two of the original National Institute of Allergy and Infectious Diseases-funded BRCs, PATRIC provides researchers with an online resource that stores and integrates a variety of data types [e.g. genomics, transcriptomics, protein–protein interactions (PPIs), three-dimensional protein structures and sequence typing data] and associated metadata. Datatypes are summarized for individual genomes and across taxonomic levels. All genomes in PATRIC, currently more than 10 000, are consistently annotated using RAST, the Rapid Annotations using Subsystems Technology. Summaries of different data types are also provided for individual genes, where comparisons of different annotations are available, and also include available transcriptomic data. PATRIC provides a variety of ways for researchers to find data of interest and a private workspace where they can store both genomic and gene associations, and their own private data. Both private and public data can be analyzed together using a suite of tools to perform comparative genomic or transcriptomic analysis. PATRIC also includes integrated information related to disease and PPIs. All the data and integrated analysis and visualization tools are freely available. This manuscript describes updates to the PATRIC since its initial report in the 2007 NAR Database Issue.

...read moreread less

1,136 citations

Journal Article•DOI•

The Database of Genomic Variants: a curated collection of structural variation in the human genome

[...]

Jeffrey R. MacDonald¹, Robert Ziman¹, Ryan K. C. Yuen¹, Lars Feuk¹, Stephen W. Scherer¹ - Show less +1 more•Institutions (1)

The Centre for Applied Genomics¹

01 Jan 2014-Nucleic Acids Research

TL;DR: The core visualization tool (gbrowse) has been upgraded with additional functions to facilitate data analysis and comparison, and a new query tool has been developed to provide flexible and interactive access to the data.

...read moreread less

Abstract: Over the past decade, the Database of Genomic Variants (DGV; http://dgv.tcag.ca/) has provided a publicly accessible, comprehensive curated catalogue of structural variation (SV) found in the genomes of control individuals from worldwide populations. Here, we describe updates and new features, which have expanded the utility of DGV for both the basic research and clinical diagnostic communities. The current version of DGV consists of 55 published studies, comprising >2.5 million entries identified in >22 300 genomes. Studies included in DGV are selected from the accessioned data sets in the archival SV databases dbVar (NCBI) and DGVa (EBI), and then further curated for accuracy and validity. The core visualization tool (gbrowse) has been upgraded with additional functions to facilitate data analysis and comparison, and a new query tool has been developed to provide flexible and interactive access to the data. The content from DGV is regularly incorporated into other large-scale genome reference databases and represents a standard data resource for new product and database development, in particular for copy number variation testing in clinical labs. The accurate cataloguing of variants in DGV will continue to enable medical genetics and genome sequencing research.

...read moreread less

1,080 citations

Journal Article•DOI•

K-Profiles: A Nonlinear Clustering Method for Pattern Detection in High Dimensional Data.

[...]

Kai Wang¹, Qing Zhao², Jianwei Lu², Tianwei Yu¹•Institutions (2)

Emory University¹, Tongji University²

03 Aug 2015-BioMed Research International

TL;DR: The nonlinear K-profiles clustering method is designed, which can be seen as the nonlinear counterpart of the K-means clustering algorithm, and has a built-in statistical testing procedure that ensures genes not belonging to any cluster do not impact the estimation of cluster profiles.

...read moreread less

Abstract: With modern technologies such as microarray, deep sequencing, and liquid chromatography-mass spectrometry (LC-MS), it is possible to measure the expression levels of thousands of genes/proteins simultaneously to unravel important biological processes. A very first step towards elucidating hidden patterns and understanding the massive data is the application of clustering techniques. Nonlinear relations, which were mostly unutilized in contrast to linear correlations, are prevalent in high-throughput data. In many cases, nonlinear relations can model the biological relationship more precisely and reflect critical patterns in the biological systems. Using the general dependency measure, Distance Based on Conditional Ordered List (DCOL) that we introduced before, we designed the nonlinear K-profiles clustering method, which can be seen as the nonlinear counterpart of the K-means clustering algorithm. The method has a built-in statistical testing procedure that ensures genes not belonging to any cluster do not impact the estimation of cluster profiles. Results from extensive simulation studies showed that K-profiles clustering not only outperformed traditional linear K-means algorithm, but also presented significantly better performance over our previous General Dependency Hierarchical Clustering (GDHC) algorithm. We further analyzed a gene expression dataset, on which K-profile clustering generated biologically meaningful results.

...read moreread less

1,005 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155

Collapse