scispace - formally typeset
Open AccessJournal ArticleDOI

A human phenome-interactome network of protein complexes implicated in genetic disorders

Reads0
Chats0
TLDR
A Bayesian predictor is developed that identifies novel candidates implicated in disorders such as retinitis pigmentosa, epithelial ovarian cancer, inflammatory bowel disease, amyotrophic lateral sclerosis, Alzheimer disease, type 2 diabetes and coronary heart disease.
Abstract
We performed a systematic, large-scale analysis of human protein complexes comprising gene products implicated in many different categories of human disease to create a phenome-interactome network. This was done by integrating quality-controlled interactions of human proteins with a validated, computationally derived phenotype similarity score, permitting identification of previously unknown complexes likely to be associated with disease. Using a phenomic ranking of protein complexes linked to human disease, we developed a Bayesian predictor that in 298 of 669 linkage intervals correctly ranks the known disease-causing protein as the top candidate, and in 870 intervals with no identified disease-causing gene, provides novel candidates implicated in disorders such as retinitis pigmentosa, epithelial ovarian cancer, inflammatory bowel disease, amyotrophic lateral sclerosis, Alzheimer disease, type 2 diabetes and coronary heart disease. Our publicly available draft of protein complexes associated with pathology comprises 506 complexes, which reveal functional relationships between disease-promoting genes that will inform future experimentation.

read more

Content maybe subject to copyright    Report

NATURE BIOTECHNOLOGY VOLUME 25
NUMBER 3
MARCH 2007 309
A human phenome-interactome network of protein
complexes implicated in genetic disorders
Kasper Lage
1,6
, E Olof Karlberg
1,6
, Zenia M Størling
1
, Páll Í Ólason
1
, Anders G Pedersen
1
, Olga Rigina
1
,
Anders M Hinsby
1
, Zeynep Tümer
2
, Flemming Pociot
3,4
, Niels Tommerup
2
, Yves Moreau
5
& Søren Brunak
1
We performed a systematic, large-scale analysis of human
protein complexes comprising gene products implicated in many
different categories of human disease to create a phenome-
interactome network. This was done by integrating quality-
controlled interactions of human proteins with a validated,
computationally derived phenotype similarity score, permitting
identification of previously unknown complexes likely to be
associated with disease. Using a phenomic ranking of protein
complexes linked to human disease, we developed a Bayesian
predictor that in 298 of 669 linkage intervals correctly ranks
the known disease-causing protein as the top candidate, and in
870 intervals with no identified disease-causing gene, provides
novel candidates implicated in disorders such as retinitis
pigmentosa, epithelial ovarian cancer, inflammatory bowel
disease, amyotrophic lateral sclerosis, Alzheimer disease, type
2 diabetes and coronary heart disease. Our publicly available
draft of protein complexes associated with pathology comprises
506 complexes, which reveal functional relationships between
disease-promoting genes that will inform future experimentation.
Several diseases with overlapping clinical manifestations are caused by
mutations in different genes that are part of the same functional module.
In such instances, the clinical overlap can be attributed to mutations in
single genes rendering the complete module dysfunctional
1
. This concept
has been applied to searches for disease genes by several computational
methods, including, for example, schemes based on Gene Ontology
annotations and gene expression data
2–12
. The advent of proteome-wide
interaction screens in model organisms has revealed the modularity of
the cellular interactome and that many genes exert their functions as
components of protein complexes such as cellular machines, rigid struc-
tures, dynamic signaling or metabolic networks and post-translational
modification systems
13
.
Analyses involving model organisms, and more recently humans, show
that direct and indirect interactions often occur between protein pairs
responsible for similar phenotypes
14–22
. In humans this relationship can,
for example, be observed in various inherited ataxias
20
. These findings
hint at the widespread association of protein complexes with human
disease and the likelihood that defects in several proteins, alone or in
combination, can cause overlapping clinical manifestations. Systematic
investigation of these complexes would help to elucidate cellular mecha-
nisms underlying various disorders and prioritize positional candidates
identified, for example, by linkage analysis or association studies.
Our strategy is predicated on the simple assumption that mutations
in different members of a protein complex (predicted from protein-pro-
tein interaction data) lead to comparable phenotypes, the similarities of
which can be automatically recognized by text mining. Computational
integration of phenotypic data with a high-confidence interaction net-
work of human proteins is required to perform such an analysis for many
human diseases simultaneously. This creates a phenome-interactome net-
work. However, there is no single standard vocabulary for phenotypic
annotation in humans. Furthermore, protein interaction data are noisy,
are scattered among different databases and contain many false positive
interactions
23
. Additionally, only a few large-scale protein interaction
studies have been finalized for the human proteome
24,25
rendering the
coverage of human protein interaction data too low for a systematic study
of protein complexes associated with human disease. Thus, extensive data
integration, including conservative incorporation of protein interaction
data from model organisms, streamlining of human phenotype data and
thorough testing of the resulting method, is required for the systematic
investigation of protein complexes associated with human disease.
RESULTS
Construction of a quality-controlled interaction network of human pro-
teins and implementation of a thoroughly benchmarked computational
phenotype similarity score allowed us to analyze a human phenome-
interactome network. The results show that the 506 disease-associated
protein complexes span a wide range of inherited disease categories. We
furthermore trained a Bayesian predictor to prioritize candidates in 870
linkage intervals by assigning candidates to protein complexes and rank-
ing these complexes based on the phenotypes associated with its members
by text mining. The key steps in our approach are illustrated in Figure
1. Four disease-specific case studies are presented to illustrate how the
complexes can be exploited to generate novel hypotheses, which directly
suggest specific validation experiments involving particular patient-
derived materials.
1
Center for Biological Sequence Analysis, BioCentrum-DTU, Technical University
of Denmark, Building 208, DK-2800 Lyngby, Denmark.
2
Wilhelm Johannsen
Centre for Functional Genome Research, Department of Cellular and Molecular
Medicine, The Panum Institute, University of Copenhagen, Blegdamsvej 3, DK-
2200, Copenhagen N, Denmark.
3
Institute for Clinical Science, University of
Lund, SE – 22100 Lund, Sweden.
4
Steno Diabetes Center, Niels Steensesvej 2,
DK-2820 Gentofte, Denmark.
5
Department of Electrical Engineering, Faculty of
Engineering, Katholieke Universiteit Leuven, B–3001 Heverlee, Belgium.
6
These
authors contributed equally to this work. Correspondence should be addressed to
S.B. (brunak@cbs.dtu.dk).
Published online 7 March 2007; doi:10.1038/nbt1295
ANALYSIS
© 2007 Nature Publishing Group http://www.nature.com/naturebiotechnology

310 VOLUME 25
NUMBER 3
MARCH 2007 NATURE BIOTECHNOLOGY
Measuring phenotype similarity scores
Text mining techniques are well suited for investigating phenotype-
genotype relationships
8,11,12,14,26–28
. Inspired by such techniques, we
created a scoring scheme that quantitatively measures the phenotypic
overlap of Online Mendelian Inheritance in Man (OMIM)
29
records
(Supplementary Fig. 1 online). For every record we created a phenotype
vector consisting of weighted medical terms present in the record, which
represent the phenotype described in that particular record. The parsing
of the OMIM records was done using MetaMap Transfer
30
(MMTx),
a program that maps text to the Unified Medical Language System
(UMLS)
31
metathesaurus (MTH) concepts. The pairwise phenotypic
overlap between records was quantified by calculating the cosine of the
angle between normalized vector pairs
32
, which is a standard measure
in such analyses. Essentially, the method amounts to detecting words
(from the UMLS vocabulary) that are (i) common to the description
of the two phenotypes and (ii) do not occur too frequently among all
phenotype descriptions and thus are informative about the phenotype
under consideration.
Even though our approach is comparable to successful methods
reported in other contexts
28
, there are a number of problems surrounding
the use of MMTx and UMLS
33
, and it is not obvious that the cosine dis-
tance between phenotype vectors can accurately capture and quantify the
phenotypic overlap between record pairs. To evaluate the reliability of our
method, we extracted a large set of ~7,000 OMIM record pairs, which had
a high degree of phenotypic overlap. This assertion of phenotypic overlap
was based on a combination of the opinion of expert OMIM curators and
experts familiar with the diseases under consideration (Supplementary
Methods online). To evaluate the phenotypic overlap of record pairs in
this set, we manually curated 100 random record pairs. This evaluation
showed that over 90% of the pairs consist of records with a high degree
of phenotypic overlap (Supplementary Table 1 online).
The reliability of the phenotype similarity score was then tested by fit-
ting a calibration curve of the score against the overlap with the OMIM
record pairs (that is, the percentage of the pairs with a given score found
among the record pairs). This demonstrates their direct correlation
(Supplementary Fig. 2 online). The higher the phenotype similarity
score between records measured by our text-mining scheme, the higher
the probability that the records had been independently evaluated to
have a phenotypic overlap by the OMIM curators, so that indeed the
constructed phenotype vectors and scoring scheme produce a reliable
measure of phenotypic overlap between OMIM records.
Constructing a scored network of human protein interactions
We created a human protein interaction network by pooling human
interaction data from several of the largest databases and increased the
coverage by transferring data from model organisms. We then devised
and tested a network-wide confidence score for all interactions. This score
relies on network topology and furthermore considers (i) that interac-
tions from large-scale experiments generally contain more false positives
than interactions from small-scale experiments
23
, and (ii) that interac-
tions are more reliable if they have been reproduced in more than one
independent interaction experiment
23
. The reliability of this score as a
measure of interaction confidence was confirmed by fitting a calibration
curve of the score against overlap with a high-confidence set of about
35,000 human interactions (Supplementary Fig. 3 online). The resulting
network contains ~343,000 unique interactions between ~8,500 human
proteins. Of these, ~62,000 are high-confidence interactions.
Testing the predictor on 1,404 linkage intervals
We trained a Bayesian predictor to rank known disease-causing pro-
teins in linkage intervals, by assigning candidates to protein complexes
and ranking these complexes based on the phenotypes assigned to their
members by text mining. The predictor was validated by fivefold cross-
validation on a total of 1,404 linkage intervals containing an average of
109 candidates and including one candidate known to be involved in the
particular disease. For ranking candidates, the Bayesian predictor takes as
input the patient phenotype (e.g., Leber congenital amaurosis) and a link-
age interval, and the candidates are ranked by the following three steps
(Fig. 1). First, a given positional candidate is queried for high-scoring
interaction partners (termed a virtual pull-down of the protein). These
interaction partners compose the candidate complex. Second, proteins
known to be involved in disease are identified in the candidate complex,
and pairwise scores of the phenotypic overlap between diseases of these
proteins and the candidate phenotype are assigned. Third, based on the
phenotypes represented in the candidate complex, the Bayesian predictor
awards a posterior probability score to the candidate in the complex. All
candidates in the linkage interval are ranked on the basis of this score.
The biological interpretation of a high-scoring candidate is that this pro-
tein is likely to be involved in the molecular pathology of the disorder
of interest, because it is part of a high-confidence candidate complex
in which some proteins are known to be involved in highly similar (or
identical) disorders.
Performance of the Bayesian model relying on phenomic
scoring of protein complexes associated with disease
The results of prioritizing candidates in the 1,404 test linkage intervals
show that the predictor has both good precision and recall (Fig. 2a).
For each disease, we consider the known disease gene as the relevant
gene. Our method makes a prediction for a disease if the top-scoring
gene for this disease has a score above the threshold of 0.1. This thresh-
old is chosen because predictions scoring below 0.1 approximate the
chance of picking the correct gene randomly. The retrieved gene is then
this top-scoring gene. Precision (at a given threshold) is the propor-
tion of relevant genes among all retrieved genes (no. of relevant genes
retrieved/no. of genes retrieved). Recall is the fraction of the relevant
genes that have been retrieved at the same threshold (no. of relevant
genes retrieved/no. of relevant genes). For the 1,404 linkage intervals,
there are 669 different predictions with a score above 0.1. Among these,
there were 298 correctly identified disease genes, so that the precision
at this threshold is 45% (that is, 45% of the candidates that ranked
number one with a score above 0.1 are correctly identified as genes
causing disease) (Fig. 2a)—a level of precision far superior to random
prediction. At this threshold, the recall is 21%. A plot of precision versus
prediction score cutoff shows proportionality between the score and
the chance that the candidate is correct. Candidates scoring above 0.9
are correct in more than 65% of the cases (Fig. 2a). Thus, high-scor-
ing candidates are very likely to be correct, and the score awarded to a
candidate is a direct indication of the chance that the gene contributes
to the disease in question.
There were two main types of failures to identify the relevant genes.
Either the proteins coded by the relevant genes do not have an interac-
tion partner that is involved in a relevant phenotype (which applies to
59% of all intervals), or there is a gene in the region considered a bet-
ter candidate by the predictor (which applies to 26% of all intervals).
These 26% could in theory be correct predictions, as suggested by manual
inspection of false predictions with high posterior probabilities. By far
the most common failure is the lack of interaction partners involved in
similar diseases. In 75% of such cases there were no candidates that scored
above the threshold of 0.1. These failures could either be due to a lack
of data or because some disease proteins do not interact with proteins
involved in similar diseases. It seems most likely that the failures are due
to a combination of both.
ANALYSIS
© 2007 Nature Publishing Group http://www.nature.com/naturebiotechnology

NATURE BIOTECHNOLOGY VOLUME 25
NUMBER 3
MARCH 2007 311
We also tested a predictor trained on large-scale protein interac-
tion data from which bias related to human diseases was eliminated
(Supplementary Methods online). Here we observed a comparable
precision to the predictor trained on the full protein interaction data
set (Fig. 2b). Using these data, the precision above 0.1 is 25%, and above
0.9, it is 58%. Therefore, although the performance is slightly lower, it
is still very high. These results illustrate the value of large-scale protein
interaction data from model organisms, if subjected to stringent quality
control. The much lower recall (2.3%) is to be expected with less data.
This shows that it is possible to accurately identify disease genes using
data from model organisms that were not produced specifically to inves-
tigate disease relationships.
Because mutational analysis of candidates in linkage intervals is
extremely demanding in terms of resources, our method should be valu-
able for identifying highly likely candidates and thereby facilitating the
discovery of novel genes involved in human disease.
Predicting novel disease gene candidates
OMIM contains 870 intervals linked to diseases for which there are no
confirmed disease-causing genes. We ranked the genes in these intervals
by the method depicted in Figure 1. The full set of predictions above the
threshold of 0.1 can be seen in the Supplementary Data. We present the
best-scoring candidates made by our predictor in Supplementary Table 2
online. In each of the 91 represented intervals at least one candidate scores
above 0.2. In some intervals there are also candidates scoring in the range
0.1–0.2, these are included for completeness, so the table contains a total
of 113 candidates in 91 intervals.
All predictions in Supplementary Table 2 were followed up by inde-
pendent literature studies, where we investigated the distance of the pre-
dicted gene to the closest published high-resolution marker. Seven genes
were located >20 Mb from such markers (labeled * in Supplementary
Table 2 online). We also investigated whether the candidates had previ-
ously been associated with the respective disorders, and whether there
were inconsistencies between candidates we proposed and those pro-
posed by other groups for the same diseases and intervals.
Twenty-four of the predictions point to genes that are most likely true
positives, but where the causative mutation has not yet been identified
(annotated with “2” or “2#” in Supplementary Table 2 online). In these
cases, our predictions should be seen as further evidence that the genes
are involved in the respective diseases. Seven predictions point to genes
where a causative mutation has been identified (annotated with “3” in
Supplementary Table 2 online). Together, these constitute 31 predictions
most likely to be true. Of these, 25 are the best scoring in the interval, and
6 are scored second or lower. Sixteen predictions point to genes for which
literature studies show that a different gene is strongly incriminated in
the disease, most likely rendering the prediction wrong (annotated with
“1#” in Supplementary Table 2 online). Of these, 11 are the best-scoring
candidate in the interval and 5 score second or lower. When considering
Linkage interval,
with N candidates
found in genetic
studies to be
associated with
the patient
phenotype
Patient phenotype
Leber congenital
amaurois
Pairwise similarity of protein phenotype
and patient phenotype
Not involved in
similar disease
Figure 1 Steps in scoring each candidate in a linkage interval. First, a virtual pull-down of each candidate identifies putative protein complexes including
the candidate. Each complex is named the candidate complex. Second, proteins responsible for promoting disease are identified in the candidate complex,
and the pairwise similarity to the patient phenotype is measured by text-mining. In this case, proteins that are involved in different disorders comparable to
Leber congenital amaurosis are colored according to the clinical overlap with this phenotype. The last step involves scoring and ranking the candidates by the
Bayesian predictor. Each candidate is scored based on phenotypes associated with the proteins in the candidate complex, and all candidates in the interval
are ranked based on this score.
ANALYSIS
© 2007 Nature Publishing Group http://www.nature.com/naturebiotechnology

312 VOLUME 25
NUMBER 3
MARCH 2007 NATURE BIOTECHNOLOGY
only the best-scoring candidate in each interval (as we have done in the
benchmark), 25 are most likely true positives and 11 are most likely nega-
tives. Thus, the precision is 69%—even better than the precision in the
benchmark, where predictions above 0.2 have a precision of 49%. Sixty-
six of the candidates belong to intervals where there is no evidence in
the literature regarding a gene(s) that contributes to the pathology. We
consider these as novel candidates. All complexes underlying the candi-
dates scoring 0.1 or above are available for download from the database
supporting this work.
To exemplify the candidate protein complexes underlying the scoring
of the Bayesian predictor, we present four case studies of the novel candi-
dates from Supplementary Table 2 online. Similar analysis can be carried
out for all 506 complexes in the data set, pointing to specific approaches
toward validation of the proposed relationships.
Case studies
Retinitis pigmentosa is a clinically and genetically heterogeneous group
of disorders. Common traits are night blindness, constricted visual field
and retinal dystrophy. In an associated interval on 2p15–p11 (ref. 34), the
Bayesian predictor points to LOC130951 with a score of 0.5232. This pro-
tein is uncharacterized but evolutionarily conserved, and it is putatively
involved in the disease based on an interaction with CRX
25,35
(Fig. 3a).
CRX is a homeobox transcription factor known to be involved in retinitis
pigmentosa and cone rod dystrophy
36
. The candidature of LOC130951 is
not obvious, and because both interaction studies reporting the interac-
tion to CRX are large scale, including thousands of interactions, it seems
unlikely that LOC130951 would have been chosen as a suitable candidate
by manual investigation of the interval.
Epithelial ovarian cancer arises as a result of genetic alterations in the
ovarian surface epithelium. In an associated interval on 3p25–p22 (ref.
37), the Bayesian predictor points to Fanconi anemia group D2 protein
(FANCD2) with a score of 0.9981. This protein is placed in a complex
with breast cancer type 2 susceptibility protein (BRCA2), breast cancer
type 1 susceptibility protein (BRCA1) and nibrin isoform 1 (NBN), all
of which are involved in ovarian cancer, breast cancer or chromosomal
instability disorders
38–41
(Fig. 3b). Furthermore, other proteins involved
in cancer can be identified in the complex (Supplementary Data and
Supplementary Fig. 4 online). FANCD2 is part of the BRCA pathway in
cisplatin-sensitive cells
42
and is known to be involved in different types
of cancer
43
. However, to our knowledge, a mutation in this gene has
never been demonstrated in epithelial ovarian cancer, and we consider
it to be a likely candidate in epithelial ovarian cancer in families with
linkage to 3p22–p25.
Inflammatory bowel disease is characterized by chronic, relapsing
intestinal inflammation. In an associated interval on 6p
44,45
, the Bayesian
predictor points to receptor-interacting serine/threonine protein kinase
(RIPK1) as the most likely candidate with a score of 0.9984 (Fig. 3c).
The candidate complex includes the signaling proteins tumor necrosis
factor receptor 2 (TNFRSF1B), tumor necrosis factor precursor (TNF)
and tumor necrosis factor receptor precursor (TNFRSF1A), all known
to be associated with inflammatory bowel disease or other inflamma-
tory disorders. Furthermore, other proteins involved in inflammation
and immune responses can be observed in the complex (Supplementary
Data and Supplementary Fig. 5 online). We thus identified a positional
candidate, which is placed centrally in a complex of proteins known to
be involved in inflammatory bowel disease and other types of inflamma-
tion. We note that RIPK1 lies 20.6 Mb from the closest high-resolution
marker published. However, considering that all of 6q was screened for
candidates, and that several genes lying far from the published markers
are most likely true predictions in Supplementary Table 2 online, we
believe that RIPK1 is a very likely candidate involved in inflammatory
bowel disease.
Amyotrophic lateral sclerosis (ALS) with frontotemporal dementia is a
degenerative motor neuron disorder characterized by muscular atrophy,
progressive motor neuron function loss and bulbar paralysis. In many
families, hereditary ALS is associated with frontotemporal dementia
and linkage has been shown to an area on 9q21–q22 (ref. 46). Here, the
Bayesian predictor points to two likely candidates: bicaudal D homolog
2 (BICD2) and cytoplasmic isoleucyl-tRNA synthetase (IARS), scor-
ing 0.4351 and 0.2154, respectively. Although BICD2 is scored highest,
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Precision
0.10 0.15 0.20 0.25
Prediction score cutoff
Recall
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
a
b
Figure 2 Performance of the Bayesian predictor. (a) A plot of recall of the
predictor against precision shows that precision for high-scoring candidates
can approach 65%. We also trained a predictor only on large-scale data
where we had removed all data that were related to diseases that were
represented in the test set. (b) Prediction score cutoff is plotted for both the
predictor trained on all protein interaction data in our network (green line)
and the predictor trained only on unbiased large-scale data (blue line). The
precision of these two approaches is comparable, showing that it is possible
to find disease genes with very high precision, even with unbiased large-
scale data inferred from model organisms, if these data are scored correctly.
ANALYSIS
© 2007 Nature Publishing Group http://www.nature.com/naturebiotechnology

NATURE BIOTECHNOLOGY VOLUME 25
NUMBER 3
MARCH 2007 313
both candidates are awarded good scores and
are plausible candidates for contributing to
ALS associated with dementia. However, inves-
tigation of the candidate complexes suggests
that BICD2 is more likely to be involved in
nonfamilial ALS not associated with dementia,
because it is part of a complex with dynactin,
which is associated with ALS without demen-
tia. IARS is in a complex with superoxide dis-
mutase 1, a protein known to be involved in
familial ALS
47
including dementia (Fig. 3d).
Also, the IARS complex contains molecular
chaperones and other proteins that have been
connected to the disease and other types of
dementias (Supplementary Fig. 6 online), and
the interaction data underlying the complex
is highly reproducible (Supplementary Data
online). Both candidates are likely, but the
candidate complex underlying IARS is seem-
ingly more relevant to familial ALS, and it is
plausible that IARS could be involved in the
disease in families with linkage to 9q21–q22.
Because little is known about this disorder, the complex revealed here is
an interesting new lead concerning its underlying causes.
These case studies indicate the value of data mining our phenome-
interactome network and integrating interaction data across multiple
organisms for positional candidate prioritization. In the case of retinitis
pigmentosa and ALS with frontotemporal dementia, the predictor iden-
tifies nonobvious candidates in novel putative complexes supported by
a network of reproducible interaction data from humans and multiple
model organisms. In the cases of inflammatory bowel disease and epi-
thelial ovarian cancer, we identify partly characterized complexes, where
several members are known to be involved in the patient phenotype.
However, because there are ~500 positional candidates in the case of
inflammatory bowel disease, it would require extensive literature studies
to reveal this network and candidate by manual data integration. We thus
believe that RIPK1 would probably not have been identified as a good
candidate despite prior knowledge of its involvement in a known network
contributing to inflammatory responses.
DISCUSSION
We have recently witnessed the emergence of integrative methods for
identifying probable disease genes in linkage intervals associated with dis-
ease based on data integration involving, for example, Gene Ontology cat-
egories and expression data
2–12
. Traditionally these methods are compared
by measuring average fold enrichment of positional probability
(Supplementary Methods online). If a method ranks the true candidate
in the top 10% of all candidates in 50% of the linkage intervals, there is
a tenfold enrichment in the successful predictions intervals and fivefold
enrichment on average. We show that our method increases the probabil-
ity 108.8 times for the successful predictions and 23.1 times on average,
significantly outperforming the other computational methods for posi-
tional candidate prioritization, which report 5.6–31.2 times enrichment
in the successful linkage intervals to 3.8 to 19.4 times enrichment on
average (Supplementary Table 3 online). The most common failure of
our method to correctly identify the disease gene results from the inability
to find interaction partners associated with a similar phenotype as the
relevant protein. This could result from either a lack of data or the failure
of these proteins to interact with proteins involved in similar phenotypes.
In 75% of these cases, failure to identify another candidate scoring over
0.1 eliminates the possibility of an incorrect prediction.
Our ability to assign candidates to high-confidence protein com-
plexes and rank these complexes in terms of phenomics has permitted
us to present a first draft of 506 protein complexes associated with
human disease. The success of our method can be attributed to a com-
bination of factors. First, we integrate experimental protein interaction
data with a phenotype similarity scheme, thereby taking advantage of
the complete clinical spectrum of related human diseases. Also, we use
ab
cd
-
-
-
-
Figure 3 Case studies of four candidate
complexes. (ad) These candidate complexes
are subjected to virtual pull-down with the best-
scoring candidate in retinitis pigmentosa 28
(RP28) (a), epithelial ovarian cancer (EOC)
(b), inflammatory bowel disease (IBD) (c) and
a high-scoring candidate in amyotrophic lateral
sclerosis (ALS) with frontotemporal dementia
(d). Solid black circles (c) represent proteins
that are the high-scoring candidates in the four
disorders. Numbered circles are proteins that
interact with the candidate proteins. Colored
nodes are proteins identified by our phenotype-
similarity scheme. Gray proteins are not predicted
by our phenotype-similarity scheme to be
implicated in any disease.
ANALYSIS
© 2007 Nature Publishing Group http://www.nature.com/naturebiotechnology

Citations
More filters
Journal ArticleDOI

Network Medicine: A Network-Based Approach to Human Disease

TL;DR: Advances in this direction are essential for identifying new disease genes, for uncovering the biological significance of disease-associated mutations identified by genome-wide association studies and full-genome sequencing, and for identifying drug targets and biomarkers for complex diseases.
Journal ArticleDOI

The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored

TL;DR: An update on the online database resource Search Tool for the Retrieval of Interacting Genes (STRING), which provides uniquely comprehensive coverage and ease of access to both experimental as well as predicted interaction information.
Journal ArticleDOI

Genetics of rheumatoid arthritis contributes to biology and drug discovery

Yukinori Okada, +115 more
- 20 Feb 2014 - 
TL;DR: A genome-wide association study meta-analysis in a total of >100,000 subjects of European and Asian ancestries provides empirical evidence that the genetics of RA can provide important information for drug discovery, and sheds light on fundamental genes, pathways and cell types that contribute to RA pathogenesis.
Journal ArticleDOI

Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes

Andrew P. Morris, +232 more
- 01 Sep 2012 - 
TL;DR: This article conducted a meta-analysis of genetic variants on the Metabochip, including 34,840 cases and 114,981 controls, overwhelmingly of European descent, and identified ten previously unreported T2D susceptibility loci, including two showing sex-differentiated association.
References
More filters
Book

Introduction to Modern Information Retrieval

TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Journal ArticleDOI

Network biology: understanding the cell's functional organization

TL;DR: This work states that rapid advances in network biology indicate that cellular networks are governed by universal laws and offer a new conceptual framework that could potentially revolutionize the view of biology and disease pathologies in the twenty-first century.
Journal ArticleDOI

The Unified Medical Language System (UMLS): integrating biomedical terminology

TL;DR: The Unified Medical Language System is a repository of biomedical vocabularies developed by the US National Library of Medicine and includes tools for customizing the Metathesaurus (MetamorphoSys), for generating lexical variants of concept names (lvg) and for extracting UMLS concepts from text (MetaMap).
Journal ArticleDOI

A comprehensive two-hybrid analysis to explore the yeast protein interactome

TL;DR: The comprehensive analysis using a system to examine two-hybrid interactions in all possible combinations between the budding yeast Saccharomyces cerevisiae is completed and would significantly expand and improve the protein interaction map for the exploration of genome functions that eventually leads to thorough understanding of the cell as a molecular system.
Journal ArticleDOI

From genomics to chemical genomics: new developments in KEGG

TL;DR: The scope of KEGG LIGAND has been significantly expanded to cover both endogenous and exogenous molecules, and RPAIR contains curated chemical structure transformation patterns extracted from known enzymatic reactions, which would enable analysis of genome-environment interactions.
Related Papers (5)