scispace - formally typeset
Open AccessJournal ArticleDOI

Fusing literature and full network data improves disease similarity computation.

Ping Li, +2 more
- 30 Aug 2016 - 
- Vol. 17, Iss: 1, pp 326-326
Reads0
Chats0
TLDR
A new method for computing disease similarity by integrating medical literature and protein interaction network is proposed, which shows that quality of protein interaction data was more important than its volume, and can be an effective way to compute disease similarity.
Abstract
Identifying relatedness among diseases could help deepen understanding for the underlying pathogenic mechanisms of diseases, and facilitate drug repositioning projects. A number of methods for computing disease similarity had been developed; however, none of them were designed to utilize information of the entire protein interaction network, using instead only those interactions involving disease causing genes. Most of previously published methods required gene-disease association data, unfortunately, many diseases still have very few or no associated genes, which impeded broad adoption of those methods. In this study, we propose a new method (MedNetSim) for computing disease similarity by integrating medical literature and protein interaction network. MedNetSim consists of a network-based method (NetSim), which employs the entire protein interaction network, and a MEDLINE-based method (MedSim), which computes disease similarity by mining the biomedical literature. Among function-based methods, NetSim achieved the best performance. Its average AUC (area under the receiver operating characteristic curve) reached 95.2 %. MedSim, whose performance was even comparable to some function-based methods, acquired the highest average AUC in all semantic-based methods. Integration of MedSim and NetSim (MedNetSim) further improved the average AUC to 96.4 %. We further studied the effectiveness of different data sources. It was found that quality of protein interaction data was more important than its volume. On the contrary, higher volume of gene-disease association data was more beneficial, even with a lower reliability. Utilizing higher volume of disease-related gene data further improved the average AUC of MedNetSim and NetSim to 97.5 % and 96.7 %, respectively. Integrating biomedical literature and protein interaction network can be an effective way to compute disease similarity. Lacking sufficient disease-related gene data, literature-based methods such as MedSim can be a great addition to function-based algorithms. It may be beneficial to steer more resources torward studying gene-disease associations and improving the quality of protein interaction data. Disease similarities can be computed using the proposed methods at http:// www.digintelli.com:8000/ .

read more

Content maybe subject to copyright    Report

RES E A R C H A R T I C L E Open Access
Fusing literature and full network data
improves disease similarity computation
Ping Li
1,2
, Yaling Nie
1,2
and Jingkai Yu
1*
Abstract
Background: Identifying relatedness among diseases could help deepen understanding for the underlying
pathogenic mechanisms of diseases, and facilitate drug repositioning projects. A number of methods for
computing disease similarity had been developed; however, none of them were designed to utilize information of
the entire protein interaction network, using instead only those interactions involving disease causing genes. Most
of previously published methods required gene-disease association data, unfortunately, many diseases still have
very few or no associated genes, which impeded broad adoption of those methods. In this study, we propose a
new method (MedNetSim) for computing disease similarity by integrating medical literature and protein interaction
network. MedNetSim consists of a network-based method (NetSim), which employs the entire protein interaction
network, and a MEDLINE-based method (MedSim), which computes disease similarity by mining the biomedical
literature.
Results: Among function-based methods, NetSim achieved the best performance. Its average AUC (area under the
receiver operating characteristic curve) reached 95.2 %. MedSim, whose performance was even comparable to
some function-based methods, acquired the highest average AUC in all semantic-based methods. Integration of
MedSim and NetSim (MedNetSim) further improved the average AUC to 96.4 %. We further studied the
effectiveness of different data sources. It was found that quality of protein interaction dat a was more important
than its volume. On the contrary, higher volume of gene-disease association data was more beneficial, even with a
lower reliability. Utilizing higher volume of disease-related gene data further improved the average AUC of
MedNetSim and NetSim to 97.5 % and 96.7 %, respectively.
Conclusions: Integrating biome dical literature and protein interaction network can be an effective way to compute
disease similarity. Lacking sufficient disease-related gene data, literature-based methods such as MedSim can be
a great addition to function-based algorithms. It may be beneficial to steer more resources torward studying
gene-disease associations and improving the quality of protein interaction data. Disease similarities can be computed
using the proposed methods at http://www.digintelli.com:8000/.
Keywords: Disease similarity, MedSim, NetSim, MedNetSim, Random walk with Restart
Abbreviations: AUC, The area under the ROC curve; BOG, Based on overlapping gene sets method;
comPPI, Common protein-protein interactions of hPPIN and HumanNet; CTD, Comparative toxicogenomics
database; DisGeNET, A database of gene-disease associations; DO, Disease onto logy; DOID, Disease ontology
identifier; DSN, Dise ase simi larity netwo rk; GAD, Genetic association database; GO, Gene ontology; GO_BP, GO
biological process; GO_CC, GO cellular component; GO_MF, GO molecular fu nction; HPO, Human phenotype
ontology; IC, Information content; IDF, Inverse document frequency; MeSH, Medical subject headings;
(Continued on next page)
* Correspondence: jkyu@ipe.ac.cn
1
State Key Laboratory of Biochemical Engineering, Institute of Process
Engineering, Chinese Academy of Sciences, Beijing 100190, China
Full list of author information is available at the end of the article
© 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Li et al. BMC Bioinformatics (2016) 17:326
DOI 10.1186/s12859-016-1205-4

(Continued from previous page)
MG, Myasthenia gravi s; MICA, The most informative commo n anc estor; NLTK, Nature language toolkit;
OMIM, Online mendeli an inhe ritance in man; PSB, Process-similarity based method; ROC, Receiver operating
characteristic; RWR, Random walk with restart; FR, Funct ional relevance; SIDD, A semantically integrated
disease database; TF, Term frequency; TF-IDF, Term frequency times inverse document frequency; UMLS
ID, Unified medical language system identifier; UMLS, Unified medical language system
Background
Discovering closely related disea ses could be helpful in
revealing their common pathophysiology [1, 2]. It may
also be useful for identifying novel drug indications [3],
as similar diseases may have the same or similar thera-
peutic targets, which suggests they could be treated with
the same or similar drugs. There has been a growing
interest in quantitatively measuring similarities between
diseases [47].
Phenotypic similarity plays an important role in a
number of biological and biomedical applications [8].
During the past years, based on the Human Phenotype
Ontology (HPO) [9], researchers had designed several
methods to find related diseases and predict disease-
causing genes, such a s Phenomizer [10], Exomiser [11]
and PhenIX [12]. The HPO provides a controlled and
standardized vocabulary of phenotypic abnormalities
that characterize human diseases. Phenotype similarity
also, becomes the most common way to define classifi-
cation rules for dise ases. The classification of disease
terms in Medical Subject Headings (MeSH) [13] and
Disease Ontology (DO) [14] are taking this approach. To
quantify disease similarity, several semantic-based
methods had thus been proposed based on HPO, MeSH
or DO, suc h as Resnik [15], Lin [16] and Wang [17].
Resniks method measures disease similarity based on in-
formation content (IC) of the most informative common
ancestor (MICA) between two terms. Besides IC of
MICA, Lins method also considers the IC of the two
compared diseases [16]. Wang et al.s method [17] com-
putes similarity of a disease pair by considering the con-
tribution of all common ancestors in the ontology. It
had been successfully applied to compute similarity be-
tween MeSH [18] terms. All of those semantic-based
methods exploited disease associations based on ontol-
ogies and/or gene annotations. They did not, however,
consider the functional associations between disease-
related gene sets. The BOG (based on overlapping gene
sets) method wa s thus designed by Mathur and Dinakar-
pandian [19], which calculates disease similarity by
exploiting the co-occurrence of disease-related genes.
Mathur et al. [20] also devised a process-similarity based
(PSB) method. Instead of defining disease similarity as a
function of genes, PSB computes di sease similarity based
on Gene Ontology (GO) [21] biological process terms
associated with those genes. PSB achieved a better
performance than BOG [20]. Functional associations be-
tween genes involve not only GO terms [22], but also
co-expression [23], protein-protein interaction [24], etc.
Cheng et al. recently presented the method FunSim [25],
which measures disease similarity using a weighted hu-
man protein interaction network. The first neighbors of
disease-related genes in the protein network were taken
into account. FunSim further improved the results of
PSB [25].
Although a number of methods for computing disease
similarity had been developed, no method had been pro-
posed to take advantage of the entire protein interaction
network, beyond using only the first neighbors. A
network-based method (NetSim) is proposed which
takes advantage of the entire interaction network. The
effectiveness of different data sources were also evalu-
ated, including gene-disease associations and protein-
protein interactions. Most of the previously developed
methods were based on disea se-related genes. However,
many diseases still have very few or no associated genes.
Relying entirely on disease-related genes greatly limits
the utility of those methods. To overcome the limitation,
a new semantic-based similarity measure (MedSim) is
developed to compute disease similarity based on the
MEDLINE database. MedSim and NetSim were eventu-
ally integrated into MedNetSim to further improve com-
puting performance.
Methods
Diseases and gene-disease association databases
The disease terms in DO were chosen as the vocabulary
for describing diseases. DO database is a biomedical re-
source of disease concepts with stable identifiers orga-
nized by disease etiology [14]. It contains 6,457 non-
obsolete disease terms and 6,819 IS_A relationships
among diseases. The non-obsolete disease terms was
used as the disease vocabulary. Each disease in DO has a
unique identifier, called DOID.
SIDD [26] and DisGeNET [27] were adopted as two
disease-gene association databases (Fig. 1). SIDD inte-
grated five disease-related gene databases: GeneRIF [28],
Online Mendelian Inheritance in Man (OMIM) [29],
Comparative Toxicogenomics Database (CTD) [30],
Genetic Association Database (GAD) [31], and SpliceDi-
sease [32]. In total, SIDD contains 2,427 diseases and
104,052 gene-disease associations (see Additional file 1).
Li et al. BMC Bioinformatics (2016) 17:326 Page 2 of 13

The DisGeNET [27] database integrated human gene-
disease associations from various expert curated databases
and text-mining derived associations including Mendelian,
complex and environmental diseases. Compared to SIDD,
DisGeNET had more lower reliability disease-gene associ-
ations based on literature mining, i.e., LHGDN [33] and
BeFree data [34]. DisGeNET contains 14,619 diseases and
429,111 gene-disease associations. UMLS ID (Unified
Medical Language System Identifier) was used as the
unique identifier for each disease in DisGeNET. We
mapped UMLS ID to DOID, which produced 3,259 dis-
ease terms and 206,403 gene-disease associations (see
Additional file 2). Almost every disease term in DisGeNET
has more associated genes than that in SIDD. All source
data were downloaded until April 30, 2015.
Protein interaction datasets
Two protein interaction datasets were used (Fig. 2). One
is hPPIN, built in house, which integrated four existing
protein interaction databases, i.e., BioGrid [35], HPRD
[36], IntAct [37 ], and HomoMINT [38]. Protein identi-
fiers were mapped to the genes coding for the proteins,
and redundant interactions were removed. The acquired
protein interaction network covered 15,710 human
genes and 143,237 interactions (Fig. 2). The other is
HumanNet [39], which is a genome-scale functional net-
work for human genes. To build HumanNet, 21 diverse
functional genomi cs and proteomics datasets were eval-
uated for their tendencies to link human genes in the
same biological processes. Pairwise gene linkages derived
from the individual datasets were then integrated into a
comprehensive HumanNet [39]. HumanNet contains
476,399 functional linkages among 16,243 human genes
(Fig. 2). Unlike hPPIN which mainly focuses on experi-
mentally verified protein interactions, HumanNet was
constructed based on the functional probability that two
genes belonged to the same biological processes. The
two protein interaction datasets have 13,626 genes and
42,584 interactions in common (called comPPI, Fig. 2).
Additionally, different proportions of hPPIN (5 %, 10 %,
20 %, 40 %, 60 %, 80 %, 90 %) were randomly sampled 20
times and used as the protein interaction datasets to evau-
late the impact of data volume on the proposed method.
Medline-based disease similarity (MedSim)
Biomedical literature contains rich and diverse informa-
tion, such as disease symptoms, pathogenesis, thera-
peutic drugs, and so on. Features representing diseases
were generated through mining the biomedical literature
corpus; the features were then utilized to compute
disease similarity (MedSim method, Fig. 3). MedSim was
not limited to use only one aspect of disease information
(i.e., disease-related genes), but took advantages of all
relevant information that had already been archived in
the literature.
Disease corpus
The text corpus contains all MEDLINE abst racts pub-
lished up to year 2015. The non-obsolete disease terms
in DO were used as the disease vocabulary. Each disease
Fig 1 Gene-disease association databases. a : The number of
diseases, b: The number of associations between genes and diseases
Fig 2 Protein interaction datasets. a: The number of genes, b: The
number of interactions between genes, *: The common
protein-protein interactions
Li et al. BMC Bioinformatics (2016) 17:326 Page 3 of 13

term was mapped to Unified Medical Language System
(UMLS) [40] so that its synonyms could be retrieved. Syn-
onyms were taken directly from DO for diseases that
could not be mapped to UMLS. Every disease term and its
synonyms were then used as keywords to perform
keyword-based queries into MEDLINE to retrieve ab-
stracts related to that disease. To limit computational cost,
only the top 100 most relevant abstracts were selected to
construct the bag-of-words model for diseases. The rele-
vance of an abstract to a disease was defined in Eq. 1.
R
abstract
¼
X
W
W
df
W
of
ð1Þ
Where W
df
and W
of
represent document frequency
and occurrence frequency of a word X, respectively.
Document frequency W
df
is the proportion of abstracts
that contain word X. W
df
represents the relevance of
word X to a disease. Occurrence frequency W
of
repre-
sents the number of times word X occurs in an abstract,
measuring the importance of word X in a specific ab-
stract. For a specific disease, W is defined a s the set of
nouns (Xs) whic h appeared in abstracts when W
df
is
greater than 0.005. Larger R
abstract
means that an ab-
stract is more closely related to the disease. Some dis-
eases were not yet broadly studied, so their number of
retrieved abstracts can be less than 100. For those cases,
all retrieved abstracts were used. For each disease, the
selected most relevant abstracts were merged into one
combined document. At the end of preprocessing, every
disease was associated with one document. These docu-
ments together made up the disease corpus.
Constructing the bag-of-words model and computing
MedSim
The disease corpus was tokenized to obtain word vo-
cabulary, using Python package NLTK (Nature Language
Toolkit, www.nltk.org) to remove non-alphabetic words
and reduce inflected/derived words to their stem. Overly
common (appeared in more than 60 % of the docu-
ments) or rare (appeared in less than 4 documents)
words were removed, as those words could not provide
meaningful information. Each disease was then repre-
sented by a word vector, whose dimensionality is the size
of the word vocabulary. Each dimension was assigned a
weight (TF-IDF, that is, TF times IDF) based on term fre-
quency (TF) and inverse document frequency (IDF)
values. TF is the number of times a word appears in a
document. IDF represents the inverse of the number of
documents containing the word. TF-IDF assigns larger
weights to words that appeared more often in a document
but only in a small percentage of all documents, as those
words are important and informative for that document.
With diseases represented as TF-IDF weighted vectors,
the MedSim of two diseases was measured by calculating
the cosine similarity of the two vectors. Python package
scikit-learn [41] was used to perform the computation.
Network-based disease similarity (NetSim)
Previously published methods werent designed to utilize
the entire protein interaction network. They instead fo-
cused only on the disease-related genes or their first
neighbors in the network. To take full advantage of the
entire protein interaction network, random walk with re-
start (RWR) [42, 43] (see [44] for working details) was
used to measure Functional Relevance (FR) between a
gene g and a gene set G, which is described in Eq. 2.
FR
G
ðgÞ¼
P
RW R
g protein interaction network
1 g protein interaction network and gG
0 g protein interaction network and gG
Þ
8
>
>
<
>
>
:
ð2Þ
Fig 3 Overview of MedSim. DO: Human Disease Ontology database; UMLS: Unified Medical Language System
Li et al. BMC Bioinformatics (2016) 17:326 Page 4 of 13

Where gene set G was defined to be the seed genes,
that is, the known set of genes associated with a disease.
The initial probability of each seed genes was set to 1.0.
P
RWR
represents the acquired steady-state probability of
gene g after running RWR in the whole protein inter-
action network. A larger probability (FR
G
(g)) will be
assigned to gene g when it sits more closely to the gene
set G in the network according to Eq. 2, which means that
gene g are more functionally related with gene set G.
Suppose that G
1
={g
11
,g
12
,}andG
2
={g
21
,g
22
,} are
the seed gene sets for disease d
1
and d
2
, respectively.
Then, the NetSim of d
1
and d
2
is defined in Eq. 3.
NeSim G
1
; G
2
ðÞ¼
X
1ilen G
1
ðÞ
FR
G
2
g
1i

þ
X
1jlen G
2
ðÞ
FR
G
1
g
2j

len G
1
ðÞþlen G
2
ðÞ
;
g
1i
G
1
; g
2j
G
2
ð3Þ
Where len(G
1
) and len(G
2
) are the number of genes in
G
1
and G
2
, respectively. The numerator is the sum of func-
tional relevance of g
1i
to G
2
and g
2j
to G
1
.AhigherNetSim
value represents closer connection between G
1
and G
2
,
which suggests closer ties between diseases d
1
and d
2
.
MedSim and NetSim is combined into MedNetSim,
which is defined in Eq. 4.
MedNetSim d
1
; d
2
ðÞ¼MedSim d
1
; d
2
ðÞ
NetSim G
1
; G
2
ðÞ ð4Þ
Where d
1
and d
2
are two diseases in DO, G
1
and G
2
are the seed gene sets for d
1
and d
2
, respe ctively.
Performance evaluation
Similarities of disease pairs in the benchmark set and
the random set were calculated and ranked in descend-
ing order, receiver operating characteristic (ROC) [45]
curves were then drawn to evaluate and quantify the
predictive power of the proposed methods. A ROC
curve is a plot of the true positive rate of a classifier as a
function of the false positive rate. The area under the
ROC curve (AUC) is used as a quantitative measure of a
classifiers quality [46]. Disease pairs in the benchmark
set and the random set are defined as positives and neg-
atives, respectively. True positives are the disease pairs
in the benchmark set that are correctly predicted by a
classifier, and false positives are those disease pairs from
the random set that are predicted to be positives but not
found in the benchmark set. More percentage of disease
pairs in the benchmark set receiving higher rankings
means better AUC values. The benchmark set was taken
from reference [25]. It had 47 diseases and 70 disease
pairs (see Additional file 3) with high similarity derived
from two manually checked datasets by Suthram et al.
[2] and Pakhomov et al. [47]. Cancers were omitted. The
benchmark set contains disease pairs that are expected
to be related to each other, such as Alzheimers disease
(DOID: 10652) and schizophrenia (DOID: 5419), dia-
betes mellitus (DOID: 9351) and obesity (DOID: 9970).
It also includes some pairs that are not apparently re-
lated, but were found to be correlated by various evi-
dences, such as asthma (DOID: 2841) and diabetes
mellitus, malaria (DOID: 12365) and anemia (DOID:
2355). 700 disease pairs were randomly selected from
DO to generate a random set, with disease pairs from the
benchmark set removed from the generated random set.
To get an average A UC of the proposed methods, the above
experiment was iterated 50 times by calculating similarities
of disease pairs in the benchmark set and 50 random sets.
MedSim was compared with other semantic-based
methods including Resnik [15], Lin [16] and Wang [17],
based on HPO and DO, respectively. For each disease,
the associated HPO annotations were acquired from
[48], which covered disease-phenotype associations for
over 6000 common, rare, infectious and Mendelian dis-
eases through text-mining approach. The HPO-based
disease similarities were defined by calculating the se-
mantic similari ty of their associated HPO phenotypes.
For two diseases (d
1
, d
2
), the HPO-based similarity of d
1
to d
2
is defined as follows:
HPO sim d
1
d
2
ðÞ¼avg
X
sd
1
max
td
2
SemSim s; tðÞ
2
4
3
5
ð5Þ
Where s and t are the annotated phenotypes of d
1
and
d
2
, respectively. SemSim() is one of the methods applied to
compute the semantic similarity of two phenotype terms,
including Resnik, Lin and Wang. Eq. 5, for each pheno-
type term of d
1
, found the best match among the pheno-
type terms annotated to d
2
, and the average overall
phenotype terms was calculated. Note that this similarity
is asymmetric, i.e., HPO_sim(d
1
d
2
) is not always equal
to HPO_sim(d
2
d
1
). Therefore, we used a symmetric
HPO-based similarity, which is defined in Eq. 6:
HPO sim d
1
; d
2
ðÞ¼
1
2
HPO
sim d
1
d
2
ðÞ
þ
1
2
HPO
sim d
2
d
1
ðÞ ð6Þ
The DO-based disease similarities were defined as the
directly semantic similarity of two disease terms in DO,
where the above mentioned three semantic-base
methods (Resnik, Lin and Wang) were applied, too. Net-
Sim was also compared with other function-based
methods including BOG [19], PSB [20] and FunSim [25].
Parameters of the aforementioned methods were set to
values used in the original paper.
Li et al. BMC Bioinformatics (2016) 17:326 Page 5 of 13

Citations
More filters
Journal ArticleDOI

Constructing Disease Similarity Networks Based on Disease Module Theory

TL;DR: Wang et al. as discussed by the authors proposed a new method called ModuleSim to measure associations between diseases by using disease-gene association data and protein-protein interaction network (PPIN) data based on disease module theory.
Journal ArticleDOI

NEDD: a network embedding based method for predicting drug-disease associations

TL;DR: A meta-path-based computational method called NEDD is proposed to predict novel associations between drugs and diseases using heterogeneous information and produces superior prediction results compared with the state-of-the-art approaches.
Journal ArticleDOI

A network-based approach to uncover microRNA-mediated disease comorbidities and potential pathobiological implications

TL;DR: A network-based methodology to infer clinically relevant disease–disease relationships from miRNA regulatory networks, developed by a team led by Dr. Feixiong Cheng at Cleveland Clinic, which shows high performance in inferring clinically reported disease-diseases relationships, outperforming that of traditional gene/miRNA-overlap approaches.
Journal ArticleDOI

Similar Disease Prediction With Heterogeneous Disease Information Networks

TL;DR: A method to predict the similarity of diseases by node representation learning is proposed that integrates the semantic score and topological score between diseases by combining multiple data sources and conducts comparative experiment based on benchmark set and other disease nodes outside the benchmark set.
Journal ArticleDOI

Understanding and predicting disease relationships through similarity fusion.

TL;DR: A similarity fusion approach which accounts for differences in information content between different data types, allowing combination of each data type in a balanced manner to aid understanding of common processes taking place in disease.
References
More filters
Journal Article

Scikit-learn: Machine Learning in Python

TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Journal ArticleDOI

Gene Ontology: tool for the unification of biology

TL;DR: The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.
Journal ArticleDOI

An introduction to ROC analysis

TL;DR: The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.
Proceedings Article

An Information-Theoretic Definition of Similarity

Dekang Lin
TL;DR: This work presents an informationtheoretic definition of similarity that is applicable as long as there is a probabilistic model and demonstrates how this definition can be used to measure the similarity in a number of different domains.
Related Papers (5)