Fusing literature and full network data improves disease similarity computation.

doi:10.1186/S12859-016-1205-4

RES E A R C H A R T I C L E Open Access

Fusing literature and full network data

improves disease similarity computation

Ping Li

1,2

, Yaling Nie

1,2

and Jingkai Yu

1*

Abstract

Background: Identifying relatedness among diseases could help deepen understanding for the underlying

pathogenic mechanisms of diseases, and facilitate drug repositioning projects. A number of methods for

computing disease similarity had been developed; however, none of them were designed to utilize information of

the entire protein interaction network, using instead only those interactions involving disease causing genes. Most

of previously published methods required gene-disease association data, unfortunately, many diseases still have

very few or no associated genes, which impeded broad adoption of those methods. In this study, we propose a

new method (MedNetSim) for computing disease similarity by integrating medical literature and protein interaction

network. MedNetSim consists of a network-based method (NetSim), which employs the entire protein interaction

network, and a MEDLINE-based method (MedSim), which computes disease similarity by mining the biomedical

literature.

Results: Among function-based methods, NetSim achieved the best performance. Its average AUC (area under the

receiver operating characteristic curve) reached 95.2 %. MedSim, whose performance was even comparable to

some function-based methods, acquired the highest average AUC in all semantic-based methods. Integration of

MedSim and NetSim (MedNetSim) further improved the average AUC to 96.4 %. We further studied the

effectiveness of different data sources. It was found that quality of protein interaction dat a was more important

than its volume. On the contrary, higher volume of gene-disease association data was more beneficial, even with a

lower reliability. Utilizing higher volume of disease-related gene data further improved the average AUC of

MedNetSim and NetSim to 97.5 % and 96.7 %, respectively.

Conclusions: Integrating biome dical literature and protein interaction network can be an effective way to compute

disease similarity. Lacking sufficient disease-related gene data, literature-based methods such as MedSim can be

a great addition to function-based algorithms. It may be beneficial to steer more resources torward studying

gene-disease associations and improving the quality of protein interaction data. Disease similarities can be computed

using the proposed methods at http://www.digintelli.com:8000/.

Keywords: Disease similarity, MedSim, NetSim, MedNetSim, Random walk with Restart

Abbreviations: AUC, The area under the ROC curve; BOG, Based on overlapping gene sets method;

comPPI, Common protein-protein interactions of hPPIN and HumanNet; CTD, Comparative toxicogenomics

database; DisGeNET, A database of gene-disease associations; DO, Disease onto logy; DOID, Disease ontology

identifier; DSN, Dise ase simi larity netwo rk; GAD, Genetic association database; GO, Gene ontology; GO_BP, GO

biological process; GO_CC, GO cellular component; GO_MF, GO molecular fu nction; HPO, Human phenotype

ontology; IC, Information content; IDF, Inverse document frequency; MeSH, Medical subject headings;

(Continued on next page)

* Correspondence: jkyu@ipe.ac.cn

1

State Key Laboratory of Biochemical Engineering, Institute of Process

Engineering, Chinese Academy of Sciences, Beijing 100190, China

Full list of author information is available at the end of the article

International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and

reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to

the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver

(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Li et al. BMC Bioinformatics (2016) 17:326

DOI 10.1186/s12859-016-1205-4

(Continued from previous page)

MG, Myasthenia gravi s; MICA, The most informative commo n anc estor; NLTK, Nature language toolkit;

OMIM, Online mendeli an inhe ritance in man; PSB, Process-similarity based method; ROC, Receiver operating

characteristic; RWR, Random walk with restart; FR, Funct ional relevance; SIDD, A semantically integrated

disease database; TF, Term frequency; TF-IDF, Term frequency times inverse document frequency; UMLS

ID, Unified medical language system identifier; UMLS, Unified medical language system

Background

Discovering closely related disea ses could be helpful in

revealing their common pathophysiology [1, 2]. It may

also be useful for identifying novel drug indications [3],

as similar diseases may have the same or similar thera-

peutic targets, which suggests they could be treated with

the same or similar drugs. There has been a growing

interest in quantitatively measuring similarities between

diseases [4–7].

Phenotypic similarity plays an important role in a

number of biological and biomedical applications [8].

During the past years, based on the Human Phenotype

Ontology (HPO) [9], researchers had designed several

methods to find related diseases and predict disease-

causing genes, such a s Phenomizer [10], Exomiser [11]

and PhenIX [12]. The HPO provides a controlled and

standardized vocabulary of phenotypic abnormalities

that characterize human diseases. Phenotype similarity

also, becomes the most common way to define classifi-

cation rules for dise ases. The classification of disease

terms in Medical Subject Headings (MeSH) [13] and

Disease Ontology (DO) [14] are taking this approach. To

quantify disease similarity, several semantic-based

methods had thus been proposed based on HPO, MeSH

or DO, suc h as Resnik [15], Lin [16] and Wang [17].

Resnik’s method measures disease similarity based on in-

formation content (IC) of the most informative common

ancestor (MICA) between two terms. Besides IC of

MICA, Lin’s method also considers the IC of the two

compared diseases [16]. Wang et al.’s method [17] com-

putes similarity of a disease pair by considering the con-

tribution of all common ancestors in the ontology. It

had been successfully applied to compute similarity be-

tween MeSH [18] terms. All of those semantic-based

methods exploited disease associations based on ontol-

ogies and/or gene annotations. They did not, however,

consider the functional associations between disease-

related gene sets. The BOG (based on overlapping gene

sets) method wa s thus designed by Mathur and Dinakar-

pandian [19], which calculates disease similarity by

exploiting the co-occurrence of disease-related genes.

Mathur et al. [20] also devised a process-similarity based

(PSB) method. Instead of defining disease similarity as a

function of genes, PSB computes di sease similarity based

on Gene Ontology (GO) [21] biological process terms

associated with those genes. PSB achieved a better

performance than BOG [20]. Functional associations be-

tween genes involve not only GO terms [22], but also

co-expression [23], protein-protein interaction [24], etc.

Cheng et al. recently presented the method FunSim [25],

which measures disease similarity using a weighted hu-

man protein interaction network. The first neighbors of

disease-related genes in the protein network were taken

into account. FunSim further improved the results of

PSB [25].

Although a number of methods for computing disease

similarity had been developed, no method had been pro-

posed to take advantage of the entire protein interaction

network, beyond using only the first neighbors. A

network-based method (NetSim) is proposed which

takes advantage of the entire interaction network. The

effectiveness of different data sources were also evalu-

ated, including gene-disease associations and protein-

protein interactions. Most of the previously developed

methods were based on disea se-related genes. However,

many diseases still have very few or no associated genes.

Relying entirely on disease-related genes greatly limits

the utility of those methods. To overcome the limitation,

a new semantic-based similarity measure (MedSim) is

developed to compute disease similarity based on the

MEDLINE database. MedSim and NetSim were eventu-

ally integrated into MedNetSim to further improve com-

puting performance.

Methods

Diseases and gene-disease association databases

The disease terms in DO were chosen as the vocabulary

for describing diseases. DO database is a biomedical re-

source of disease concepts with stable identifiers orga-

nized by disease etiology [14]. It contains 6,457 non-

obsolete disease terms and 6,819 ‘IS_A’ relationships

among diseases. The non-obsolete disease terms was

used as the disease vocabulary. Each disease in DO has a

unique identifier, called DOID.

SIDD [26] and DisGeNET [27] were adopted as two

disease-gene association databases (Fig. 1). SIDD inte-

grated five disease-related gene databases: GeneRIF [28],

Online Mendelian Inheritance in Man (OMIM) [29],

Comparative Toxicogenomics Database (CTD) [30],

Genetic Association Database (GAD) [31], and SpliceDi-

sease [32]. In total, SIDD contains 2,427 diseases and

104,052 gene-disease associations (see Additional file 1).

Li et al. BMC Bioinformatics (2016) 17:326 Page 2 of 13

The DisGeNET [27] database integrated human gene-

disease associations from various expert curated databases

and text-mining derived associations including Mendelian,

complex and environmental diseases. Compared to SIDD,

DisGeNET had more lower reliability disease-gene associ-

ations based on literature mining, i.e., LHGDN [33] and

BeFree data [34]. DisGeNET contains 14,619 diseases and

429,111 gene-disease associations. UMLS ID (Unified

Medical Language System Identifier) was used as the

unique identifier for each disease in DisGeNET. We

mapped UMLS ID to DOID, which produced 3,259 dis-

ease terms and 206,403 gene-disease associations (see

Additional file 2). Almost every disease term in DisGeNET

has more associated genes than that in SIDD. All source

data were downloaded until April 30, 2015.

Protein interaction datasets

Two protein interaction datasets were used (Fig. 2). One

is hPPIN, built in house, which integrated four existing

protein interaction databases, i.e., BioGrid [35], HPRD

[36], IntAct [37 ], and HomoMINT [38]. Protein identi-

fiers were mapped to the genes coding for the proteins,

and redundant interactions were removed. The acquired

protein interaction network covered 15,710 human

genes and 143,237 interactions (Fig. 2). The other is

HumanNet [39], which is a genome-scale functional net-

work for human genes. To build HumanNet, 21 diverse

functional genomi cs and proteomics datasets were eval-

uated for their tendencies to link human genes in the

same biological processes. Pairwise gene linkages derived

from the individual datasets were then integrated into a

comprehensive HumanNet [39]. HumanNet contains

476,399 functional linkages among 16,243 human genes

(Fig. 2). Unlike hPPIN which mainly focuses on experi-

mentally verified protein interactions, HumanNet was

constructed based on the functional probability that two

genes belonged to the same biological processes. The

two protein interaction datasets have 13,626 genes and

42,584 interactions in common (called comPPI, Fig. 2).

Additionally, different proportions of hPPIN (5 %, 10 %,

20 %, 40 %, 60 %, 80 %, 90 %) were randomly sampled 20

times and used as the protein interaction datasets to evau-

late the impact of data volume on the proposed method.

Medline-based disease similarity (MedSim)

Biomedical literature contains rich and diverse informa-

tion, such as disease symptoms, pathogenesis, thera-

peutic drugs, and so on. Features representing diseases

were generated through mining the biomedical literature

corpus; the features were then utilized to compute

disease similarity (MedSim method, Fig. 3). MedSim was

not limited to use only one aspect of disease information

(i.e., disease-related genes), but took advantages of all

relevant information that had already been archived in

the literature.

Disease corpus

The text corpus contains all MEDLINE abst racts pub-

lished up to year 2015. The non-obsolete disease terms

in DO were used as the disease vocabulary. Each disease

Fig 1 Gene-disease association databases. a : The number of

diseases, b: The number of associations between genes and diseases

Fig 2 Protein interaction datasets. a: The number of genes, b: The

number of interactions between genes, *: The common

protein-protein interactions

Li et al. BMC Bioinformatics (2016) 17:326 Page 3 of 13

term was mapped to Unified Medical Language System

(UMLS) [40] so that its synonyms could be retrieved. Syn-

onyms were taken directly from DO for diseases that

could not be mapped to UMLS. Every disease term and its

synonyms were then used as keywords to perform

keyword-based queries into MEDLINE to retrieve ab-

stracts related to that disease. To limit computational cost,

only the top 100 most relevant abstracts were selected to

construct the bag-of-words model for diseases. The rele-

vance of an abstract to a disease was defined in Eq. 1.

R

abstract

¼

X

W

df

 W

of

ð1Þ

Where W

df

and W

of

represent document frequency

and occurrence frequency of a word X, respectively.

Document frequency W

df

is the proportion of abstracts

that contain word X. W

df

represents the relevance of

word X to a disease. Occurrence frequency W

of

repre-

sents the number of times word X occurs in an abstract,

measuring the importance of word X in a specific ab-

stract. For a specific disease, W is defined a s the set of

nouns (Xs) whic h appeared in abstracts when W

df

is

greater than 0.005. Larger R

abstract

means that an ab-

stract is more closely related to the disease. Some dis-

eases were not yet broadly studied, so their number of

retrieved abstracts can be less than 100. For those cases,

all retrieved abstracts were used. For each disease, the

selected most relevant abstracts were merged into one

combined document. At the end of preprocessing, every

disease was associated with one document. These docu-

ments together made up the disease corpus.

Constructing the bag-of-words model and computing

MedSim

The disease corpus was tokenized to obtain word vo-

cabulary, using Python package NLTK (Nature Language

Toolkit, www.nltk.org) to remove non-alphabetic words

and reduce inflected/derived words to their stem. Overly

common (appeared in more than 60 % of the docu-

ments) or rare (appeared in less than 4 documents)

words were removed, as those words could not provide

meaningful information. Each disease was then repre-

sented by a word vector, whose dimensionality is the size

of the word vocabulary. Each dimension was assigned a

weight (TF-IDF, that is, TF times IDF) based on term fre-

quency (TF) and inverse document frequency (IDF)

values. TF is the number of times a word appears in a

document. IDF represents the inverse of the number of

documents containing the word. TF-IDF assigns larger

weights to words that appeared more often in a document

but only in a small percentage of all documents, as those

words are important and informative for that document.

With diseases represented as TF-IDF weighted vectors,

the MedSim of two diseases was measured by calculating

the cosine similarity of the two vectors. Python package

scikit-learn [41] was used to perform the computation.

Network-based disease similarity (NetSim)

Previously published methods weren’t designed to utilize

the entire protein interaction network. They instead fo-

cused only on the disease-related genes or their first

neighbors in the network. To take full advantage of the

entire protein interaction network, random walk with re-

start (RWR) [42, 43] (see [44] for working details) was

used to measure Functional Relevance (FR) between a

gene g and a gene set G, which is described in Eq. 2.

FR

G

ðgÞ¼

P

RW R

g ∈ protein interaction network

1 g ∉ protein interaction network and g∈G

0 g ∉ protein interaction network and g∉G

Þ

8

>

<

>

:

ð2Þ

Fig 3 Overview of MedSim. DO: Human Disease Ontology database; UMLS: Unified Medical Language System

Li et al. BMC Bioinformatics (2016) 17:326 Page 4 of 13

Where gene set G was defined to be the seed genes,

that is, the known set of genes associated with a disease.

The initial probability of each seed genes was set to 1.0.

P

RWR

represents the acquired steady-state probability of

gene g after running RWR in the whole protein inter-

action network. A larger probability (FR

G

(g)) will be

assigned to gene g when it sits more closely to the gene

set G in the network according to Eq. 2, which means that

gene g are more functionally related with gene set G.

Suppose that G

1

={g

11

,g

12

,…}andG

2

={g

21

,g

22

,…} are

the seed gene sets for disease d

1

and d

2

, respectively.

Then, the NetSim of d

1

and d

2

is defined in Eq. 3.

NeSim G

1

; G

2

ðÞ¼

X

1≤i≤len G

1

ðÞ

FR

G

2

g

1i



þ

X

1≤j≤len G

2

ðÞ

FR

G

1

g

2j



len G

1

ðÞþlen G

2

ðÞ

;

g

1i

∈G

1

; g

2j

∈G

2

ð3Þ

Where len(G

1

) and len(G

2

) are the number of genes in

G

1

and G

2

, respectively. The numerator is the sum of func-

tional relevance of g

1i

to G

2

and g

2j

to G

1

.AhigherNetSim

value represents closer connection between G

1

and G

2

,

which suggests closer ties between diseases d

1

and d

2

.

MedSim and NetSim is combined into MedNetSim,

which is defined in Eq. 4.

MedNetSim d

1

; d

2

ðÞ¼MedSim d

1

; d

2

ðÞ

 NetSim G

1

; G

2

ðÞ ð4Þ

Where d

1

and d

2

are two diseases in DO, G

1

and G

2

are the seed gene sets for d

1

and d

2

, respe ctively.

Performance evaluation

Similarities of disease pairs in the benchmark set and

the random set were calculated and ranked in descend-

ing order, receiver operating characteristic (ROC) [45]

curves were then drawn to evaluate and quantify the

predictive power of the proposed methods. A ROC

curve is a plot of the true positive rate of a classifier as a

function of the false positive rate. The area under the

ROC curve (AUC) is used as a quantitative measure of a

classifier’s quality [46]. Disease pairs in the benchmark

set and the random set are defined as positives and neg-

atives, respectively. True positives are the disease pairs

in the benchmark set that are correctly predicted by a

classifier, and false positives are those disease pairs from

the random set that are predicted to be positives but not

found in the benchmark set. More percentage of disease

pairs in the benchmark set receiving higher rankings

means better AUC values. The benchmark set was taken

from reference [25]. It had 47 diseases and 70 disease

pairs (see Additional file 3) with high similarity derived

from two manually checked datasets by Suthram et al.

[2] and Pakhomov et al. [47]. Cancers were omitted. The

benchmark set contains disease pairs that are expected

to be related to each other, such as Alzheimer’s disease

(DOID: 10652) and schizophrenia (DOID: 5419), dia-

betes mellitus (DOID: 9351) and obesity (DOID: 9970).

It also includes some pairs that are not apparently re-

lated, but were found to be correlated by various evi-

dences, such as asthma (DOID: 2841) and diabetes

mellitus, malaria (DOID: 12365) and anemia (DOID:

2355). 700 disease pairs were randomly selected from

DO to generate a random set, with disease pairs from the

benchmark set removed from the generated random set.

To get an average A UC of the proposed methods, the above

experiment was iterated 50 times by calculating similarities

of disease pairs in the benchmark set and 50 random sets.

MedSim was compared with other semantic-based

methods including Resnik [15], Lin [16] and Wang [17],

based on HPO and DO, respectively. For each disease,

the associated HPO annotations were acquired from

[48], which covered disease-phenotype associations for

over 6000 common, rare, infectious and Mendelian dis-

eases through text-mining approach. The HPO-based

disease similarities were defined by calculating the se-

mantic similari ty of their associated HPO phenotypes.

For two diseases (d

1

, d

2

), the HPO-based similarity of d

1

to d

2

is defined as follows:

HPO sim d

1

→d

2

ðÞ¼avg

X

s∈d

1

max

t∈d

2

SemSim s; tðÞ

2

4

3

5

ð5Þ

Where s and t are the annotated phenotypes of d

1

and

d

2

, respectively. SemSim() is one of the methods applied to

compute the semantic similarity of two phenotype terms,

including Resnik, Lin and Wang. Eq. 5, for each pheno-

type term of d

1

, found the “best match” among the pheno-

type terms annotated to d

2

, and the average overall

phenotype terms was calculated. Note that this similarity

is asymmetric, i.e., HPO_sim(d

1

→ d

2

) is not always equal

to HPO_sim(d

2

→ d

1

). Therefore, we used a symmetric

HPO-based similarity, which is defined in Eq. 6:

HPO sim d

1

; d

2

ðÞ¼

1

2

HPO

sim d

1

→d

2

ðÞ

þ

1

2

HPO

sim d

2

→d

1

ðÞ ð6Þ

The DO-based disease similarities were defined as the

directly semantic similarity of two disease terms in DO,

where the above mentioned three semantic-base

methods (Resnik, Lin and Wang) were applied, too. Net-

Sim was also compared with other function-based

methods including BOG [19], PSB [20] and FunSim [25].

Parameters of the aforementioned methods were set to

values used in the original paper.

Li et al. BMC Bioinformatics (2016) 17:326 Page 5 of 13

Fusing literature and full network data improves disease similarity computation.

Citations

Constructing Disease Similarity Networks Based on Disease Module Theory

NEDD: a network embedding based method for predicting drug-disease associations

A network-based approach to uncover microRNA-mediated disease comorbidities and potential pathobiological implications

Similar Disease Prediction With Heterogeneous Disease Information Networks

Understanding and predicting disease relationships through similarity fusion.

References

Scikit-learn: Machine Learning in Python

Gene Ontology: tool for the unification of biology

Scikit-learn: Machine Learning in Python

An introduction to ROC analysis

An Information-Theoretic Definition of Similarity

Related Papers (5)

The human disease network

Uncovering disease-disease relationships through the incomplete interactome

Gene Ontology: tool for the unification of biology

PREDICT: a method for inferring novel drug indications with application to personalized medicine.

Disease Ontology: a backbone for disease semantic integration