HAL Id: hal-00704248

https://hal.archives-ouvertes.fr/hal-00704248

Submitted on 5 Jun 2012

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Case retrieval in medical databases by fusing

heterogeneous information.

Gwénolé Quellec, Mathieu Lamard, Guy Cazuguel, Christian Roux, Béatrice

Cochener

To cite this version:

Gwénolé Quellec, Mathieu Lamard, Guy Cazuguel, Christian Roux, Béatrice Cochener. Case retrieval

in medical databases by fusing heterogeneous information.. IEEE Transactions on Medical Imaging,

Institute of Electrical and Electronics Engineers, 2011, 30 (1), pp.108-18. �10.1109/TMI.2010.2063711�.

�hal-00704248�

IEEE TRANSACTIONS ON MEDICAL IMAGING 1

Case Retrieval in Medical Databases by Fusing

Heterogeneous Information

Gw

´

enol

´

e Quellec, Mathieu Lamard, Guy Cazuguel, Member, IEEE, Christian Roux, Fellow Member, IEEE and

B

´

eatrice Cochener

Abstract—A novel content-based heterogeneous information

retrieval framework, particularly well suited to browse med-

ical databases and support new generation Computer Aided

Diagnosis (CADx) systems, is presented in this paper. It was

designed to retrieve possibly incomplete documents, consisting

of several images and semantic information, from a database;

more complex data types such as videos can also be included in

the framework. The proposed retrieval method relies on image

processing, in order to characterize each individual image in a

document by their digital content, and information fusion. Once

the available images in a query document are characterized,

a degree of match, between the query document and each

reference document stored in the database, is deﬁned for each

attribute (an image feature or a metadata). A Bayesian network

is used to recover missing information if need be. Finally, two

novel information fusion methods are proposed to combine these

degrees of match, in order to rank the reference documents

by decreasing relevance for the query. In the ﬁrst method, the

degrees of match are fused by the Bayesian network itself. In

the second method, they are fused by the Dezert-Smarandache

theory: the second approach lets us model our conﬁdence in each

source of information (i.e. each attribute) and take it into account

in the fusion process for a better retrieval performance. The

proposed methods were applied to two heterogeneous medical

databases, a diabetic retinopathy database and a mammography

screening database, for computer aided diagnosis. Precisions at

ﬁve of 0.809±0.158 and 0.821±0.177, respectively, were obtained

for these two databases, which is very promising.

Index Terms—Medical databases, Heterogeneous information

retrieval, Information fusion, Diabetic retinopathy, Mammogra-

phy

I. INTRODUCTION

T

WO main tasks in Computer Aided Diagnosis (CADx)

using medical images are extraction of relevant informa-

tion from images and combination of the extracted features

with other sources of information to automatically or semi-

automatically generate a reliable diagnosis. One promising

Copyright (c) 2010 IEEE. Personal use of this material is permitted.

However, permission to use this material for any other purposes must be

obtained from the IEEE by sending a request to pubs-permissions@ieee.org.

G. Quellec, G. Cazuguel, and C. Roux are with the INSTITUT TELE-

COM/TELECOM Bretagne, Dpt ITI, Brest, F-29200 France, and also with

the Institut National de la Sant

´

e et de la Recherche M

´

edicale (INSERM),

U650, Brest, F-29200 France (e-mail: gwenole.quellec@telecom-bretagne.eu;

guy.cazuguel@telecom-bretagne.eu; christian.roux@telecom-bretagne.eu).

M. Lamard is with the University of Bretagne Occidentale, Brest, F-

29200 France, and also with the Institut National de la Sant

´

e et de

la Recherche M

´

edicale (INSERM), U650, Brest, F-29200 France (e-mail:

mathieu.lamard@univ-brest.fr).

B. Cochener is with the Centre Hospitalier Universitaire de Brest, Service

d’Ophtalmologie, Brest, F-29200 France, also with the University of Bretagne

Occidentale, Brest, F-29200 France, and also with the Institut National de la

Sant

´

e et de la Recherche M

´

edicale (INSERM), U650, Brest, F-29200 France

(e-mail: Beatrice.Cochener-lamard@chu-brest.fr)

way to achieve the second goal is to take advantage of

the growing number of digital medical databases either for

heterogeneous data mining, i.e. for extracting new knowledge,

or for heterogeneous information retrieval, i.e. for ﬁnding

similar heterogeneous medical records (e.g. consisting of

digital images and metadata). This paper presents a generic

solution to use digital medical databases for heterogeneous

information retrieval, and solve CADx problems using Case-

Based Reasoning (CBR) [

1].

CBR was introduced in the early 1980s as a new decision

support tool. It relies on the idea that analogous problems have

similar solutions. In CBR, interpreting a new situation revolves

around the retrieval of relevant documents in a case database.

The knowledge of medical experts is a mixture of textbook

knowledge and experience through real life clinical cases, so

the assumption that analogous problems have similar solutions

makes sense to them. This is the reason why there is a growing

interest in CBR for the development of medical decision

support systems [2]. Medical CBR systems are intended to

be used as follows: should a physician be doubtful about

his/her diagnosis, he/she can send the available data about

the patient to the system; the system selects and displays the

most similar documents, along with their associated medical

interpretations, which may help him/her conﬁrm or invalidate

his/her diagnosis by analogy. Therefore, the purpose of such

a system is not to replace physicians’ diagnosis, but rather to

aid their diagnosis. Medical documents often consist of digital

information such as images and symbolic information such as

clinical annotations. In the case of Diabetic Retinopathy, for

instance, physicians analyze heterogeneous series of images

together with contextual information such as the age, sex and

medical history of the patient. Moreover, medical information

is sometimes incomplete and uncertain, two problems that

require a particular attention. As a consequence, original

CBR systems, designed to process simple documents such

as homogeneous and comprehensive attribute vectors, are

clearly unsuited to complex CADx applications. On one hand,

some CBR systems have been designed to manage symbolic

information [3]. On the other hand, some others, based on

Content-Based Image Retrieval [

4], have been designed to

manage digital images [

5]. However, few attempts have been

made to merge the two kinds of approaches. We consider in

this paper a larger class of problems: CBR in heterogeneous

databases.

To retrieve heterogeneous information, some simple ap-

proaches, based on early fusion (i.e. attributes are fused in

feature space) [6], [7] or late fusion (i.e. attributes are fused

in semantic space) [8], [9], [10] have been presented in the

IEEE TRANSACTIONS ON MEDICAL IMAGING 2

literature. A few application-speciﬁc approaches [

11], [12],

[

13], [14], [15], as well as a generic retrieval system, based

on dissimilarity spaces and relevance feedback [

16], have also

been presented. We introduce in this paper a novel generic

approach that does not require relevance feedback from the

user. The proposed system is able to manage incomplete

information and the aggregation of heterogeneous attributes:

symbolic and multidimensional digital information (we focus

on digital images, but the same principle can be applied to

any n-dimensional signals). The proposed approach is based

on a Bayesian network and the Dezert-Smarandache theory

(DSmT) [

17]. Bayesian networks have been used previously

in retrieval systems, either for keyword based retrieval [

18],

[

19] or for content-based image or video retrieval [20], [21].

The Dezert-Smarandache theory is more and more widely used

in remote sensing applications [17], however, to our knowl-

edge, this is its ﬁrst medical application. In our approach, a

Bayesian network is used to model the relationships between

the different attributes (the extracted features of each digital

image and each contextual information ﬁeld): we associate

each attribute with a variable in the Bayesian network. It lets us

compare incomplete documents: the Bayesian network is used

to estimate the probability of unknown variables (associated

with missing attributes) knowing the value of other variables

(associated with available attributes). Information coming from

each attribute is then used to derive an estimation of the degree

of match between a query document and a reference document

in the database. Then, these estimations are fused; two fusion

operators are introduced in this paper for this purpose. The

ﬁrst fusion operator is incorporated in the Bayesian network:

the computation of the degree of match, with respect to a

given attribute, relies on the design of conditional probabilities

relating this attribute to the overall degree of match. An

evolution of this fusion operator that models our conﬁdence in

each source of information (i.e. each attribute) is introduced. It

is based on the Dezert-Smarandache theory. In order to model

our conﬁdence in each source of information, within this

second fusion operator, an uncertainty component is included

in the belief mass function characterizing the evidence coming

from this source of information.

The main advantage of the proposed approach, over standard

feature selection / feature classiﬁcation approaches, is that a

retrieval model is trained separately for each attribute. This

is useful to process incomplete documents: in the proposed

approach, we simply combine the models associated with all

available attributes; as a comparison, a standard classiﬁer relies

on feature combinations, and therefore may become invalid

when input feature vectors are incomplete. Also, because each

attribute is processed separately, the curse of dimensionality

is avoided. Therefore, it is not necessary to select the most

relevant features: instead, we simply weight each feature by a

conﬁdence measure.

The paper is organized as follows. Section

II presents

the proposed Bayesian network based retrieval. Section III

presents the Bayesian network and Dezert-Smarandache theory

based retrieval. These methods are applied in section IV to

CADx in two heterogeneous databases: a diabetic retinopa-

thy database and a mammography database. We end with a

discussion and a conclusion in section

V.

II. BAYESIAN NETWORK BASED RETRIEVAL

A. Description of Bayesian Networks

A Bayesian network [

22] is a probabilistic graphical model

that represents a set of variables and their probabilistic depen-

dencies. It is a directed acyclic graph whose nodes represent

variables, and whose edges encode conditional independencies

between the variables. Examples of Bayesian networks are

given in Fig.

1.

(a) (b) (c)

Fig. 1. Examples of Bayesian Networks. Fig. (a) shows a chain. Fig. (b)

shows a polytree, i.e. a network in which there is at most one (undirected)

path between two nodes. Fig. (c) shows a network containing a cycle: <

A, D, E, C, A >.

In the example of Fig.

1 (b), the edge from the parent

node A to its child node D indicates that variable A has a

direct inﬂuence on variable D. Each edge in the graph is

associated with a conditional probability matrix expressing

the probability of a child variable given one of its parent

variables. For instance, if A = { a

0

, a

1

} and D = {d

0

, d

1

, d

2

},

then A → D is assigned the following (3 × 2) conditional

probability matrix P (D|A):

P (D|A) =

P (D = d

0

|A = a

0

) P (D = d

0

|A = a

1

)

P (D = d

1

|A = a

0

) P (D = d

1

|A = a

1

)

P (D = d

2

|A = a

0

) P (D = d

2

|A = a

1

)

(1)

A directed acyclic graph is a Bayesian Network relative

to a set of variables {X

1

, . . . , X

n

} if the joint distribution

P (X

1

, . . . , X

n

) can be expressed as in equation

2:

P (X

1

, . . . , X

n

) =

n

i=1

P (X

i

|parents(X

i

)) (2)

where parents(X) is the set of nodes such that Y → X is in

the graph ∀ Y ∈ parents(X). Because a Bayesian network

can completely model the variables and their relationships,

it can be used to answer queries about them. Typically, it

is used to estimate unknown probabilities for a subset of

variables when other variables (the evidence variables) are

observed. This process of computing the posterior distribution

of variables, given evidence, is called probabilistic inference.

In Bayesian networks containing cycles, exact inference is

a NP-hard problem. Approximate inference algorithms have

been proposed, but their accuracies depend on the network’s

structure; therefore, they are not general. By transforming the

network into a cycle-free hypergraph, and performing infer-

ence in this hypergraph, Lauritzen and Spiegelhalter proposed

an exact inference algorithm with relatively low complexity

[

23]; this algorithm was used in the proposed system.

IEEE TRANSACTIONS ON MEDICAL IMAGING 3

B. Learning a Bayesian Network from Data

A Bayesian network is deﬁned by a structure and the

conditional probability of each node given its parents in that

structure (or its prior probability if it does not have any parent).

These parameters can be learned automatically from data.

Deﬁning the structure consists in ﬁnding pairs of nodes (X, Y )

directly dependent, i.e. such that:

• X and Y are not independent (P (X, Y ) 6= P (X)P (Y ))

• There is no node set Z such that X and Y are indepen-

dent given Z (P (X, Y |Z) 6= P (X|Z)P (Y |Z))

Independence and conditional independence can be assessed

by mutual information (see equation 3) and conditional mutual

information (see equation

4), respectively.

I(X, Y ) =

x,y

P (x, y) log

P (x, y)

P (x)P (y)

(3)

I(X, Y |Z) =

x,y,z

P (x, y, z) log

P (x, y|z)

P (x|z)P (y|z)

(4)

Two nodes are independent (resp. conditionally independent)

if mutual information (resp. conditional mutual information)

is smaller than a given threshold ǫ, 0 ≤ ǫ < 1. Ideally, ǫ

should be equal to 0. However, in the presence of noise, some

meaningless edges (links) can appear. These edges can also

unnecessarily increase the computation time. To avoid this, in

this study, ǫ was chosen in advance to be equal to 0.1. This

number is independent of dataset cardinality [

24].

The structure of the Bayesian network, as well as edge

orientation, was obtained by Cheng’s algorithm [24]. This

algorithm was chosen for its complexity: complexity is poly-

nomial in the number of variables, as opposed to exponential

in competing algorithms.

C. Including Images in a Bayesian Network

Contextual information are included as usual in a Bayesian

network: a variable with a ﬁnite set of states, one for each

possible attribute value, is deﬁned for each ﬁeld.

To include images in a Bayesian network, we ﬁrst deﬁne a

variable for each image in the sequence. For each “image

variable”, we follow the usual steps of Content-Based Image

Retrieval (CBIR) [

4]: 1) building a signature for each image

(i.e. extracting a feature vector summarizing their digital

content), and 2) deﬁning a distance measure between two

signatures (see section

II-C1). Thus, measuring the distance

between two images comes down to measuring the distance

between two signatures. Similarly, in a Bayesian network,

deﬁning states for an “image variable” comes down to deﬁning

states for the signature of the corresponding images. To

this aim, similar image signatures are clustered, as described

below, and each cluster is associated with a state. Thanks to

this process, image signatures can be included in a Bayesian

network like any other variable.

1) Image Signature and Distance Measure: in previous

works on CBIR, we proposed to extract a signature for

images from their wavelet transform [

25]. These signatures

model the distribution of the wavelet coefﬁcients in each

subband of the decomposition; as a consequence they provide

a multiscale description of images. To characterize the wavelet

coefﬁcient distribution in a given subband, Wouwer’s work

was applied [

26]: Wouwer has shown that this distribution can

be modeled by a generalized Gaussian function. The maximum

likelihood estimators of the wavelet coefﬁcient distribution

in each subband are used as a signature. These estimators

can be computed directly from wavelet-based compressed

images (such as JPEG-2000 compressed images), which can

be useful when a large number of images has to be processed.

A simpliﬁed version of Do’s generalized Gaussian parameter

estimation method [

27], [25] is proposed in appendix A to

reduce computation times. Any wavelet basis can be used to

decompose images. However, the effectiveness of the extracted

signatures largely depends on the choice of this basis. For

this reason, we proposed to search for an optimal wavelet

basis [

25] within the lifting scheme framework, which is

implemented in the compression standards. To compare two

signatures, Do proposed the use of the Kullback-Leibler di-

vergence between wavelet coefﬁcient distributions P and Q

in two subbands [27]:

D(P ||Q) =

R

p(x) log

p(x)

q(x)

dx (5)

where p and q are the densities of P and Q, respectively.

A symmetric version of the Kullback-Leibler divergence was

used, since clustering algorithms require (symmetric) distance

measures:

1

2

(D(P ||Q) + D(Q||P )) (6)

Finally, the distance between two images is deﬁned as a

weighted sum of these distances over the subbands, noted

W SD; weights are tuned by a genetic algorithm to maximize

retrieval performance on the training set [

25]. The ability to

select a weight vector and a wavelet basis makes this image

representation highly tunable. We have shown in previous

works the superiority of the proposed image signature, in

terms of retrieval performance, over several well-known image

signatures [

25].

2) Signature Clustering: in order to deﬁne several states

for an “image variable”, similar images are clustered with

an unsupervised classiﬁcation algorithm, thanks to the image

signatures and the associated distance measure above. Any

algorithm can be used, provided that the distance measure

can be speciﬁed. We chose the well-known Fuzzy C-Means

algorithm (FCM) [

28] and replaced the Euclidean distance by

W SD described above. In this algorithm, each document is

assigned to each cluster k = 1..K with a fuzzy membership

u

k

, 0 ≤ u

k

≤ 1, such that

K

k=1

u

k

= 1, which can

be interpreted as a probability. Finding the right number of

clusters is generally a difﬁcult problem. However, when each

sample has been assigned a class label, mutual information

between clusters and class labels can be used to determine the

optimal number of clusters

ˆ

K [

29] (see equation (7)).

ˆ

K = argmax

K

C

c=1

K

k=1

P (c, k) log

C+K

P (c, k)

P (c)P (k)

(7)

where c = 1..C are the class labels, P (c, k) is the joint proba-

bility distribution function of the class and cluster labels, P (c)

IEEE TRANSACTIONS ON MEDICAL IMAGING 4

and P (k) are the marginal probability distribution functions.

Other continuous variables can be discretized similarly: the

age of a person, one-dimensional signals, videos, etc.

D. System Design

QUERY - case in the testing setOFFLINE - on the training set

Learn the probabilistic

relationships

between variables

(section II.B).

Compute the correlation

between two

states of a variable

(section II-E2).

Intermediate network

(Fig 3(a))

Correlations

Compute the probabilistic

relationships between the

variables and the query node

(section II-E).

Add a query node Q to

the intermediate network

(section II-D).

Query-specific network

(Fig 3(b))

case x in the training set

Probabilistic inference on

the query-specific network

using x as evidence

(sections II-A, II-F)

Cases in the training set

ranked in decreasing

order of P(Q|x)

Fig. 2. Bayesian Network based Retrieval. Solid-lined arrows mean “leads

to” or “is followed by” and dashed-lined arrows mean “is used by”.

Let x

q

be a query document and M be the number of

attributes.

Deﬁnition

: A document x is said to be relevant for x

q

if x

and x

q

belong to the same class.

To assess the relevance of each reference document in a

database for x

q

, we deﬁne a Bayesian network with the

following variables:

• a set of variables {A

i

, i = 1..M }, where A

i

represents

the i

th

attribute of x,

• a Boolean variable Q = “x is relevant for x

q

” (

¯

Q = “x

is not relevant for x

q

”).

The design of the system is described hereafter and illustrated

in Fig.

2. To build the network, the ﬁrst step is to learn the

different relationships between the attributes {A

i

, i = 1..M}.

So, an intermediate network is built from data, using Cheng’s

algorithm (see section

II-B). In that purpose, the studied

database is divided into a training dataset and a test dataset.

Cheng’s algorithm is applied to the training dataset. In our

experiments, the query document x

q

belongs to the test dataset

and x belongs to the training dataset. To build this Bayesian

network, a ﬁnite number of states a

ij

is deﬁned for each

variable A

i

, i = 1..M . To learn the relationships between these

variables, we use the membership degree of any document y

in the training dataset to each state a

ij

of each variable A

i

,

noted α

ij

(y). If A

i

is a nominal variable, α

ij

(y) is boolean;

for instance, if y is a male then α

“sex

′′

,“male

′′

(y) = 1 and

α

“sex

′′

,“female

′′

(y) = 0. If A

i

is a continuous variable (such

as an image-based feature), α

ik

(y) is the fuzzy membership

of y to each cluster k = 1..K (see section

II-C2). An example

of intermediate network is given in Fig. 3 (a).

(a) Intermediate network

(b) Query-speciﬁc network

Fig. 3. Retrieval Bayesian Network (built for the database presented in

section

IV-A). In the example of Fig. (b), attributes A

1

, ..., A

6

, A

8

, A

10

,

A

13

, A

14

, A

15

, A

17

, A

18

, A

22

, A

23

are available for the query document

x

q

, so the associated nodes are then connected to node Q.

Q is then integrated in the network. For retrieval, the

attributes of x are observable evidences for Q, as a con-

sequence the associated variables should be descendants of

Q. In the retrieval network, the probabilistic dependences

between Q and each variable A

i

depend on x

q

. In fact, x

q

speciﬁes which attributes should be found in the retrieved

documents in order to meet the user’s needs. So, when the

i

th

attribute of x

q

is available, we connect the two nodes Q

and A

i

and we estimate the associated conditional probability

matrix P

q

(A

i

= a

ij

|Q) according to x

q

(see Fig.

3 (b)).

The index q denotes that the probability depends on x

q

. A

query-speciﬁc network is obtained: its structure depends on

which attributes are available for the query document and the

conditional probability matrices depend on the value taken for

these available attributes by the query document. This network

is used to assess the relevance of any reference document for

x

q

.

E. Computing the Conditional Probabilities P

q

(A

i

= a

ij

|Q)

To compute P

q

(A

i

= a

ij

|Q), we ﬁrst estimate P

q

(Q|A

i

=

a

ij

): the probability that a reference document x, with full

membership to the state a

ij

of attribute A

i

, is relevant.

P

q

(A

i

= a

ij

|Q) can then be computed thanks to Bayes’

theorem (see equation (8)). The prior probability

P

q

(

Q

)

is

required; it can be estimated by the probability that two

documents belong to the same class, i.e. the probability that

both documents belong to class 1 or that both documents

belong to class 2, etc., hence equation 9:

P (A|B) =

P (B|A)P (A)

P (B)

(8)

P

q

(Q) =

C

c=1

(P (c))

2

(9)