
HAL Id: hal-00704248
https://hal.archives-ouvertes.fr/hal-00704248
Submitted on 5 Jun 2012
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Case retrieval in medical databases by fusing
heterogeneous information.
Gwénolé Quellec, Mathieu Lamard, Guy Cazuguel, Christian Roux, Béatrice
Cochener
To cite this version:
Gwénolé Quellec, Mathieu Lamard, Guy Cazuguel, Christian Roux, Béatrice Cochener. Case retrieval
in medical databases by fusing heterogeneous information.. IEEE Transactions on Medical Imaging,
Institute of Electrical and Electronics Engineers, 2011, 30 (1), pp.108-18. �10.1109/TMI.2010.2063711�.
�hal-00704248�

IEEE TRANSACTIONS ON MEDICAL IMAGING 1
Case Retrieval in Medical Databases by Fusing
Heterogeneous Information
Gw
´
enol
´
e Quellec, Mathieu Lamard, Guy Cazuguel, Member, IEEE, Christian Roux, Fellow Member, IEEE and
B
´
eatrice Cochener
Abstract—A novel content-based heterogeneous information
retrieval framework, particularly well suited to browse med-
ical databases and support new generation Computer Aided
Diagnosis (CADx) systems, is presented in this paper. It was
designed to retrieve possibly incomplete documents, consisting
of several images and semantic information, from a database;
more complex data types such as videos can also be included in
the framework. The proposed retrieval method relies on image
processing, in order to characterize each individual image in a
document by their digital content, and information fusion. Once
the available images in a query document are characterized,
a degree of match, between the query document and each
reference document stored in the database, is defined for each
attribute (an image feature or a metadata). A Bayesian network
is used to recover missing information if need be. Finally, two
novel information fusion methods are proposed to combine these
degrees of match, in order to rank the reference documents
by decreasing relevance for the query. In the first method, the
degrees of match are fused by the Bayesian network itself. In
the second method, they are fused by the Dezert-Smarandache
theory: the second approach lets us model our confidence in each
source of information (i.e. each attribute) and take it into account
in the fusion process for a better retrieval performance. The
proposed methods were applied to two heterogeneous medical
databases, a diabetic retinopathy database and a mammography
screening database, for computer aided diagnosis. Precisions at
five of 0.809±0.158 and 0.821±0.177, respectively, were obtained
for these two databases, which is very promising.
Index Terms—Medical databases, Heterogeneous information
retrieval, Information fusion, Diabetic retinopathy, Mammogra-
phy
I. INTRODUCTION
T
WO main tasks in Computer Aided Diagnosis (CADx)
using medical images are extraction of relevant informa-
tion from images and combination of the extracted features
with other sources of information to automatically or semi-
automatically generate a reliable diagnosis. One promising
Copyright (c) 2010 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending a request to pubs-permissions@ieee.org.
G. Quellec, G. Cazuguel, and C. Roux are with the INSTITUT TELE-
COM/TELECOM Bretagne, Dpt ITI, Brest, F-29200 France, and also with
the Institut National de la Sant
´
e et de la Recherche M
´
edicale (INSERM),
U650, Brest, F-29200 France (e-mail: gwenole.quellec@telecom-bretagne.eu;
guy.cazuguel@telecom-bretagne.eu; christian.roux@telecom-bretagne.eu).
M. Lamard is with the University of Bretagne Occidentale, Brest, F-
29200 France, and also with the Institut National de la Sant
´
e et de
la Recherche M
´
edicale (INSERM), U650, Brest, F-29200 France (e-mail:
mathieu.lamard@univ-brest.fr).
B. Cochener is with the Centre Hospitalier Universitaire de Brest, Service
d’Ophtalmologie, Brest, F-29200 France, also with the University of Bretagne
Occidentale, Brest, F-29200 France, and also with the Institut National de la
Sant
´
e et de la Recherche M
´
edicale (INSERM), U650, Brest, F-29200 France
(e-mail: Beatrice.Cochener-lamard@chu-brest.fr)
way to achieve the second goal is to take advantage of
the growing number of digital medical databases either for
heterogeneous data mining, i.e. for extracting new knowledge,
or for heterogeneous information retrieval, i.e. for finding
similar heterogeneous medical records (e.g. consisting of
digital images and metadata). This paper presents a generic
solution to use digital medical databases for heterogeneous
information retrieval, and solve CADx problems using Case-
Based Reasoning (CBR) [
1].
CBR was introduced in the early 1980s as a new decision
support tool. It relies on the idea that analogous problems have
similar solutions. In CBR, interpreting a new situation revolves
around the retrieval of relevant documents in a case database.
The knowledge of medical experts is a mixture of textbook
knowledge and experience through real life clinical cases, so
the assumption that analogous problems have similar solutions
makes sense to them. This is the reason why there is a growing
interest in CBR for the development of medical decision
support systems [2]. Medical CBR systems are intended to
be used as follows: should a physician be doubtful about
his/her diagnosis, he/she can send the available data about
the patient to the system; the system selects and displays the
most similar documents, along with their associated medical
interpretations, which may help him/her confirm or invalidate
his/her diagnosis by analogy. Therefore, the purpose of such
a system is not to replace physicians’ diagnosis, but rather to
aid their diagnosis. Medical documents often consist of digital
information such as images and symbolic information such as
clinical annotations. In the case of Diabetic Retinopathy, for
instance, physicians analyze heterogeneous series of images
together with contextual information such as the age, sex and
medical history of the patient. Moreover, medical information
is sometimes incomplete and uncertain, two problems that
require a particular attention. As a consequence, original
CBR systems, designed to process simple documents such
as homogeneous and comprehensive attribute vectors, are
clearly unsuited to complex CADx applications. On one hand,
some CBR systems have been designed to manage symbolic
information [3]. On the other hand, some others, based on
Content-Based Image Retrieval [
4], have been designed to
manage digital images [
5]. However, few attempts have been
made to merge the two kinds of approaches. We consider in
this paper a larger class of problems: CBR in heterogeneous
databases.
To retrieve heterogeneous information, some simple ap-
proaches, based on early fusion (i.e. attributes are fused in
feature space) [6], [7] or late fusion (i.e. attributes are fused
in semantic space) [8], [9], [10] have been presented in the

IEEE TRANSACTIONS ON MEDICAL IMAGING 2
literature. A few application-specific approaches [
11], [12],
[
13], [14], [15], as well as a generic retrieval system, based
on dissimilarity spaces and relevance feedback [
16], have also
been presented. We introduce in this paper a novel generic
approach that does not require relevance feedback from the
user. The proposed system is able to manage incomplete
information and the aggregation of heterogeneous attributes:
symbolic and multidimensional digital information (we focus
on digital images, but the same principle can be applied to
any n-dimensional signals). The proposed approach is based
on a Bayesian network and the Dezert-Smarandache theory
(DSmT) [
17]. Bayesian networks have been used previously
in retrieval systems, either for keyword based retrieval [
18],
[
19] or for content-based image or video retrieval [20], [21].
The Dezert-Smarandache theory is more and more widely used
in remote sensing applications [17], however, to our knowl-
edge, this is its first medical application. In our approach, a
Bayesian network is used to model the relationships between
the different attributes (the extracted features of each digital
image and each contextual information field): we associate
each attribute with a variable in the Bayesian network. It lets us
compare incomplete documents: the Bayesian network is used
to estimate the probability of unknown variables (associated
with missing attributes) knowing the value of other variables
(associated with available attributes). Information coming from
each attribute is then used to derive an estimation of the degree
of match between a query document and a reference document
in the database. Then, these estimations are fused; two fusion
operators are introduced in this paper for this purpose. The
first fusion operator is incorporated in the Bayesian network:
the computation of the degree of match, with respect to a
given attribute, relies on the design of conditional probabilities
relating this attribute to the overall degree of match. An
evolution of this fusion operator that models our confidence in
each source of information (i.e. each attribute) is introduced. It
is based on the Dezert-Smarandache theory. In order to model
our confidence in each source of information, within this
second fusion operator, an uncertainty component is included
in the belief mass function characterizing the evidence coming
from this source of information.
The main advantage of the proposed approach, over standard
feature selection / feature classification approaches, is that a
retrieval model is trained separately for each attribute. This
is useful to process incomplete documents: in the proposed
approach, we simply combine the models associated with all
available attributes; as a comparison, a standard classifier relies
on feature combinations, and therefore may become invalid
when input feature vectors are incomplete. Also, because each
attribute is processed separately, the curse of dimensionality
is avoided. Therefore, it is not necessary to select the most
relevant features: instead, we simply weight each feature by a
confidence measure.
The paper is organized as follows. Section
II presents
the proposed Bayesian network based retrieval. Section III
presents the Bayesian network and Dezert-Smarandache theory
based retrieval. These methods are applied in section IV to
CADx in two heterogeneous databases: a diabetic retinopa-
thy database and a mammography database. We end with a
discussion and a conclusion in section
V.
II. BAYESIAN NETWORK BASED RETRIEVAL
A. Description of Bayesian Networks
A Bayesian network [
22] is a probabilistic graphical model
that represents a set of variables and their probabilistic depen-
dencies. It is a directed acyclic graph whose nodes represent
variables, and whose edges encode conditional independencies
between the variables. Examples of Bayesian networks are
given in Fig.
1.
(a) (b) (c)
Fig. 1. Examples of Bayesian Networks. Fig. (a) shows a chain. Fig. (b)
shows a polytree, i.e. a network in which there is at most one (undirected)
path between two nodes. Fig. (c) shows a network containing a cycle: <
A, D, E, C, A >.
In the example of Fig.
1 (b), the edge from the parent
node A to its child node D indicates that variable A has a
direct influence on variable D. Each edge in the graph is
associated with a conditional probability matrix expressing
the probability of a child variable given one of its parent
variables. For instance, if A = { a
0
, a
1
} and D = {d
0
, d
1
, d
2
},
then A → D is assigned the following (3 × 2) conditional
probability matrix P (D|A):
P (D|A) =
P (D = d
0
|A = a
0
) P (D = d
0
|A = a
1
)
P (D = d
1
|A = a
0
) P (D = d
1
|A = a
1
)
P (D = d
2
|A = a
0
) P (D = d
2
|A = a
1
)
(1)
A directed acyclic graph is a Bayesian Network relative
to a set of variables {X
1
, . . . , X
n
} if the joint distribution
P (X
1
, . . . , X
n
) can be expressed as in equation
2:
P (X
1
, . . . , X
n
) =
n
i=1
P (X
i
|parents(X
i
)) (2)
where parents(X) is the set of nodes such that Y → X is in
the graph ∀ Y ∈ parents(X). Because a Bayesian network
can completely model the variables and their relationships,
it can be used to answer queries about them. Typically, it
is used to estimate unknown probabilities for a subset of
variables when other variables (the evidence variables) are
observed. This process of computing the posterior distribution
of variables, given evidence, is called probabilistic inference.
In Bayesian networks containing cycles, exact inference is
a NP-hard problem. Approximate inference algorithms have
been proposed, but their accuracies depend on the network’s
structure; therefore, they are not general. By transforming the
network into a cycle-free hypergraph, and performing infer-
ence in this hypergraph, Lauritzen and Spiegelhalter proposed
an exact inference algorithm with relatively low complexity
[
23]; this algorithm was used in the proposed system.

IEEE TRANSACTIONS ON MEDICAL IMAGING 3
B. Learning a Bayesian Network from Data
A Bayesian network is defined by a structure and the
conditional probability of each node given its parents in that
structure (or its prior probability if it does not have any parent).
These parameters can be learned automatically from data.
Defining the structure consists in finding pairs of nodes (X, Y )
directly dependent, i.e. such that:
• X and Y are not independent (P (X, Y ) 6= P (X)P (Y ))
• There is no node set Z such that X and Y are indepen-
dent given Z (P (X, Y |Z) 6= P (X|Z)P (Y |Z))
Independence and conditional independence can be assessed
by mutual information (see equation 3) and conditional mutual
information (see equation
4), respectively.
I(X, Y ) =
x,y
P (x, y) log
P (x, y)
P (x)P (y)
(3)
I(X, Y |Z) =
x,y,z
P (x, y, z) log
P (x, y|z)
P (x|z)P (y|z)
(4)
Two nodes are independent (resp. conditionally independent)
if mutual information (resp. conditional mutual information)
is smaller than a given threshold ǫ, 0 ≤ ǫ < 1. Ideally, ǫ
should be equal to 0. However, in the presence of noise, some
meaningless edges (links) can appear. These edges can also
unnecessarily increase the computation time. To avoid this, in
this study, ǫ was chosen in advance to be equal to 0.1. This
number is independent of dataset cardinality [
24].
The structure of the Bayesian network, as well as edge
orientation, was obtained by Cheng’s algorithm [24]. This
algorithm was chosen for its complexity: complexity is poly-
nomial in the number of variables, as opposed to exponential
in competing algorithms.
C. Including Images in a Bayesian Network
Contextual information are included as usual in a Bayesian
network: a variable with a finite set of states, one for each
possible attribute value, is defined for each field.
To include images in a Bayesian network, we first define a
variable for each image in the sequence. For each “image
variable”, we follow the usual steps of Content-Based Image
Retrieval (CBIR) [
4]: 1) building a signature for each image
(i.e. extracting a feature vector summarizing their digital
content), and 2) defining a distance measure between two
signatures (see section
II-C1). Thus, measuring the distance
between two images comes down to measuring the distance
between two signatures. Similarly, in a Bayesian network,
defining states for an “image variable” comes down to defining
states for the signature of the corresponding images. To
this aim, similar image signatures are clustered, as described
below, and each cluster is associated with a state. Thanks to
this process, image signatures can be included in a Bayesian
network like any other variable.
1) Image Signature and Distance Measure: in previous
works on CBIR, we proposed to extract a signature for
images from their wavelet transform [
25]. These signatures
model the distribution of the wavelet coefficients in each
subband of the decomposition; as a consequence they provide
a multiscale description of images. To characterize the wavelet
coefficient distribution in a given subband, Wouwer’s work
was applied [
26]: Wouwer has shown that this distribution can
be modeled by a generalized Gaussian function. The maximum
likelihood estimators of the wavelet coefficient distribution
in each subband are used as a signature. These estimators
can be computed directly from wavelet-based compressed
images (such as JPEG-2000 compressed images), which can
be useful when a large number of images has to be processed.
A simplified version of Do’s generalized Gaussian parameter
estimation method [
27], [25] is proposed in appendix A to
reduce computation times. Any wavelet basis can be used to
decompose images. However, the effectiveness of the extracted
signatures largely depends on the choice of this basis. For
this reason, we proposed to search for an optimal wavelet
basis [
25] within the lifting scheme framework, which is
implemented in the compression standards. To compare two
signatures, Do proposed the use of the Kullback-Leibler di-
vergence between wavelet coefficient distributions P and Q
in two subbands [27]:
D(P ||Q) =
R
p(x) log
p(x)
q(x)
dx (5)
where p and q are the densities of P and Q, respectively.
A symmetric version of the Kullback-Leibler divergence was
used, since clustering algorithms require (symmetric) distance
measures:
1
2
(D(P ||Q) + D(Q||P )) (6)
Finally, the distance between two images is defined as a
weighted sum of these distances over the subbands, noted
W SD; weights are tuned by a genetic algorithm to maximize
retrieval performance on the training set [
25]. The ability to
select a weight vector and a wavelet basis makes this image
representation highly tunable. We have shown in previous
works the superiority of the proposed image signature, in
terms of retrieval performance, over several well-known image
signatures [
25].
2) Signature Clustering: in order to define several states
for an “image variable”, similar images are clustered with
an unsupervised classification algorithm, thanks to the image
signatures and the associated distance measure above. Any
algorithm can be used, provided that the distance measure
can be specified. We chose the well-known Fuzzy C-Means
algorithm (FCM) [
28] and replaced the Euclidean distance by
W SD described above. In this algorithm, each document is
assigned to each cluster k = 1..K with a fuzzy membership
u
k
, 0 ≤ u
k
≤ 1, such that
K
k=1
u
k
= 1, which can
be interpreted as a probability. Finding the right number of
clusters is generally a difficult problem. However, when each
sample has been assigned a class label, mutual information
between clusters and class labels can be used to determine the
optimal number of clusters
ˆ
K [
29] (see equation (7)).
ˆ
K = argmax
K
C
c=1
K
k=1
P (c, k) log
C+K
P (c, k)
P (c)P (k)
(7)
where c = 1..C are the class labels, P (c, k) is the joint proba-
bility distribution function of the class and cluster labels, P (c)

IEEE TRANSACTIONS ON MEDICAL IMAGING 4
and P (k) are the marginal probability distribution functions.
Other continuous variables can be discretized similarly: the
age of a person, one-dimensional signals, videos, etc.
D. System Design
QUERY - case in the testing setOFFLINE - on the training set
Learn the probabilistic
relationships
between variables
(section II.B).
Compute the correlation
between two
states of a variable
(section II-E2).
Intermediate network
(Fig 3(a))
Correlations
Compute the probabilistic
relationships between the
variables and the query node
(section II-E).
Add a query node Q to
the intermediate network
(section II-D).
Query-specific network
(Fig 3(b))
case x in the training set
Probabilistic inference on
the query-specific network
using x as evidence
(sections II-A, II-F)
Cases in the training set
ranked in decreasing
order of P(Q|x)
Fig. 2. Bayesian Network based Retrieval. Solid-lined arrows mean “leads
to” or “is followed by” and dashed-lined arrows mean “is used by”.
Let x
q
be a query document and M be the number of
attributes.
Definition
: A document x is said to be relevant for x
q
if x
and x
q
belong to the same class.
To assess the relevance of each reference document in a
database for x
q
, we define a Bayesian network with the
following variables:
• a set of variables {A
i
, i = 1..M }, where A
i
represents
the i
th
attribute of x,
• a Boolean variable Q = “x is relevant for x
q
” (
¯
Q = “x
is not relevant for x
q
”).
The design of the system is described hereafter and illustrated
in Fig.
2. To build the network, the first step is to learn the
different relationships between the attributes {A
i
, i = 1..M}.
So, an intermediate network is built from data, using Cheng’s
algorithm (see section
II-B). In that purpose, the studied
database is divided into a training dataset and a test dataset.
Cheng’s algorithm is applied to the training dataset. In our
experiments, the query document x
q
belongs to the test dataset
and x belongs to the training dataset. To build this Bayesian
network, a finite number of states a
ij
is defined for each
variable A
i
, i = 1..M . To learn the relationships between these
variables, we use the membership degree of any document y
in the training dataset to each state a
ij
of each variable A
i
,
noted α
ij
(y). If A
i
is a nominal variable, α
ij
(y) is boolean;
for instance, if y is a male then α
“sex
′′
,“male
′′
(y) = 1 and
α
“sex
′′
,“female
′′
(y) = 0. If A
i
is a continuous variable (such
as an image-based feature), α
ik
(y) is the fuzzy membership
of y to each cluster k = 1..K (see section
II-C2). An example
of intermediate network is given in Fig. 3 (a).
(a) Intermediate network
(b) Query-specific network
Fig. 3. Retrieval Bayesian Network (built for the database presented in
section
IV-A). In the example of Fig. (b), attributes A
1
, ..., A
6
, A
8
, A
10
,
A
13
, A
14
, A
15
, A
17
, A
18
, A
22
, A
23
are available for the query document
x
q
, so the associated nodes are then connected to node Q.
Q is then integrated in the network. For retrieval, the
attributes of x are observable evidences for Q, as a con-
sequence the associated variables should be descendants of
Q. In the retrieval network, the probabilistic dependences
between Q and each variable A
i
depend on x
q
. In fact, x
q
specifies which attributes should be found in the retrieved
documents in order to meet the user’s needs. So, when the
i
th
attribute of x
q
is available, we connect the two nodes Q
and A
i
and we estimate the associated conditional probability
matrix P
q
(A
i
= a
ij
|Q) according to x
q
(see Fig.
3 (b)).
The index q denotes that the probability depends on x
q
. A
query-specific network is obtained: its structure depends on
which attributes are available for the query document and the
conditional probability matrices depend on the value taken for
these available attributes by the query document. This network
is used to assess the relevance of any reference document for
x
q
.
E. Computing the Conditional Probabilities P
q
(A
i
= a
ij
|Q)
To compute P
q
(A
i
= a
ij
|Q), we first estimate P
q
(Q|A
i
=
a
ij
): the probability that a reference document x, with full
membership to the state a
ij
of attribute A
i
, is relevant.
P
q
(A
i
= a
ij
|Q) can then be computed thanks to Bayes’
theorem (see equation (8)). The prior probability
P
q
(
Q
)
is
required; it can be estimated by the probability that two
documents belong to the same class, i.e. the probability that
both documents belong to class 1 or that both documents
belong to class 2, etc., hence equation 9:
P (A|B) =
P (B|A)P (A)
P (B)
(8)
P
q
(Q) =
C
c=1
(P (c))
2
(9)