What is the disadvantage of this method?

The disadvantages of this method is that it uses the original indices and similarity computations rather than the graph representation to determine feature weights.

What is the advantage of the inverted index?

Instead of having to compare N vectors given a particular query, the inverted index facilitates a fast computation of the relevant results.

What is the cosine similarity between a query vector and a document?

The cosine similarity between a query vector Q and a document Di is defined assim(Q,Di) = ∑Vj=0 wQ, j ×wi, j√∑Vj=0(wQ, j)2 ∑ V j=0(wi, j)2(12)where V is the total number of terms, and wQ, j is the weight of term j in the query.

How many iterations are used to get relevance feedback?

The setup of these runs is the following: for each task 200 queries consisting of 3 example images are issued to the system and relevance feedback is performed over a total of 20 relevance feedback iterations.

What is the way to determine the similarity between query nodes?

the graph structure could be used to determine the similarity between query nodes based on the three individual features.

What is the way to use the feature weights?

Instead of simply using the feature weights as a scaling factor for updating link weights, the authors can also envisage a more drastic weighting technique.

What is the optimal query vector for the i-th feature?

The optimal query vector ~qi (for the i-th feature) is calculated as the centroid of the P positive examples specified by the user.

(Open Access) Adaptive image retrieval using a Graph model for semantic feature integration (2006) | Jana Urban

Q: What have the authors contributed in "Adaptive image retrieval using a graph model for semantic feature integration" ?

In this paper the authors present a retrieval model and learning framework for the purpose of interactive information retrieval. The authors describe how semantic relations between multimedia objects based on user interaction can be learnt and then integrated with visual and textual features into a unified framework. In addition, the authors present ideas to implement short-term learning from relevance feedback.

Urban, J. and Jose, J.M. (2006) Adaptive image retrieval using a graph

model for semantic feature integration. In, 8th ACM International

Workshop on Multimedia Information Retrieval MIR '06, 26-27 October

2006, pages pp. 117-126, Santa Barbara, CA, USA.

http://eprints.gla.ac.uk/3583/

Glasgow ePrints Service

http://eprints.gla.ac.uk

Adaptive Image Retrieval using a Graph Model for

Semantic Feature Integration

Jana Urban and Joemon M. Jose

Dept. of Computing Science, University of Glasgow

Glasgow, UK

{jana,jj}@dcs.gla.ac.uk

ABSTRACT

The variety of features available to represent multimedia data con-

stitutes a rich pool of information. However, the plethora of data

poses a challenge in terms of feature selection and integration for

effective retrieval. Moreover, to further improve effectiveness, the

retrieval model should ideally incorporate context-dependent fea-

ture representations to allow for retrieval on a higher semantic level.

In this paper we present a retrieval model and learning framework

for the purpose of interactive information retrieval. We describe

how semantic relations between multimedia objects based on user

interaction can be learnt and then integrated with visual and textual

features into a uniﬁed framework. The framework models both fea-

ture similarities and semantic relations in a single graph. Querying

in this model is implemented using the theory of random walks. In

addition, we present ideas to implement short-term learning from

relevance feedback. Systematic experimental results validate the

effectiveness of the proposed approach for image retrieval. How-

ever, the model is not restricted to the image domain and could eas-

ily be employed for retrieving multimedia data (and even a combi-

nation of different domains, eg images, audio and text documents).

Categories and Subject Descriptors: H.3.3 [Information Storage

and Retrieval]: Information Search and Retrieval—relevance feed-

back, retrieval models

General Terms: Retrieval Models, Experimentation, Performance

Keywords: semantic features, image retrieval, relevance feedback,

random walks, fusion

1. INTRODUCTION

Ever since the deﬁciencies of primitive content-based features

were realised, interest has turned to “semantic features” and “se-

mantic retrieval”. Semantic features are now the ultimate goal in

order to facilitate effective retrieval of visual data, but what are

they? Smeulders et al state that ”Semantic features aim at encoding

interpretations of the image which may be relevant to the applica-

tion.” [15, p. 1361]. There are two important points to note in this

assertion. Firstly, semantics are about interpretation, and secondly

the interpretation is to a large degree domain or context dependent.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

MIR’06, October 26–27, 2006, Santa Barbara, California, USA.

An image by itself usually has no intrinsic meaning. The mean-

ing is bestowed upon the image by a human observer regarding the

context of both the observer and the image.

The goal of the semantic approach is to replace the low-level

feature space with a higher-level semantic space, which is closer

to the abstract concepts the user has in mind when looking for an

image. Since the endeavour of obtaining semantic features directly

from the visual attributes was unfruitful, mining for semantic con-

cepts from a knowledge-base has been the focus of research to this

end. Most of the existing attempts towards semantic features can

be broadly categorised in two classes: annotation-based [7, 12] and

user-based [18, 3, 4]. This distinction arises from the nature of the

knowledge-base used: the ﬁrst method relies on an (at least par-

tially) annotated image corpus from which semantic concepts can

be learnt and propagated to other images, whereas the latter learns

semantic concepts from the user directly. While there is a number

of general concepts that can universally be agreed upon, e.g. an ‘in-

door’ vs. ‘outdoor’ classiﬁcation, there are more subtle meanings

that are subject to the observer’s interpretation, e.g. ‘a romantic

scene’. The major difference in the two approaches hence lies in

the interpretation context considered for deciphering the image’s

meaning. It should become obvious that the annotation-based ap-

proach can only succeed in taking very general concepts into con-

sideration, as opposed to user-based approaches that are tailored to

the user’s expectations and interpretations.

Our approach is an example of the latter in that contextual in-

formation is mined from user interaction. We have developed a

system, EGO, that encourages its users to manage their retrieval re-

sults on a workspace provided in the interface [20]. While search-

ing for images, the creation of groupings of related images is sup-

ported, inciting the user to break up the task into related facets to

organise their ideas and concepts. The system can then assist the

user by recommending relevant images for selected groups. Previ-

ous user experiments have shown that EGO helps to overcome the

query formulation problem and leads to a more effective and en-

joyable search experience compared to a state-of-the-art relevance

feedback interface [22].

In this work, we use the groupings created in the user experi-

ments to infer a semantic feature. Our underlying assumption is

that all objects (images) in one group share some semantic concept

(user-, usage-, and task-dependent), eg images of snowy mountains,

images with high visual contrasts, images that could be used as

background on the front of a ﬂyer. Instead of trying to label these

concepts, however, we simply record that there is a semantic rela-

tion between those images in a group. We refer to these relation-

ships as peer information. Appropriately recorded, the peer infor-

mation can be used to implement long-term learning of semantic

concepts in the system.

In addition to the peer information, low-level visual features and

textual annotations are further sources of information for the re-

trieval (and recommendation) system. However, the combination

of different feature modalities is a big challenge in multimedia re-

trieval [11, 6, 19]. Most state-of-the-art systems treat each feature

individually and fuse the result lists to obtain the ﬁnal results. How-

ever, the method of fusion is far from obvious and such systems fail

to capture dependencies between the features. Even worse, such

systems have difﬁculties in exceeding the performance of a text-

only system in information retrieval tasks [11]. Instead of a late

fusion of results, we propose to integrate the different modalities in

a single graph and use the theory of random walks [10] to calculate

retrieval results.

In our model, images, terms, and visual features are represented

as nodes in an Image-Context Graph (ICG). The links between

nodes represent: (1) image attributes (relations between images and

their features); (2) intra-feature relations (feature similarities); and

(3) semantic relations (peer information). We describe a retrieval

model based on random walks, that can retrieve both top matching

images as well as terms to a query (consisting of both image ex-

amples and terms). In addition, we show how short-term relevance

feedback learning can be integrated in our model by adapting the

link weights in the ICG. The main contributions of this paper are:

• We propose a group-based contextual feature (peer informa-

tion) based on mining usage information while searching in

a multimedia collection.

• We show how the peer information can be integrated with

already existing low-level visual features and textual annota-

tion in a graph model.

• We deﬁne various learning strategies in the graph model.

• Through systematic experimental results the effectiveness of

the proposed approach is validated and learning strategies are

investigated.

The remainder of this document is organised as follows. Sec-

tion 2 reviews related work. We detail the graph-model and explain

the mathematical background in Section 3. Section 4 introduces

the baseline systems used in the evaluation. It consists of three sep-

arate retrieval models for each feature modality, whose results are

combined using a rank-based list aggregation method. We outline

the experimental methodology in Section 5, followed by the exper-

imental results in Section 6. Finally, we summarise and conclude

the paper in Section 7.

2. RELATED WORK

The theory of Random Walks has been applied to information

retrieval in the form of Google’s famous PageRank algorithm [1].

The idea can be sketched as follows. Imagine a random surfer on

the Web choosing to follow a link on each page at random. Occa-

sionally, the surfer gets stuck in a dead end or in cycles, or simply

gets bored. At these points, he may randomly jump to another page

on the Web not following any links. The goal of a page’s PageR-

ank score is to reﬂect its quality depending on the number of other

pages linking to it based on the random surfer model. The PageR-

ank algorithm can be viewed as a random walk on the Web graph.

The mathematical details will be elaborated in Section 3.1.

2.1 Random Walks in the Image Domain

Graph-based modelling techniques have recently found their way

into the image domain. The two most closely related approaches

include its application for relevance feedback learning [3, 4] and

for image captioning [12]. Han et al have proposed to model the

relationships between images based on their co-selection in rele-

vance feedback sessions [3]. The ratio of the frequency of two

image being labelled as positive examples in the same retrieval ses-

sion over the total frequency of them having been selected together

(as positive or negative samples) determines the weight of the link

between these two images. The calculation of a semantic similar-

ity measure between two images is based on the overall correla-

tion as determined by analysing the resulting graph (referred to as

the image link network). An overall similarity measure is deﬁned

as a weighted linear combination of the semantic similarity and

the low-level feature similarity. In contrast, the theory of random

walks is explicitly employed on an image graph in which links be-

tween image nodes are also constructed from relevance feedback

information in [4]. Here the graph is constructed by adding two

special nodes to the graph: a positive absorbing node and a neg-

ative absorbing node. Each positively labelled image receives a

link to the positive absorbing node, while negative examples are

directly linked to the negative absorbing node. As this approach is

not discriminating between query session, it can only be used for

short-term learning.

The second application of random walks in the image domain

is to automatically learn annotations for previously unlabelled im-

ages [12]. A graph, called GCap, is constructed, which contains

one node per image, a node for each image region per image, and

a node for all terms in the vocabulary. Images are connected to its

region nodes and the terms it is annotated with. Further, regions are

linked to their k-nearest neighbours. Given an unlabelled image, ie

an image node I

that does not have any links to a term node in the

graph, a random walk is performed to compute the most probable

terms for this image. These are found by calculating the long-term

(stationary) probabilities that a random walker ﬁnds himself at a

particular node given that it randomly restarts the walk from i. The

top t terms with the highest stationary probability are returned as

the suggested labels.

The semantic link approaches [3, 4] only model the information

gained from relevance feedback which has to be combined with

feature-based similarity values in a further step, while the image

captioning approach [12] only models image-feature similarities

without a facility of adaptation to relevance feedback. We propose

to model both the image-feature relations as well as inter-image (or

semantic) relations together. Hence there are two vital ingredients

to our approach: (1) the feature integration of semantic as well as

low-level features using a graph-model, and (2) a learning strategy

in the graph model. The latter incorporates two levels of feedback

to implement short- and long-term learning from user feedback. By

adding links between images that are grouped together the seman-

tic network is iteratively constructed and enforced by using adap-

tive link weights, thus implementing a long-term learning strategy.

Further, we show how short-term learning can be achieved by in-

troducing feature weights to ensure that those links to feature nodes

with a strong feature weight are favoured over feature links with

small weights given a particular query.

3. THE IMAGE-CONTEXT GRAPH

The problems addressed in this paper are (a) how to capture and

model personalised usage information to improve retrieval perfor-

mance, and (b) how to integrate this information with other features

(visual and textual) to model interdependencies between features.

The idea is to represent images and all their attributes (features)

in a graph. The graph consists of a number of layers of vertices:

vertices for all images in the collection, and one layer of vertices

per implemented feature. These layers will contain both visual and

textual features. There are two different types of edges connect-

Figure 1: An example image-context graph

ing vertices: edges representing a “contains” relationship (ie edges

between the image vertices and their attributes), and edges repre-

senting the similarity amongst vertices in the same layer (“similar-

ity edges”). These edges are constructed based on the similarity

between features (similarity between visual feature vectors, simi-

larity between terms) or semantic relationships/co-occurrences of

images. Thus the graph represents the images in context and in the

following it is referred to as the Image-Context Graph, or ICG. An

example graph containing three image nodes (I

,..,I

), four term

nodes (t

,...,t

), and two types of visual features ( f

, f

) is depicted

in Figure 1.

The general recommendation problem (or retrieval problem for

that matter) can be stated as: Given a query, consisting of image

examples and/or terms, compute the most similar images to recom-

mend to the user. In the ICG, this translates to: given a start set of

vertices in the graph, compute those image vertices that are most

likely to be reached starting from the start set.

A solution to this problem can be found in the theory of Random

Walks. The likelihood of passing a node in the ICG is given by

calculating the stationary distribution of the Markov chain induced

by the ICG. By setting the restart vector to the nodes representing

the query items, we can stage a Random Walk with Restarts on the

ICG. This is equivalent to computing a query-biased “PageRank”

of the ICG as will be explained in the following section.

3.1 Mathematical Background

A random walk is a ﬁnite-state Markov chain that is time-revers-

ible. Markov chains are frequently used to model physical and con-

ceptual processes that evolve over time, for example the spread of

disease within a population or the modelling of gambling. An intro-

duction to Random Walks and Markov chains can be found in [10].

Let the Markov chain M consist of a ﬁnite number of states, say

N = {1,2,...,n}, and probabilities of a transition occurring between

states at discrete time steps. The (one-step) transition probability

i j

, denotes the conditional probability that M will be in state j at

time t + 1 given that it was observed in state i at time t. In general,

i j

denotes the probability that M proceeds from state i to state

j after k transitions. The transition probability matrix P = [p

i j

]

is often used to represent M . The stationary distributions π

[π

,π

,...,π

] represent the long-run proportion of time the chain

M spends in each state. π is also referred to as the steady state

probability vector. Markov chains are often represented as a graph,

or state transition diagram G. Finally to make the connection to

PageRank: the PageRank scores are equivalent to the stationary

distribution π of the Markov chain associated with the Web graph.

3.1.1 Calculating π

In general the stationary distribution, π, of a Markov chain can

be found by solving the following eigenvector problem:

π =

∗π (1)

A unique stationary distribution is guaranteed to exist, iff P is a

stochastic, irreducible matrix [8].

In the PageRank model, a transition probability matrix P is built

from the hyperlink structure of the Web. To create a stochastic,

irreducible matrix, Brin and Page suggested to eliminate dangling

pages (pages with no outlinks) by linking them to all other pages

in the Web [1]. This is achieved by replacing 0

rows of the sparse

matrix P with dense vectors, that is the uniform vector

initially

or a more general probability distribution over all pages v

. This

stochastic ﬁx can be modelled implicitly by the following transfor-

mations (see [8]):

P = P + a v

(2)

P = (1 −α)P + α e v

(3)

where a is a vector whose elements a

= 1 if row i in P corresponds

to a dangling node, and 0 otherwise; e the vector of all 1s; 0 ≤

α ≤ 1; and v representing a general probability distribution over

the nodes—often referred to as the personalisation or restart vector.

Substituting

P in Equation 1 then leads to:

π = ((1 −α)P + ((1 −α)a + α e)v

)

π (4)

π = (1 −α)(P + a v

)

π + α v (5)

with the constraint that π is normalised, such that |π| = 1 and thus

π = 1. α is the probability of restarting the random walk from

any of the nodes in v.

3.1.2 Parameters of the PageRank Model

α. The value of α denotes the probability of a surfer choosing

to jump to a new Web page (teleportation), while they choose to

click on hyperlinks with probability (1 −α). A small α places

more emphasis on the hyperlink structure of the graph and much

less on the teleportation tendencies, and also slows convergence of

the iterative computation of PageRank. Originally α = 0.15 was

proposed [1].

In the image annotation graph of [12] a value of α = 0.65 was

found to be better suited, which they could explain by a relationship

to the estimated diameter of the graph.

The personalisation vector v

. Instead of the uniform distri-

bution

, a more general distribution v

> 0 can be used in its

place. v

is often referred to as personalisation vector or restart

vector in random walk terms.

The personalisation vector also allows PageRank to be made

query-sensitive. The original PageRank assigns a score to a page

proportional to the number of times a random surfer would visit

that page, if they surfed indeﬁnitely, following all outlinks with

equal probability or occasionally jumping to a random new page

chosen with equal probability. If we change the probability distri-

bution given by the personalisation vector v

then we can introduce

a certain bias that the surfer jumps to pages with high probability

in v

3.2 Constructing the ICG

Let G be the ICG and V the set of vertices in G and E the set of

edges. Then G = (V,E). The graph will be stored in the form of its

adjacency matrix M.

3.2.1 The Nodes

There are three types of nodes: image nodes I , term nodes T ,

and feature nodes F , and V = I ∪T ∪F :

• Let I denote the set of all image nodes in G. Add one node

per image to the set of image nodes. I

denotes the node for

image i.

• Let T denote the set of all term nodes in G. Add one node

for every term in the vocabulary to T . t

denotes the node

for term i.

• Construct the set of visual feature nodes F by adding one

node per low-level visual feature for each image. If the num-

ber of implemented visual features is v (which is 6 in our

case), then |F | = v ×|I |. f

i j

denotes the node for the j-th

feature of image i.

3.2.2 The Edges

There are two types of edges: attribute edges and similarity edges.

The ﬁrst type of edges link images to their attributes, the second

type of edge links nodes of the same feature type (term and visual

feature nodes) based on the similarity between these nodes. A spe-

cial type of similarity edges are peer edges between image nodes

themselves, which are created based on users’ groupings of images.

Attribute Edges Each image node I

is linked to all its features.

Thus an edge is created to each of its visual feature nodes f

,... f

For the textual features, an edge is created between an image node

and a term node t

if image i is annotated with term j.

Similarity Edges Similar to [12], we propose to create edges be-

tween visual features based on their nearest neighbours. Consider a

feature node f

representing the l-th feature of image i, then com-

pute the top k nearest neighbours by calculating the similarity score

between the feature vector

−→

and the feature vector

−→

for all other

images j (0 < j < |I |). This allows for an adaptive deﬁnition of

closeness without having to ﬁx a threshold value.

A similar idea could be applied to the term nodes choosing a

similarity measure between terms based on relationships between

terms (eg using WordNet) or a collection-based analysis. Since the

number of terms contained in an image (annotations) is typically

very low (compared to text documents), a collection-based anal-

ysis is probably not very signiﬁcant. Instead we adopt a simple

similarity measure sim(t

) = 1 if i = j and 0 otherwise. Using

this similarity measure, we will obtain an edge that links each term

node to itself.

Peer Edges Finally, the edges between the image nodes them-

selves are based on user feedback. For each group created by a user,

edges are created connecting all the images in that group. An edge

between two images i and j has a weight, which generally reﬂects

the frequency of these images co-occurring in groups. However,

the weight can also be reduced by negative feedback (see below).

These edges represent high-level semantic relationships between

images based on their usage.

3.3 Evaluating a Query

The objective of retrieval in the graph is to ﬁnd those image

nodes ∈ I that are closest (or best connected) to the query nodes.

The overview of the algorithm is as follows. First, the restart vector

is built from the query nodes. Then, a Random Walk with Restarts

Algorithm 1 Calculating the query results based on a Random

Walk on ICG

Require: Query consisting of image examples and query terms; M

the adjacency matrix of the ICG; constant 0 < α < 1;

Ensure: ||π||

= 1 (L

norm of π)

1: Initialise personalisation vector v.

2: M’ = normalise(M).

3: Initialise π

= v

4: Set k = 0 the number of iterations.

5: while not converged do

6: π

= (1 −α) ∗M

∗π

k−1

−α ∗v

7: Normalise π

8: k = k+1

9: end while

10: return Image documents sorted by their π values after con-

vergence.

is performed on the graph to estimate the stationary probability dis-

tribution π. Finally, the image nodes are returned to the user sorted

in descending order by their steady state probability scores. Algo-

rithm 1 shows an overview of these steps.

Construction of the restart/personalisation vector. As-

sume a query contains a number of image examples and a set of

terms. The personalisation vector v is initialised, such that v(u) =

for all nodes u representing the image examples and terms, where q

is the size of the query. The remaining elements are set to 0. Choos-

ing the personalisation vector this way ensures that these nodes are

favoured in the following Random Walk computation.

Calculating π. Recall from Section 3.1.1 (cf Equation 1) that

the stationary distribution, π, of a Markov chain can be found by

solving the eigenvector problem: π = P

∗π. In the ICG, there are

no dangling nodes due to the way the ICG is constructed, so the

transformation to create a stochastic, irreducible matrix represent-

ing the ICG (cf Equation 2) can be simpliﬁed to:

P = (1 −α)P +α e v

(6)

And the calculation of π can be achieved by:

π = (1 −α)M

π + α v (7)

where M

(= P

) is the column normalised adjacency matrix of the

ICG. α is the probability of restarting the random walk from any of

the nodes in v.

The estimation of π is solved in the iterative algorithm detailed

in Alg 1. The algorithm converges if two consecutive estimates π

and π

k+1

are reasonably close together, ie |π

-π

k+1

| < threshold.

The threshold is set to 10

−6

Returning the query results. Finally, we choose the top r

image nodes (ie the elements, π(u

), from π, where 1 ≤ i ≤ |I |)

and present them to the user.

3.4 Relevance Feedback

In this section we show how both long- and short-term learning

can be implemented in the ICG to create a retrieval system that

adapts to its users. On the one hand, relevance feedback is used

to build up the semantic or peer network (the subgraph consisting

of image nodes and the edges between them) over time. On the

other hand, short-term learning is implemented by computing a set

Adaptive image retrieval using a Graph model for semantic feature integration

Figures

Citations

Information retrieval

On social networks and collaborative recommendation

Categorising social tags to improve folksonomy-based recommendations

Interactive search in image retrieval: a survey

Generating Visual Summaries of Geographic Areas Using Community-Contributed Images

References

The anatomy of a large-scale hypertextual Web search engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Visual pattern recognition by moment invariants

Content-based image retrieval at the end of the early years

Image Processing: Analysis and Machine Vision

Related Papers (5)

GCap: Graph-based Automatic Image Captioning

The PageRank Citation Ranking : Bringing Order to the Web

Random Walks on Graphs: A Survey

Automatic multimedia cross-modal correlation discovery

Random walks on the click graph

Frequently Asked Questions (14)

Q1. What have the authors contributed in "Adaptive image retrieval using a graph model for semantic feature integration" ?

Q2. What is the disadvantage of this method?

Q3. What is the advantage of the inverted index?

Q4. What are the problems addressed in this paper?

Q5. How many images are better suited to the annotation graph?

Q6. What is the cosine similarity between a query vector and a document?

Q7. How many iterations are used to get relevance feedback?

Q8. How is the effectiveness of the proposed approach evaluated?

Q9. What are the links between nodes in an image-context graph?

Q10. What is the way to determine the similarity between query nodes?

Q11. What are the sources of information for the retrieval system?

Q12. What is the common method of removing dangling pages?

Q13. What is the way to use the feature weights?

Q14. What is the optimal query vector for the i-th feature?