What are the contributions in "Multiple bernoulli relevance models for image and video annotation" ?

Here, the authors show how they can do both automatic image annotation and retrieval ( using one word queries ) from images and videos using a multiple Bernoulli relevance model. The model assumes that a training set of images or videos along with keyword annotations is provided. Multiple keywords are provided for an image and the specific correspondence between a keyword and an image is not provided. The authors show experiments on both images from a standard Corel data set and a set of video key frames from NIST ’ s Video Trec. The results also show that their model significantly outperforms previously reported results on the task of image and video annotation.

What are the future works mentioned in the paper "Multiple bernoulli relevance models for image and video annotation" ?

Future work will include a more extensive retrieval task with this model, which allows for longer text strings.

What is the main contribution of the current model over the CRM?

Another major contribution of the current model over the CRM is in their use of the multiple-Bernoulli distribution for modeling image annotations.

What is the reason why the maximization in equation (2) can be done so efficiently?

One can show that the maximization in equation (2) can be done very efficiently because of the factored nature of the Bernoulli component.

How many features are there in the MBRM?

There are 30 features: 18 color features (including region color average, standard deviation and skewness) and 12 texture features (Gabor energy computed over 3 scales and 4 orientations).

How many ways would the probability mass be split in the first case?

The probability mass would be split three ways (0.33 each) in the first case while in the second image “face” would have a probability of 1.

How many rectangles are selected for the Corel set?

The number of rectangles is empirically selected (using the training and validation sets) and is 24 for the Corel set, and 35 for the video dataset set.

(Open Access) Multiple Bernoulli relevance models for image and video annotation (2004) | Shaolei Feng

Multiple Bernoulli Relevance Models for Image and Video Annotation

S. L. Feng, R. Manmatha and V. Lavrenko

∗

Multimedia Indexing and Retrieval Group

Center for Intelligent Information Retrieval

University of Massachusetts

Amherst, MA, 01003

Abstract

Retrieving images in response to textual queries requires

some knowledge of the semantics of the picture. Here, we

show how we can do both automatic image annotation and

retrieval (using one word queries) from images and videos

using a multiple Bernoulli relevance model. The model as-

sumes that a training set of images or videos along with

keyword annotations is provided. Multiple keywords are

provided for an image and the speciﬁc correspondence be-

tween a keyword and an image is not provided. Each im-

age is partitioned into a set of rectangular regions and a

real-valued feature vector is computed over these regions.

The relevance model is a joint probability distribution of

the word annotations and the image feature vectors and is

computed using the training set. The word probabilities are

estimated using a multiple Bernoulli model and the image

feature probabilities using a non-parametric kernel density

estimate. The model is then used to annotate images in

a test set. We show experiments on both images from a

standard Corel data set and a set of video key frames from

NIST’s Video Trec. Comparative experiments show that the

model performs better than a model based on estimating

word probabilities using the popular multinomial distribu-

tion. The results also show that our model signiﬁcantly out-

performs previously reported results on the task of image

and video annotation.

1. Introduction

Searching and ﬁnding large numbers of images and videos

from a database is a challenging problem. The conventional

approach to this problem is to search on image attributes

like color and texture. Such approaches suffer from a num-

ber of problems. They do not really capture the semantics

∗

This work was supported in part by the Center for Intelligent Informa-

tion Retrieval and in part by the National Science Foundation under grant

number IIS-9909073 and in part by SPAWARSYSCEN-SD grant number

N66001-02-1-8903. Any opinions, ﬁndings and conclusions or recommen-

dations expressed in this material are the author(s) and do not necessarily

reﬂect those of the sponsor.

of the problem well and they often require people to pose

image queries using color or texture which is difﬁcult for

most people to do. The traditional “low-tech” solution to

this problem practiced by librarians is to annotate each im-

age manually with keywords or captions and then search on

those captions or keywords using a conventional text search

engine. The rationale here is that the keywords capture the

semantic content of the image and help in retrieving the im-

ages. This technique is also used by television news orga-

nizations to retrieve ﬁle footage from their videos. While

“low-tech”, such techniques allow text queries and are suc-

cessful in ﬁnding the relevant pictures. The main disadvan-

tage with manual annotations is the cost and difﬁculty of

scaling it to large numbers of images.

Automatically annotating images/videos would solve

this problem while still retaining the advantages of a se-

mantic search. Here, we propose approaches to automati-

cally annotating and retrieving images/videos by learning a

statistical generative model called a relevance model using a

set of annotated training images. The images are partitioned

into rectangles and features are computed over these rectan-

gles. We then learn a joint probability model for (continu-

ous) image features and words called a relevance model and

use this model to annotate test images which we have not

seen. Words are modeled using a multiple Bernoulli pro-

cess and images modeled using a kernel density estimate.

We test this model using a Corel dataset provided by [5]

and show that it outperforms previously reported results on

other models. It performs 4 times better than a model based

on machine translation [5] and better than one which models

word probabilities using a multinomial to represent words.

Existing annotation models [5, 3, 7, 8] by analogy with the

text retrieval world have used the multinomial distribution

to model annotation words. We believe that annotation text

has very different characteristics than full text in documents

and hence a Bernoulli distribution is more appropriate.

In image/video annotation, a multinomial would split the

probability mass between multiple words. For example, if

an image was annotated with “person, grass”, with perfect

annotation the probability for each word would be equal to

0.5. On the other hand another image which has just one

annotation “person” would have a probability of 1.0 with

perfect annotation. If we want to ﬁnd images of people,

when rank ordering these images by probability the second

image would be preferred to the ﬁrst although there is no

reason for preferring one image over another. The problem

can be made much worse when the annotation lengths for

different images differ substantially. A similar effect occurs

when annotations are hierarchical. For example, let one im-

age be annotated “face, male

face, Bill Clinton” and a sec-

ond image be annotated with just “face”. The probability

mass would be split three ways (0.33 each) in the ﬁrst case

while in the second image “face” would have a probability

of 1. Again the second image would be preferred for the

query “face”, although there is no reason for preferring one

over the other. The Bernoulli model avoids this problem

by making decisions about each annotation independent of

the other words. Thus, in all the above examples, each of

the words would have a probability of 1 (assuming perfect

annotation).

It has been argued [14] that the Corel dataset is much

easier to annotate and retrieve and does not really capture

the difﬁculties inherent in more challenging (real) datasets

like the news videos in Trec Video [12] We therefore, exper-

imented with a subset of news videos (ABC, CNN) from the

Trec Video dataset. We show that in fact we obtain compa-

rable or even better performance (depending on the task) on

this dataset and that again the Bernoulli model outperforms

a multinomial model.

The speciﬁc contributions of this work include:

1. A probabilistic generative model which uses a

Bernoulli process to generate words and kernel den-

sity estimate to generate image features. This model

simultaneously learns the joint probabilities of associ-

ating words with image features using a training set

of images with keywords and then generates multiple

probabilistic annotations for each image.

2. Signiﬁcant improvements in annotation performance

over a number of other models on both a standard

Corel dataset and a real word news video dataset.

3. Large improvements in annotation performance by us-

ing a rectangular grid instead of regions obtained using

a segmentation algorithm (see [4] for a related result).

4. Substantial improvements in retrieval performance on

one word queries over a multinomial model.

The focus of this paper is on models and not on features.

We use features similar to those used in [5, 3]

The rest of this paper is organized as follows. We ﬁrst

discuss the multiple Bernoulli relevance model and its rela-

tion to the multinomial relevance model. This is followed

by a discussion of related work in this area. The next sec-

tion describes the datasets and the results obtained. Finally,

we conclude the paper.

2 Multiple-Bernoulli Relevance Model

In this section we describe a statistical model for auto-

matic annotation of images and video frames. Our model is

called Multiple-Bernoulli Relevance Model (MBRM) and

is based on the Continuous-space Relevance Model (CRM)

proposed by [8]. CRM has proved to be very successful

on the tasks of automatic image annotation and retrieval.

In the rest of this section we discuss two shortcomings of

the CRM in the video domain and propose a possible way

of addressing these shortcomings. We then provide a for-

mal description of our model as a generative process and

complete the section with a brief discussion of estimation

details.

2.1 Relation of MBRM and CRM

CRM[8] is a probabilistic model for image annotation and

retrieval. The basic idea behind CRM is to reduce an image

to a set of real-valued feature vectors, and then model the

joint probability of observing feature vectors with possible

annotation words. The feature vectors in [8] are based on

automatic segmentation[10] of the target image into regions

and are modeled using a kernel-based probability density

function. The annotation words are modeled with a multi-

nomial distribution. The joint distribution in [8] of words

and feature vectors relies on a doubly non-parametric ap-

proach, where expectations are computed over each anno-

tated image in the training set.

We believe the CRM model makes two assumptions that

make it ill-suited for annotations in the image/video do-

main.

1. Segmentation: The CRM relies on automatic seg-

mentation of the image into semantically-coherent re-

gions. While the CRM does not make any assumptions

about correspondence of annotation words to image re-

gions, the overall annotation performance is strongly

affected by the quality of segmentation. In addition,

automatic segmentation is a rather expensive process

that is poorly suited for large-scale video datasets.

2. Multinomial: CRM assumes that annotation words

for any given image follow a multinomial distribu-

tion. This is a reasonable assumption in the Corel[5]

dataset, where all annotations are approximately equal

in length and words reﬂect prominence of objects in the

image. However, in our video dataset[12] individual

frames have hierarchical annotations which do not fol-

low the multinomial distribution. The length of the an-

notations also varies widely for different video frames.

Furthermore, video annotations focus on presence of

an object in a frame, rather than its prominence.

In the next two subsections we show how we can improve

results by modifying these assumptions.

2.1.1 Rectangular image regions

In the current model, rather than attempting segmentation,

we impose a ﬁxed-size rectangular grid on each image. The

image is then represented as a set of tiles. Using a grid

provides a number of advantages. First, there is a very sig-

niﬁcant reduction in the computational time required for the

model. Second, each image now contains a ﬁxed number of

regions, which simpliﬁes parameter estimation. Finally, us-

ing a grid makes it somewhat easier to incorporate context

into the model. For example, relative position could greatly

aid in distinguishing adjacent tiles of water and sky. To eval-

uate the effect of using rectangular regions versus segmen-

tation, we ran experiments with the CRM model but with

rectangular regions as input - we call this CRM-Rectangles.

The experiments in Section 4 show that this alone improves

the mean per-word precision by about 38% - a substantial

improvement in performance. We believe this is because

segmentation is done on a per image basis. The CRM model

cannot undo any problems that occur with segmentation.

However, using a rectangular grid (with more regions than

produced by the segmentation) allows the model to learn

using a much larger set of training images what the correct

association of words and image regions should be.

2.1.2 Multiple-Bernoulli word model

Another major contribution of the current model over the

CRM is in our use of the multiple-Bernoulli distribution

for modeling image annotations. In this section we high-

light the differences between the multiple-Bernoulli and

the multinomial model, and articulate why we believe that

multiple-Bernoulli is a better alternative.

The multinomial model is meant to reﬂect the promi-

nence of words in a given annotation. The event space of

the model is the set of all strings over a given vocabulary,

and consequently words can appear multiple times in the an-

notation. In addition, the probability mass is shared by all

words in the vocabulary, and during the estimation process

the words compete for this probability mass. As a result,

an image I

annotated with a single word “face” will as-

sign all probability mass to that word, so P (face|I

) = 1.

At the same time, an image I

annotated with two words

“face” and “person” will split the probability mass, so

texture,shape,

color, ...

P(w|J)

P(g|J)

tiger

ape

grass

sun

zoo

Figure 1: MBRM viewed as a generative process. The an-

notation w is a binary vector sampled from the underlying

multiple-Bernoulli model. The image is produced by ﬁrst

sampling a set of feature vectors {g

. . .g

}, and then gen-

erating image regions {r

. . .r

} from the feature vectors.

Resulting regions are tiled to form the image.

P (face|I

) =

. Thus the multinomial distribution mod-

els prominence of a word in the annotation, favoring single

words, or words that occur multiple times in an annotation.

Arguably, both images I

and I

contain a face, so the

probability of “face” should be equal. This can be mod-

eled by a multiple-Bernoulli model, which explicitly fo-

cuses on presence or absence of words in the annotation,

rather than on their prominence. The event space of the

multiple-Bernoulli model is the set of all subsets of a given

vocabulary. Each subset can be represented as a binary

occurrence vector in {0, 1}

. Individual components of

the vector are assumed to be independent and identically

(Bernoulli-) distributed given the particular image.

In our dataset, image annotations are hierarchical and

have greatly varying length. No word is ever used more than

once in any given annotation, so modeling word frequency

is pointless. Finally, words are assigned to the annotation

based on merely the presence of an object in a frame, not

on its prominence. We believe that a Bernoulli model pro-

vides a much closer match for this environment. Our hy-

pothesis is supported by experimental results which will be

discussed in section 4.

2.2 MBRM as a generative model.

Let V denote the annotation vocabulary, T denote the train-

ing set of annotated images, and let J be an element of T .

According to the previous section J is represented as a set of

image regions r

= {r

. . .r

} along with the correspond-

ing annotation w

∈ {0, 1}

. We assume that the process

that generated J is based on two distinct probability distri-

butions. First, we assume that the set of annotation words

is a result of |V| independent samples from every com-

ponent of some underlying multiple-Bernoulli distribution

(·|J). Second, for each image region r we sample a real-

valued feature vector g of dimension k. The feature vector is

sampled from some underlying multi-variate density func-

tion P

(·|J). Finally, the rectangular region r is produced

according to some unknown distribution conditioned on g.

We make no attempt to model the process of generating r

from g. The resulting regions r

. . .r

are tiled to form the

image.

Now let r

= {g

. . .g

} denote the feature vectors of

some image A, which is not in the training set T . Simi-

larly, let w

be some arbitrary subset of V. We would like

to model P (r

, w

), the joint probability of observing an

image deﬁned by r

together with annotation words w

We hypothesize that the observation {r

, w

} came from

the same process that generated one of the images J

∗

the training set T . However, we don’t know which process

that was, and so we compute an expectation over all images

J∈T . The overall process for jointly generating w

and

is as follows:

1. Pick a training image J∈T with probability P

(J)

2. Sample w

from a multiple-Bernoulli model P

(·|J).

3. For a = 1 . . . n

(a) Sample a generator vector g

from the probabil-

ity density P

(·|J).

Figure 1 shows a graphical dependency diagram for the

generative process outlined above. We show the process of

generating a simple image consisting of three regions and a

corresponding 3-word annotation. Note that the number of

words in the annotation n

does not have to be the same as

the number of image regions n

. Formally, the probability

of a joint observation {r

, w

} is given by:

P (r

, w

) =

J∈T

(

(J)

a=1

|J)×

v∈w

(v|J)

v6∈w

(1 − P

(v|J))







(1)

Equation (1) makes it evident how we can use MBRM

for annotating new images or video frames. Given a new

(un-annotated) image we can split it into regions r

, com-

pute feature vectors g

. . .g

for each region and then use

equation 1 to determine what subset of vocabulary w

∗

most likely to co-occur with the set of feature vectors:

∗

= arg max

w∈{0,1}

P (r

, w)

P (r

)

(2)

In practice we only consider subsets of a ﬁxed size (5

words). One can show that the maximization in equation (2)

can be done very efﬁciently because of the factored nature

of the Bernoulli component. Essentially it can be shown that

the equations may be simpliﬁed so that P (w

|J) may be

computed independently for each word. This simpliﬁcation

arises because each word occurs at most once as the caption

of an image. Space constraints preclude us from providing

the proof.

2.3 Estimating Parameters of the Model

In this section we will discuss simple but effective estima-

tion techniques for the three components of the model: P

and P

. P

(J) is the probability of selecting the under-

lying model of image J to generate some new observation

r, w. In the absence of any task knowledge we use a uni-

form prior P

(J) = 1/N

, where N

is the size of the

training set.

(·|J) is a density function responsible for generating

the feature vectors g

. . .g

, which are later mapped to im-

age regions r

according to P

. We use a non-parametric

kernel-based density estimate for the distribution P

. As-

suming g

= {g

. . .g

} to be the set of regions of image J

we estimate:

(g|J) =

i=1

exp



−(g − g

))

−1

(g − g

))



|Σ|

(3)

Equation (3) arises out of placing a Gaussian kernel over

the feature vector g

of every region of image J. Each kernel

is parametrized by the feature covariance matrix Σ. As a

matter of convenience we assumed Σ = β·I, where I is

the identity matrix. β plays the role of kernel bandwidth: it

determines the smoothness of P

around the support point

. The value of β is selected empirically on a held-out

portion of the training set T .

(v|J) is the v’th component of the multiple-Bernoulli

distribution that is assumed to have generated the annotation

of image J∈T . The Bayes estimate using a beta prior

(conjugate to a Bernoulli) for each word is given by:

(v|J) =

µ δ

v,J

+ N

µ + N

(4)

here µ is a smoothing parameter estimated using the

training and validation set, δ

v,J

= 1 if the word v occurs

in the annotation of image J and zero otherwise. N

is the

number of training images that contain v in the annotation

and N is the total number of training images.

3 Related Work

Our model differs from traditional object recognition ap-

proaches in a number of ways (for example [9, 13, 1, 6, 4,

11]. Such approaches require a separate model to be trained

for each object to be recognized That is, even though the

form of the statistical model may be the same, learning two

different objects like a car and a person requires two sepa-

rate training runs (one for each object). Each training run

requires positive and negative examples for that particular

object. On the other hand, in the relevance model approach

described here all the annotation words are learned at the

same time - each training image usually has many anno-

tations. While some of the newer object recognition tech-

niques [6] do not require training examples of the objects to

be cut out of the background, they still seem to require one

object in each image. Our model on the other hand can han-

dle multiple objects in the same training image and can also

ascribe annotations to the backgrounds like sky and grass.

Unlike the more traditional object recognition techniques

we label the entire picture and not speciﬁc image regions in

a picture. This is as a librarian’s manual annotation shows

more than sufﬁcient for tasks like retrieving images from

a large database. The joint probability model that we pro-

pose takes context into account i.e. from training images it

learns that an elephant is more likely to be associated with

grass and sky and less likely to be associated with buildings

and hence if there are image regions associated with grass,

this increases the probability of recognizing the object as an

elephant. Traditional object recognition models do not do

this.

The model described here is closest in spirit to the an-

notation models proposed by [5, 3, 7, 8, 2]. Duygulu et

al [5] proposed to describe images using a vocabulary of

blobs. First, regions are created using a segmentation al-

gorithm like normalized cuts. For each region, features are

computed and then blobs are generated by clustering the

image features for these regions across images. Each im-

age is generated by using a certain number of these blobs.

Their Translation Model applies one of the classical statis-

tical machine translation models to translate from the set of

keywords of an image to the set of blobs forming the image.

On the surface, MBRM appears to be similar to one of

the intermediate models considered by Blei and Jordan [3].

Speciﬁcally, their GM-mixture model employs a similar de-

pendence structure among the random variables involved.

However, the topological structure of MBRM is quite dif-

ferent from the one employed by [3]. GM-mixture assumes

a low-dimensional topology, leading to a fully-parametric

model where 200 or so “latent aspects” are estimated us-

ing the EM algorithm. To contrast that, MBRM makes no

assumptions about the topological structure, and leads to

a doubly non-parametric approach, where expectations are

computed over every individual point in the training set.

In addition they model words using a multinomial process.

Blei and Jordan used a different subset of the Corel dataset

and hence it is difﬁcult to make a direct quantitative com-

parison with their models.

MBRM is also related to the cross-media relevance

model (CMRM) [7], which is also doubly non-parametric.

There are three signiﬁcant differences between MBRM and

CMRM. First, CMRM is a discrete model and cannot take

advantage of continuous features. In order to use CMRM

for image annotation we have to quantize continuous feature

vectors into a discrete vocabulary (similarly to the trans-

lation [5] models). MBRM, on the other hand, directly

models continuous features. The second difference is that

CMRM relies on clustering of the feature vectors into blobs.

Annotation quality of the CMRM is very sensitive to clus-

tering errors, and depends on being able to a-priori select

the right cluster granularity: too many clusters will result

in extreme sparseness of the space, while too few will lead

us to confuse different objects in the images. MBRM does

not rely on clustering and consequently does not suffer from

the granularity issues. Finally, CMRM also models words

using a multinomial process.

We would like to stress that the difference between

MBRM and previously discussed models is not merely con-

ceptual. In section 4 we will show that MBRM performs

signiﬁcantly better than all previously proposed models on

the tasks of image annotation and retrieval. To ensure a fair

comparison, we show results on exactly the same data set

and similar feature representations as used in [5, 7, 8].

4. Experimental Results

We tested the algorithms using two different datasets, the

Corel data set from Duygulu et al [5] and a set of video key

frames from NIST’s Video Trec [12]. To provide a mean-

ingful comparison between MBRM and CRM-Rectangles,

we do comparative experiments using the same set of fea-

tures extracted from the same set of rectangular grids. For

the Corel dataset we also compare the results with those of

Duygulu et al and the CRM model.

4.1. Datasets and Feature sets

The Corel data set consists of 5000 images from 50 Corel

Stock Photo cds.

Each cd includes 100 images on the

same topic, and each image is also associated with 1-5 key-

words. Overall there are 371 keywords in the dataset. In

experiments, we divided this dataset into 3 parts: a training

set of 4000 images, a validation set of 500 images and a test

set of 500 images. The validation set is used to ﬁnd model

parameters. After ﬁnding the parameters, we merged the

4000 training set and 500 validation set to form a new train-

ing set. This corresponds to the training set of 4500 images

and the test set of 500 images used by Duygulu et al [5].

We used a subset of NIST’s Video Trec dataset (for com-

putational reasons we did not use the entire data set). The

We thank Kobus Barnard for making the Corel dataset available at

http://www.cs.arizona.edu/people/kobus/research/data/eccv 2002

Multiple Bernoulli relevance models for image and video annotation

Figures

Citations

Deep Sets

A new approach to cross-modal multimedia retrieval

Supervised Learning of Semantic Classes for Image Annotation and Retrieval

TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation

Large Scale Online Learning of Image Similarity Through Ranking

References

Rapid object detection using a boosted cascade of simple features

Normalized cuts and image segmentation

Normalized cuts and image segmentation

Object class recognition by unsupervised scale-invariant learning

Example-based learning for view-based human face detection

Related Papers (5)

Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

Automatic image annotation and retrieval using cross-media relevance models

Modeling annotated data

Matching words and pictures

Automatic Linguistic Indexing of Pictures by a statistical modeling approach

Frequently Asked Questions (11)

Q1. What are the contributions in "Multiple bernoulli relevance models for image and video annotation" ?

Q2. What are the future works mentioned in the paper "Multiple bernoulli relevance models for image and video annotation" ?

Q3. What is the main contribution of the current model over the CRM?

Q4. What is the way to model annotation words?

Q5. What is the definition of a probabilistic generative model?

Q6. What is the reason why the maximization in equation (2) can be done so efficiently?

Q7. How many features are there in the MBRM?

Q8. How many ways would the probability mass be split in the first case?

Q9. How many rectangles are selected for the Corel set?

Q10. What is the way to improve annotation performance?

Q11. What is the way to find images from a database?