scispace - formally typeset
Open AccessProceedings ArticleDOI

Multiple Bernoulli relevance models for image and video annotation

Shaolei Feng, +2 more
- Vol. 2, pp 1002-1009
TLDR
This work shows how it can do both automatic image annotation and retrieval (using one word queries) from images and videos using a multiple Bernoulli relevance model, which significantly outperforms previously reported results on the task of image and video annotation.
Abstract
Retrieving images in response to textual queries requires some knowledge of the semantics of the picture. Here, we show how we can do both automatic image annotation and retrieval (using one word queries) from images and videos using a multiple Bernoulli relevance model. The model assumes that a training set of images or videos along with keyword annotations is provided. Multiple keywords are provided for an image and the specific correspondence between a keyword and an image is not provided. Each image is partitioned into a set of rectangular regions and a real-valued feature vector is computed over these regions. The relevance model is a joint probability distribution of the word annotations and the image feature vectors and is computed using the training set. The word probabilities are estimated using a multiple Bernoulli model and the image feature probabilities using a non-parametric kernel density estimate. The model is then used to annotate images in a test set. We show experiments on both images from a standard Corel data set and a set of video key frames from NIST's video tree. Comparative experiments show that the model performs better than a model based on estimating word probabilities using the popular multinomial distribution. The results also show that our model significantly outperforms previously reported results on the task of image and video annotation.

read more

Content maybe subject to copyright    Report

Multiple Bernoulli Relevance Models for Image and Video Annotation
S. L. Feng, R. Manmatha and V. Lavrenko
Multimedia Indexing and Retrieval Group
Center for Intelligent Information Retrieval
University of Massachusetts
Amherst, MA, 01003
Abstract
Retrieving images in response to textual queries requires
some knowledge of the semantics of the picture. Here, we
show how we can do both automatic image annotation and
retrieval (using one word queries) from images and videos
using a multiple Bernoulli relevance model. The model as-
sumes that a training set of images or videos along with
keyword annotations is provided. Multiple keywords are
provided for an image and the specific correspondence be-
tween a keyword and an image is not provided. Each im-
age is partitioned into a set of rectangular regions and a
real-valued feature vector is computed over these regions.
The relevance model is a joint probability distribution of
the word annotations and the image feature vectors and is
computed using the training set. The word probabilities are
estimated using a multiple Bernoulli model and the image
feature probabilities using a non-parametric kernel density
estimate. The model is then used to annotate images in
a test set. We show experiments on both images from a
standard Corel data set and a set of video key frames from
NIST’s Video Trec. Comparative experiments show that the
model performs better than a model based on estimating
word probabilities using the popular multinomial distribu-
tion. The results also show that our model significantly out-
performs previously reported results on the task of image
and video annotation.
1. Introduction
Searching and finding large numbers of images and videos
from a database is a challenging problem. The conventional
approach to this problem is to search on image attributes
like color and texture. Such approaches suffer from a num-
ber of problems. They do not really capture the semantics
This work was supported in part by the Center for Intelligent Informa-
tion Retrieval and in part by the National Science Foundation under grant
number IIS-9909073 and in part by SPAWARSYSCEN-SD grant number
N66001-02-1-8903. Any opinions, findings and conclusions or recommen-
dations expressed in this material are the author(s) and do not necessarily
reflect those of the sponsor.
of the problem well and they often require people to pose
image queries using color or texture which is difficult for
most people to do. The traditional “low-tech” solution to
this problem practiced by librarians is to annotate each im-
age manually with keywords or captions and then search on
those captions or keywords using a conventional text search
engine. The rationale here is that the keywords capture the
semantic content of the image and help in retrieving the im-
ages. This technique is also used by television news orga-
nizations to retrieve file footage from their videos. While
“low-tech”, such techniques allow text queries and are suc-
cessful in finding the relevant pictures. The main disadvan-
tage with manual annotations is the cost and difficulty of
scaling it to large numbers of images.
Automatically annotating images/videos would solve
this problem while still retaining the advantages of a se-
mantic search. Here, we propose approaches to automati-
cally annotating and retrieving images/videos by learning a
statistical generative model called a relevance model using a
set of annotated training images. The images are partitioned
into rectangles and features are computed over these rectan-
gles. We then learn a joint probability model for (continu-
ous) image features and words called a relevance model and
use this model to annotate test images which we have not
seen. Words are modeled using a multiple Bernoulli pro-
cess and images modeled using a kernel density estimate.
We test this model using a Corel dataset provided by [5]
and show that it outperforms previously reported results on
other models. It performs 4 times better than a model based
on machine translation [5] and better than one which models
word probabilities using a multinomial to represent words.
Existing annotation models [5, 3, 7, 8] by analogy with the
text retrieval world have used the multinomial distribution
to model annotation words. We believe that annotation text
has very different characteristics than full text in documents
and hence a Bernoulli distribution is more appropriate.
In image/video annotation, a multinomial would split the
probability mass between multiple words. For example, if
an image was annotated with “person, grass”, with perfect
annotation the probability for each word would be equal to
1

0.5. On the other hand another image which has just one
annotation “person” would have a probability of 1.0 with
perfect annotation. If we want to find images of people,
when rank ordering these images by probability the second
image would be preferred to the first although there is no
reason for preferring one image over another. The problem
can be made much worse when the annotation lengths for
different images differ substantially. A similar effect occurs
when annotations are hierarchical. For example, let one im-
age be annotated “face, male
face, Bill Clinton” and a sec-
ond image be annotated with just “face”. The probability
mass would be split three ways (0.33 each) in the first case
while in the second image “face” would have a probability
of 1. Again the second image would be preferred for the
query “face”, although there is no reason for preferring one
over the other. The Bernoulli model avoids this problem
by making decisions about each annotation independent of
the other words. Thus, in all the above examples, each of
the words would have a probability of 1 (assuming perfect
annotation).
It has been argued [14] that the Corel dataset is much
easier to annotate and retrieve and does not really capture
the difficulties inherent in more challenging (real) datasets
like the news videos in Trec Video [12] We therefore, exper-
imented with a subset of news videos (ABC, CNN) from the
Trec Video dataset. We show that in fact we obtain compa-
rable or even better performance (depending on the task) on
this dataset and that again the Bernoulli model outperforms
a multinomial model.
The specific contributions of this work include:
1. A probabilistic generative model which uses a
Bernoulli process to generate words and kernel den-
sity estimate to generate image features. This model
simultaneously learns the joint probabilities of associ-
ating words with image features using a training set
of images with keywords and then generates multiple
probabilistic annotations for each image.
2. Significant improvements in annotation performance
over a number of other models on both a standard
Corel dataset and a real word news video dataset.
3. Large improvements in annotation performance by us-
ing a rectangular grid instead of regions obtained using
a segmentation algorithm (see [4] for a related result).
4. Substantial improvements in retrieval performance on
one word queries over a multinomial model.
The focus of this paper is on models and not on features.
We use features similar to those used in [5, 3]
The rest of this paper is organized as follows. We first
discuss the multiple Bernoulli relevance model and its rela-
tion to the multinomial relevance model. This is followed
by a discussion of related work in this area. The next sec-
tion describes the datasets and the results obtained. Finally,
we conclude the paper.
2 Multiple-Bernoulli Relevance Model
In this section we describe a statistical model for auto-
matic annotation of images and video frames. Our model is
called Multiple-Bernoulli Relevance Model (MBRM) and
is based on the Continuous-space Relevance Model (CRM)
proposed by [8]. CRM has proved to be very successful
on the tasks of automatic image annotation and retrieval.
In the rest of this section we discuss two shortcomings of
the CRM in the video domain and propose a possible way
of addressing these shortcomings. We then provide a for-
mal description of our model as a generative process and
complete the section with a brief discussion of estimation
details.
2.1 Relation of MBRM and CRM
CRM[8] is a probabilistic model for image annotation and
retrieval. The basic idea behind CRM is to reduce an image
to a set of real-valued feature vectors, and then model the
joint probability of observing feature vectors with possible
annotation words. The feature vectors in [8] are based on
automatic segmentation[10] of the target image into regions
and are modeled using a kernel-based probability density
function. The annotation words are modeled with a multi-
nomial distribution. The joint distribution in [8] of words
and feature vectors relies on a doubly non-parametric ap-
proach, where expectations are computed over each anno-
tated image in the training set.
We believe the CRM model makes two assumptions that
make it ill-suited for annotations in the image/video do-
main.
1. Segmentation: The CRM relies on automatic seg-
mentation of the image into semantically-coherent re-
gions. While the CRM does not make any assumptions
about correspondence of annotation words to image re-
gions, the overall annotation performance is strongly
affected by the quality of segmentation. In addition,
automatic segmentation is a rather expensive process
that is poorly suited for large-scale video datasets.
2. Multinomial: CRM assumes that annotation words
for any given image follow a multinomial distribu-
tion. This is a reasonable assumption in the Corel[5]
dataset, where all annotations are approximately equal
in length and words reflect prominence of objects in the
image. However, in our video dataset[12] individual
2

frames have hierarchical annotations which do not fol-
low the multinomial distribution. The length of the an-
notations also varies widely for different video frames.
Furthermore, video annotations focus on presence of
an object in a frame, rather than its prominence.
In the next two subsections we show how we can improve
results by modifying these assumptions.
2.1.1 Rectangular image regions
In the current model, rather than attempting segmentation,
we impose a fixed-size rectangular grid on each image. The
image is then represented as a set of tiles. Using a grid
provides a number of advantages. First, there is a very sig-
nificant reduction in the computational time required for the
model. Second, each image now contains a fixed number of
regions, which simplifies parameter estimation. Finally, us-
ing a grid makes it somewhat easier to incorporate context
into the model. For example, relative position could greatly
aid in distinguishing adjacent tiles of water and sky. To eval-
uate the effect of using rectangular regions versus segmen-
tation, we ran experiments with the CRM model but with
rectangular regions as input - we call this CRM-Rectangles.
The experiments in Section 4 show that this alone improves
the mean per-word precision by about 38% - a substantial
improvement in performance. We believe this is because
segmentation is done on a per image basis. The CRM model
cannot undo any problems that occur with segmentation.
However, using a rectangular grid (with more regions than
produced by the segmentation) allows the model to learn
using a much larger set of training images what the correct
association of words and image regions should be.
2.1.2 Multiple-Bernoulli word model
Another major contribution of the current model over the
CRM is in our use of the multiple-Bernoulli distribution
for modeling image annotations. In this section we high-
light the differences between the multiple-Bernoulli and
the multinomial model, and articulate why we believe that
multiple-Bernoulli is a better alternative.
The multinomial model is meant to reflect the promi-
nence of words in a given annotation. The event space of
the model is the set of all strings over a given vocabulary,
and consequently words can appear multiple times in the an-
notation. In addition, the probability mass is shared by all
words in the vocabulary, and during the estimation process
the words compete for this probability mass. As a result,
an image I
1
annotated with a single word “face” will as-
sign all probability mass to that word, so P (face|I
1
) = 1.
At the same time, an image I
2
annotated with two words
“face” and “person” will split the probability mass, so
J
w
g
1
g
2
g
3
texture,shape,
color, ...
r
1
r
2
r
3
P(w|J)
P(g|J)
tiger
=
ape
.
.
.
grass
sun
zoo
.
.
.
.
.
.
Figure 1: MBRM viewed as a generative process. The an-
notation w is a binary vector sampled from the underlying
multiple-Bernoulli model. The image is produced by first
sampling a set of feature vectors {g
1
. . .g
n
}, and then gen-
erating image regions {r
1
. . .r
n
} from the feature vectors.
Resulting regions are tiled to form the image.
P (face|I
2
) =
1
2
. Thus the multinomial distribution mod-
els prominence of a word in the annotation, favoring single
words, or words that occur multiple times in an annotation.
Arguably, both images I
1
and I
2
contain a face, so the
probability of “face” should be equal. This can be mod-
eled by a multiple-Bernoulli model, which explicitly fo-
cuses on presence or absence of words in the annotation,
rather than on their prominence. The event space of the
multiple-Bernoulli model is the set of all subsets of a given
vocabulary. Each subset can be represented as a binary
occurrence vector in {0, 1}
V
. Individual components of
the vector are assumed to be independent and identically
(Bernoulli-) distributed given the particular image.
In our dataset, image annotations are hierarchical and
have greatly varying length. No word is ever used more than
once in any given annotation, so modeling word frequency
is pointless. Finally, words are assigned to the annotation
based on merely the presence of an object in a frame, not
on its prominence. We believe that a Bernoulli model pro-
vides a much closer match for this environment. Our hy-
pothesis is supported by experimental results which will be
discussed in section 4.
2.2 MBRM as a generative model.
Let V denote the annotation vocabulary, T denote the train-
ing set of annotated images, and let J be an element of T .
According to the previous section J is represented as a set of
image regions r
J
= {r
1
. . .r
n
} along with the correspond-
ing annotation w
J
{0, 1}
V
. We assume that the process
that generated J is based on two distinct probability distri-
butions. First, we assume that the set of annotation words
w
J
is a result of |V| independent samples from every com-
ponent of some underlying multiple-Bernoulli distribution
P
V
(·|J). Second, for each image region r we sample a real-
valued feature vector g of dimension k. The feature vector is
sampled from some underlying multi-variate density func-
3

tion P
G
(·|J). Finally, the rectangular region r is produced
according to some unknown distribution conditioned on g.
We make no attempt to model the process of generating r
from g. The resulting regions r
1
. . .r
n
are tiled to form the
image.
Now let r
A
= {g
1
. . .g
n
A
} denote the feature vectors of
some image A, which is not in the training set T . Simi-
larly, let w
B
be some arbitrary subset of V. We would like
to model P (r
A
, w
B
), the joint probability of observing an
image defined by r
A
together with annotation words w
B
.
We hypothesize that the observation {r
A
, w
B
} came from
the same process that generated one of the images J
in
the training set T . However, we don’t know which process
that was, and so we compute an expectation over all images
J∈T . The overall process for jointly generating w
B
and
r
A
is as follows:
1. Pick a training image J∈T with probability P
T
(J)
2. Sample w
B
from a multiple-Bernoulli model P
V
(·|J).
3. For a = 1 . . . n
A
:
(a) Sample a generator vector g
a
from the probabil-
ity density P
G
(·|J).
Figure 1 shows a graphical dependency diagram for the
generative process outlined above. We show the process of
generating a simple image consisting of three regions and a
corresponding 3-word annotation. Note that the number of
words in the annotation n
B
does not have to be the same as
the number of image regions n
A
. Formally, the probability
of a joint observation {r
A
, w
B
} is given by:
P (r
A
, w
B
) =
X
J∈T
(
P
T
(J)
n
A
Y
a=1
P
G
(g
a
|J)×
×
Y
vw
B
P
V
(v|J)
Y
v6∈w
B
(1 P
V
(v|J))
(1)
Equation (1) makes it evident how we can use MBRM
for annotating new images or video frames. Given a new
(un-annotated) image we can split it into regions r
A
, com-
pute feature vectors g
1
. . .g
n
for each region and then use
equation 1 to determine what subset of vocabulary w
is
most likely to co-occur with the set of feature vectors:
w
= arg max
w∈{0,1}
V
P (r
A
, w)
P (r
A
)
(2)
In practice we only consider subsets of a fixed size (5
words). One can show that the maximization in equation (2)
can be done very efficiently because of the factored nature
of the Bernoulli component. Essentially it can be shown that
the equations may be simplified so that P (w
i
|J) may be
computed independently for each word. This simplification
arises because each word occurs at most once as the caption
of an image. Space constraints preclude us from providing
the proof.
2.3 Estimating Parameters of the Model
In this section we will discuss simple but effective estima-
tion techniques for the three components of the model: P
T
,
P
V
and P
G
. P
T
(J) is the probability of selecting the under-
lying model of image J to generate some new observation
r, w. In the absence of any task knowledge we use a uni-
form prior P
T
(J) = 1/N
T
, where N
T
is the size of the
training set.
P
G
(·|J) is a density function responsible for generating
the feature vectors g
1
. . .g
n
, which are later mapped to im-
age regions r
J
according to P
R
. We use a non-parametric
kernel-based density estimate for the distribution P
G
. As-
suming g
J
= {g
1
. . .g
n
} to be the set of regions of image J
we estimate:
P
G
(g|J) =
1
n
n
X
i=1
exp
(g g
i
))
>
Σ
1
(g g
i
))
p
2
k
π
k
|Σ|
(3)
Equation (3) arises out of placing a Gaussian kernel over
the feature vector g
i
of every region of image J. Each kernel
is parametrized by the feature covariance matrix Σ. As a
matter of convenience we assumed Σ = β·I, where I is
the identity matrix. β plays the role of kernel bandwidth: it
determines the smoothness of P
G
around the support point
g
i
. The value of β is selected empirically on a held-out
portion of the training set T .
P
V
(v|J) is the v’th component of the multiple-Bernoulli
distribution that is assumed to have generated the annotation
w
J
of image J∈T . The Bayes estimate using a beta prior
(conjugate to a Bernoulli) for each word is given by:
P
V
(v|J) =
µ δ
v,J
+ N
v
µ + N
(4)
here µ is a smoothing parameter estimated using the
training and validation set, δ
v,J
= 1 if the word v occurs
in the annotation of image J and zero otherwise. N
v
is the
number of training images that contain v in the annotation
and N is the total number of training images.
3 Related Work
Our model differs from traditional object recognition ap-
proaches in a number of ways (for example [9, 13, 1, 6, 4,
11]. Such approaches require a separate model to be trained
for each object to be recognized That is, even though the
4

form of the statistical model may be the same, learning two
different objects like a car and a person requires two sepa-
rate training runs (one for each object). Each training run
requires positive and negative examples for that particular
object. On the other hand, in the relevance model approach
described here all the annotation words are learned at the
same time - each training image usually has many anno-
tations. While some of the newer object recognition tech-
niques [6] do not require training examples of the objects to
be cut out of the background, they still seem to require one
object in each image. Our model on the other hand can han-
dle multiple objects in the same training image and can also
ascribe annotations to the backgrounds like sky and grass.
Unlike the more traditional object recognition techniques
we label the entire picture and not specific image regions in
a picture. This is as a librarian’s manual annotation shows
more than sufficient for tasks like retrieving images from
a large database. The joint probability model that we pro-
pose takes context into account i.e. from training images it
learns that an elephant is more likely to be associated with
grass and sky and less likely to be associated with buildings
and hence if there are image regions associated with grass,
this increases the probability of recognizing the object as an
elephant. Traditional object recognition models do not do
this.
The model described here is closest in spirit to the an-
notation models proposed by [5, 3, 7, 8, 2]. Duygulu et
al [5] proposed to describe images using a vocabulary of
blobs. First, regions are created using a segmentation al-
gorithm like normalized cuts. For each region, features are
computed and then blobs are generated by clustering the
image features for these regions across images. Each im-
age is generated by using a certain number of these blobs.
Their Translation Model applies one of the classical statis-
tical machine translation models to translate from the set of
keywords of an image to the set of blobs forming the image.
On the surface, MBRM appears to be similar to one of
the intermediate models considered by Blei and Jordan [3].
Specifically, their GM-mixture model employs a similar de-
pendence structure among the random variables involved.
However, the topological structure of MBRM is quite dif-
ferent from the one employed by [3]. GM-mixture assumes
a low-dimensional topology, leading to a fully-parametric
model where 200 or so “latent aspects” are estimated us-
ing the EM algorithm. To contrast that, MBRM makes no
assumptions about the topological structure, and leads to
a doubly non-parametric approach, where expectations are
computed over every individual point in the training set.
In addition they model words using a multinomial process.
Blei and Jordan used a different subset of the Corel dataset
and hence it is difficult to make a direct quantitative com-
parison with their models.
MBRM is also related to the cross-media relevance
model (CMRM) [7], which is also doubly non-parametric.
There are three significant differences between MBRM and
CMRM. First, CMRM is a discrete model and cannot take
advantage of continuous features. In order to use CMRM
for image annotation we have to quantize continuous feature
vectors into a discrete vocabulary (similarly to the trans-
lation [5] models). MBRM, on the other hand, directly
models continuous features. The second difference is that
CMRM relies on clustering of the feature vectors into blobs.
Annotation quality of the CMRM is very sensitive to clus-
tering errors, and depends on being able to a-priori select
the right cluster granularity: too many clusters will result
in extreme sparseness of the space, while too few will lead
us to confuse different objects in the images. MBRM does
not rely on clustering and consequently does not suffer from
the granularity issues. Finally, CMRM also models words
using a multinomial process.
We would like to stress that the difference between
MBRM and previously discussed models is not merely con-
ceptual. In section 4 we will show that MBRM performs
significantly better than all previously proposed models on
the tasks of image annotation and retrieval. To ensure a fair
comparison, we show results on exactly the same data set
and similar feature representations as used in [5, 7, 8].
4. Experimental Results
We tested the algorithms using two different datasets, the
Corel data set from Duygulu et al [5] and a set of video key
frames from NIST’s Video Trec [12]. To provide a mean-
ingful comparison between MBRM and CRM-Rectangles,
we do comparative experiments using the same set of fea-
tures extracted from the same set of rectangular grids. For
the Corel dataset we also compare the results with those of
Duygulu et al and the CRM model.
4.1. Datasets and Feature sets
The Corel data set consists of 5000 images from 50 Corel
Stock Photo cds.
1
Each cd includes 100 images on the
same topic, and each image is also associated with 1-5 key-
words. Overall there are 371 keywords in the dataset. In
experiments, we divided this dataset into 3 parts: a training
set of 4000 images, a validation set of 500 images and a test
set of 500 images. The validation set is used to find model
parameters. After finding the parameters, we merged the
4000 training set and 500 validation set to form a new train-
ing set. This corresponds to the training set of 4500 images
and the test set of 500 images used by Duygulu et al [5].
We used a subset of NIST’s Video Trec dataset (for com-
putational reasons we did not use the entire data set). The
1
We thank Kobus Barnard for making the Corel dataset available at
http://www.cs.arizona.edu/people/kobus/research/data/eccv 2002
5

Citations
More filters
Posted Content

Deep Sets

TL;DR: The main theorem characterizes the permutation invariant objective functions and provides a family of functions to which any permutation covariant objective function must belong, which enables the design of a deep network architecture that can operate on sets and which can be deployed on a variety of scenarios including both unsupervised and supervised learning tasks.
Proceedings ArticleDOI

A new approach to cross-modal multimedia retrieval

TL;DR: It is shown that accounting for cross-modal correlations and semantic abstraction both improve retrieval accuracy and are shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.
Journal ArticleDOI

Supervised Learning of Semantic Classes for Image Annotation and Retrieval

TL;DR: The supervised formulation is shown to achieve higher accuracy than various previously published methods at a fraction of their computational cost and to be fairly robust to parameter tuning.
Proceedings ArticleDOI

TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation

TL;DR: This work proposes TagProp, a discriminatively trained nearest neighbor model that allows the integration of metric learning by directly maximizing the log-likelihood of the tag predictions in the training set, and introduces a word specific sigmoidal modulation of the weighted neighbor tag predictions to boost the recall of rare words.
Journal Article

Large Scale Online Learning of Image Similarity Through Ranking

TL;DR: OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost, which suggests that query independent similarity could be accurately learned even for large scale data sets that could not be handled before.
References
More filters
Proceedings ArticleDOI

Rapid object detection using a boosted cascade of simple features

TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Journal ArticleDOI

Normalized cuts and image segmentation

TL;DR: This work treats image segmentation as a graph partitioning problem and proposes a novel global criterion, the normalized cut, for segmenting the graph, which measures both the total dissimilarity between the different groups as well as the total similarity within the groups.
Proceedings ArticleDOI

Normalized cuts and image segmentation

TL;DR: This work treats image segmentation as a graph partitioning problem and proposes a novel global criterion, the normalized cut, for segmenting the graph, which measures both the total dissimilarity between the different groups as well as the total similarity within the groups.
Proceedings ArticleDOI

Object class recognition by unsupervised scale-invariant learning

TL;DR: The flexible nature of the model is demonstrated by excellent results over a range of datasets including geometrically constrained classes (e.g. faces, cars) and flexible objects (such as animals).
Journal ArticleDOI

Example-based learning for view-based human face detection

TL;DR: An example-based learning approach for locating vertical frontal views of human faces in complex scenes and shows empirically that the distance metric adopted for computing difference feature vectors, and the "nonface" clusters included in the distribution-based model, are both critical for the success of the system.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What are the contributions in "Multiple bernoulli relevance models for image and video annotation" ?

Here, the authors show how they can do both automatic image annotation and retrieval ( using one word queries ) from images and videos using a multiple Bernoulli relevance model. The model assumes that a training set of images or videos along with keyword annotations is provided. Multiple keywords are provided for an image and the specific correspondence between a keyword and an image is not provided. The authors show experiments on both images from a standard Corel data set and a set of video key frames from NIST ’ s Video Trec. The results also show that their model significantly outperforms previously reported results on the task of image and video annotation. 

Future work will include a more extensive retrieval task with this model, which allows for longer text strings. 

Another major contribution of the current model over the CRM is in their use of the multiple-Bernoulli distribution for modeling image annotations. 

Existing annotation models [5, 3, 7, 8] by analogy with the text retrieval world have used the multinomial distribution to model annotation words. 

A probabilistic generative model which uses a Bernoulli process to generate words and kernel density estimate to generate image features. 

One can show that the maximization in equation (2) can be done very efficiently because of the factored nature of the Bernoulli component. 

There are 30 features: 18 color features (including region color average, standard deviation and skewness) and 12 texture features (Gabor energy computed over 3 scales and 4 orientations). 

The probability mass would be split three ways (0.33 each) in the first case while in the second image “face” would have a probability of 1. 

The number of rectangles is empirically selected (using the training and validation sets) and is 24 for the Corel set, and 35 for the video dataset set. 

While the CRM does not make any assumptions about correspondence of annotation words to image regions, the overall annotation performance is strongly affected by the quality of segmentation. 

The traditional “low-tech” solution to this problem practiced by librarians is to annotate each image manually with keywords or captions and then search on those captions or keywords using a conventional text search engine.