scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Generating Image Descriptions Using Semantic Similarities in the Output Space

TL;DR: This work extends the nearest-neighbour based generative phrase prediction model by considering inter-phrase semantic similarities, and re-formulates their objective function for parameter learning by penalizing each pair of phrases unevenly, in a manner similar to that in structured predictions.
Abstract: Automatically generating meaningful descriptions for images has recently emerged as an important area of research. In this direction, a nearest-neighbour based generative phrase prediction model (PPM) proposed by (Gupta et al. 2012) was shown to achieve state-of-the-art results on PASCAL sentence dataset, thanks to the simultaneous use of three different sources of information (i.e. visual clues, corpus statistics and available descriptions). However, they do not utilize semantic similarities among the phrases that might be helpful in relating semantically similar phrases during phrase relevance prediction. In this paper, we extend their model by considering inter-phrase semantic similarities. To compute similarity between two phrases, we consider similarities among their constituent words determined using WordNet. We also re-formulate their objective function for parameter learning by penalizing each pair of phrases unevenly, in a manner similar to that in structured predictions. Various automatic and human evaluations are performed to demonstrate the advantage of our "semantic phrase prediction model" (SPPM) over PPM.

Summary (3 min read)

1. Introduction

  • Along with the outburst of digital photographs on the Internet as well as in personal collections, there has been a parallel growth in the amount of images with relevant and more or less structured captions.
  • Thus, it would not be justifiable to treat the phrases “child” and “building” as equally absent.
  • First, the authors modify their model for predicting a phrase given an image.
  • This is a generic formulation and can be used/extended to other scenarios (such as metric learning in nearest-neighbour based methods [23]) where structured prediction needs to be performed using some nearest-neighbour based model.
  • Since their model relies on consideration of semantics among phrases during prediction, the authors call it “semantic phrase prediction model” (or SPPM).

3. Phrase Prediction Model

  • Given images and corresponding descriptions, a set of phrases Y is extracted using all the descriptions.
  • These phrases are restricted to five different types (considering “subject” and “” as equivalent for practical purposes): , (attribute, ), (, verb), (verb, prep, ), and (, prep, ).
  • The motivation behind using Google counts of phrases is to smooth their relative frequencies.
  • In order to learn the two sets of parameters (i.e., the weights wi’s and smoothing parameters μi’s), an objective function analogus to [23] is used.

4. Semantic Phrase Prediction Model

  • This results in penalizing semantically similar phrases (e.g. “person” vs. “man”).
  • Here the authors extend this model by considering semantic similarities among phrases.
  • To begin with, first the authors discuss how to compute semantic similarities.

4.1. Computing Semantic Similarities

  • The authors use WordNet based JCN simiarity measure [7] to compute semantic simiarity between the words a1 and a23.
  • WordNet is a large lexical database of English where words are interlinked in a hierarchy based on their semantic and lexical relationships.
  • It should be noted that the authors cannot compute semantic similarity between two prepositions using WordNet.

4.2. SPPM

  • Such a definition allows us to take into account the structure/semantic inter-dependence among phrases while predicting the relevance of a phrase.
  • Since the authors have modified the conditional probablity model for predicting a phrase given an image, they also need to update the objective function of equation 5 accordingly.
  • (11) The implication of Δ(·) is that if two phrases are semantically similar (e.g. “kid” and “child”), then penalty should be small and vice-versa.
  • This objective function looks similar to that used in [22] for metric learning in nearest neighbour scenario.
  • The major difference being that there the objective function is defined over samples, and penalty is based on semantic similarity between two samples (proportional to number of labels they share).

5.1. Experimental Details

  • The authors follow the same experimental set-up as in [6], and use UIUC PASCAL sentence dataset [19] for evaluation.
  • It has 1, 000 images and each image is described using 5 independent sentences.
  • These sentences are used to extract different types of phrases using “collapsed-ccprocesseddependencies” in the Stanford CoreNLP toolkit [1]4, giving 12, 865 distinct phrases.
  • All features other than GIST are also computed over three equal horizontal and vertical partitions [10].
  • While computing distance between two images (equation 1), L1 distance is used for colour, L2 for scene and texture, and χ2 for shape features.

5.2.2 Human Evaluation

  • Automatically describing an image is significantly different from machine translation or summary generation.
  • Approach BLEU-1 Score Rouge-1 Score BabyTalk [8].
  • (Higher score means better performance.) to rely just on automatic evaluation, and hence the need for human evaluation arises.
  • To measure grammatical correctness of generated description by giving the following ratings: (1) Terrible, (2) Mostly comprehensible with some errors, (3) Mostly perfect English sentence.
  • The authors also try to analyze the relative relevance of descriptions generated using PPM and SPPM.

5.3.1 Quantitative Results

  • Table 1 shows the results corresponding to automatic evaluations.
  • One important thing that the authors would like to point out is that it is not fully justifiable to directly compare their results with those of [8] and [24].
  • This is because the data (i.e., the fixed sets of objects, prepositions, verbs) that they use for composing new sentences is very much different from that of ours.
  • In [6], it was shown that when same data is used, PPM performs better than both of these.
  • In conclusion, their results are directly comparable only with PPM [6].

5.3.2 Qualitative Results

  • Human evaluation results corresponding to “Readability” and “Relevance” are shown in Table 2.
  • This is because SPPM takes into account semantic similarities among the phrases, which in turn results in generating more coherent descriptions than PPM.
  • For this, the authors show the top ten phrases of the type “object” A groom is posing with a scraggly person.
  • Predicted using the two models for an example image.
  • This is because in SPPM, the relevance (or presence) of a phrase also depends on the presence of other phrases that are semantically similar to it.

6. Conclusion

  • The authors have presented an extension to PPM [6] by incorporating semantic similarities among phrases during phrase prediction and parameter learning steps.
  • As the number of phrases increases, inter-phrase relationships start getting prominent.
  • Due to the phenomenon of “longtail”, available data alone might not be sufficient to learn such complex relationships, and thus arises the need of bringing-in knowledge from other sources.
  • The authors have tried to perform this using WordNet.
  • To the best of their knowledge, this is the attempt of its kind in this domain, and can be integrated with other similar models as well.

Did you find this useful? Give us your feedback

Figures (4)

Content maybe subject to copyright    Report

Generating Image Descriptions Using Semantic Similarities in the Output Space
Yashaswi Verma Ankush Gupta Prashanth Mannem C. V. Jawahar
International Institute of Information Technology, Hyderabad, India
Abstract
Automatically generating meaningful descriptions for
images has recently emerged as an important area of re-
search. In this direction, a nearest-neighbour based gener-
ative phrase prediction model (PPM) proposed by (Gupta
et al. 2012) was shown to achieve state-of-the-art results on
PASCAL sentence dataset, thanks to the simultaneous use
of three different sources of information (i.e. visual clues,
corpus statistics and available descriptions). However, they
do not utilize semantic similarities among the phrases that
might be helpful in relating semantically similar phrases
during phrase relevance prediction. In this paper, we ex-
tend their model by considering inter-phrase semantic sim-
ilarities. To compute similarity between two phrases, we
consider similarities among their constituent words deter-
mined using WordNet. We also re-formulate their objective
function for parameter learning by penalizing each pair of
phrases unevenly, in a manner similar to that in structured
predictions. Various automatic and human evaluations are
performed to demonstrate the advantage of our “semantic
phrase prediction model” (SPPM) over PPM.
1. Introduction
Along with the outburst of digital photographs on the In-
ternet as well as in personal collections, there has been a
parallel growth in the amount of images with relevant and
more or less structured captions. This has opened-up new
dimensions to deploy machine learning techniques to study
available descriptions, and build systems to describe new
images automatically. Analysis of available image descrip-
tions would help to figure out possible relationships that
exist among different entities within a sentence (e.g. ob-
ject, action, preposition, etc.). However, even for simple
images, automatically generating such descriptions may be
quite complex, thus suggesting the hardness of the problem.
Recently, there have been few attempts in this direc-
tion [2, 6, 8, 9, 12, 15, 17, 24]. Most of these approaches
rely on visual clues (global image features and/or trained
detectors and classifiers) and generate descriptions in an in-
dependent manner. This makes such methods susceptible
to linguistic errors during the generation step. An attempt
towards addressing this was made in [6] using a nearest-
neighbour based model. This model utilizes image descrip-
tions at hand to learn different language constructs and con-
straints practiced by humans, and associates this informa-
tion with visual properties of an image. It extracts linguis-
tic phrases of different types (e.g. “white aeroplane”, “aero-
plane at airport”, etc.) from available sentences, and uses
them to describe new images. The underlying hypothesis
of this model is that an image inherits the phrases that are
present in the ground-truth of its visually similar images.
This simple but conceptually coherent hypothesis resulted
in state-of-the-art results on
PASCAL-sentence dataset [19]
1
.
However, this hypothesis has its limitations as well. One
such limitation is the ignorance of semantic relationships
among the phrases; i.e., presence of one phrase should trig-
ger presence of other phrases that are semantically similar
to it. E.g., consider a set of three phrases {“kid”, “child”,
“building”}, an image J and its neighbouring image I.If
the image I has the phrase “kid” in its ground-truth, then
according to the model of [6], it will get associated with J
with some probability, while (almost) ignoring the remain-
ing phrases. However, if we look at these phrases, then it
can be easily noticed that the phrases “kid” and “child” are
semantically very similar, whereas the phrases “child” and
“building” are semantically very different. Thus, it would
not be justifiable to treat the phrases “child” and “building”
as equally absent. That is to say, presence of “kid” should
also indicate the presence of the phrase “child”. From the
machine learning perspective, this relates with the notion of
predicting structured outputs [21]. Intuitively, it asserts that
given a true (or positive) label and a set of false (or negative)
labels, each negative label should be penalized unevenly de-
pending on its (dis)similarity with the true label.
In this paper, we try to address this limitation of the
phrase prediction model (
PPM) of [6]. For this, we propose
two extensions to
PPM. First, we modify their model for
predicting a phrase given an image. This is performed by
considering semantic similarities among the phrases. And
second, we propose a parameter learning formulation in the
nearest-neighbour set-up that takes into account the relation
1
http://vision.cs.uiuc.edu/pascal-sentences/
2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops
978-0-7695-4990-3/13 $26.00 © 2013 IEEE
DOI 10.1109/CVPRW.2013.50
288
2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops
978-0-7695-4990-3/13 $26.00 © 2013 IEEE
DOI 10.1109/CVPRW.2013.50
288
2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops
978-0-7695-4990-3/13 $26.00 © 2013 IEEE
DOI 10.1109/CVPRW.2013.50
288
2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops
978-0-7695-4990-3/13 $26.00 © 2013 IEEE
DOI 10.1109/CVPRW.2013.50
288

(structure) present in the output space. This is a generic
formulation and can be used/extended to other scenarios
(such as metric learning in nearest-neighbour based meth-
ods [23]) where structured prediction needs to be performed
using some nearest-neighbour based model. Both of our ex-
tensions utilize semantic similarities among phrases deter-
mined using WordNet [3]. Since our model relies on con-
sideration of semantics among phrases during prediction,
we call it “semantic phrase prediction model” (or
SPPM).
We perform several automatic and human evaluations to
demonstrate the advantage of
SPPM over PPM.
2. Related Works
Here we discuss some of the notable contributions in
this domain. In [25], a semi-automatic method is proposed
where first an image is parsed and converted into a semantic
representation, which is then used by a text parse engine to
generate image description. The visual knowledge is rep-
resented using a parse graph which associates objects with
WordNet synsets to acquire categorical relationships. Us-
ing this, they are able to compose new rule-based grounded
symbols (e.g., “zebra” = “horse” + “stripes”). In [8], they
use trained detectors and classifiers to predict the objects
and attributes present in an image, and simple heuristics to
figure out the preposition between any two objects. These
predictions are then combined with corpus statistics (fre-
quency of a term in a large text corpus, e.g. Google) and
given as an input to a
CRF model. The final output is a set
of objects, their attributes and a preposition for each pair of
objects, which are then mapped to a sentence using a sim-
ple template-based approach. Similar to this, [24] relies on
detectors and classifiers to predict upto two objects and the
overall scene of an image. Along with preposition, they
also predict the action performed by subject; and combine
the predictions using an
HMM model. In [12], the outputs of
object detectors are combined with frequency counts of dif-
ferent n-grams (n 5) obtained using the Google-1T data.
Their phrase fusion technique specifically infuses some cre-
ativity into the output descriptions. Another closely related
work with similar motivation is [15].
One of the limitations of most of these methods is that
they don’t make use of available descriptions. This may
help in avoiding generation of noisy/absurd descriptions
(e.g. “person under road”). Two recent methods [6, 9] try to
address this issue by making use of higher-level language
constructs, called phrases. A phrase is a collection of syn-
tactically ordered words that is semantically meaningful and
complete on its own (e.g., “person pose”, “cow in field”,
etc.)
2
. In [9], phrases are extracted from the dataset pro-
posed in [17]. Then, an integer-programming based formu-
lation is used that fuses visual clues with words and phrases
2
The term ‘phrase’ is used in a more general sense, and is different
from the linguistic sense of phrase.
to generate sentences. In [6], a nearest-neighbour based
model is proposed that simultaneously integrates three dif-
ferent sources of information, i.e. visual clues, corpus
statistics and available descriptions. They use linguistic
phrases extracted from available sentences to construct de-
scriptions for new images. These two models are closely re-
lated with the notion of visual phrases [20], which says that
it is more meaningful to detect visual phrases (e.g. “person
next to car”) than individual objects in an image.
Apart from these, there are few other methods that di-
rectly transfer one or more complete sentences from a col-
lection of sentences. E.g., the method proposed in [17]
transfers multiple descriptions from some other images to
a given image. They discuss two ways to perform this:
(i) using global image features to find similar images, and
(ii) using detectors to re-rank the descriptions obtained af-
ter the first step. Their approach mainly relies on a very
large collection of one million captioned images. Similar
to [17], in [2] also a complete sentence from the training
image descriptions is transferred by mapping a given (test)
image and available descriptions into a “meaning space” of
the form (object, action, scene). This is done using a re-
trieval based approach combined with an
MRF model.
3. Phrase Prediction Model
In this section, we briefly discuss PPM [6]. Given im-
ages and corresponding descriptions, a set of phrases Y is
extracted using all the descriptions. These phrases are re-
stricted to ve different types (considering “subject” and
“object” as equivalent for practical purposes): (object),
(attribute, object), (object, verb), (verb, prep, object),
and (object, prep, object). The dataset takes the form
T = {(I
i
,Y
i
)}, where I
i
is an image and Y
i
⊆Yis its
set of phrases. Each image I is represented using a set of
n features {f
1,I
,...,f
n,I
}. Given two images I and J, dis-
tance between them is computed using a weighted sum of
distances corresponding to each feature as:
D
I,J
= w
1
d
1
I,J
+ ...+ w
n
d
n
I,J
= w · d
I,J
, (1)
where w
i
0 denotes the weight corresponding to i
th
fea-
ture distance. Using this, for a new image I, its K most
similar images T
K
I
⊆T are picked. Then, the joint proba-
bility of associating a phrase y
i
∈Ywith I is given by:
P (y
i
,I)=
J∈T
K
I
P
T
(J)P
F
(I|J)P
Y
(y
i
|J). (2)
Here, P
T
(J)=1/K denotes the uniform probability of
picking some image J from T
K
I
. P
F
(I|J) denotes the like-
lihood of image I given J, defined as:
P
F
(I|J)=
exp(D
I,J
)
J
∈T
K
I
exp(D
I,J
)
. (3)
289289289289

Finally, P
Y
(y
i
|J) denotes the probability of seeing the
phrase y
i
given image J, and is defined according to [4]:
P
Y
(y
i
|J)=
μ
i
δ
y
i
,
J
+ N
i
μ
i
+ N
. (4)
Here, if y
i
Y
J
, then δ
y
i
,
J
=1and 0 otherwise. N
i
is the
(approximate) Google count of the phrase y
i
, N denotes
the sum of Google counts of all phrases in Y that are of
the same type as that of y
i
, and μ
i
0 is the smoothing
parameter. The motivation behind using Google counts of
phrases is to smooth their relative frequencies.
In order to learn the two sets of parameters (i.e., the
weights w
i
s and smoothing parameters μ
i
s), an objective
function analogus to [23] is used. Given an image J along
with its true prhases Y
J
, the goal is to learn the parame-
ters such that (i) the probability of predicting the phrases
in Y\Y
J
should be minimized, and (ii) the probability of
predicting each phrase in Y
J
should be more than any other
phrase. Precisely, we minimize the following function:
e =
J,y
k
P (y
k
,J)+λ
(J,y
k
,y
j
)∈M
(P (y
k
,J) P (y
j
,J)).
(5)
Here, y
j
Y
J
, y
k
∈Y\Y
J
, M is the set of triples that vio-
late the second constraint stated above, and λ>0 is used to
manage the trade-off betweent the two terms. The objective
function is optimized using a gradient descent method, by
learning w
i
s and μ
i
s in an alternate manner.
Using equation 2, a ranked list of phrases is obtained,
which are then integrated to produce triples of the form
{((attribute1, object1), verb), (verb,prep, (attribute2,
object2)), (object1, prep, object2)}. These are then
mapped to simple sentences using SimpleNLG [5].
4. Semantic Phrase Prediction Model
As discussed before, one of the limitations of PPM is that
it treats phrases in a binary manner; i.e., in equation 4, either
δ
y
i
,
J
is 1 or 0 depending on presence or absence of y
i
in
Y
J
. This results in penalizing semantically similar phrases
(e.g. “person” vs. “man”). Here we extend this model by
considering semantic similarities among phrases. To begin
with, first we discuss how to compute semantic similarities.
4.1. Computing Semantic Similarities
Let a
1
and a
2
be two words (e.g. “boy” and “man”). We
use WordNet based
JCN simiarity measure [7] to compute
semantic simiarity between the words a
1
and a
2
3
. WordNet
is a large lexical database of English where words are inter-
linked in a hierarchy based on their semantic and lexical re-
lationships. Given a pair of words (a
1
, a
2
), the JCN similar-
ity measure returns a score s
a
1
a
2
in the range [0, inf), with
3
Using the code available at http://search.cpan.org/CPAN/authors/id/T/
TP/TPEDERSE/WordNet-Similarity-2.05.tar.gz
higher score corresponding to larger similarity and vice-
versa. This similarity score is then mapped into the range
[0, 1] using the following non-linear transformation as de-
scribed in [11] (denoting s
a
1
a
2
by s in short):
γ(s)=
1 s 0.1
0.6 0.4 sin(
25π
2
s +
3
4
π) s (0.06, 0.1)
0.6 0.6 sin(
π
2
(1
1
3.471s+0.653
)) s 0.06
Using this, we define a similarity function that takes two
words as input and returns the semantic similarity score be-
tween them computed using the above equation as:
W
sim
(a
1
,a
2
)=γ(s
a
1
a
2
) (6)
From this, we compute semantic dissimilarity score as:
W
sim
(a
1
,a
2
)=1 W
sim
(a
1
,a
2
) (7)
Based on equation 6, we define sematic similarity be-
tween two phrases (of the same type) as V
sim
, which is an
average of the semantic similarity between each of their cor-
responding constituting terms. E.g., if we have two phrases
v
1
=(“person”, “walk”) and v
2
=(“boy”, “run”) of the type
(object, verb), then their semantic similarity score will be
given by V
sim
(v
1
,v
2
)=0.5 (W
sim
(“person”,“boy”)+
W
sim
(“walk”,“run”)). It should be noted that we cannot
compute semantic similarity between two prepositions us-
ing WordNet. So, while computing semantic simiarity be-
tween two phrases that contain prepositions in them (i.e.,
of type (verb, prep, object) or (object, prep, object)), we
do not consider the prepositions. Analogous to equation 7,
we can compute semantic dissimilarity score between two
phrases as
V
sim
(v
1
,v
2
)=1 V
sim
(v
1
,v
2
). Finally, given
a phrase y
i
and a set of phrases Y of the same type as that
of y
i
, we define semantic similarity between them as
U
sim
(y
i
,Y) = max
y
j
Y
V
sim
(y
i
,y
j
). (8)
In practice, if |Y | =0then we set U
sim
(y
i
,Y)=0.
Also, in order to emphasize more on an exact match, we
set U
sim
(y
i
,Y) to exp(1) if y
i
Y in the above equation.
4.2. SPPM
In order to benefit from semantic similarity between two
phrases while predicting relevance of some given phrase y
i
with Y
J
of image J, we need to modify equation 4 accord-
ingly. Let y
i
be of type t, and the set of phrases of type t in
Y
J
be Y
t
J
Y
J
. Then, we re-define P
Y
(y
i
|J) as:
P
Y
(y
i
|J)=
μ
i
δ
y
i
,
J
+ N
i
μ
i
+ N
, (9)
where δ
y
i
,
J
= U
sim
(y
i
,Y
t
J
). This means that when
y
i
/ Y
t
J
, we look for that phrase in Y
t
J
that is seman-
tically most similar to y
i
and use their similarity score,
290290290290

PPM [6] SPPM (this work)
Figure 1. Difference between the two models. In PPM, the conditional probability of a phrase y
i
given an image J depends on whether
that phrase is present in the ground-truth phrases of J (i.e. Y
J
) or not. When the phrase is not present, corresponding δ
y
i
,
J
(equation 4)
becomes zero without considering the semantic similarity of y
i
with other phrases in Y
J
. This limitation of PPM is addressed in SPPM by
finding the phrase in Y
J
that is semantically most similar to y
i
and using their similarity score instead of zero. In the above example, we
have Y
J
= {“bus”, “road”, ”street”}. Given a phrase y
i
= “highway”, δ
y
i
,
J
=0according to PPM. Whereas δ
y
i
,
J
=0.8582 according
to
SPPM (equation 9) by considering the similarity of “highway” with “road” (i.e., V
sim
(“highway”, “road” )=0.8582).
rather than putting a zero. Such a definition allows us to
take into account the structure/semantic inter-dependence
among phrases while predicting the relevance of a phrase.
Since we have modified the conditional probablity model
for predicting a phrase given an image, we also need to up-
date the objective function of equation 5 accordingly. Given
an image J along with its true prhases y
j
’s in Y
J
,nowwe
additionally need to ensure that the penalty imposed for a
higher relevance score of some phrase y
k
∈Y\Y
J
than
any phrase y
j
Y
J
should also depend on the semantic
similarity between y
j
and y
k
. This is similar to the notion
of predicting structured outputs as discussed in [21]. Pre-
cisely, we re-define the objective function as:
e =
J,y
k
P (y
k
,J)+λ
(J,y
k
,y
j
)∈M
Δ(J, y
k
,y
j
), (10)
Δ(J, y
k
,y
j
)=
V
sim
(y
k
,y
j
)(P (y
k
,J) P (y
j
,J)). (11)
The implication of Δ(·) is that if two phrases are semanti-
cally similar (e.g. “kid” and “child”), then penalty should be
small and vice-versa. This objective function looks similar
to that used in [22] for metric learning in nearest neighbour
scenario. The major difference being that there the objec-
tive function is defined over samples, and penalty is based
on semantic similarity between two samples (proportional
to number of labels they share). Whereas, here the objec-
tive function is defined over phrases, and penalty is based
on semantic similarity between two phrases.
5. Experiments
5.1. Experimental Details
We follow the same experimental set-up as in [6], and
use
UIUC PASCAL sentence dataset [19] for evaluation. It
has 1, 000 images and each image is described using 5 in-
dependent sentences. These sentences are used to extract
different types of phrases using “collapsed-ccprocessed-
dependencies” in the Stanford CoreNLP toolkit [1]
4
,giv-
ing 12, 865 distinct phrases. In order to consider synonyms,
WordNet synsets are used to expand each noun upto 3 hy-
ponym levels resulting in a reduced set of 10, 429 phrases.
Similar to [6], we partition the dataset into 90% training
and 10% testing for learning the parameters, and repeat this
over 10 partitions in order to generate descriptions for all the
images. During relevance prediction, we consider K =15
nearest-neighbours from the training data.
For image representation, we use a set of colour (
RGB
and HSV), texture (Gabor and Haar), scene (GIST [16]) and
shape (
SIFT [14]) descriptors computed globally. All fea-
tures other than
GIST are also computed over three equal
horizontal and vertical partitions [10]. This gives a set of 16
features per image. While computing distance between two
images (equation 1), L1 distance is used for colour, L2 for
scene and texture, and χ
2
for shape features.
5.2. Evaluation Measures
In our experiments, we perform both automatic as well
as human evaluations for performance analysis.
5.2.1 Automatic Evaluation
For this we use the
BLEU [18] and Rouge [13] metrics.
These are frequently used for evaluations in the areas of ma-
chine translation and automatic summarization respectively.
5.2.2 Human Evaluation
Automatically describing an image is significantly different
from machine translation or summary generation. Since an
image can be described in several ways, it is not justifiable
4
http://nlp.stanford.edu/software/corenlp.shtml
291291291291

Approach BLEU-1 Score Rouge-1 Score
BabyTalk [8] 0.30 -
CorpusGuided [24] - 0.44
PPM [6] w/ syn. 0.41 0.28
PPM [6] w/o syn. 0.36 0.21
SPPM w/ syn. 0.43 0.29
SPPM w/o syn. 0.36 0.20
Table 1. Automatic evaluation results for sentence generation.
(Higher score means better performance.)
Approach Readability Relevance
PPM [6] w/ syn. 2.84 1.49
PPM [6] w/o syn. 2.75 1.32
SPPM w/ syn. 2.93 1.61
SPPM w/o syn. 2.91 1.39
Table 2. Human evaluation results for “Relevance” and “Readabil-
ity”. (Higher score means better performance.)
to rely just on automatic evaluation, and hence the need for
human evaluation arises. We gather judgements from two
human evaluators on 100 images randomly picked from the
dataset and take their average. The evaluators are asked to
verify three aspects on a likert scale of {1, 2, 3} [6, 12]:
Readability: To measure grammatical correctness of gen-
erated description by giving the following ratings: (1) Ter-
rible, (2) Mostly comprehensible with some errors, (3)
Mostly perfect English sentence.
Relevance: To measure the semantic relevance of the gen-
erated sentence by giving the following ratings: (1) Totally
off, (2) Reasonably relevant, (3) Very relevant.
Relative Relevance: We also try to analyze the relative
relevance of descriptions generated using
PPM and SPPM.
Corresponding to each image, we present the descriptions
generated using these two models to the human evaluators
(without telling them that they are generated using two dif-
ferent models) and collect judgements based on the follow-
ing ratings: (1) Description generated by
PPM is more rel-
evant, (2) Description generated by
SPPM is more relevant,
(3) Both descriptions are equally relevant/irrelevant.
5.3. Results and Discussion
5.3.1 Quantitative Results
Table 1 shows the results corresponding to automatic eval-
uations. It can be noticed that
SPPM shows comparable or
superior performance than
PPM. One important thing that
we would like to point out is that it is not fully justifiable to
directly compare our results with those of [8] and [24]. This
is because the data (i.e., the fixed sets of objects, preposi-
tions, verbs) that they use for composing new sentences is
very much different from that of ours. However, in [6], it
PPM [6] count SPPM count Both/None count
w/ syn. 16 28 56
w/o syn. 21 25 54
Table 3. Human evaluation results for “Relative Relevance”. Last
column denotes the number of times descriptions generated using
the two methods were judged as equally relevant or irrelevant with
given image. (Larger count means better performance.)
PPM:
(1) flap (2) csa (3) symbol (4) air-
craft (5) slope (6) crag (7) villa (8) biplane
(9) distance (10) sky
SPPM:
(1) aeroplane (2) airplane (3) plane
(4) sky (5) boat (6) water (7) air (8) aircraft
(9) jet (10) gear
Figure 3. Example image from the PASCAL sentence dataset along
with the top ten “objects” predicted using the two models.
was shown that when same data is used, PPM performs bet-
ter than both of these. Since the data that we use in our
experiments is exactly the same as that of
PPM, and SPPM
performs comparable or better than PPM, we believe that un-
der the same experimental set-up, our model would perform
better than both [8] and [24]. Also, we are not comparing
with other works because since this is an emerging domain,
different works have used either different evaluation mea-
sures (such as [2]), or experimental set-up (such as [15]), or
even datasets (such as [9, 17]). In conclusion, our results
are directly comparable only with
PPM [6].
5.3.2 Qualitative Results
Human evaluation results corresponding to “Readability”
and “Relevance” are shown in Table 2. Here, we can no-
tice that
SPPM consistently performs better than PPM on
all the evaluation metrics. This is because
SPPM takes into
account semantic similarities among the phrases, which in
turn results in generating more coherent descriptions than
PPM. This is also highlighted in Figure 2 that shows ex-
ample descriptions generated using
PPM and SPPM. It can
be noticed that the words in descriptions generated using
SPPM usually show semantic connectedness; which is not
always the case with
PPM. E.g., compare the descriptions
obtained using
PPM (in the second row) with those obtained
using
SPPM (in the fourth row) for the last three images.
In Table 3, results corresponding to “Relative Relevance”
are shown. In this case also,
SPPM always performs better
than
PPM. This means that the descriptions generated using
SPPM are semantically more relevant than those using PPM.
In Figure 3, we try to get some insight about how the
internal functioning of
SPPM is different from that of PPM.
For this, we show the top ten phrases of the type “object”
292292292292

Citations
More filters
Journal ArticleDOI
TL;DR: In this survey, the image captioning approaches and improvements based on deep neural network are introduced, including the characteristics of the specific techniques.
Abstract: Image captioning is a hot topic of image understanding, and it is composed of two natural parts ("look" and "language expression") which correspond to the two most important fields of artificial intelligence ("machine vision" and "natural language processing"). With the development of deep neural networks and better labeling database, the image captioning techniques have developed quickly. In this survey, the image captioning approaches and improvements based on deep neural network are introduced, including the characteristics of the specific techniques. The early image captioning approach based on deep neural network is the retrieval-based method. The retrieval method makes use of a searching technique to find an appropriate image description. The template-based method separates the image captioning process into object detection and sentence generation. Recently, end-to-end learning-based image captioning method has been verified effective at image captioning. The end-to-end learning techniques can generate more flexible and fluent sentence. In this survey, the image captioning methods are reviewed in detail. Furthermore, some remaining challenges are discussed.

64 citations

Proceedings ArticleDOI
01 Jan 2014
TL;DR: This paper studies two complementary cross-modal prediction tasks: predicting text given an image (“Im2Text”), and predicting image(s) given a piece of text (‘Text2Im’), and proposes a novel Structural SVM based unified formulation for these two tasks.
Abstract: Building bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary cross-modal prediction tasks: (i) predicting text(s) given an image (“Im2Text”), and (ii) predicting image(s) given a piece of text (“Text2Im”). We make no assumption on the specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We pose both these tasks in a retrieval framework. For Im2Text, given a query image, our goal is to retrieve a ranked list of semantically relevant texts from an independent textcorpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query text, we aim to retrieve a ranked list of semantically relevant images from a collection of unannotated images (i.e., images without any associated textual meta-data). We propose a novel Structural SVM based unified formulation for these two tasks. For both visual and textual data, two types of representations are investigated. These are based on: (1) unimodal probability distributions over topics learned using latent Dirichlet allocation, and (2) explicitly learned multi-modal correlations using canonical correlation analysis. Extensive experiments on three popular datasets (two medium and one web-scale) demonstrate that our framework gives promising results compared to existing models under various settings, thus confirming its efficacy for both the tasks.

52 citations


Cites background or methods from "Generating Image Descriptions Using..."

  • ...In contrast to several popular methods such as [5, 6, 10, 11, 15, 19, 22, 35, 37] that assume presence of data from both the modalities (visual and textual) during the testing phase, our approach has a motivation similar to the few known works (e....

    [...]

  • ...Lately, there have been several attempts that use short captions to describe images [5, 11, 14, 15, 16, 21, 22, 33, 37, 39, 40] (and a few recent efforts such as [9, 28] to describe videos)....

    [...]

  • ...These have also been used by previous methods [11, 14, 16, 22, 37] that describe images (i....

    [...]

  • ...Most of these works first try to predict the visual content of an image using some off-the-shelf computer vision technique (such as pre-trained object detectors and/or scene classifiers [14, 16, 21, 39], feature-based similarity with database images [11, 22, 37], or both [15, 22])....

    [...]

  • ...While most of the earlier as well as recent works have focused on automatically annotating images using semantic labels [4, 6, 10, 19, 35, 38], in the past few years, describing images using phrases [11, 15, 29, 37], or one or more simple captions [5, 11, 14, 15, 16, 21, 22, 26, 37, 39, 40] have attained significant attention....

    [...]

Proceedings ArticleDOI
07 Dec 2015
TL;DR: A novel phrase-learning method that obtains a subspace in which all feature vectors associated with the same phrase are mapped as mutually close, and classifiers for each phrase are learned, and training samples are shared among co-occurring phrases.
Abstract: Generating captions to describe images is a fundamental problem that combines computer vision and natural language processing. Recent works focus on descriptive phrases, such as "a white dog" to explain the visual composites of an input image. The phrases can not only express objects, attributes, events, and their relations but can also reduce visual complexity. A caption for an input image can be generated by connecting estimated phrases using a grammar model. However, because phrases are combinations of various words, the number of phrases is much larger than the number of single words. Consequently, the accuracy of phrase estimation suffers from too few training samples per phrase. In this paper, we propose a novel phrase-learning method: Common Subspace for Model and Similarity (CoSMoS). In order to overcome the shortage of training samples, CoSMoS obtains a subspace in which (a) all feature vectors associated with the same phrase are mapped as mutually close, (b) classifiers for each phrase are learned, and (c) training samples are shared among co-occurring phrases. Experimental results demonstrate that our system is more accurate than those in earlier work and that the accuracy increases when the dataset from the web increases.

49 citations


Cites background or methods from "Generating Image Descriptions Using..."

  • ...Some works [10, 43] adopt Large Margin Nearest Neighbor (LMNN) classification [45], which is not scalable for the data amount, and templates for caption generation....

    [...]

  • ...Because it is unclear if the accuracy of captions with a combinatorial optimization method is better than the methods with templates [9,10,43], we experimentally compare them in Sec....

    [...]

  • ...Various works [10, 22, 28, 41, 43] employ such phrases to generate captions from images....

    [...]

  • ...In order to represent image contents, such as objects, events, attributes, and their relations, recent works [7, 10, 20,22,28,35,41,43] focus on visual phrases describing image contents and their relations....

    [...]

  • ...Some researchers [10, 28, 43] use a parser to extract phrases from the ground truth....

    [...]

Journal ArticleDOI
TL;DR: This paper evaluates ten state-of-the-art approaches for image annotation using the same baseline CNN features and proposes new quantitative measures to examine various issues/aspects in the image annotation domain, such as dataset specific biases, per-label versus per-image evaluation criteria, and the impact of changing the number and type of predicted labels.
Abstract: Automatic image annotation is one of the fundamental problems in computer vision and machine learning. Given an image, here the goal is to predict a set of textual labels that describe the semantics of that image. During the last decade, a large number of image annotation techniques have been proposed that have been shown to achieve encouraging results on various annotation datasets. However, their scope has mostly remained restricted to quantitative results on the test data, thus ignoring various key aspects related to dataset properties and evaluation metrics that inherently affect the performance to a considerable extent. In this paper, first we evaluate ten state-of-the-art (both deep-learning based as well as non-deep-learning based) approaches for image annotation using the same baseline CNN features. Then we propose new quantitative measures to examine various issues/aspects in the image annotation domain, such as dataset specific biases, per-label versus per-image evaluation criteria, and the impact of changing the number and type of predicted labels. We believe the conclusions derived in this paper through thorough empirical analyzes would be helpful in making systematic advancements in this domain.

17 citations

Journal ArticleDOI
TL;DR: A novel and generic approach for cross-modal search based on Structural SVM based unified framework that provides max-margin guarantees and better generalization than competing methods is proposed.

16 citations


Cites background or methods from "Generating Image Descriptions Using..."

  • ...Most of these works first try to predict the visual content of an image using some off-the-shelf computer vision technique (such as pre-trained object detectors and/or scene classifiers [12, 13], feature-based similarity with database images [10, 11], or 215...

    [...]

  • ...While most of the earlier as well as recent research has focused on automatically annotating images using semantic labels [1, 2, 3, 4, 5, 6, 7], in the past few years, describing images using phrases [8, 9, 10, 11], or one or more simple captions [9, 10, 11, 12, 13, 14, 15, 16] have attained significant attention....

    [...]

  • ...Captions For captions, we consider two types of evaluation metrics that have been adopted by (1) image caption generation methods (such as [9, 10, 11, 12, 13]), and (2) image-caption retrieval 730...

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.
Abstract: In this paper, we propose a computational model of the recognition of real world scenes that bypasses the segmentation and the processing of individual objects or regions. The procedure is based on a very low dimensional representation of the scene, that we term the Spatial Envelope. We propose a set of perceptual dimensions (naturalness, openness, roughness, expansion, ruggedness) that represent the dominant spatial structure of a scene. Then, we show that these dimensions may be reliably estimated using spectral and coarsely localized information. The model generates a multidimensional space in which scenes sharing membership in semantic categories (e.g., streets, highways, coasts) are projected closed together. The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.

6,882 citations


"Generating Image Descriptions Using..." refers methods in this paper

  • ...For image representation, we use a set of colour (RGB and HSV), texture (Gabor and Haar), scene (GIST [16]) and shape (SIFT [14]) descriptors computed globally....

    [...]

Proceedings Article
05 Dec 2005
TL;DR: In this article, a Mahanalobis distance metric for k-NN classification is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin.
Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification.

4,433 citations

Journal ArticleDOI
TL;DR: This paper shows how to learn a Mahalanobis distance metric for kNN classification from labeled examples in a globally integrated manner and finds that metrics trained in this way lead to significant improvements in kNN Classification.
Abstract: The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner.

4,157 citations


"Generating Image Descriptions Using..." refers methods in this paper

  • ..., the weights wi’s and smoothing parameters μi’s), an objective function analogus to [23] is used....

    [...]

  • ...This is a generic formulation and can be used/extended to other scenarios (such as metric learning in nearest-neighbour based methods [23]) where structured prediction needs to be performed using some nearest-neighbour based model....

    [...]

01 Aug 1997
TL;DR: This paper presents a new approach for measuring semantic similarity/distance between words and concepts that combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the semantic space constructed by the taxonomy can be better quantified with the computational evidence derived from a distributional analysis of corpus data.
Abstract: This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the semantic space constructed by the taxonomy can be better quantified with the computational evidence derived from a distributional analysis of corpus data. Specifically, the proposed measure is a combined approach that inherits the edge-based approach of the edge counting scheme, which is then enhanced by the node-based approach of the information content calculation. When tested on a common data set of word pair similarity ratings, the proposed approach outperforms other computational models. It gives the highest correlation value (r = 0.828) with a benchmark based on human similarity judgements, whereas an upper bound (r = 0.885) is observed when human subjects replicate the same task.

3,061 citations


"Generating Image Descriptions Using..." refers methods in this paper

  • ...We use WordNet based JCN simiarity measure [7] to compute semantic simiarity between the words a1 and a2(3)....

    [...]

Proceedings ArticleDOI
27 May 2003
TL;DR: The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results.
Abstract: Following the recent adoption by the machine translation community of automatic evaluation using the BLEU/NIST scoring process, we conduct an in-depth study of a similar idea for evaluating summaries. The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results.

1,644 citations

Frequently Asked Questions (14)
Q1. What have the authors contributed in "Generating image descriptions using semantic similarities in the output space" ?

In this paper, the authors extend their model by considering inter-phrase semantic similarities. To compute similarity between two phrases, the authors consider similarities among their constituent words determined using WordNet. 

In order to consider synonyms, WordNet synsets are used to expand each noun upto 3 hyponym levels resulting in a reduced set of 10, 429 phrases. 

The final output is a set of objects, their attributes and a preposition for each pair of objects, which are then mapped to a sentence using a simple template-based approach. 

The underlying hypothesis of this model is that an image inherits the phrases that are present in the ground-truth of its visually similar images. 

This is because SPPM takes into account semantic similarities among the phrases, which in turn results in generating more coherent descriptions than PPM. 

The visual knowledge is represented using a parse graph which associates objects with WordNet synsets to acquire categorical relationships. 

This model utilizes image descriptions at hand to learn different language constructs and constraints practiced by humans, and associates this information with visual properties of an image. 

due to the phenomenon of “longtail”, available data alone might not be sufficient to learn such complex relationships, and thus arises the need of bringing-in knowledge from other sources. 

WordNet is a large lexical database of English where words are interlinked in a hierarchy based on their semantic and lexical relationships. 

the authors are not comparing with other works because since this is an emerging domain, different works have used either different evaluation measures (such as [2]), or experimental set-up (such as [15]), or even datasets (such as [9, 17]). 

They discuss two ways to perform this: (i) using global image features to find similar images, and (ii) using detectors to re-rank the descriptions obtained after the first step. 

PY(yi|J) denotes the probability of seeing the phrase yi given image J , and is defined according to [4]:PY(yi|J) = μiδyi,J + Niμi + N . (4)Here, if yi ∈ YJ , then δyi,J = 1 and 0 otherwise. 

if the authors have two phrases v1=(“person”, “walk”) and v2=(“boy”, “run”) of the type (object, verb), then their semantic similarity score will be given by Vsim(v1, v2) = 0.5 ∗ (Wsim(“person”,“boy”) + Wsim(“walk”,“run”)). 

Using equation 2, a ranked list of phrases is obtained, which are then integrated to produce triples of the form {((attribute1, object1), verb), (verb, prep, (attribute2, object2)), (object1, prep, object2)}.