Generating Image Descriptions Using Semantic Similarities in the Output Space
Summary (3 min read)
1. Introduction
- Along with the outburst of digital photographs on the Internet as well as in personal collections, there has been a parallel growth in the amount of images with relevant and more or less structured captions.
- Thus, it would not be justifiable to treat the phrases “child” and “building” as equally absent.
- First, the authors modify their model for predicting a phrase given an image.
- This is a generic formulation and can be used/extended to other scenarios (such as metric learning in nearest-neighbour based methods [23]) where structured prediction needs to be performed using some nearest-neighbour based model.
- Since their model relies on consideration of semantics among phrases during prediction, the authors call it “semantic phrase prediction model” (or SPPM).
3. Phrase Prediction Model
- Given images and corresponding descriptions, a set of phrases Y is extracted using all the descriptions.
- These phrases are restricted to five different types (considering “subject” and “” as equivalent for practical purposes): , (attribute, ), (, verb), (verb, prep, ), and (, prep, ).
- The motivation behind using Google counts of phrases is to smooth their relative frequencies.
- In order to learn the two sets of parameters (i.e., the weights wi’s and smoothing parameters μi’s), an objective function analogus to [23] is used.
4. Semantic Phrase Prediction Model
- This results in penalizing semantically similar phrases (e.g. “person” vs. “man”).
- Here the authors extend this model by considering semantic similarities among phrases.
- To begin with, first the authors discuss how to compute semantic similarities.
4.1. Computing Semantic Similarities
- The authors use WordNet based JCN simiarity measure [7] to compute semantic simiarity between the words a1 and a23.
- WordNet is a large lexical database of English where words are interlinked in a hierarchy based on their semantic and lexical relationships.
- It should be noted that the authors cannot compute semantic similarity between two prepositions using WordNet.
4.2. SPPM
- Such a definition allows us to take into account the structure/semantic inter-dependence among phrases while predicting the relevance of a phrase.
- Since the authors have modified the conditional probablity model for predicting a phrase given an image, they also need to update the objective function of equation 5 accordingly.
- (11) The implication of Δ(·) is that if two phrases are semantically similar (e.g. “kid” and “child”), then penalty should be small and vice-versa.
- This objective function looks similar to that used in [22] for metric learning in nearest neighbour scenario.
- The major difference being that there the objective function is defined over samples, and penalty is based on semantic similarity between two samples (proportional to number of labels they share).
5.1. Experimental Details
- The authors follow the same experimental set-up as in [6], and use UIUC PASCAL sentence dataset [19] for evaluation.
- It has 1, 000 images and each image is described using 5 independent sentences.
- These sentences are used to extract different types of phrases using “collapsed-ccprocesseddependencies” in the Stanford CoreNLP toolkit [1]4, giving 12, 865 distinct phrases.
- All features other than GIST are also computed over three equal horizontal and vertical partitions [10].
- While computing distance between two images (equation 1), L1 distance is used for colour, L2 for scene and texture, and χ2 for shape features.
5.2.2 Human Evaluation
- Automatically describing an image is significantly different from machine translation or summary generation.
- Approach BLEU-1 Score Rouge-1 Score BabyTalk [8].
- (Higher score means better performance.) to rely just on automatic evaluation, and hence the need for human evaluation arises.
- To measure grammatical correctness of generated description by giving the following ratings: (1) Terrible, (2) Mostly comprehensible with some errors, (3) Mostly perfect English sentence.
- The authors also try to analyze the relative relevance of descriptions generated using PPM and SPPM.
5.3.1 Quantitative Results
- Table 1 shows the results corresponding to automatic evaluations.
- One important thing that the authors would like to point out is that it is not fully justifiable to directly compare their results with those of [8] and [24].
- This is because the data (i.e., the fixed sets of objects, prepositions, verbs) that they use for composing new sentences is very much different from that of ours.
- In [6], it was shown that when same data is used, PPM performs better than both of these.
- In conclusion, their results are directly comparable only with PPM [6].
5.3.2 Qualitative Results
- Human evaluation results corresponding to “Readability” and “Relevance” are shown in Table 2.
- This is because SPPM takes into account semantic similarities among the phrases, which in turn results in generating more coherent descriptions than PPM.
- For this, the authors show the top ten phrases of the type “object” A groom is posing with a scraggly person.
- Predicted using the two models for an example image.
- This is because in SPPM, the relevance (or presence) of a phrase also depends on the presence of other phrases that are semantically similar to it.
6. Conclusion
- The authors have presented an extension to PPM [6] by incorporating semantic similarities among phrases during phrase prediction and parameter learning steps.
- As the number of phrases increases, inter-phrase relationships start getting prominent.
- Due to the phenomenon of “longtail”, available data alone might not be sufficient to learn such complex relationships, and thus arises the need of bringing-in knowledge from other sources.
- The authors have tried to perform this using WordNet.
- To the best of their knowledge, this is the attempt of its kind in this domain, and can be integrated with other similar models as well.
Did you find this useful? Give us your feedback
Citations
Cites background or methods from "Generating Image Descriptions Using..."
...Conceptually, our work closely relates with the image description generation methods [6, 17], and demonstrates their application to the image retrieval task given descriptive textual queries....
[...]
...This is computed using the procedure described in [17], which is based on WordNet based similarity between the individual terms of the two phrases....
[...]
...Among these, there are two popular practices: either to generate a description given an image [6, 16, 17], or to retrieve one from a collection of available descriptions [11, 14, 18]....
[...]
Cites background from "Generating Image Descriptions Using..."
...Some works even generate phrases [47] and sentences [97]....
[...]
References
46,906 citations
21,126 citations
14,708 citations
"Generating Image Descriptions Using..." refers methods in this paper
...For image representation, we use a set of colour (RGB and HSV), texture (Gabor and Haar), scene (GIST [16]) and shape (SIFT [14]) descriptors computed globally....
[...]
13,049 citations
"Generating Image Descriptions Using..." refers methods in this paper
...We use WordNet based JCN simiarity measure [7] to compute semantic simiarity between the words a1 and a23....
[...]
...It should be noted that we cannot compute semantic similarity between two prepositions using WordNet....
[...]
...Given a pair of words (a1, a2), the JCN similarity measure returns a score sa1a2 in the range [0, inf), with 3Using the code available at http://search.cpan.org/CPAN/authors/id/T/ TP/TPEDERSE/WordNet-Similarity-2.05.tar.gz higher score corresponding to larger similarity and viceversa....
[...]
...In this work, we have tried to perform this using WordNet....
[...]
...Both of our extensions utilize semantic similarities among phrases determined using WordNet [3]....
[...]
8,736 citations
Additional excerpts
...All features other than GIST are also computed over three equal horizontal and vertical partitions [10]....
[...]
Related Papers (5)
Frequently Asked Questions (14)
Q2. How many synonyms are used in WordNet?
In order to consider synonyms, WordNet synsets are used to expand each noun upto 3 hyponym levels resulting in a reduced set of 10, 429 phrases.
Q3. What is the final output of the phrase prediction model?
The final output is a set of objects, their attributes and a preposition for each pair of objects, which are then mapped to a sentence using a simple template-based approach.
Q4. What is the underlying hypothesis of this model?
The underlying hypothesis of this model is that an image inherits the phrases that are present in the ground-truth of its visually similar images.
Q5. Why does SPPM consistently perform better than PPM?
This is because SPPM takes into account semantic similarities among the phrases, which in turn results in generating more coherent descriptions than PPM.
Q6. What is the semantic representation of the visual knowledge?
The visual knowledge is represented using a parse graph which associates objects with WordNet synsets to acquire categorical relationships.
Q7. What is the purpose of the nearest neighbor based model?
This model utilizes image descriptions at hand to learn different language constructs and constraints practiced by humans, and associates this information with visual properties of an image.
Q8. Why is the need of bringing in knowledge from other sources?
due to the phenomenon of “longtail”, available data alone might not be sufficient to learn such complex relationships, and thus arises the need of bringing-in knowledge from other sources.
Q9. What is the simiarity measure used to compute the semantic similarities between the words?
WordNet is a large lexical database of English where words are interlinked in a hierarchy based on their semantic and lexical relationships.
Q10. Why are the authors not comparing with other works?
the authors are not comparing with other works because since this is an emerging domain, different works have used either different evaluation measures (such as [2]), or experimental set-up (such as [15]), or even datasets (such as [9, 17]).
Q11. What are the two ways to perform this?
They discuss two ways to perform this: (i) using global image features to find similar images, and (ii) using detectors to re-rank the descriptions obtained after the first step.
Q12. What is the probability of seeing the phrase yi given image?
PY(yi|J) denotes the probability of seeing the phrase yi given image J , and is defined according to [4]:PY(yi|J) = μiδyi,J + Niμi + N . (4)Here, if yi ∈ YJ , then δyi,J = 1 and 0 otherwise.
Q13. What is the semantic similarity score between two phrases?
if the authors have two phrases v1=(“person”, “walk”) and v2=(“boy”, “run”) of the type (object, verb), then their semantic similarity score will be given by Vsim(v1, v2) = 0.5 ∗ (Wsim(“person”,“boy”) + Wsim(“walk”,“run”)).
Q14. What is the simiarity measure used to compute the ranked list of phrases?
Using equation 2, a ranked list of phrases is obtained, which are then integrated to produce triples of the form {((attribute1, object1), verb), (verb, prep, (attribute2, object2)), (object1, prep, object2)}.