Generating Image Descriptions Using Semantic Similarities in the Output Space
read more
Citations
A survey on deep neural network-based image captioning
Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval
Common Subspace for Model and Similarity: Phrase Learning for Caption Generation from Images
Automatic image annotation: the quirks and what works
A support vector approach for cross-modal search of images and texts
References
Collecting Image Annotations Using Amazon's Mechanical Turk
Baby talk: Understanding and generating simple image descriptions
Recognition using visual phrases
Midge: Generating Image Descriptions From Computer Vision Detections
Corpus-Guided Sentence Generation of Natural Images
Related Papers (5)
Frequently Asked Questions (14)
Q2. How many synonyms are used in WordNet?
In order to consider synonyms, WordNet synsets are used to expand each noun upto 3 hyponym levels resulting in a reduced set of 10, 429 phrases.
Q3. What is the final output of the phrase prediction model?
The final output is a set of objects, their attributes and a preposition for each pair of objects, which are then mapped to a sentence using a simple template-based approach.
Q4. What is the underlying hypothesis of this model?
The underlying hypothesis of this model is that an image inherits the phrases that are present in the ground-truth of its visually similar images.
Q5. Why does SPPM consistently perform better than PPM?
This is because SPPM takes into account semantic similarities among the phrases, which in turn results in generating more coherent descriptions than PPM.
Q6. What is the semantic representation of the visual knowledge?
The visual knowledge is represented using a parse graph which associates objects with WordNet synsets to acquire categorical relationships.
Q7. What is the purpose of the nearest neighbor based model?
This model utilizes image descriptions at hand to learn different language constructs and constraints practiced by humans, and associates this information with visual properties of an image.
Q8. Why is the need of bringing in knowledge from other sources?
due to the phenomenon of “longtail”, available data alone might not be sufficient to learn such complex relationships, and thus arises the need of bringing-in knowledge from other sources.
Q9. What is the simiarity measure used to compute the semantic similarities between the words?
WordNet is a large lexical database of English where words are interlinked in a hierarchy based on their semantic and lexical relationships.
Q10. Why are the authors not comparing with other works?
the authors are not comparing with other works because since this is an emerging domain, different works have used either different evaluation measures (such as [2]), or experimental set-up (such as [15]), or even datasets (such as [9, 17]).
Q11. What are the two ways to perform this?
They discuss two ways to perform this: (i) using global image features to find similar images, and (ii) using detectors to re-rank the descriptions obtained after the first step.
Q12. What is the probability of seeing the phrase yi given image?
PY(yi|J) denotes the probability of seeing the phrase yi given image J , and is defined according to [4]:PY(yi|J) = μiδyi,J + Niμi + N . (4)Here, if yi ∈ YJ , then δyi,J = 1 and 0 otherwise.
Q13. What is the semantic similarity score between two phrases?
if the authors have two phrases v1=(“person”, “walk”) and v2=(“boy”, “run”) of the type (object, verb), then their semantic similarity score will be given by Vsim(v1, v2) = 0.5 ∗ (Wsim(“person”,“boy”) + Wsim(“walk”,“run”)).
Q14. What is the simiarity measure used to compute the ranked list of phrases?
Using equation 2, a ranked list of phrases is obtained, which are then integrated to produce triples of the form {((attribute1, object1), verb), (verb, prep, (attribute2, object2)), (object1, prep, object2)}.