Generating Image Descriptions Using Semantic Similarities in the Output Space
read more
Citations
Generating diagnostic report for medical image by high-middle-level visual information incorporation on double deep learning models.
Scene graph generation by multi-level semantic tasks
On the use of commonsense ontology for multimedia event recounting
CIC Chinese Image Captioning Based on Image Label Information
Fast RF-UIC: A fast unsupervised image captioning model
References
Distinctive Image Features from Scale-Invariant Keypoints
Bleu: a Method for Automatic Evaluation of Machine Translation
Distinctive Image Features from Scale-Invariant Keypoints
WordNet : an electronic lexical database
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories
Related Papers (5)
Frequently Asked Questions (14)
Q2. How many synonyms are used in WordNet?
In order to consider synonyms, WordNet synsets are used to expand each noun upto 3 hyponym levels resulting in a reduced set of 10, 429 phrases.
Q3. What is the final output of the phrase prediction model?
The final output is a set of objects, their attributes and a preposition for each pair of objects, which are then mapped to a sentence using a simple template-based approach.
Q4. What is the underlying hypothesis of this model?
The underlying hypothesis of this model is that an image inherits the phrases that are present in the ground-truth of its visually similar images.
Q5. Why does SPPM consistently perform better than PPM?
This is because SPPM takes into account semantic similarities among the phrases, which in turn results in generating more coherent descriptions than PPM.
Q6. What is the semantic representation of the visual knowledge?
The visual knowledge is represented using a parse graph which associates objects with WordNet synsets to acquire categorical relationships.
Q7. What is the purpose of the nearest neighbor based model?
This model utilizes image descriptions at hand to learn different language constructs and constraints practiced by humans, and associates this information with visual properties of an image.
Q8. Why is the need of bringing in knowledge from other sources?
due to the phenomenon of “longtail”, available data alone might not be sufficient to learn such complex relationships, and thus arises the need of bringing-in knowledge from other sources.
Q9. What is the simiarity measure used to compute the semantic similarities between the words?
WordNet is a large lexical database of English where words are interlinked in a hierarchy based on their semantic and lexical relationships.
Q10. Why are the authors not comparing with other works?
the authors are not comparing with other works because since this is an emerging domain, different works have used either different evaluation measures (such as [2]), or experimental set-up (such as [15]), or even datasets (such as [9, 17]).
Q11. What are the two ways to perform this?
They discuss two ways to perform this: (i) using global image features to find similar images, and (ii) using detectors to re-rank the descriptions obtained after the first step.
Q12. What is the probability of seeing the phrase yi given image?
PY(yi|J) denotes the probability of seeing the phrase yi given image J , and is defined according to [4]:PY(yi|J) = μiδyi,J + Niμi + N . (4)Here, if yi ∈ YJ , then δyi,J = 1 and 0 otherwise.
Q13. What is the semantic similarity score between two phrases?
if the authors have two phrases v1=(“person”, “walk”) and v2=(“boy”, “run”) of the type (object, verb), then their semantic similarity score will be given by Vsim(v1, v2) = 0.5 ∗ (Wsim(“person”,“boy”) + Wsim(“walk”,“run”)).
Q14. What is the simiarity measure used to compute the ranked list of phrases?
Using equation 2, a ranked list of phrases is obtained, which are then integrated to produce triples of the form {((attribute1, object1), verb), (verb, prep, (attribute2, object2)), (object1, prep, object2)}.