




...read more
Content maybe subject to copyright Report
[...]
47 citations
[...]
[...]
[...]
[...]
[...]
[...]
36 citations
[...]
33 citations
[...]
[...]
[...]
[...]
[...]
[...]
11 citations
[...]
[...]
[...]
[...]
9 citations
[...]
42,225 citations
[...]
16,385 citations
[...]
14,701 citations
[...]
[...]
12,607 citations
[...]
[...]
[...]
[...]
[...]
[...]
8,415 citations
[...]
[...]
[...]
[...]
[...]
[...]
In order to consider synonyms, WordNet synsets are used to expand each noun upto 3 hyponym levels resulting in a reduced set of 10, 429 phrases.
The final output is a set of objects, their attributes and a preposition for each pair of objects, which are then mapped to a sentence using a simple template-based approach.
The underlying hypothesis of this model is that an image inherits the phrases that are present in the ground-truth of its visually similar images.
This is because SPPM takes into account semantic similarities among the phrases, which in turn results in generating more coherent descriptions than PPM.
The visual knowledge is represented using a parse graph which associates objects with WordNet synsets to acquire categorical relationships.
This model utilizes image descriptions at hand to learn different language constructs and constraints practiced by humans, and associates this information with visual properties of an image.
due to the phenomenon of “longtail”, available data alone might not be sufficient to learn such complex relationships, and thus arises the need of bringing-in knowledge from other sources.
WordNet is a large lexical database of English where words are interlinked in a hierarchy based on their semantic and lexical relationships.
the authors are not comparing with other works because since this is an emerging domain, different works have used either different evaluation measures (such as [2]), or experimental set-up (such as [15]), or even datasets (such as [9, 17]).
They discuss two ways to perform this: (i) using global image features to find similar images, and (ii) using detectors to re-rank the descriptions obtained after the first step.
PY(yi|J) denotes the probability of seeing the phrase yi given image J , and is defined according to [4]:PY(yi|J) = μiδyi,J + Niμi + N . (4)Here, if yi ∈ YJ , then δyi,J = 1 and 0 otherwise.
if the authors have two phrases v1=(“person”, “walk”) and v2=(“boy”, “run”) of the type (object, verb), then their semantic similarity score will be given by Vsim(v1, v2) = 0.5 ∗ (Wsim(“person”,“boy”) + Wsim(“walk”,“run”)).
Using equation 2, a ranked list of phrases is obtained, which are then integrated to produce triples of the form {((attribute1, object1), verb), (verb, prep, (attribute2, object2)), (object1, prep, object2)}.