YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition
Citations
4,206 citations
3,935 citations
Cites methods from "YouTube2Text: Recognizing and Descr..."
...[11, 30, 17, 3, 6, 17, 41, 42] propose methods for generating sentence descriptions for video, but to our knowledge we present the first application of deep models to the video description task....
[...]
3,513 citations
Cites methods from "YouTube2Text: Recognizing and Descr..."
...Related to VQA are the tasks of image tagging [8, 23], image captioning [24, 14, 34, 7, 13, 45, 9, 19, 32, 21] and video captioning [39, 17], where words or sentences are generated to describe visual content....
[...]
2,365 citations
Cites methods from "YouTube2Text: Recognizing and Descr..."
...Related to VQA are the tasks of image tagging [8, 23], image captioning [24, 14, 34, 7, 13, 45, 9, 19, 32, 21] and video captioning [39, 17], where words or sentences are generated to describe visual content....
[...]
1,945 citations
References
40,826 citations
"YouTube2Text: Recognizing and Descr..." refers methods in this paper
...…81.03 45.71 28.45 OU 92.57 46.83 46.66 93.72 61.19 58.41 Table 2: Comparison of WUP Similarity which combines the outputs of the SVMs according to the WordNet hierarchy and chooses the appropriate level of generalization by setting the accuracy to some prespecified value (0.9 in our experiments)....
[...]
...Then we combine activity and object descriptors using a multi-channel approach [31] and pass it to a non-linear SVM [3] (see Section 7)....
[...]
15,935 citations
"YouTube2Text: Recognizing and Descr..." refers methods in this paper
...Subsets of this data were previously used by [20] and [14], however, contrary to these works, we use all the videos, and not only the 20 objects included in the PASCAL dataset [9]....
[...]
10,501 citations
"YouTube2Text: Recognizing and Descr..." refers background or methods in this paper
...Second, for each video we extract 2 frames per second, and for each frame, apply the object detectors proposed in [11] and [18], and select the maximum score assigned to each object in any frame....
[...]
...The flat (FL) baseline predicts the most confident output for each SVM trained over the whole set of labels without any hierarchy or grouping....
[...]
4,584 citations
"YouTube2Text: Recognizing and Descr..." refers methods in this paper
...This idea can be used to expand “coarse” activity detections, obtained by training classifiers on available (possibly limited) activity training data, with “finer” activities unseen at training time....
[...]
3,501 citations