Movie Description
read more
Citations
Bulut Tabanlı Bilgisayarlı Görü Kullanılarak Sesli Betimleme Sistem Tasarımı
Video Caption Dataset for Describing Human Actions in Japanese
More Than Reading Comprehension: A Survey on Datasets and Metrics of Textual Question Answering.
Conversational AI Systems for Social Good: Opportunities and Challenges.
The Role of the Input in Natural Language Video Description
References
ImageNet Classification with Deep Convolutional Neural Networks
Long short-term memory
Very Deep Convolutional Networks for Large-Scale Image Recognition
ImageNet: A large-scale hierarchical image database
Going deeper with convolutions
Related Papers (5)
Frequently Asked Questions (12)
Q2. What are the future works in "Movie description" ?
In the future work the movie description approaches should aim to achieve rich yet correct and fluent descriptions. Beyond their current challenge on single sentences, the dataset opens new possibilities to understand stories and plots acrossmultiple sentences in an open domain scenario on a large scale. Their evaluation server will continue to be available for automatic evaluation.
Q3. What are the frequent verbs in the dataset?
The most frequent verbs there are “look up” and “nod”, which are also frequent in the dataset and in the sentences produced by SMT-Best.
Q4. What is the main challenge in the construction of a video annotation dataset?
One of the main challenges in automating the construction of a video annotation dataset derived from AD audio is accurately segmenting the AD output, which is mixed with the original movie soundtrack.
Q5. What are the evaluation measures used for the semantic parsing pipeline?
The automatic evaluation measures include BLEU-1,-2,-3,-4 (Papineni et al. 2002), METEOR (Denkowski and Lavie 2014), ROUGE-L (Lin 2004), and CIDEr (Vedantam et al. 2015).
Q6. How do the authors decompose the sentences in a movie?
The authors start by decomposing the typically long sentences present in movie descriptions into smaller clauses using the ClausIE tool (Del Corro and Gemulla 2013).
Q7. How does the evaluation measure measure semantic content?
The authors also use the recently proposed evaluation measure SPICE (Anderson et al. 2016), which aims to compare the semantic content of two descriptions, by matching the information contained in dependency parse trees for both descriptions.
Q8. How do the authors add 2s to the end of each video clip?
In order to compen-sate for the potential 1–2s misalignment between the AD narrator speaking and the corresponding scene in the movie, the authors automatically add 2s to the end of each video clip.
Q9. What are the properties of a video description dataset?
The authors look at the following properties: availability of multi-sentence descriptions (long videos described continuously with multiple sentences), data domain, source of descriptions and dataset size.
Q10. What is the way to improve the visual representation of video?
Ballas et al. (2016) leverages multiple convolutional maps from different CNN layers to improve the visual representation for activity and video description.
Q11. What is the LSTM used to encode the video?
This submissionuses an encoder–decoder framework with 2 LSTMs, one LSTM used to encode the frame sequence of the video and another to decode it into a sentence.
Q12. What is the method used to align scripts to subtitles?
Then the authors use the dynamic programming method of Laptev et al. (2008) to align scripts to subtitles and infer the time-stamps for the description sentences.