scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Movie Description

TL;DR: The Large Scale Movie Description Challenge (LSMDC) as discussed by the authors ) is a dataset of 128,118 sentences aligned to video clips from 200 movies (around 150 h of video in total).
Abstract: Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. We introduce the Large Scale Movie Description Challenge (LSMDC) which contains a parallel corpus of 128,118 sentences aligned to video clips from 200 movies (around 150 h of video in total). The goal of the challenge is to automatically generate descriptions for the movie clips. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in the challenges organized in the context of two workshops at ICCV 2015 and ECCV 2016.

Summary (10 min read)

Jump to: [1 Introduction][2 Related Work][2.1 Image Description][2.2 Video Description][2.3 Movie Scripts and Audio Descriptions][2.4 Works Building on Our Dataset][3 Datasets for Movie Description][3.1.1 Collection of ADs][3.1.2 Collection of Script Data][3.1.3 Manual Sentence-Video Alignment][3.1.4 Visual Features][3.2 The Montreal Video Annotation Dataset (M-VAD)][3.2.1 Collection of ADs][3.2.2 AD Narrations Segmentation Using Vocal Isolation][3.2.3 Movie/AD Alignment and Professional Transcription][3.3 The Large Scale Movie Description Challenge (LSMDC)][3.4 Movie Description Dataset Statistics][3.5 Comparison to Other Video Description Datasets][4 Approaches for Movie Description][4.1.1 Semantic Parsing][4.1.2 SMT][4.2.1 Robust Visual Classifiers][4.2.2 LSTM for Sentence Generation][5 Evaluation on MPII-MD and M-VAD][5.1 Comparison of AD Versus Script Data][5.2 Semantic Parser Evaluation][5.3.1 Automatic Metrics][5.3.2 Human Evaluation][5.4 Movie Description Evaluation][5.4.3 Comparison to Related Work][5.5 Movie Description Analysis][5.5.1 Difficulty Versus Performance][5.5.2 Semantic Analysis][6 The Large Scale Movie Description Challenge][6.1 LSMDC Participants][6.1.1 LSMDC 15 Submissions][6.1.2 LSMDC 16 Submissions][6.2.1 Automatic Evaluation][6.2.2 Human Evaluation][6.3 LSMDC Qualitative Results] and [7 Conclusion]

1 Introduction

  • Audio descriptions (ADs) make movies accessible to millions of blind or visually impaired people.
  • The combination of large datasets and convolutional neural networks (CNNs) has been particularly potent (Krizhevsky et al. 2012).
  • AD narrations are carefully positioned within movies to fit in the natural pauses in the dialogue and are mixed with the original movie soundtrack by professional post-production.
  • As a first study on their dataset the authors benchmark several approaches for movie description.
  • It first builds robust visual classifiers which distinguish verbs, objects, and places extracted from weak sentence annotations.

2.1 Image Description

  • Much of the recent work has relied on Recurrent Neural Networks (RNNs) and in particular on long short-termmemory networks .
  • New datasets have been released, such as the Flickr30k (Young et al. 2014) and MS COCO Captions (Chen et al. 2015), where Chen et al. (2015) also presents a standardized protocol for image captioning evaluation.
  • Other work has analyzed the performance of recent methods, e.g. Devlin et al. (2015) compare themwith respect to the novelty of generated descriptions, while also exploring a nearest neighbor baseline that improves over recent methods.

2.2 Video Description

  • In the past video description has been addressed in controlled settings (Barbu et al. 2012; Kojima et al. 2002), on a small scale (Das et al.
  • Donahue et al. (2015) first proposed to describe videos using an LSTM, relying on precomputed CRF scores from Rohrbach et al. (2014).
  • To handle the challenging scenario of movie description, Yao et al. (2015) propose a soft-attention based model which selects the most relevant temporal segments in a video, incorporates 3-D CNN and generates a sentence using an LSTM.
  • Venugopalan et al. (2016) explore the benefit of pre-trained word embeddings and language models for generation on large external text corpora.
  • Specifically they use dense trajectory features (Wang et al. 2013) extracted for the clips and CNN features extracted at center frames of the clip.

2.3 Movie Scripts and Audio Descriptions

  • Movie scripts have been used for automatic discovery and annotation of scenes and human actions in videos (Duchenne et al. Bojanowski et al. (2013) approach the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts.
  • They rely on the semantic parser SEMAFOR (Das et al. 2012) trained on the FrameNet database (Baker et al. 1998), however, they limit the recognition only to two frames.
  • ADs have also been used to understand which characters interactwith eachother (Salway et al. 2007).
  • Their corpus is based on the original sources to create the ADs and contains different kinds of artifacts not present in actual description, such as dialogs and production notes.

2.4 Works Building on Our Dataset

  • Interestingly, other works, datasets, and challenges are already building upon their data.
  • Zhu et al. (2015b) learn a visual-semantic embedding from their clips and ADs to relate movies to books.
  • Bruni et al. (2016) also learn a joint embedding of videos and descriptions and use this representation to improve activity recognition on the Hollywood 2 dataset Marszalek et al. (2009).
  • Tapaswi et al. (2016) use their AD transcripts for building their MovieQA dataset, which asks natural language questions about movies, requiring an understanding of visual and textual information, such as dialogue and AD, to answer the question.
  • Zhu et al. (2015a) present a fill-in-the-blank challenge for audio description of the current, previous, and next sentence description for a given clip, requiring to understand the temporal context of the clips.

3 Datasets for Movie Description

  • In the following, the authors present how they collect their data for movie description and discuss its properties.
  • The Large Scale Movie Description Challenge is based on two datasets which were originally collected independently.
  • It consists of AD and script data and uses sentence-level manual alignment of transcribed audio to the actions in the video (Sect. 3.1).
  • M-VADwas collected with DVD data quality and only relies on AD.
  • It includes a submission server for evaluation on public and blind test sets.

3.1.1 Collection of ADs

  • The authors search for Blu-ray movies with ADs in the “Audio Description” section of the British Amazon4 and select 55 movies of diverse genres (e.g. drama, comedy, action).
  • Then the authors semi-automatically segment out the sections of the AD audio (which is mixed with the original audio stream) with the approach described below.
  • The audio segments are then transcribed by a crowd-sourced transcription service7 that also provides us the time-stamps for each spoken sentence.
  • The precise alignment is important to compute the similarity of both streams.
  • The authors smooth this decision over time using aminimum segment length of 1 s.

3.1.2 Collection of Script Data

  • In addition to the ADs the authors mine script web resources8 and select 39 movie scripts.
  • As starting point the authors use the movie scripts from “Hollywood2” (Marszalek et al. 2009) that have highest alignment scores to their movie.
  • The authors found that the “overlap” is quite narrow, so they analyze 11 such movies in their dataset.
  • The authors follow existing approaches 8 http://www.weeklyscript.com, http://www.simplyscripts.com, http:// www.dailyscript.com, http://www.imsdb.com.
  • Then the authors use the dynamic programming method of Laptev et al. (2008) to align scripts to subtitles and infer the time-stamps for the description sentences.

3.1.3 Manual Sentence-Video Alignment

  • As the AD is added to the original audio stream between the dialogs, there might be a small misalignment between the time of speech and the corresponding visual content.
  • During the manual alignment the authors also filter out: (a) sentences describingmovie introduction/ending (production logo, cast, etc); (b) texts read from the screen; (c) irrelevant sentences describing something not present in the video; (d) sentences related to audio/sounds/music.
  • For the movie scripts, the reduction in number of words is about 19%, while for ADs it is under 4%.
  • In the case of ADs, filtering mainly happens due to initial/ending movie intervals and transcribed dialogs (when shown as text).
  • If the manually aligned video clip is shorter than 2s, the authors symmetrically expand it (from beginning and end) to be exactly 2 s long.

3.1.4 Visual Features

  • The authors extract video clips from the full movie based on the aligned sentence intervals.
  • As discussed earlier, ADs and scripts describe activities, objects and scenes (aswell as emotions which the authors do not explicitly handle with these features, but they might still be captured, e.g. by the context or activities).
  • For each feature (Trajectory, HOG, HOF, MBH) the authors create a codebook with 4,000 clusters and compute the corresponding histograms.
  • Finally, the authors use the recent scene classification CNNs (Zhou et al. 2014) featuring 205 scene classes.
  • The authors mean-pool over the frames of each video clip, using the result as a feature.

3.2 The Montreal Video Annotation Dataset (M-VAD)

  • One of the main challenges in automating the construction of a video annotation dataset derived from AD audio is accurately segmenting the AD output, which is mixed with the original movie soundtrack.
  • In Sect. 3.1.1 the authors have introduced away of semi-automatic AD segmentation.
  • In this sectionthe authors describe a fully automatic method for AD narration isolation and video alignment.
  • When a scene changes rapidly, the narrator will speak multiple sentences without pauses.
  • Such content should be kept together 10 mpii.de/movie-description.

3.2.1 Collection of ADs

  • To search formovieswithADweuse themovie lists provided in “An Initiative of the American Council of the Blind” 11 and “Media Access Group at WGBH”12 websites, and buy them based on their availability and price.
  • To extract video and audio from the DVDs the authors use the DVDfab13 software.

3.2.2 AD Narrations Segmentation Using Vocal Isolation

  • Creating a completely automated approach for extracting the relevant narration or annotation from the audio track and refining the alignment of the annotation with the video still poses some challenges.
  • Vocal isolation techniques boost vocals, including dialogues and AD narrations while suppressing background movie sound in stereo signals.
  • The authors align the movie and AD audio signals by taking an FFT of the two audio signals, compute the cross-correlation, measure similarity for different offsets and select the offset which corresponds to peak cross-correlation.
  • Even in cases where the shapes of the standard movie audio signal and standard movie audio mixed with AD signal are very different—due to the AD mixing process—our procedure is sufficient for the automatic segmentation of AD narration.

3.2.3 Movie/AD Alignment and Professional Transcription

  • AD audio narration segments are time-stamped based on their automatic AD narration segmentation.
  • In order to compen- sate for the potential 1–2s misalignment between the AD narrator speaking and the corresponding scene in the movie, the authors automatically add 2s to the end of each video clip.
  • Also the authors discard all the transcriptions related to movie introduction/ending which are located at the beginning and the end of movies.
  • In order to obtain high quality text descriptions, the AD audio segments were transcribed with more than 98% transcription accuracy, using a professional transcription service.
  • The authors audio narration isolation technique allows us to process the audio into small, well defined time segments and reduce the overall transcription effort and cost.

3.3 The Large Scale Movie Description Challenge (LSMDC)

  • To build their Large Scale Movie Description Challenge , the authors combine the M-VAD and MPII-MD datasets.
  • Wefirst identify the overlap between the two, so that the same movie does not appear in the training and test set of the joined dataset.
  • The authors also exclude script-basedmovie alignments from the validation and test sets ofMPII-MD.
  • The authors provide more information about the challenge setup and results in Sect.
  • There is a movie annotation track which asks to select the correct sentence out of five in a multiple-choice test, a retrieval track which asks to retrieve the correct test clip for a given sentence, and a fill-in-the-blank track which requires to predict a missing word in a given description and the corresponding clip.

3.4 Movie Description Dataset Statistics

  • Table 1 presents statistics for the number of words, sentences and clips in their movie description corpora.
  • The authors also report the average/total length of the annotated time intervals.
  • The authors combined LSMDC 2015 dataset contains over 118K sentence-clips pairs and 158h of video.
  • This split balances movie genres within each set, which is motivated by the fact that the vocabulary used to describe, say, an action movie could be very different from the vocabulary used in a comedy movie.
  • To compute the part of speech statistics for their corpora the authors tag and stem all words in the datasets with the Standford Part-OfSpeech (POS) tagger and stemmer toolbox (Toutanova et al. 16 https://codalab.org/.

3.5 Comparison to Other Video Description Datasets

  • The authors compare their corpus to other existing parallel video corpora in Table 3.
  • The authors look at the following properties: availability of multi-sentence descriptions (long videos described continuously with multiple sentences), data domain, source of descriptions and dataset size.
  • Themain limitations of prior datasets include the coverage of a single domain (Das et al.
  • Similar toMSVD dataset (Chen and Dolan 2011), MSR-VTT is based on YouTube clips.
  • TGIF is a large dataset of 100k image sequences (GIFs) with associated descriptions.

4 Approaches for Movie Description

  • Given a training corpus of aligned videos and sentences the authors want to describe a new unseen test video.
  • The authors second approach (Sect. 4.2) learns to generate descriptions using long short-termmemory network (LSTM).
  • While the first approach does not differentiate which features to use for different labels, their second approach defines different semantic groups of labels and uses most relevant visual features for each group.
  • Next, the first approach uses the classifier scores as input to a CRF to predict a semantic representation (SR) (SUBJECT, VERB, OBJECT, LOCATION), and then translates it into a sentencewith SMT.
  • Figure 5 shows an overview of the two discussed approaches.

4.1.1 Semantic Parsing

  • Learning from a parallel corpus of videos and natural language sentences is challenging when no annotated intermediate representation is available.
  • The authors lift the words in a sentence to a semantic space of roles and WordNet (Fellbaum 1998) senses by performing SRL (Semantic Role Labeling) and WSD (Word Sense Disambiguation).
  • The authors start by decomposing the typically long sentences present in movie descriptions into smaller clauses using the ClausIE tool (Del Corro and Gemulla 2013).
  • Shoot3v (sense killing), the role restriction is Agent.animate V Patient.animate PP Instrument.solid.
  • The authors ensure that the selected WordNet verb sense adheres to both the syntactic frame and the semantic role restriction provided by VerbNet.

4.1.2 SMT

  • For the sentence generation the authors build on the two-step translation approach of Rohrbach et al. (2013).
  • As the first step it learns a mapping from the visual input to the semantic representation (SR), modeling pairwise dependencies in a CRF using visual classifiers as unaries.
  • The unaries are trained using an SVM on dense trajectories (Wang and Schmid 2013).
  • In the second step it translates the SR to a sentence using Statistical Machine Translation (SMT) (Koehn et al. 2007).
  • For this the approach uses a concatenated SR as input language, e.g. cut knife tomato, and natural sentence as output language, e.g.

4.2.1 Robust Visual Classifiers

  • For training the authors rely on a parallel corpus of videos and weak sentence annotations.
  • To avoid loosing the potential labels in these sentences, the authors match their set of initial labels to the sentences which the parser failed to process.
  • Objects and places.the authors.
  • Finally, the authors discard labels which the classifiers could not learn reliably, as these are likely noisy or not visual.
  • The authors estimate a threshold for the ROC values on a validation set.

4.2.2 LSTM for Sentence Generation

  • Werely on the basicLSTMarchitecture proposed inDonahue et al. (2015) for video description.
  • The embedding is jointly learned during training of the LSTM.
  • The authors feed in the classifier scores as input to the LSTM which is equivalent to the best variant proposed in Donahue et al. (2015).
  • The authors compare a 1-layer architecture with a 2-layer architecture.
  • To learn a more robust network which is less likely to overfit the authors rely on a dropout (Hinton et al. 2012), i.e. a ratio r of randomly selected units is set to 0 during training (while all others are multiplied with 1/r ).

5 Evaluation on MPII-MD and M-VAD

  • In this section the authors evaluate and provide more insights about their movie description datasets MPII-MD and M-VAD.
  • The authors compare ADs to movie scripts (Sect. 5.1), present a short evaluation of their semantic parser (Sect. 5.2), present the automatic and human evaluation metrics for description (Sect. 5.3) and then benchmark the approaches to video description introduced inSect.
  • The authors conclude this section with an analysis of the different approaches (Sect. 5.5).
  • In Sect. 6 the authors will extend this discussion to the results of the Large Scale Movie Description Challenge.

5.1 Comparison of AD Versus Script Data

  • The authors compare the AD and script data using 11movies from the MPII-MD dataset where both are available (see Sect. 3.1.2).
  • For these movies the authors select the overlapping time intervals with an intersection over union overlap of at least 75%,which results in 279 sentence pairs, they remove 2 pairs which have idendical sentences.
  • Table 5 presents the results of this evaluation.
  • Looking at the more strict evaluation where at least 4 out of 5 judges agree (in brackets in Table 5) there is still a significant margin of 24.5% between ADs and movie scripts for Correctness, and 28.1% for Relevance.
  • This evaluation supports their intuition that scrips contain mistakes and irrelevant content even after being cleaned up and manually aligned.

5.2 Semantic Parser Evaluation

  • The authors empirically evaluate the various components of the semantic parsing pipeline, namely, clause splitting , POS tagging and chunking (NLP), semantic role labeling , and, word sense disambiguation (WSD).
  • The authors randomly sample 101 sentences from theMPII-MD dataset over which they perform semantic parsing and log the outputs at various stages of the pipeline (similar to Table4).
  • The authors let three human judges evaluate the results for every token in the clause (similar to evaluating every row in Table 4) with a correct/ incorrect label.
  • It is evident that the poorest performing parts are the NLP and the WSD components.
  • Some of the NLP mistakes arise due to incorrect POS tagging.

5.3.1 Automatic Metrics

  • For automatic evaluation the authors rely on the MS COCO Caption Evaluation API.19.
  • The authors also use the recently proposed evaluation measure SPICE (Anderson et al. 2016), which aims to compare the semantic content of two descriptions, by matching the information contained in dependency parse trees for both descriptions.
  • While the authors report all measures for the final evaluation in the LSMDC (Sect. 6), they focus their discussion on METEOR and CIDEr scores in the preliminary evaluations in this section.
  • According to Elliott and Keller (2013) and Vedantam et al. (2015), METEOR/CIDEr supersede previously used measures in terms of agreement with human judgments.

5.3.2 Human Evaluation

  • The AMT workers are given randomized sentences, and, in addition to some general instruction, the following definitions: Grammar “Rank grammatical correctness of sentences: Judge the fluency and readability of the sentence (independently of the correctness with respect to the video).”.
  • For which sentence is the content more correct with respect to the video (independent if it is complete, i.e. describes everything), independent of the grammatical correctness.”.
  • In the LSMDC evaluation the authors introduce a new measure, which should capture how useful a description would be for blind people: “Rank the sentences according to how useful they would be for a blind person which would like to understand/follow the movie without seeing it.”.

5.4 Movie Description Evaluation

  • As the collected text data comes from the movie context, it contains a lot of information specific to the plot, such as names of the characters.
  • The authors pre-process each sentence in the corpus, transforming the names to “Someone” or “people” (in case of plural).
  • The authors first analyze the performance of the proposed approaches on the MPII-MD dataset, and then evaluate the best version on the M-VAD dataset.
  • The other 83movies are used for training.
  • OnM-VADthe authors use 10 movies for testing, 10 for validation and 72 for training.

5.5 Movie Description Analysis

  • The performance on the movie description datasets (MPII-MD and M-VAD) remains rather low.
  • SMT-Best, S2VT and VisualLabels, in order to understand where these methods succeed and where they fail.the authors.
  • In the following the authors evaluate all three methods on the MPII-MD test set.

5.5.1 Difficulty Versus Performance

  • As the first study the authors suggest to sort the test reference sentences by difficulty, where difficulty is defined in multiple ways21.
  • Some of the intuitive sentence difficulty measures are its length and average frequency of its words.
  • Fig. 8a) shows the performance of comparedmethods w.r.t. the sentence length.
  • For the word frequency the correlation is even stronger, see Fig. 8b.
  • Visual-Labels consistently outperforms the other two methods, most notable as the difficulty increases.

5.5.2 Semantic Analysis

  • Next the authors analyze the test reference sentences w.r.t. verb semantics.
  • The most frequent Topics, “motion” and “contact”, which are also visual (e.g. “turn”, “walk”, “sit”), are nevertheless quite challenging, which the authors attribute to their high diversity (see their entropy w.r.t. different verbs and their frequencies in Table 14).
  • The authors look at 100 test reference sentences, where Visual-Labels obtains highest and lowest METEOR scores.
  • Among the worst 100 sentences the authors observe more diversity: 12 contain no verb, 10mention unusualwords (specific to the movie), 24 have no subject, 29 have a non-human subject.
  • Summary (a) The test reference sentences thatmention verbs like “look” get higher scores due to their high frequency in the dataset.

6 The Large Scale Movie Description Challenge

  • The Large Scale Movie Description Challenge was held twice, first in conjunction with ICCV 2015 ( 15) and then at ECCV 2016 ( 16).
  • In the second phase of the challenge the participantswere providedwith the videos from the blind test set (without textual descriptions).
  • To measure performance of the competing approaches the authors performed both automatic and human evaluation.
  • The submission formatwas similar to the MS COCO Challenge (Chen et al. 2015) and the authors also used the identical automatic evaluation protocol.
  • In the following the authors review the participants and their results for both LSMDC 15 and LSMDC 16.

6.1 LSMDC Participants

  • The authors received 4 submissions to LSMDC 15, including their Visual-Labels approach.
  • The other submissions are S2VT (Venugopalan et al. 2015b), Temporal Attention (Yao et al. 2015) and Frame-Video-Concept Fusion (Shetty and Laaksonen 2015).
  • As the blind test set is not changed between LSMDC 2015 to LSMDC 2016, the authors look at all the submitted results jointly.
  • In the following the authors summarize the submissions based on the (sometimes very limited) information provided by the authors.

6.1.1 LSMDC 15 Submissions

  • S2VT (Venugopalan et al. 2015b)Venugopalan et al. (2015b) propose S2VT, an encoder–decoder framework, where a single LSTM encodes the input video, frame by frame, and decodes it into a sentence.
  • The authors note that the results to LSMDC were obtained with a different set of hyper-parameters then the results discussed in the previous section.
  • METEOR on the validation set, which resulted in significantly longer but also nosier sentences.
  • Frame-Video-Concept Fusion (Shetty and Laaksonen 2015) Shetty and Laaksonen (2015) evaluate diverse visual features as input for an LSTM generation frame-work.
  • Specifically theyusedense trajectory features (Wanget al. 2013) extracted for the entire clip and VGG (Simonyan and Zisserman 2015) andGoogleNet (Szegedy et al. 2015) CNN features extracted at the center frame of each clip.

6.1.2 LSMDC 16 Submissions

  • This submission retrieves a nearest neighbor from the training set, learning a unified space using Canonical Correlation Analysis (CCA) over textual and visual features.
  • Aalto University (Shetty and Laaksonen 2016) Shetty and Laaksonen (2016) rely on an ensemble of four models which were trained on the MSR-VTT dataset (Xu et al. 2016) without additional training on the LSMDC dataset.
  • The four models were trained with different combinations of keyframe based GoogleLeNet features and segment based dense trajectory and C3D features.
  • This work relies on temporal and attribute attention.
  • According to the authors, their VD-ivt model consists of three parallel channels: a basic video description channel, a sentence to sentence channel for language learning, and a channel to fuse visual and textual information.

6.2.1 Automatic Evaluation

  • The authors first look at the results of the automatic evaluation on the blind test set of LSMDC in Table 15.
  • One reason for lower scores for Frame-Video-Concept Fusion and Temporal Attention appears to be the generated sentence length, which is much smaller compared to the reference sentences, as the authors discuss below (see also Table 16).
  • It takes a second place w.r.t. the CIDEr score, while not achieving particularly high scores in other measures.
  • In terms of vocabulary size all approaches fall far below the reference descriptions.
  • Looking at the LSMDC 16 submissions, we, not surprisingly, see that Tel Aviv University retrieval approach achieves highest diversity among all approaches.

6.2.2 Human Evaluation

  • The authors performed separate human evaluations for LSMDC 15 and LSMDC 16.
  • LSMDC 15 The results of the human evaluation are shown in Table 17.
  • As the authors have to compare more approaches the ranking becomes unfeasible.
  • This leads us to the following evaluation protocol which is inspired by the human evaluation metric “M1” in the MS COCO Challenge (Chen et al. 2015).
  • Additionally the authors measure the correlation between the automatic and human evaluation in Fig. 10.

6.3 LSMDC Qualitative Results

  • Figure 11 shows qualitative results from the competing approaches submitted to LSMDC 15.
  • The first two examples are success cases, where most of the approaches are able to describe the video correctly.
  • The third example is an interesting case where visually relevant descriptions, provided by most approaches, do not match the reference description, which focuses on an action happening in the background of the scene (“Someone sets down his young daughter then moves to a small wooden table.”).
  • The last two rows contain partial and complete failures.
  • Tel Aviv University and Visual-Labels are able to capture important details, such as sipping a drink, which the other methods fail to recognize.

7 Conclusion

  • A novel dataset of movies with aligned descriptions sourced from movie scripts and ADs (audio descriptions for the blind, also referred to as DVS).the authors.
  • The authors approach, Visual-Labels, to automatic movie description trains visual classifiers and uses their scores as input to an LSTM.
  • When ranking sentences with respect to the criteria “helpful for the blind”, their Visual-Labels is well received by human judges, likely because it includes important aspects provided by the strong visual labels.
  • This time the authors introduced a new human evaluation protocol to allow comparison of a large number of approaches.
  • Open access funding provided by Max Planck Society.

Did you find this useful? Give us your feedback

Figures (26)

Content maybe subject to copyright    Report

Int J Comput Vis
DOI 10.1007/s11263-016-0987-1
Movie Description
Anna Rohrbach
1
· Atousa Torabi
3
· Marcus Rohrbach
2
· Niket Tandon
1
·
Christopher Pal
4
· Hugo Larochelle
5,6
· Aaron Courville
7
· Bernt Schiele
1
Received: 10 May 2016 / Accepted: 23 December 2016
© The Author(s) 2017. This article is published with open access at Springerlink.com
Abstract Audio description (AD) provides linguistic
descriptions of movies and allows visually impaired people
to follow a movie along with their peers. Such descriptions
are by design mainly visual and thus naturally form an inter-
esting data source for computer vision and computational
linguistics. In this work we propose a novel dataset which
contains transcribed ADs, which are temporally aligned to
full length movies. In addition we also collected and aligned
movie scripts used in prior work and compare the two sources
of descriptions. We introduce the Large Scale Movie Descrip-
tion Challenge (LSMDC) which contains a parallel corpus
of 128,118 sentences aligned to video clips from 200 movies
(around 150h of video in total). The goal of the challenge
is to automatically generate descriptions for the movie clips.
First we characterize the dataset by benchmarking differ-
ent approaches for generating video descriptions. Comparing
ADs to scripts, we find that ADs are more visual and describe
precisely what is shown rather than what should happen
according to the scripts created prior to movie production.
Furthermore, we present and compare the results of several
Communicated by Margaret Mitchell, John Platt and Kate Saenko.
B
Anna Rohrbach
arohrbach@mpi-inf.mpg.de
1
Max Planck Institute for Informatics, Saarland Informatics
Campus, Saarbrücken, Germany
2
ICSI and EECS, UC Berkeley, Berkeley, CA, USA
3
Disney Research, Pittsburgh, PA, USA
4
École Polytechnique de Montréal, Montreal, Canada
5
Université de Sherbrooke, Sherbrooke, Canada
6
Twitter, Cambridge, ON, USA
7
Université de Montréal, Montreal, Canada
teams who participated in the challenges organized in the
context of two workshops at ICCV 2015 and ECCV 2016.
Keywords Movie description · Video description · Video
captioning · Video understanding · Movie description
dataset · Movie description challenge · Long short-term
memory network · Audio description · LSMDC
1 Introduction
Audio descriptions (ADs) make movies accessible to mil-
lions of blind or visually impaired people.
1
AD—sometimes
also referred to as descriptivevideo service (DVS)—provides
an audio narrative of the “most important aspects of the
visual information” (Salway2007), namely actions, gestures,
scenes, and character appearance as can be seen in Figs.1
and 2. AD is prepared by trained describers and read by pro-
fessional narrators. While more and more movies are audio
transcribed, it may take up to 60 person-hours to describe a 2-
hmovie(Lakritz and Salway 2006), resulting in the fact that
today only a small subset of movies and TV programs are
available for the blind. Consequently, automating this pro-
cess has the potential to greatly increase accessibility to this
media content.
In addition to the benefits for the blind, generating descrip-
tions for video is an interesting task in itself, requiring the
combination of core techniques from computer vision and
computational linguistics. To understand the visual input one
has to reliably recognize scenes, human activities, and par-
ticipating objects. To generate a good description one has to
1
In this work we refer for simplicity to “the blind” to account for all
blind and visually impaired people which benefit from AD, knowing of
the variety of visually impaired and that AD is not accessible to all.
123

Int J Comput Vis
AD: Abby gets in
the basket.
Mike leans over and
sees how high they
are.
Abby clasps her
hands around his
face and kisses him
passionately.
Script: After a
moment a frazzled
Abby pops up in his
place.
Mike looks down to
see they are now
fifteen feet above the
ground.
For the first time in
her life, she stops
thinking and grabs
Mike and kisses the
hell out of him.
Fig. 1 Audio description (AD) and movie script samples from the
movie “Ugly Truth”
decide what part of the visual information to verbalize, i.e.
recognize what is salient.
Large datasets of objects (Deng et al. 2009) and scenes
(Xiao et al. 2010; Zhou et al. 2014) have had an important
impact in computer vision and have significantly improved
our ability to recognize objects and scenes. The combination
of large datasets and convolutional neural networks (CNNs)
has been particularly potent (Krizhevsky et al. 2012). To be
able to learn how to generate descriptions of visual content,
parallel datasets of visual content paired with descriptions are
indispensable (Rohrbach et al. 2013). While recently several
large datasets have been released which provide images with
descriptions (Young et al. 2014; Lin et al. 2014; Ordonez et al.
2011), video description datasets focus on short video clips
with single sentence descriptions and have a limited number
of video clips (Xu et al. 2016; Chen and Dolan 2011)orare
not publicly available (Over et al. 2012). TACoS Multi-Level
(Rohrbach et al. 2014) and YouCook (Das et al. 2013)are
exceptionsas they providemultiple sentence descriptions and
longer videos. While these corpora pose challenges in terms
of fine-grained recognition, they are restricted to the cooking
scenario. In contrast, movies are open domain and realistic,
even though, as any other video source (e.g. YouTube or
surveillance videos), they have their specific characteristics.
ADs and scripts associated with movies provide rich multiple
sentence descriptions. They even go beyond this by telling a
story which means they facilitate the study of how to extract
plots, the understanding of long term semantic dependencies
and human interactions from both visual and textual data.
AD: Buckbeak rears and
attacks Malfoy.
Hagrid lifts Malfoy up. As Hagrid carries Malfoy
away, the hippogriff gen-
tly nudges Harry.
Script: In a flash, Buck-
beak’s steely talons slash
down.
Malfoy freezes. Looks down at the blood
blossoming on his robes.
Buckbeak whips around,
raises its talons and -
seeing Harry - lowers
them.
AD: Another room, the
wife and mother sits at a
window with a towel over
her hair.
She smokes a cigarette
with a latex-gloved hand.
Putting the cigarette out,
she uncovers her hair, re-
moves the glove and pops
gum in her mouth.
She pats her face and
hands with a wipe, then
sprays herself with per-
fume.
She pats her face and
hands with a wipe, then
sprays herself with per-
fume.
Script: Debbie opens a
window and sneaks a
cigarette.
She holds her cigarette
with a yellow dish wash-
ing glove.
She puts out the cigarette
and goes through an elab-
orate routine of hiding
the smell of smoke.
She puts some weird oil
in her hair and uses a
wet nap on her neck and
clothes and brushes her
teeth.
She sprays cologne and
walks through it.
AD: They rush out onto
the street.
A man is trapped under a
cart.
Valjean is crouched down
beside him.
Javert watches as Valjean
places his shoulder under
the shaft.
Javert’s eyes narrow.
Script: Valjean and
Javert hurry out across
the factory yard and
down the muddy track
beyond to discover -
A heavily laden cart has
toppled onto the cart
driver.
Valjean, Javert and
Javert’s assistant all
hurry to help, but they
can’t get a proper pur-
chase in the spongy
ground.
He throws himself under
the cart at this higher
end, and braces himself
to lift it from beneath.
Javert stands back and
looks on.
Fig. 2 Audio description (AD) and movie script samples from the movies “Harry Potter and the Prisoner of Azkaban”, “This is 40”, and “Les
Miserables”. Typical mistakes contained in scripts marked in red italic
123

Int J Comput Vis
Fig. 3 Some of the diverse verbs/actions present in our Large Scale Movie Description Challenge (LSMDC)
Figures 1 and 2 show examples of ADs and compare them
to movie scripts. Scripts have been used for various tasks
(Cour et al. 2008; Duchenne et al. 2009; Laptev et al. 2008;
Liang et al. 2011; Marszalek et al. 2009), but so far not for
video description. The main reason for this is that automatic
alignment frequently fails due to the discrepancy between
the movie and the script. As scripts are produced prior to the
shooting of the movie they are frequently not as precise as
the AD (Fig. 2 shows some typical mistakes marked in red
italic). A common case is that part of the sentence is correct,
while another part contains incorrect/irrelevant information.
As can be seen in the examples, AD narrations describe key
visual elements of the video such as changes in the scene,
people’s appearance, gestures, actions, and their interaction
with each other and the scene’s objects in concise and pre-
cise language. Figure 3 shows the variability of AD data
w.r.t. to verbs (actions) and corresponding scenes from the
movies.
In this work we present a dataset which provides tran-
scribed ADs, aligned to full length movies. AD narrations
are carefully positioned within movies to fit in the natural
pauses in the dialogue and are mixed with the original movie
soundtrack by professional post-production. To obtain ADs
we retrieve audio streams from DVDs/Blu-ray disks, seg-
ment out the sections of the AD audio and transcribe them
via a crowd-sourced transcription service. The ADs provide
an initialtemporal alignment, which however doesnot always
cover the full activity in the video. We discuss a way to
fully automate both audio-segmentation and temporal align-
ment, but also manually align each sentence to the movie
for all the data. Therefore, in contrast to Salway (2007) and
Salway et al. (2007), our dataset provides alignment to the
actions in the video, rather than just to the audio track of the
description.
In addition we also mine existing movie scripts, pre-align
them automatically, similar to Cour et al. (2008) and Laptev
et al. (2008), and then manually align the sentences to the
movie.
As a first study on our dataset we benchmark several
approaches for movie description. We first examine near-
est neighbor retrieval using diverse visual features which do
not require any additional labels, but retrieve sentences from
the training data. Second, we adapt the translation approach
of Rohrbach et al. (2013) by automatically extracting an
intermediate semantic representation from the sentences
using semantic parsing. Third, based on the success of
long short-term memory networks (LSTMs) (Hochreiter
and Schmidhuber 1997) for the image captioning problem
(Donahue et al. 2015; Karpathy and Fei-Fei 2015; Kiros
et al. 2015; Vinyals et al. 2015) we propose our approach
Visual-Labels. It first builds robust visual classifiers which
distinguish verbs, objects, and places extracted from weak
sentence annotations. Then the visual classifiers form the
input to an LSTM for generating movie descriptions.
The main contribution of this work is the Large Scale
Movie Description Challenge (LSMDC)
2
which provides
transcribed and aligned AD and script data sentences. The
LSMDC was first presented at the Workshop “Describing
and Understanding Video & The Large Scale Movie Descrip-
tion Challenge (LSMDC)”, collocated with ICCV 2015. The
second edition, LSMDC 2016, was presented at the “Joint
Workshop on Storytelling with Images and Videos and Large
Scale Movie Description and Understanding Challenge”,
collocated with ECCV 2016. Both challenges include the
same public and blind test sets with an evaluation server
3
for automatic evaluation. LSMDC is based on the MPII
Movie Description dataset (MPII-MD) and the Montreal
Video Annotation Dataset (M-VAD) which were initially
collected independently but are presented jointly in this
work. We detail the data collection and dataset properties
in Sect. 3, which includes our approach to automatically
collect and align AD data. In Sect. 4 we present several
benchmark approaches for movie description, including our
2
https://sites.google.com/ site/describingmovies/.
3
https://competitions.codalab.org/competitions/6121.
123

Int J Comput Vis
Visual-Labels approach which learns robust visual classi-
fiers and generates description using an LSTM. In Sect. 5
we present an evaluation of the benchmark approaches on
the M-VAD and MPII-MD datasets, analyzing the influence
of the different design choices. Using automatic and human
evaluation, we also show that our Visual-Labels approach
outperforms prior work on both datasets. In Sect. 5.5 we
perform an analysis of prior work and our approach to under-
stand the challenges of the movie description task. In Sect. 6
we present and discuss the results of the LSMDC 2015 and
LSMDC 2016.
This work is partially based on the original publica-
tions from Rohrbach et al. ( 2015c, b) and the technical
report from Torabi et al. (2015). Torabi et al. (2015) col-
lected M-VAD, Rohrbach et al. (2015c) collected the MPII-
MD dataset and presented the translation-based description
approach. Rohrbach et al. (2015b) proposed the Visual-
Labels approach.
2 Related Work
We discuss recent approaches to image and video description
including existing work using movie scripts and ADs. We
also discuss works which build on our dataset. We compare
our proposed dataset to related video description datasets in
Table 3 (Sect. 3.5).
2.1 Image Description
Prior work on image description includes Farhadi et al.
(2010), Kulkarni et al. (2011), Kuznetsova et al. (2012,
2014), Li et al. (2011), Mitchell et al. (2012) and Socher et al.
(2014). Recently image description has gained increased
attention with work such as that of Chen and Zitnick (2015),
Donahue et al. (2015), Fang et al. (2015), Karpathy and Fei-
Fei (2015), Kiros et al. (2014, 2015), Mao et al. (2015),
Vinyals et al. (2015) and Xu et al. (2015a). Much of the recent
work has relied on Recurrent Neural Networks (RNNs) and
in particular on long short-term memory networks (LSTMs).
New datasets have been released, such as the Flickr30k
(Young et al. 2014) and MS COCO Captions (Chen et al.
2015), where Chen et al. (2015) also presents a standard-
ized protocol for image captioning evaluation. Other work
has analyzed the performance of recent methods, e.g. Devlin
et al. (2015) compare them with respect to the novelty of gen-
erated descriptions, while also exploring a nearest neighbor
baseline that improves over recent methods.
2.2 Video Description
In the past video description has been addressed in controlled
settings (Barbu et al. 2012; Kojima et al. 2002),onasmall
scale (Das et al. 2013; Guadarrama et al. 2013
; Thomason
et al. 2014) or in single domains like cooking (Rohrbach et al.
2014, 2013; Donahue et al. 2015). Donahue et al. (2015) first
proposed to describe videos using an LSTM, relying on pre-
computed CRF scores from Rohrbach et al. (2014). Later
Venugopalan et al. (2015c) extended this work to extract
CNN features from frames which are max-pooled over time.
Pan et al. (2016b) propose a framework that consists of a 2-
/3-D CNN and LSTM trained jointly with a visual-semantic
embedding to ensure better coherence between video and
text. Xu et al. (2015b) jointly address the language generation
and video/language retrieval tasks by learning a joint embed-
ding for a deep video model and a compositional semantic
language model. Li et al. (2015) study the problem of summa-
rizing a long video to a single concise description by using
ranking based summarization of multiple generated candi-
date sentences.
Concurrent and Consequent Work To handle the challeng-
ing scenario of movie description, Yao et al. (2015) propose
a soft-attention based model which selects the most rele-
vant temporal segments in a video, incorporates 3-D CNN
and generates a sentence using an LSTM. Venugopalan
et al. (2015b) propose S2VT, an encoder–decoder frame-
work, where a single LSTM encodes the input video frame
by frame and decodes it into a sentence. Pan et al. (2016a)
extend the video encoding idea by introducing a s econd
LSTM layer which receives input of the first layer, but
skips several frames, reducing its temporal depth. Venu-
gopalan et al. (2016) explore the benefit of pre-trained word
embeddings and language models for generation on large
external text corpora. Shetty and Laaksonen (2015) evalu-
ate different visual features as input for an LSTM generation
frame-work. Specifically they use dense trajectory f eatures
(Wang et al. 2013) extracted for the clips and CNN features
extracted at center frames of the clip. They find that train-
ing concept classifiers on MS COCO with the CNN features,
combined with dense trajectories provides the best input for
the LSTM. Ballas et al. (2016) leverages multiple convo-
lutional maps from different CNN layers to improve the
visual representation for activity and video description. To
model multi-sentence description, Yu et al. (2016a) propose
to use two stacked RNNs where the first one models words
within a sentence and the second one, sentences within a
paragraph. Yao et al. (2016) has conducted an interesting
study on performance upper bounds for both image and video
description tasks on available datasets, including the LSMDC
dataset.
2.3 Movie Scripts and Audio Descriptions
Movie scripts have been used for automatic discovery and
annotation of scenes and human actions in videos (Duchenne
et al. 2009; Laptev et al. 2008; Marszalek et al. 2009),
123

Int J Comput Vis
as well as a resource to construct activity knowledge base
(Tandon et al. 2015; de Melo and Tandon 2016). We rely on
the approach presented by Laptev et al. (2008) to align movie
scripts using subtitles.
Bojanowski et al. (2013) approach the problem of learn-
ing a joint model of actors and actions in movies using weak
supervision provided by scripts. They rely on the semantic
parser SEMAFOR (Das et al. 2012) trained on the FrameNet
database (Baker et al. 1998), however, they limit the recog-
nition only to two frames. Bojanowski et al. (2014) aim to
localize individual short actions in longer clips by exploiting
the ordering constraints as weak supervision. Bojanowski
et al. (2013, 2014), Duchenne et al. (2009), Laptev et al.
(2008), Marszalek et al. (2009) proposed datasets focused
on extracting several activities from movies. Most of them
are part of the “Hollywood2” dataset (Marszalek et al. 2009)
which contains 69 movies and 3669 clips. Another line of
work (Cour et al. 2009; Everingham et al. 2006; Ramanathan
et al. 2014; Sivic et al. 2009; Tapaswi et al. 2012) proposed
datasets for character identification targeting TV shows. All
the mentioned datasets rely on alignments to movie/TV
scripts and none uses ADs.
ADs have also been used to understand which characters
interact with each other (Salwayet al. 2007). Other prior work
has looked at supporting AD production using scripts as an
information source (Lakritz and Salway 2006) and automati-
cally finding scene boundaries (Gagnon et al. 2010). Salway
(2007) analyses the linguistic properties on a non-public cor-
pus of ADs from 91 movies. Their corpus is based on the
original sources to create the ADs and contains different
kinds of artifacts not present in actual description, such as
dialogs and production notes. In contrast, our text corpus is
much cleaner as it consists only of the actual ADs.
2.4 Works Building on Our Dataset
Interestingly, other works, datasets, and challenges are
already building upon our data. Zhu et al. (2015b) learn
a visual-semantic embedding from our clips and ADs to
relate movies to books. Bruni et al. (2016) also learn a joint
embedding of videos and descriptions and use this represen-
tation to improve activity recognition on the Hollywood 2
dataset Marszalek et al.
(2009). Tapaswi et al. (2016) use our
AD transcripts for building their MovieQA dataset, which
asks natural language questions about movies, requiring an
understanding of visual and textual information, such as dia-
logue and AD, to answer the question. Zhu et al. (2015a)
present a fill-in-the-blank challenge for audio description of
the current, previous, and next sentence description for a
given clip, requiring to understand the temporal context of the
clips.
3 Datasets for Movie Description
In the following, we present how we collect our data for
movie description and discuss its properties. The Large
Scale Movie Description Challenge (LSMDC) is based on
two datasets which were originally collected independently.
The MPII Movie Description Dataset (MPII-MD), initially
presented by Rohrbach et al. (2015c), was collected from
Blu-ray movie data. It consists of AD and script data and
uses sentence-level manual alignment of transcribed audio
to the actions in the video (Sect. 3.1). In Sect. 3.2 we dis-
cuss how to fully automate AD audio segmentation and
alignment for the Montreal Video Annotation Dataset (M-
VAD), initially presented by Torabi et al. (2015). M-VAD was
collected with DVD data quality and only relies on AD. Sec-
tion 3.3 details the Large Scale Movie Description Challenge
(LSMDC) which is based on M-VAD and MPII-MD, but also
contains additional movies, and was set up as a challenge. It
includes a submission server for evaluation on public and
blind t est sets. In Sect. 3.4 we present the detailed statistics
of our datasets, also see Table 1. In Sect. 3.5 we compare our
movie description data to other video description datasets.
3.1 The MPII Movie Description (MPII-MD) Dataset
In the following we describe our approach behind the col-
lection of ADs (Sect. 3.1.1) and script data (Sect. 3.1.2).
Then we discuss how to manually align them to the video
(Sect. 3.1.3) and which visual features we extracted from the
video (Sect. 3.1.4).
3.1.1 Collection of ADs
We search for Blu-ray movies with ADs in the Audio
Description” section of the British Amazon
4
and select 55
movies of diverse genres (e.g. drama, comedy, action). As
ADs are only available in audio format, we first retrieve the
audio stream from the Blu-ray HD disks. We use MakeMKV
5
to extract a Blu-ray in the .mkv file format, and then XMe-
diaRecode
6
to select and extract the audio streams from it.
Then we semi-automatically segment out the sections of the
AD audio (which is mixed with the original audio stream)
with the approach described below. The audio segments are
then transcribed by a crowd-sourced transcription service
7
that also provides us the time-stamps for each spoken sen-
tence.
4
www.amazon.co.uk.
5
https://www.makemkv.com/.
6
https://www.xmedia-recode.de/.
7
CastingWords transcription service, http://castingwords.com/.
123

Citations
More filters
Journal ArticleDOI
TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.
Abstract: Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together Multimodal machine learning aims to build models that can process and relate information from multiple modalities It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research

1,945 citations

Proceedings ArticleDOI
15 Jun 2019
TL;DR: To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.
Abstract: Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45%). To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (~65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.

687 citations

Proceedings ArticleDOI
16 Aug 2018
TL;DR: In this paper, the authors introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and present SWAG, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations.
Abstract: Given a partial description like “she opened the hood of the car,” humans can reason about the situation and anticipate what might come next (”then, she examined the engine”). In this paper, we introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning. We present SWAG, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations. To address the recurring challenges of the annotation artifacts and human biases found in many existing datasets, we propose Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data. To account for the aggressive adversarial filtering, we use state-of-the-art language models to massively oversample a diverse set of potential counterfactuals. Empirical results demonstrate that while humans can solve the resulting inference problems with high accuracy (88%), various competitive models struggle on our task. We provide comprehensive analysis that indicates significant opportunities for future research.

505 citations

Proceedings ArticleDOI
01 Oct 2017
TL;DR: In this paper, a Moment Context Network (MCNCLN) is proposed to localize natural language queries in videos by integrating local and global video features over time, which can identify a specific temporal segment, or moment, from a video given a natural language text description.
Abstract: We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global video features over time. A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment. Therefore, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions. We demonstrate that MCN outperforms several baseline methods and believe that our initial results together with the release of DiDeMo will inspire further research on localizing video moments with natural language.

469 citations

Posted Content
TL;DR: It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
Abstract: Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: this http URL.

440 citations

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Frequently Asked Questions (12)
Q1. What are the contributions in "Movie description" ?

In this work the authors propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. The authors introduce theLarge ScaleMovieDescription Challenge ( LSMDC ) which contains a parallel corpus of 128,118 sentences aligned to video clips from 200 movies ( around 150h of video in total ). Comparing ADs to scripts, the authors find that ADs aremore visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, the authors present and compare the results of several Communicated by Margaret Mitchell, John Platt and Kate Saenko. 

In the future work the movie description approaches should aim to achieve rich yet correct and fluent descriptions. Beyond their current challenge on single sentences, the dataset opens new possibilities to understand stories and plots acrossmultiple sentences in an open domain scenario on a large scale. Their evaluation server will continue to be available for automatic evaluation. 

The most frequent verbs there are “look up” and “nod”, which are also frequent in the dataset and in the sentences produced by SMT-Best. 

One of the main challenges in automating the construction of a video annotation dataset derived from AD audio is accurately segmenting the AD output, which is mixed with the original movie soundtrack. 

The automatic evaluation measures include BLEU-1,-2,-3,-4 (Papineni et al. 2002), METEOR (Denkowski and Lavie 2014), ROUGE-L (Lin 2004), and CIDEr (Vedantam et al. 2015). 

The authors start by decomposing the typically long sentences present in movie descriptions into smaller clauses using the ClausIE tool (Del Corro and Gemulla 2013). 

The authors also use the recently proposed evaluation measure SPICE (Anderson et al. 2016), which aims to compare the semantic content of two descriptions, by matching the information contained in dependency parse trees for both descriptions. 

In order to compen-sate for the potential 1–2s misalignment between the AD narrator speaking and the corresponding scene in the movie, the authors automatically add 2s to the end of each video clip. 

The authors look at the following properties: availability of multi-sentence descriptions (long videos described continuously with multiple sentences), data domain, source of descriptions and dataset size. 

Ballas et al. (2016) leverages multiple convolutional maps from different CNN layers to improve the visual representation for activity and video description. 

This submissionuses an encoder–decoder framework with 2 LSTMs, one LSTM used to encode the frame sequence of the video and another to decode it into a sentence. 

Then the authors use the dynamic programming method of Laptev et al. (2008) to align scripts to subtitles and infer the time-stamps for the description sentences.