What are the contributions in "Movie description" ?

In this work the authors propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. The authors introduce theLarge ScaleMovieDescription Challenge ( LSMDC ) which contains a parallel corpus of 128,118 sentences aligned to video clips from 200 movies ( around 150h of video in total ). Comparing ADs to scripts, the authors find that ADs aremore visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, the authors present and compare the results of several Communicated by Margaret Mitchell, John Platt and Kate Saenko.

What are the future works in "Movie description" ?

In the future work the movie description approaches should aim to achieve rich yet correct and fluent descriptions. Beyond their current challenge on single sentences, the dataset opens new possibilities to understand stories and plots acrossmultiple sentences in an open domain scenario on a large scale. Their evaluation server will continue to be available for automatic evaluation.

What are the frequent verbs in the dataset?

The most frequent verbs there are “look up” and “nod”, which are also frequent in the dataset and in the sentences produced by SMT-Best.

What is the main challenge in the construction of a video annotation dataset?

One of the main challenges in automating the construction of a video annotation dataset derived from AD audio is accurately segmenting the AD output, which is mixed with the original movie soundtrack.

What are the evaluation measures used for the semantic parsing pipeline?

The automatic evaluation measures include BLEU-1,-2,-3,-4 (Papineni et al. 2002), METEOR (Denkowski and Lavie 2014), ROUGE-L (Lin 2004), and CIDEr (Vedantam et al. 2015).

How do the authors decompose the sentences in a movie?

The authors start by decomposing the typically long sentences present in movie descriptions into smaller clauses using the ClausIE tool (Del Corro and Gemulla 2013).

How does the evaluation measure measure semantic content?

The authors also use the recently proposed evaluation measure SPICE (Anderson et al. 2016), which aims to compare the semantic content of two descriptions, by matching the information contained in dependency parse trees for both descriptions.

How do the authors add 2s to the end of each video clip?

In order to compen-sate for the potential 1–2s misalignment between the AD narrator speaking and the corresponding scene in the movie, the authors automatically add 2s to the end of each video clip.

What are the properties of a video description dataset?

The authors look at the following properties: availability of multi-sentence descriptions (long videos described continuously with multiple sentences), data domain, source of descriptions and dataset size.

What is the LSTM used to encode the video?

This submissionuses an encoder–decoder framework with 2 LSTMs, one LSTM used to encode the frame sequence of the video and another to decode it into a sentence.

What is the method used to align scripts to subtitles?

Then the authors use the dynamic programming method of Laptev et al. (2008) to align scripts to subtitles and infer the time-stamps for the description sentences.

(Open Access) Movie Description (2017) | Anna Rohrbach

Q: What is the way to improve the visual representation of video?

Ballas et al. (2016) leverages multiple convolutional maps from different CNN layers to improve the visual representation for activity and video description.

Int J Comput Vis

DOI 10.1007/s11263-016-0987-1

Movie Description

Anna Rohrbach

· Atousa Torabi

· Marcus Rohrbach

· Niket Tandon

Christopher Pal

· Hugo Larochelle

5,6

· Aaron Courville

· Bernt Schiele

Received: 10 May 2016 / Accepted: 23 December 2016

Abstract Audio description (AD) provides linguistic

descriptions of movies and allows visually impaired people

to follow a movie along with their peers. Such descriptions

are by design mainly visual and thus naturally form an inter-

esting data source for computer vision and computational

linguistics. In this work we propose a novel dataset which

contains transcribed ADs, which are temporally aligned to

full length movies. In addition we also collected and aligned

movie scripts used in prior work and compare the two sources

of descriptions. We introduce the Large Scale Movie Descrip-

tion Challenge (LSMDC) which contains a parallel corpus

of 128,118 sentences aligned to video clips from 200 movies

(around 150h of video in total). The goal of the challenge

is to automatically generate descriptions for the movie clips.

First we characterize the dataset by benchmarking differ-

ent approaches for generating video descriptions. Comparing

ADs to scripts, we ﬁnd that ADs are more visual and describe

precisely what is shown rather than what should happen

according to the scripts created prior to movie production.

Furthermore, we present and compare the results of several

Communicated by Margaret Mitchell, John Platt and Kate Saenko.

Anna Rohrbach

arohrbach@mpi-inf.mpg.de

Max Planck Institute for Informatics, Saarland Informatics

Campus, Saarbrücken, Germany

ICSI and EECS, UC Berkeley, Berkeley, CA, USA

Disney Research, Pittsburgh, PA, USA

École Polytechnique de Montréal, Montreal, Canada

Université de Sherbrooke, Sherbrooke, Canada

Twitter, Cambridge, ON, USA

Université de Montréal, Montreal, Canada

teams who participated in the challenges organized in the

context of two workshops at ICCV 2015 and ECCV 2016.

Keywords Movie description · Video description · Video

captioning · Video understanding · Movie description

dataset · Movie description challenge · Long short-term

memory network · Audio description · LSMDC

1 Introduction

Audio descriptions (ADs) make movies accessible to mil-

lions of blind or visually impaired people.

AD—sometimes

also referred to as descriptivevideo service (DVS)—provides

an audio narrative of the “most important aspects of the

visual information” (Salway2007), namely actions, gestures,

scenes, and character appearance as can be seen in Figs.1

and 2. AD is prepared by trained describers and read by pro-

fessional narrators. While more and more movies are audio

transcribed, it may take up to 60 person-hours to describe a 2-

hmovie(Lakritz and Salway 2006), resulting in the fact that

today only a small subset of movies and TV programs are

available for the blind. Consequently, automating this pro-

cess has the potential to greatly increase accessibility to this

media content.

In addition to the beneﬁts for the blind, generating descrip-

tions for video is an interesting task in itself, requiring the

combination of core techniques from computer vision and

computational linguistics. To understand the visual input one

has to reliably recognize scenes, human activities, and par-

ticipating objects. To generate a good description one has to

In this work we refer for simplicity to “the blind” to account for all

blind and visually impaired people which beneﬁt from AD, knowing of

the variety of visually impaired and that AD is not accessible to all.

123

Int J Comput Vis

AD: Abby gets in

the basket.

Mike leans over and

sees how high they

are.

Abby clasps her

hands around his

face and kisses him

passionately.

Script: After a

moment a frazzled

Abby pops up in his

place.

Mike looks down to

see – they are now

ﬁfteen feet above the

ground.

For the ﬁrst time in

her life, she stops

thinking and grabs

Mike and kisses the

hell out of him.

Fig. 1 Audio description (AD) and movie script samples from the

movie “Ugly Truth”

decide what part of the visual information to verbalize, i.e.

recognize what is salient.

Large datasets of objects (Deng et al. 2009) and scenes

(Xiao et al. 2010; Zhou et al. 2014) have had an important

impact in computer vision and have signiﬁcantly improved

our ability to recognize objects and scenes. The combination

of large datasets and convolutional neural networks (CNNs)

has been particularly potent (Krizhevsky et al. 2012). To be

able to learn how to generate descriptions of visual content,

parallel datasets of visual content paired with descriptions are

indispensable (Rohrbach et al. 2013). While recently several

large datasets have been released which provide images with

descriptions (Young et al. 2014; Lin et al. 2014; Ordonez et al.

2011), video description datasets focus on short video clips

with single sentence descriptions and have a limited number

of video clips (Xu et al. 2016; Chen and Dolan 2011)orare

not publicly available (Over et al. 2012). TACoS Multi-Level

(Rohrbach et al. 2014) and YouCook (Das et al. 2013)are

exceptionsas they providemultiple sentence descriptions and

longer videos. While these corpora pose challenges in terms

of ﬁne-grained recognition, they are restricted to the cooking

scenario. In contrast, movies are open domain and realistic,

even though, as any other video source (e.g. YouTube or

surveillance videos), they have their speciﬁc characteristics.

ADs and scripts associated with movies provide rich multiple

sentence descriptions. They even go beyond this by telling a

story which means they facilitate the study of how to extract

plots, the understanding of long term semantic dependencies

and human interactions from both visual and textual data.

AD: Buckbeak rears and

attacks Malfoy.

Hagrid lifts Malfoy up. As Hagrid carries Malfoy

away, the hippogriﬀ gen-

tly nudges Harry.

Script: In a ﬂash, Buck-

beak’s steely talons slash

down.

Malfoy freezes. Looks down at the blood

blossoming on his robes.

Buckbeak whips around,

raises its talons and -

seeing Harry - lowers

them.

AD: Another room, the

wife and mother sits at a

window with a towel over

her hair.

She smokes a cigarette

with a latex-gloved hand.

Putting the cigarette out,

she uncovers her hair, re-

moves the glove and pops

gum in her mouth.

She pats her face and

hands with a wipe, then

sprays herself with per-

fume.

She pats her face and

hands with a wipe, then

sprays herself with per-

fume.

Script: Debbie opens a

window and sneaks a

cigarette.

She holds her cigarette

with a yellow dish wash-

ing glove.

She puts out the cigarette

and goes through an elab-

orate routine of hiding

the smell of smoke.

She puts some weird oil

in her hair and uses a

wet nap on her neck and

clothes and brushes her

teeth.

She sprays cologne and

walks through it.

AD: They rush out onto

the street.

A man is trapped under a

cart.

Valjean is crouched down

beside him.

Javert watches as Valjean

places his shoulder under

the shaft.

Javert’s eyes narrow.

Script: Valjean and

Javert hurry out across

the factory yard and

down the muddy track

beyond to discover -

A heavily laden cart has

toppled onto the cart

driver.

Valjean, Javert and

Javert’s assistant all

hurry to help, but they

can’t get a proper pur-

chase in the spongy

ground.

He throws himself under

the cart at this higher

end, and braces himself

to lift it from beneath.

Javert stands back and

looks on.

Fig. 2 Audio description (AD) and movie script samples from the movies “Harry Potter and the Prisoner of Azkaban”, “This is 40”, and “Les

Miserables”. Typical mistakes contained in scripts marked in red italic

123

Int J Comput Vis

Fig. 3 Some of the diverse verbs/actions present in our Large Scale Movie Description Challenge (LSMDC)

Figures 1 and 2 show examples of ADs and compare them

to movie scripts. Scripts have been used for various tasks

(Cour et al. 2008; Duchenne et al. 2009; Laptev et al. 2008;

Liang et al. 2011; Marszalek et al. 2009), but so far not for

video description. The main reason for this is that automatic

alignment frequently fails due to the discrepancy between

the movie and the script. As scripts are produced prior to the

shooting of the movie they are frequently not as precise as

the AD (Fig. 2 shows some typical mistakes marked in red

italic). A common case is that part of the sentence is correct,

while another part contains incorrect/irrelevant information.

As can be seen in the examples, AD narrations describe key

visual elements of the video such as changes in the scene,

people’s appearance, gestures, actions, and their interaction

with each other and the scene’s objects in concise and pre-

cise language. Figure 3 shows the variability of AD data

w.r.t. to verbs (actions) and corresponding scenes from the

movies.

In this work we present a dataset which provides tran-

scribed ADs, aligned to full length movies. AD narrations

are carefully positioned within movies to ﬁt in the natural

pauses in the dialogue and are mixed with the original movie

soundtrack by professional post-production. To obtain ADs

we retrieve audio streams from DVDs/Blu-ray disks, seg-

ment out the sections of the AD audio and transcribe them

via a crowd-sourced transcription service. The ADs provide

an initialtemporal alignment, which however doesnot always

cover the full activity in the video. We discuss a way to

fully automate both audio-segmentation and temporal align-

ment, but also manually align each sentence to the movie

for all the data. Therefore, in contrast to Salway (2007) and

Salway et al. (2007), our dataset provides alignment to the

actions in the video, rather than just to the audio track of the

description.

In addition we also mine existing movie scripts, pre-align

them automatically, similar to Cour et al. (2008) and Laptev

et al. (2008), and then manually align the sentences to the

movie.

As a ﬁrst study on our dataset we benchmark several

approaches for movie description. We ﬁrst examine near-

est neighbor retrieval using diverse visual features which do

not require any additional labels, but retrieve sentences from

the training data. Second, we adapt the translation approach

of Rohrbach et al. (2013) by automatically extracting an

intermediate semantic representation from the sentences

using semantic parsing. Third, based on the success of

long short-term memory networks (LSTMs) (Hochreiter

and Schmidhuber 1997) for the image captioning problem

(Donahue et al. 2015; Karpathy and Fei-Fei 2015; Kiros

et al. 2015; Vinyals et al. 2015) we propose our approach

Visual-Labels. It ﬁrst builds robust visual classiﬁers which

distinguish verbs, objects, and places extracted from weak

sentence annotations. Then the visual classiﬁers form the

input to an LSTM for generating movie descriptions.

The main contribution of this work is the Large Scale

Movie Description Challenge (LSMDC)

which provides

transcribed and aligned AD and script data sentences. The

LSMDC was ﬁrst presented at the Workshop “Describing

and Understanding Video & The Large Scale Movie Descrip-

tion Challenge (LSMDC)”, collocated with ICCV 2015. The

second edition, LSMDC 2016, was presented at the “Joint

Workshop on Storytelling with Images and Videos and Large

Scale Movie Description and Understanding Challenge”,

collocated with ECCV 2016. Both challenges include the

same public and blind test sets with an evaluation server

for automatic evaluation. LSMDC is based on the MPII

Movie Description dataset (MPII-MD) and the Montreal

Video Annotation Dataset (M-VAD) which were initially

collected independently but are presented jointly in this

work. We detail the data collection and dataset properties

in Sect. 3, which includes our approach to automatically

collect and align AD data. In Sect. 4 we present several

benchmark approaches for movie description, including our

https://sites.google.com/ site/describingmovies/.

https://competitions.codalab.org/competitions/6121.

123

Int J Comput Vis

Visual-Labels approach which learns robust visual classi-

ﬁers and generates description using an LSTM. In Sect. 5

we present an evaluation of the benchmark approaches on

the M-VAD and MPII-MD datasets, analyzing the inﬂuence

of the different design choices. Using automatic and human

evaluation, we also show that our Visual-Labels approach

outperforms prior work on both datasets. In Sect. 5.5 we

perform an analysis of prior work and our approach to under-

stand the challenges of the movie description task. In Sect. 6

we present and discuss the results of the LSMDC 2015 and

LSMDC 2016.

This work is partially based on the original publica-

tions from Rohrbach et al. ( 2015c, b) and the technical

report from Torabi et al. (2015). Torabi et al. (2015) col-

lected M-VAD, Rohrbach et al. (2015c) collected the MPII-

MD dataset and presented the translation-based description

approach. Rohrbach et al. (2015b) proposed the Visual-

Labels approach.

2 Related Work

We discuss recent approaches to image and video description

including existing work using movie scripts and ADs. We

also discuss works which build on our dataset. We compare

our proposed dataset to related video description datasets in

Table 3 (Sect. 3.5).

2.1 Image Description

Prior work on image description includes Farhadi et al.

(2010), Kulkarni et al. (2011), Kuznetsova et al. (2012,

2014), Li et al. (2011), Mitchell et al. (2012) and Socher et al.

(2014). Recently image description has gained increased

attention with work such as that of Chen and Zitnick (2015),

Donahue et al. (2015), Fang et al. (2015), Karpathy and Fei-

Fei (2015), Kiros et al. (2014, 2015), Mao et al. (2015),

Vinyals et al. (2015) and Xu et al. (2015a). Much of the recent

work has relied on Recurrent Neural Networks (RNNs) and

in particular on long short-term memory networks (LSTMs).

New datasets have been released, such as the Flickr30k

(Young et al. 2014) and MS COCO Captions (Chen et al.

2015), where Chen et al. (2015) also presents a standard-

ized protocol for image captioning evaluation. Other work

has analyzed the performance of recent methods, e.g. Devlin

et al. (2015) compare them with respect to the novelty of gen-

erated descriptions, while also exploring a nearest neighbor

baseline that improves over recent methods.

2.2 Video Description

In the past video description has been addressed in controlled

settings (Barbu et al. 2012; Kojima et al. 2002),onasmall

scale (Das et al. 2013; Guadarrama et al. 2013

; Thomason

et al. 2014) or in single domains like cooking (Rohrbach et al.

2014, 2013; Donahue et al. 2015). Donahue et al. (2015) ﬁrst

proposed to describe videos using an LSTM, relying on pre-

computed CRF scores from Rohrbach et al. (2014). Later

Venugopalan et al. (2015c) extended this work to extract

CNN features from frames which are max-pooled over time.

Pan et al. (2016b) propose a framework that consists of a 2-

/3-D CNN and LSTM trained jointly with a visual-semantic

embedding to ensure better coherence between video and

text. Xu et al. (2015b) jointly address the language generation

and video/language retrieval tasks by learning a joint embed-

ding for a deep video model and a compositional semantic

language model. Li et al. (2015) study the problem of summa-

rizing a long video to a single concise description by using

ranking based summarization of multiple generated candi-

date sentences.

Concurrent and Consequent Work To handle the challeng-

ing scenario of movie description, Yao et al. (2015) propose

a soft-attention based model which selects the most rele-

vant temporal segments in a video, incorporates 3-D CNN

and generates a sentence using an LSTM. Venugopalan

et al. (2015b) propose S2VT, an encoder–decoder frame-

work, where a single LSTM encodes the input video frame

by frame and decodes it into a sentence. Pan et al. (2016a)

extend the video encoding idea by introducing a s econd

LSTM layer which receives input of the ﬁrst layer, but

skips several frames, reducing its temporal depth. Venu-

gopalan et al. (2016) explore the beneﬁt of pre-trained word

embeddings and language models for generation on large

external text corpora. Shetty and Laaksonen (2015) evalu-

ate different visual features as input for an LSTM generation

frame-work. Speciﬁcally they use dense trajectory f eatures

(Wang et al. 2013) extracted for the clips and CNN features

extracted at center frames of the clip. They ﬁnd that train-

ing concept classiﬁers on MS COCO with the CNN features,

combined with dense trajectories provides the best input for

the LSTM. Ballas et al. (2016) leverages multiple convo-

lutional maps from different CNN layers to improve the

visual representation for activity and video description. To

model multi-sentence description, Yu et al. (2016a) propose

to use two stacked RNNs where the ﬁrst one models words

within a sentence and the second one, sentences within a

paragraph. Yao et al. (2016) has conducted an interesting

study on performance upper bounds for both image and video

description tasks on available datasets, including the LSMDC

dataset.

2.3 Movie Scripts and Audio Descriptions

Movie scripts have been used for automatic discovery and

annotation of scenes and human actions in videos (Duchenne

et al. 2009; Laptev et al. 2008; Marszalek et al. 2009),

123

Int J Comput Vis

as well as a resource to construct activity knowledge base

(Tandon et al. 2015; de Melo and Tandon 2016). We rely on

the approach presented by Laptev et al. (2008) to align movie

scripts using subtitles.

Bojanowski et al. (2013) approach the problem of learn-

ing a joint model of actors and actions in movies using weak

supervision provided by scripts. They rely on the semantic

parser SEMAFOR (Das et al. 2012) trained on the FrameNet

database (Baker et al. 1998), however, they limit the recog-

nition only to two frames. Bojanowski et al. (2014) aim to

localize individual short actions in longer clips by exploiting

the ordering constraints as weak supervision. Bojanowski

et al. (2013, 2014), Duchenne et al. (2009), Laptev et al.

(2008), Marszalek et al. (2009) proposed datasets focused

on extracting several activities from movies. Most of them

are part of the “Hollywood2” dataset (Marszalek et al. 2009)

which contains 69 movies and 3669 clips. Another line of

work (Cour et al. 2009; Everingham et al. 2006; Ramanathan

et al. 2014; Sivic et al. 2009; Tapaswi et al. 2012) proposed

datasets for character identiﬁcation targeting TV shows. All

the mentioned datasets rely on alignments to movie/TV

scripts and none uses ADs.

ADs have also been used to understand which characters

interact with each other (Salwayet al. 2007). Other prior work

has looked at supporting AD production using scripts as an

information source (Lakritz and Salway 2006) and automati-

cally ﬁnding scene boundaries (Gagnon et al. 2010). Salway

(2007) analyses the linguistic properties on a non-public cor-

pus of ADs from 91 movies. Their corpus is based on the

original sources to create the ADs and contains different

kinds of artifacts not present in actual description, such as

dialogs and production notes. In contrast, our text corpus is

much cleaner as it consists only of the actual ADs.

2.4 Works Building on Our Dataset

Interestingly, other works, datasets, and challenges are

already building upon our data. Zhu et al. (2015b) learn

a visual-semantic embedding from our clips and ADs to

relate movies to books. Bruni et al. (2016) also learn a joint

embedding of videos and descriptions and use this represen-

tation to improve activity recognition on the Hollywood 2

dataset Marszalek et al.

(2009). Tapaswi et al. (2016) use our

AD transcripts for building their MovieQA dataset, which

asks natural language questions about movies, requiring an

understanding of visual and textual information, such as dia-

logue and AD, to answer the question. Zhu et al. (2015a)

present a ﬁll-in-the-blank challenge for audio description of

the current, previous, and next sentence description for a

given clip, requiring to understand the temporal context of the

clips.

3 Datasets for Movie Description

In the following, we present how we collect our data for

movie description and discuss its properties. The Large

Scale Movie Description Challenge (LSMDC) is based on

two datasets which were originally collected independently.

The MPII Movie Description Dataset (MPII-MD), initially

presented by Rohrbach et al. (2015c), was collected from

Blu-ray movie data. It consists of AD and script data and

uses sentence-level manual alignment of transcribed audio

to the actions in the video (Sect. 3.1). In Sect. 3.2 we dis-

cuss how to fully automate AD audio segmentation and

alignment for the Montreal Video Annotation Dataset (M-

VAD), initially presented by Torabi et al. (2015). M-VAD was

collected with DVD data quality and only relies on AD. Sec-

tion 3.3 details the Large Scale Movie Description Challenge

(LSMDC) which is based on M-VAD and MPII-MD, but also

contains additional movies, and was set up as a challenge. It

includes a submission server for evaluation on public and

blind t est sets. In Sect. 3.4 we present the detailed statistics

of our datasets, also see Table 1. In Sect. 3.5 we compare our

movie description data to other video description datasets.

3.1 The MPII Movie Description (MPII-MD) Dataset

In the following we describe our approach behind the col-

lection of ADs (Sect. 3.1.1) and script data (Sect. 3.1.2).

Then we discuss how to manually align them to the video

(Sect. 3.1.3) and which visual features we extracted from the

video (Sect. 3.1.4).

3.1.1 Collection of ADs

We search for Blu-ray movies with ADs in the “Audio

Description” section of the British Amazon

and select 55

movies of diverse genres (e.g. drama, comedy, action). As

ADs are only available in audio format, we ﬁrst retrieve the

audio stream from the Blu-ray HD disks. We use MakeMKV

to extract a Blu-ray in the .mkv ﬁle format, and then XMe-

diaRecode

to select and extract the audio streams from it.

Then we semi-automatically segment out the sections of the

AD audio (which is mixed with the original audio stream)

with the approach described below. The audio segments are

then transcribed by a crowd-sourced transcription service

that also provides us the time-stamps for each spoken sen-

tence.

www.amazon.co.uk.

https://www.makemkv.com/.

https://www.xmedia-recode.de/.

CastingWords transcription service, http://castingwords.com/.

123

Movie Description

Figures

Citations

Bulut Tabanlı Bilgisayarlı Görü Kullanılarak Sesli Betimleme Sistem Tasarımı

Video Caption Dataset for Describing Human Actions in Japanese

More Than Reading Comprehension: A Survey on Datasets and Metrics of Textual Question Answering.

Conversational AI Systems for Social Good: Opportunities and Challenges.

The Role of the Input in Natural Language Video Description

References

ImageNet Classification with Deep Convolutional Neural Networks

Long short-term memory

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Going deeper with convolutions

Related Papers (5)

Deep Residual Learning for Image Recognition

Bleu: a Method for Automatic Evaluation of Machine Translation

Long short-term memory

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Glove: Global Vectors for Word Representation

Frequently Asked Questions (12)

Q1. What are the contributions in "Movie description" ?

Q2. What are the future works in "Movie description" ?

Q3. What are the frequent verbs in the dataset?

Q4. What is the main challenge in the construction of a video annotation dataset?

Q5. What are the evaluation measures used for the semantic parsing pipeline?

Q6. How do the authors decompose the sentences in a movie?

Q7. How does the evaluation measure measure semantic content?

Q8. How do the authors add 2s to the end of each video clip?

Q9. What are the properties of a video description dataset?

Q10. What is the way to improve the visual representation of video?

Q11. What is the LSTM used to encode the video?

Q12. What is the method used to align scripts to subtitles?