Showing papers on "Closed captioning published in 2015"

PDF

Open Access

Posted Content•

A Critical Review of Recurrent Neural Networks for Sequence Learning

[...]

Zachary C. Lipton¹, John Berkowitz, Charles Elkan•Institutions (1)

29 May 2015-arXiv: Learning

TL;DR: The goal of this survey is to provide a selfcontained explication of the state of the art of recurrent neural networks together with a historical perspective and references to primary research.

...read moreread less

Abstract: Countless learning tasks require dealing with sequential data. Image captioning, speech synthesis, and music generation all require that a model produce outputs that are sequences. In other domains, such as time series prediction, video analysis, and musical information retrieval, a model must learn from inputs that are sequences. Interactive tasks, such as translating natural language, engaging in dialogue, and controlling a robot, often demand both capabilities. Recurrent neural networks (RNNs) are connectionist models that capture the dynamics of sequences via cycles in the network of nodes. Unlike standard feedforward neural networks, recurrent networks retain a state that can represent information from an arbitrarily long context window. Although recurrent neural networks have traditionally been dicult to train, and often contain millions of parameters, recent advances in network architectures, optimization techniques, and parallel computation have enabled successful large-scale learning with them. In recent years, systems based on long short-term memory (LSTM) and bidirectional (BRNN) architectures have demonstrated ground-breaking performance on tasks as varied as image captioning, language translation, and handwriting recognition. In this survey, we review and synthesize the research that over the past three decades rst yielded and then made practical these powerful learning models. When appropriate, we reconcile conicting notation and nomenclature. Our goal is to provide a selfcontained explication of the state of the art together with a historical perspective and references to primary research.

...read moreread less

1,792 citations

Posted Content•

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

[...]

Justin Johnson¹, Andrej Karpathy¹, Li Fei-Fei¹•Institutions (1)

Stanford University¹

24 Nov 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: A Fully Convolutional Localization Network (FCLN) architecture is proposed that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with asingle round of optimization.

...read moreread less

Abstract: We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language. The dense captioning task generalizes object detection when the descriptions consist of a single word, and Image Captioning when one predicted region covers the full image. To address the localization and description task jointly we propose a Fully Convolutional Localization Network (FCLN) architecture that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with a single round of optimization. The architecture is composed of a Convolutional Network, a novel dense localization layer, and Recurrent Neural Network language model that generates the label sequences. We evaluate our network on the Visual Genome dataset, which comprises 94,000 images and 4,100,000 region-grounded captions. We observe both speed and accuracy improvements over baselines based on current state of the art approaches in both generation and retrieval settings.

...read moreread less

698 citations

Posted Content•

Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks

[...]

Haonan Yu¹, Jiang Wang², Zhiheng Huang³, Yi Yang², Wei Xu² - Show less +1 more•Institutions (3)

Purdue University¹, Baidu², Facebook³

26 Oct 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: An approach that exploits hierarchical Recurrent Neural Networks to tackle the video captioning problem, i.e., generating one or multiple sentences to describe a realistic video, significantly outperforms the current state-of-the-art methods.

...read moreread less

Abstract: We present an approach that exploits hierarchical Recurrent Neural Networks (RNNs) to tackle the video captioning problem, i.e., generating one or multiple sentences to describe a realistic video. Our hierarchical framework contains a sentence generator and a paragraph generator. The sentence generator produces one simple short sentence that describes a specific short video interval. It exploits both temporal- and spatial-attention mechanisms to selectively focus on visual elements during generation. The paragraph generator captures the inter-sentence dependency by taking as input the sentential embedding produced by the sentence generator, combining it with the paragraph history, and outputting the new initial state for the sentence generator. We evaluate our approach on two large-scale benchmark datasets: YouTubeClips and TACoS-MultiLevel. The experiments demonstrate that our approach significantly outperforms the current state-of-the-art methods with BLEU@4 scores 0.499 and 0.305 respectively.

...read moreread less

394 citations

Posted Content•

Generation and Comprehension of Unambiguous Object Descriptions

[...]

Junhua Mao¹, Jonathan Huang², Alexander Toshev², Oana-Maria Camburu³, Alan L. Yuille¹, Kevin Murphy² - Show less +2 more•Institutions (3)

University of California, Los Angeles¹, Google², University of Oxford³

07 Nov 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a method that can generate an unambiguous description of a specific object or region in an image and which can also comprehend or interpret such an expression to infer which object is being described, and shows that this method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene.

...read moreread less

Abstract: We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described. We show that our method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene. Our model is inspired by recent successes of deep learning methods for image captioning, but while image captioning is difficult to evaluate, our task allows for easy objective evaluation. We also present a new large-scale dataset for referring expressions, based on MS-COCO. We have released the dataset and a toolbox for visualization and evaluation, see this https URL

...read moreread less

269 citations

Proceedings Article•DOI•

Language Models for Image Captioning: The Quirks and What Works

[...]

Jacob Devlin¹, Hao Cheng², Hao Fang³, Saurabh Gupta⁴, Li Deng³, Xiaodong He³, Geoffrey Zweig³, Margaret Mitchell³ - Show less +4 more•Institutions (4)

BBN Technologies¹, University of Washington², Microsoft³, University of California, Berkeley⁴

07 May 2015

TL;DR: In this paper, the authors compare the merits of different language modeling approaches for the first time by using the same state-of-the-art CNN as input, and examine issues in different approaches, including linguistic irregularities, caption repetition, and data set overlap.

...read moreread less

Abstract: Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this paper, we compare the merits of these different language modeling approaches for the first time by using the same state-ofthe-art CNN as input. We examine issues in the different approaches, including linguistic irregularities, caption repetition, and data set overlap. By combining key aspects of the ME and RNN methods, we achieve a new record performance over previously published results on the benchmark COCO dataset. However, the gains we see in BLEU do not translate to human judgments.

...read moreread less

242 citations

Posted Content•

Exploring Nearest Neighbor Approaches for Image Captioning

[...]

Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, C. Lawrence Zitnick - Show less +1 more

17 May 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: A variety of nearest neighbor baseline approaches for image captioning find a set of nearest neighbour images in the training set from which a caption may be borrowed for the query image by finding the caption that best represents the "consensus" of the set of candidate captions gathered from the nearest neighbor images.

...read moreread less

Abstract: We explore a variety of nearest neighbor baseline approaches for image captioning. These approaches find a set of nearest neighbor images in the training set from which a caption may be borrowed for the query image. We select a caption for the query image by finding the caption that best represents the "consensus" of the set of candidate captions gathered from the nearest neighbor images. When measured by automatic evaluation metrics on the MS COCO caption evaluation server, these approaches perform as well as many recent approaches that generate novel captions. However, human studies show that a method that generates novel captions is still preferred over the nearest neighbor approach.

...read moreread less

207 citations

Posted Content•

Language Models for Image Captioning: The Quirks and What Works

[...]

Jacob Devlin¹, Hao Cheng², Hao Fang³, Saurabh Gupta⁴, Li Deng³, Xiaodong He³, Geoffrey Zweig³, Margaret Mitchell³ - Show less +4 more•Institutions (4)

BBN Technologies¹, University of Washington², Microsoft³, University of California, Berkeley⁴

07 May 2015-arXiv: Computation and Language

TL;DR: By combining key aspects of the ME and RNN methods, this paper achieves a new record performance over previously published results on the benchmark COCO dataset, however, the gains the authors see in BLEU do not translate to human judgments.

...read moreread less

Abstract: Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this paper, we compare the merits of these different language modeling approaches for the first time by using the same state-of-the-art CNN as input. We examine issues in the different approaches, including linguistic irregularities, caption repetition, and data set overlap. By combining key aspects of the ME and RNN methods, we achieve a new record performance over previously published results on the benchmark COCO dataset. However, the gains we see in BLEU do not translate to human judgments.

...read moreread less

143 citations

Posted Content•

Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

[...]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan L. Yuille - Show less +2 more

25 Apr 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as discussed by the authors proposed a transposed weight sharing scheme for image captioning, which not only improves performance on image caption, but also makes the model more suitable for the novel concept learning task.

...read moreread less

Abstract: In this paper, we address the task of learning novel visual concepts, and their interactions with other concepts, from a few images with sentence descriptions. Using linguistic context and visual features, our method is able to efficiently hypothesize the semantic meaning of new words and add them to its word dictionary so that they can be used to describe images which contain these novel concepts. Our method has an image captioning module based on m-RNN with several improvements. In particular, we propose a transposed weight sharing scheme, which not only improves performance on image captioning, but also makes the model more suitable for the novel concept learning task. We propose methods to prevent overfitting the new concepts. In addition, three novel concept datasets are constructed for this new task. In the experiments, we show that our method effectively learns novel visual concepts from a few examples without disturbing the previously learned concepts. The project page is this http URL

...read moreread less

140 citations

Posted Content•

SentiCap: Generating Image Descriptions with Sentiments

[...]

Alexander Mathews¹, Lexing Xie¹, Xuming He¹•Institutions (1)

Australian National University¹

06 Oct 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a switching recurrent neural network with word-level regularization was proposed to produce emotional image captions using only 2000+ training sentences containing sentiments, and the results showed that 84.6% of the generated positive captions were judged as being at least as descriptive as the factual captions.

...read moreread less

Abstract: The recent progress on image recognition and language modeling is making automatic description of image content a reality. However, stylized, non-factual aspects of the written description are missing from the current systems. One such style is descriptions with emotions, which is commonplace in everyday communication, and influences decision-making and interpersonal relationships. We design a system to describe an image with emotions, and present a model that automatically generates captions with positive or negative sentiments. We propose a novel switching recurrent neural network with word-level regularization, which is able to produce emotional image captions using only 2000+ training sentences containing sentiments. We evaluate the captions with different automatic and crowd-sourcing metrics. Our model compares favourably in common quality metrics for image captioning. In 84.6% of cases the generated positive captions were judged as being at least as descriptive as the factual captions. Of these positive captions 88% were confirmed by the crowd-sourced workers as having the appropriate sentiment.

...read moreread less

139 citations

Proceedings Article•DOI•

Learning Like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

[...]

Junhua Mao¹, Xu Wei², Yi Yang², Jiang Wang², Zhiheng Huang², Alan L. Yuille¹ - Show less +2 more•Institutions (2)

University of California, Los Angeles¹, Baidu²

07 Dec 2015

TL;DR: Using linguistic context and visual features, the method is able to efficiently hypothesize the semantic meaning of new words and add them to its word dictionary so that they can be used to describe images which contain these novel concepts.

...read moreread less

Abstract: In this paper, we address the task of learning novel visual concepts, and their interactions with other concepts, from a few images with sentence descriptions. Using linguistic context and visual features, our method is able to efficiently hypothesize the semantic meaning of new words and add them to its word dictionary so that they can be used to describe images which contain these novel concepts. Our method has an image captioning module based on [38] with several improvements. In particular, we propose a transposed weight sharing scheme, which not only improves performance on image captioning, but also makes the model more suitable for the novel concept learning task. We propose methods to prevent overfitting the new concepts. In addition, three novel concept datasets are constructed for this new task, and are publicly available on the project page. In the experiments, we show that our method effectively learns novel visual concepts from a few examples without disturbing the previously learned concepts. The project page is: www.stat.ucla.edu/junhua. mao/projects/child_learning.html.

...read moreread less

120 citations

Book Chapter•DOI•

The Long-Short Story of Movie Description

[...]

Anna Rohrbach¹, Marcus Rohrbach², Bernt Schiele¹•Institutions (2)

Max Planck Society¹, University of California, Berkeley²

07 Oct 2015

TL;DR: This work shows how to learn robust visual classifiers from the weak annotations of the sentence descriptions to generate a description using an LSTM and achieves the best performance to date on the challenging MPII-MD and M-VAD datasets.

...read moreread less

Abstract: Generating descriptions for videos has many applications including assisting blind people and human-robot interaction. The recent advances in image captioning as well as the release of large-scale movie description datasets such as MPII-MD [28] and M-VAD [31] allow to study this task in more depth. Many of the proposed methods for image captioning rely on pre-trained object classifier CNNs and Long Short-Term Memory recurrent networks (LSTMs) for generating descriptions. While image description focuses on objects, we argue that it is important to distinguish verbs, objects, and places in the setting of movie description. In this work we show how to learn robust visual classifiers from the weak annotations of the sentence descriptions. Based on these classifiers we generate a description using an LSTM. We explore different design choices to build and train the LSTM and achieve the best performance to date on the challenging MPII-MD and M-VAD datasets. We compare and analyze our approach and prior work along various dimensions to better understand the key challenges of the movie description task.

...read moreread less

Journal Article•DOI•

Video Captions Benefit Everyone

[...]

Morton Ann Gernsbacher¹•Institutions (1)

University of Wisconsin-Madison¹

01 Oct 2015-Policy insights from the behavioral and brain sciences

TL;DR: This work states that despite U.S. laws, which require captioning in most workplace and educational contexts, many video audiences and video creators are naïve about the legal mandate to caption, much less the empirical benefit of captions.

...read moreread less

Abstract: Video captions, also known as same-language subtitles, benefit everyone who watches videos (children, adolescents, college students, and adults). More than 100 empirical studies document that captioning a video improves comprehension of, attention to, and memory for the video. Captions are particularly beneficial for persons watching videos in their non-native language, for children and adults learning to read, and for persons who are D/deaf or hard of hearing. However, despite U.S. laws, which require captioning in most workplace and educational contexts, many video audiences and video creators are naive about the legal mandate to caption, much less the empirical benefit of captions.

...read moreread less

Journal Article•DOI•

Enhancing Vocabulary Learning Through Captioned Video: An Eye-Tracking Study

[...]

Maribel Montero Perez¹, Elke Peters², Piet Desmet²•Institutions (2)

Research Foundation - Flanders¹, Katholieke Universiteit Leuven²

01 Jun 2015-The Modern Language Journal

TL;DR: This paper investigated the effect of two attentionenhancing techniques on L2 students' learning and processing of novel French words (i.e., target words) through video with L2 subtitles or captions.

...read moreread less

Abstract: This study investigates the effect of two attention-enhancing techniques on L2 students' learning and processing of novel French words (i.e., target words) through video with L2 subtitles or captions. A combination of eye-movement data and vocabulary tests was gathered to study the effects of Type of Captioning (full or keyword captioning) and Test Announcement, realized by informing (intentional) or not informing (incidental) learners about upcoming vocabulary tests. The study adopted a between-subjects design with two independent variables (Type of Captioning and Test Announcement) resulting in four experimental groups: full captioning, incidental; full captioning, intentional; keyword captioning, incidental; keyword captioning, intentional. Results indicated that learners in the keyword groups outperformed the other groups on the form recognition test. Analyses of learners' total fixation and second pass time on the target words revealed a significant interaction effect between Type of Captioning and Test Announcement. Results also suggest that second pass as well as total fixation duration and word learning positively correlated for learners in the full captioning, intentional group: The longer their fixations on a given word, the more likely correct recognition became. Results are discussed in relation to attention and word learning through video. [ABSTRACT FROM AUTHOR]

...read moreread less

Posted Content•

The Long-Short Story of Movie Description

[...]

Anna Rohrbach¹, Marcus Rohrbach², Bernt Schiele¹•Institutions (2)

Max Planck Society¹, University of California, Berkeley²

04 Jun 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors show how to learn robust visual classifiers from the weak annotations of the sentence descriptions using an LSTM and compare and analyze their approach and prior work along various dimensions to better understand the key challenges of movie description task.

...read moreread less

Abstract: Generating descriptions for videos has many applications including assisting blind people and human-robot interaction. The recent advances in image captioning as well as the release of large-scale movie description datasets such as MPII Movie Description allow to study this task in more depth. Many of the proposed methods for image captioning rely on pre-trained object classifier CNNs and Long-Short Term Memory recurrent networks (LSTMs) for generating descriptions. While image description focuses on objects, we argue that it is important to distinguish verbs, objects, and places in the challenging setting of movie description. In this work we show how to learn robust visual classifiers from the weak annotations of the sentence descriptions. Based on these visual classifiers we learn how to generate a description using an LSTM. We explore different design choices to build and train the LSTM and achieve the best performance to date on the challenging MPII-MD dataset. We compare and analyze our approach and prior work along various dimensions to better understand the key challenges of the movie description task.

...read moreread less

Patent•

Determining importance of scenes based upon closed captioning data

[...]

Jeromey Russell Goetz¹•Institutions (1)

Amazon.com¹

17 Apr 2015

TL;DR: In this paper, the importance of scenes or moments in video content relative to one another is identified and a textual analysis of the closed captioning data is performed, where the importance level of scenes can be ranked with respect to each other.

...read moreread less

Abstract: Disclosed are various embodiments for identifying importance of scenes or moments in video content relative to one another. Closed captioning data is extracted from a video content feature. A textual analysis of the closed captioning data is performed. The importance level of scenes can be ranked with respect to one another.

...read moreread less

Proceedings Article•DOI•

Dynamic Subtitles: The User Experience

[...]

Andy Brown, Rhia Jones, Michael Crabb¹, James Sandford, Matthew Brooks, Mike Armstrong, Caroline Jay² - Show less +3 more•Institutions (2)

University of Dundee¹, University of Manchester²

03 Jun 2015

TL;DR: The testing of dynamic subtitles with hearing-impaired users, and a new analysis of previously collected eye-tracking data, demonstrates that dynamic subtitles can lead to an improved User Experience, although not for all types of subtitle user.

...read moreread less

Abstract: Subtitles (closed captions) on television are typically placed at the bottom-centre of the screen. However, placing subtitles in varying positions, according to the underlying video content (`dynamic subtitles'), has the potential to make the overall viewing experience less disjointed and more immersive. This paper describes the testing of such subtitles with hearing-impaired users, and a new analysis of previously collected eye-tracking data. The qualitative data demonstrates that dynamic subtitles can lead to an improved User Experience, although not for all types of subtitle user. The eye-tracking data was analysed to compare the gaze patterns of subtitle users with a baseline of those for people viewing without subtitles. It was found that gaze patterns of people watching dynamic subtitles were closer to the baseline than those of people watching with traditional subtitles. Finally, some of the factors that need to be considered when authoring dynamic subtitles are discussed.

...read moreread less

Patent•

Semiautomated relay method and apparatus

[...]

Robert M. Engelke, Kevin R. Colwell

27 Feb 2015

TL;DR: In this article, a system for providing voice-to-text captioning service comprising a relay processor that receives voice messages generated by a hearing user during a call, the processor programmed to present the voice messages to a first call assistant, a call assistant device used by the call assistant to generate call assistant generated text corresponding to the hearing user's voice messages, further programmed to run automated voice to text transcription software.

...read moreread less

Abstract: A system for providing voice-to-text captioning service comprising a relay processor that receives voice messages generated by a hearing user during a call, the processor programmed to present the voice messages to a first call assistant, a call assistant device used by the first call assistant to generate call assistant generated text corresponding to the hearing user's voice messages, the processor further programmed to run automated voice-to-text transcription software to generate automated text corresponding to the hearing user's voice messages, use the call assistant generated text, the automated text and the hearing user's voice messages to train the voice-to-text transcription software to more accurately transcribe the hearing user's voice messages to text, determine when the accuracy exceeds an accuracy requirement threshold, during the call and prior to the automated text exceeding the accuracy requirement threshold, transmitting the call assistant generated text to the assisted user's device for display to the assisted user and subsequent to the automated text exceeding the accuracy requirement threshold, transmitting the automated text to the assisted user's device for display to the assisted user.

...read moreread less

Proceedings Article•DOI•

Déjà Image-Captions: A Corpus of Expressive Descriptions in Repetition

[...]

Jianfu Chen¹, Polina Kuznetsova¹, David S. Warren¹, Yejin Choi²•Institutions (2)

Stony Brook University¹, University of Washington²

01 Jan 2015

TL;DR: A new approach to harvesting a large-scale, high quality image-caption corpus that makes a better use of already existing web data with no additional human efforts is presented, focusing on Deja Image-Captions: naturally existing image descriptions that are repeated almost verbatim – by more than one individual for different images.

...read moreread less

Abstract: We present a new approach to harvesting a large-scale, high quality image-caption corpus that makes a better use of already existing web data with no additional human efforts. The key idea is to focus on Deja Image-Captions: naturally existing image descriptions that are repeated almost verbatim – by more than one individual for different images. The resulting corpus provides association structure between 4 million images with 180K unique captions, capturing a rich spectrum of everyday narratives including figurative and pragmatic language. Exploring the use of the new corpus, we also present new conceptual tasks of visually situated paraphrasing, creative image captioning, and creative visual paraphrasing.

...read moreread less

Journal Article•DOI•

Learning Motivation and Adaptive Video Caption Filtering for EFL Learners Using Handheld Devices.

[...]

Ching Kun Hsu¹•Institutions (1)

National Taiwan Normal University¹

20 Jun 2015-ReCALL

TL;DR: The results confirm that different students require different quantities of information to balance listening comprehension and indicate that the proposed adaptive caption filtering approach may be an effective way to improve the skills required for listening proficiency.

...read moreread less

Abstract: The aim of this study was to provide adaptive assistance to improve the listening comprehension of eleventh grade students. This study developed a video-based language learning system for handheld devices, using three levels of caption filtering adapted to student needs. Elementary level captioning excluded 220 English sight words (see Section 1 for definition), but provided captions and Chinese translations for the remaining words. Intermediate level excluded 1000 high frequency English words, but provided captions for the remaining words, and 2200 high frequency English words were excluded at the high intermediate caption filtering level. The result was that the viewers were provided with captions for words that were likely to be unfamiliar to them. Participants in the experimental group were assigned bilingual caption modes according to their pre-test results, while those in the control group were assigned standard caption modes. Our results indicate that students in the experimental group preferred adaptive captions, enjoyed the exercises more, and gained greater intrinsic motivation compared to those in the control group. The results confirm that different students require different quantities of information to balance listening comprehension and indicate that the proposed adaptive caption filtering approach may be an effective way to improve the skills required for listening proficiency.

...read moreread less

Proceedings Article•DOI•

A Distributed Representation Based Query Expansion Approach for Image Captioning

[...]

Semih Yagcioglu¹, Erkut Erdem¹, Aykut Erdem¹, Ruket Cakici²•Institutions (2)

Hacettepe University¹, Middle East Technical University²

01 Jul 2015

TL;DR: The core idea of the method is to translate the given visual query into a distributional semantics based form, which is generated by the average of the sentence vectors extracted from the captions of images visually similar to the input image.

...read moreread less

Abstract: In this paper, we propose a novel query expansion approach for improving transferbased automatic image captioning. The core idea of our method is to translate the given visual query into a distributional semantics based form, which is generated by the average of the sentence vectors extracted from the captions of images visually similar to the input image. Using three image captioning benchmark datasets, we show that our approach provides more accurate results compared to the state-of-theart data-driven methods in terms of both automatic metrics and subjective evaluation.

...read moreread less

Journal Article•DOI•

Toward completely automated vowel extraction: Introducing DARLA

[...]

Sravana Reddy¹, James N. Stanford¹•Institutions (1)

Dartmouth College¹

01 Dec 2015

TL;DR: A fully automated program called DARLA is introduced, which automatically generates transcriptions with ASR and extracts vowels using FAVE and is tested on a dataset of the US Southern Shift and compares the results with semi-automated methods.

...read moreread less

Abstract: Automatic Speech Recognition (ASR) is reaching further and further into everyday life with Apple’s Siri, Google voice search, automated telephone information systems, dictation devices, closed captioning, and other applications. Along with such advances in speech technology, sociolinguists have been considering new methods for alignment and vowel formant extraction, including techniques like the Penn Aligner (Yuan and Liberman, 2008) and the FAVE automated vowel extraction program (Evanini et al., 2009, Rosenfelder et al., 2011). With humans transcribing audio recordings into sentences, these semi-automated methods can produce effective vowel formant measurements (Labov et al., 2013). But as the quality of ASR improves, sociolinguistics may be on the brink of another transformative technology: large-scale, completely automated vowel extraction without any need for human transcription. It would then be possible to quickly extract vowels from virtually limitless hours of recordings, such as YouTube, publicly available audio/video archives, and large-scale personal interviews or streaming video. How far away is this transformative moment? In this article, we introduce a fully automated program called DARLA (short for “Dartmouth Linguistic Automation,” http://darla.dartmouth.edu), which automatically generates transcriptions with ASR and extracts vowels using FAVE. Users simply upload an audio recording of speech, and DARLA produces vowel plots, a table of vowel formants, and probabilities of the phonetic environments for each token. In this paper, we describe DARLA and explore its sociolinguistic applications. We test the system on a dataset of the US Southern Shift and compare the results with semi-automated methods.

...read moreread less

Posted Content•

Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning

[...]

Pingbo Pan¹, Zhongwen Xu², Yi Yang², Fei Wu¹, Yueting Zhuang¹ - Show less +1 more•Institutions (2)

Zhejiang University¹, University of Technology, Sydney²

11 Nov 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos to exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level.

...read moreread less

Abstract: Recently, deep learning approach, especially deep Convolutional Neural Networks (ConvNets), have achieved overwhelming accuracy with fast processing speed for image classification. Incorporating temporal structure with deep ConvNets for video representation becomes a fundamental problem for video content analysis. In this paper, we propose a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos. Compared to recent video representation inference approaches, this paper makes the following three contributions. First, our HRNE is able to efficiently exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level. Second, computation operations are significantly lessened while attaining more non-linearity. Third, HRNE is able to uncover temporal transitions between frame chunks with different granularities, i.e., it can model the temporal transitions between frames as well as the transitions between segments. We apply the new method to video captioning where temporal information plays a crucial role. Experiments demonstrate that our method outperforms the state-of-the-art on video captioning benchmarks. Notably, even using a single network with only RGB stream as input, HRNE beats all the recent systems which combine multiple inputs, such as RGB ConvNet plus 3D ConvNet.

...read moreread less

Posted Content•

Video captioning with recurrent networks based on frame- and video-level features and visual content classification

[...]

Rakshith Shetty, Jorma Laaksonen

09 Dec 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work builds on static image captioning systems with RNN based language models and extends this framework to videos utilizing both static image features and video-specific features, and studies the usefulness of visual content classifiers for caption generation.

...read moreread less

Abstract: In this paper, we describe the system for generating textual descriptions of short video clips using recurrent neural networks (RNN), which we used while participating in the Large Scale Movie Description Challenge 2015 in ICCV 2015. Our work builds on static image captioning systems with RNN based language models and extends this framework to videos utilizing both static image features and video-specific features. In addition, we study the usefulness of visual content classifiers as a source of additional information for caption generation. With experimental results we show that utilizing keyframe based features, dense trajectory video features and content classifier outputs together gives better performance than any one of them individually.

...read moreread less

Proceedings Article•DOI•

Evaluating Alternatives for Better Deaf Accessibility to Selected Web-Based Multimedia

[...]

Brent Shiver¹, Rosalee Wolfe²•Institutions (2)

IBM¹, DePaul University²

26 Oct 2015

TL;DR: Two studies that focused on multimedia accessibility for Internet users who were born deaf or became deaf at an early age confirm that captioning online videos makes the Internet more accessible to the deaf users, even when the captions are automatically generated.

...read moreread less

Abstract: The proliferation of video and audio media on the Internet has created a distinct disadvantage for deaf Internet users. Despite technological and legislative milestones in recent decades in making television and movies more accessible, there has been less progress with online access. A major obstacle to providing captions for Internet media is the high cost of captioning and transcribing services. This paper reports on two studies that focused on multimedia accessibility for Internet users who were born deaf or became deaf at an early age. An initial study attempted to identify priorities for deaf accessibility improvement. A total of 20 deaf and hard-of-hearing participants were interviewed via videophone about their Internet usage and the issues that were the most frustrating. The most common theme was concern over a lack of accessibility for online news. In the second study, a total of 95 deaf and hard-of-hearing participants evaluated different caption styles, some of which were generated through automatic speech recognition.Results from the second study confirm that captioning online videos makes the Internet more accessible to the deaf users, even when the captions are automatically generated. However color-coded captions used to highlight confidence levels were found neither to be beneficial nor detrimental; yet when asked directly about the benefit of color-coding, participants strongly favored the concept.

...read moreread less

Journal Article•DOI•

A Comparison of Comprehension Processes in Sign Language Interpreter Videos with or without Captions

[...]

Matjaž Debevc¹, Danijela Milosevic², Ines Kožuh¹•Institutions (2)

University of Maribor¹, University of Kragujevac²

26 May 2015-PLOS ONE

TL;DR: Suggestions are led for the consistent use of captions in sign language interpreter videos in various media for improved comprehension among deaf and hard of hearing sign language users.

...read moreread less

Abstract: One important theme in captioning is whether the implementation of captions in individual sign language interpreter videos can positively affect viewers’ comprehension when compared with sign language interpreter videos without captions. In our study, an experiment was conducted using four video clips with information about everyday events. Fifty-one deaf and hard of hearing sign language users alternately watched the sign language interpreter videos with, and without, captions. Afterwards, they answered ten questions. The results showed that the presence of captions positively affected their rates of comprehension, which increased by 24% among deaf viewers and 42% among hard of hearing viewers. The most obvious differences in comprehension between watching sign language interpreter videos with and without captions were found for the subjects of hiking and culture, where comprehension was higher when captions were used. The results led to suggestions for the consistent use of captions in sign language interpreter videos in various media.

...read moreread less

Learning foreign languages with ClipFlair: Using captioning and revoicing activities to increase students’ motivation and engagement

[...]

Rocío Baños, Stavroula Sokoli

15 Jan 2015

TL;DR: The rationale and outcomes of ClipFlair, a European-funded project aimed at countering the factors that discourage Foreign Language Learning by providing a motivating, easily accessible online platform to learn a foreign language through revoicing and captioning, are presented.

...read moreread less

Abstract: The purpose of this paper is to present the rationale and outcomes of ClipFlair, a European-funded project aimed at countering the factors that discourage Foreign Language Learning (FLL) by providing a motivating, easily accessible online platform to learn a foreign language through revoicing (e.g. dubbing) and captioning (e.g. subtitling). This paper will reflect on what has been achieved throughout the project and the challenges encountered along the way, in order to share our experience and inspire other FLL tutors in secondary and tertiary education. The focus is on the main outputs of the project: a) ClipFlair Studio, an online platform where users (both tutors and learners) can create, upload and access revoicing and captioning activities to learn a foreign language; b) ClipFlair Gallery, a library of resources containing over 350 activities to learn the 15 languages targeted in the project; and c) ClipFlair Social, an online community where learners, teachers and activity authors can share information.

...read moreread less

Book•

Reading Sounds: Closed-Captioned Media and Popular Culture

[...]

Sean Zdenek

23 Dec 2015

TL;DR: In this article, Zdenek's analysis is an engrossing look at how we make the audible visible, one that proves that better standards for closed captioning create a better entertainment experience for all viewers.

...read moreread less

Abstract: Imagine a common movie scene: a hero confronts a villain. Captioning such a moment would at first glance seem as basic as transcribing the dialogue. But consider the choices involved: How do you convey the sarcasm in a comeback? Do you include a henchman's muttering in the background? Does the villain emit a scream, a grunt, or a howl as he goes down? And how do you note a gunshot without spoiling the scene? These are the choices closed captioners face every day. Captioners must decide whether and how to describe background noises, accents, laughter, musical cues, and even silences. When captioners describe a sound-or choose to ignore it-they are applying their own subjective interpretations to otherwise objective noises, creating meaning that does not necessarily exist in the soundtrack or the script. Reading Sounds looks at closed-captioning as a potent source of meaning in rhetorical analysis. Through nine engrossing chapters, Sean Zdenek demonstrates how the choices captioners make affect the way deaf and hard of hearing viewers experience media. He draws on hundreds of real-life examples, as well as interviews with both professional captioners and regular viewers of closed captioning. Zdenek's analysis is an engrossing look at how we make the audible visible, one that proves that better standards for closed captioning create a better entertainment experience for all viewers.

...read moreread less

Posted Content•

Image Captioning with an Intermediate Attributes Layer.

[...]

Qi Wu, Chunhua Shen, Anton van den Hengel¹, Lingqiao Liu, Anthony Dick - Show less +1 more•Institutions (1)

University of Adelaide¹

03 Jun 2015

TL;DR: This work shows that an intermediate image-to-attributes layer can dramatically improve captioning results over the current approach which directly connects an RNN to a CNN.

...read moreread less

Abstract: Many recent studies in image captioning rely on an architecture which learns the mapping from images to sentences in an end-to-end fashion. However, generating an accurate and complete description requires identifying all entities, their mutual interactions and the context of the image. In this work, we show that an intermediate image-to-attributes layer can dramatically improve captioning results over the current approach which directly connects an RNN to a CNN. We propose a two-stage procedure for training such an attribute-based approach: in the first stage, we mine a number of keywords from the training sentences which we use as semantic attributes for images, and learn the mapping from images to those attributes with a CNN; in the second stage, we learn the mapping from detected attribute occurrence likelihoods to sentence description using LSTM. We then demonstrate the effectiveness of our two-stage model with captioning experiments on three benchmark datasets, which are Flickr8k, Flickr30K and MS COCO.

...read moreread less

Book Chapter•DOI•

Automatic Close Captioning for Live Hungarian Television Broadcast Speech: A Fast and Resource-Efficient Approach

[...]

Ádám Varga, Balázs Tarján¹, Zoltán Tobler, György Szaszák¹, Tibor Fegyó¹, Csaba Bordás, Péter Mihajlik¹ - Show less +3 more•Institutions (1)

Budapest University of Technology and Economics¹

20 Sep 2015

TL;DR: The application of LVCSR (Large Vocabulary Continuous Speech Recognition) technology is investigated for real-time, resource-limited broadcast close captioning for deaf viewers with various models tailored for Hungarian broadcast speech recognition.

...read moreread less

Abstract: In this paper, the application of LVCSR (Large Vocabulary Continuous Speech Recognition) technology is investigated for real-time, resource-limited broadcast close captioning The work focuses on transcribing live broadcast conversation speech to make such programs accessible to deaf viewers Due to computational limitations, real time factor (RTF) and memory requirements are kept low during decoding with various models tailored for Hungarian broadcast speech recognition Two decoders are compared on the direct transcription task of broadcast conversation recordings, and setups employing re-speakers are also tested Moreover, the models are evaluated on a broadcast news transcription task as well, and different language models (LMs) are tested in order to demonstrate the performance of our systems in settings when low memory consumption is a less crucial factor

...read moreread less

Patent•

Methods and systems for recipe management

[...]

Rajendra Singh Sisodia¹, Anand Srinivasan Srinivasan Natesan¹, Aravind Gundumane, Chaitra Bhat•Institutions (1)

Philips¹

04 Nov 2015

TL;DR: In this paper, the authors present a system for obtaining content over the Internet, identifying text within the content (e.g., such as closed captioning or recipe text) or creating text from the content using such technologies as speech recognition, analyzing the text for actionable directions, and translating those actions into instructions suitable for network-connected cooking appliances.

...read moreread less

Abstract: Systems and methods for obtaining content over the Internet, identifying text within the content (e.g., such as closed captioning or recipe text) or creating text from the content using such technologies as speech recognition, analyzing the text for actionable directions, and translating those actionable directions into instructions suitable for network-connected cooking appliances. Certain embodiments provide additional guidance to avoid or correct mistakes in the cooking process, and allow for the customization of recipes to address, e.g., dietary restrictions, culinary preferences, translation into a foreign language, etc.

...read moreread less