Showing papers on "Closed captioning published in 2017"

PDF

Open Access

Posted Content•

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.

[...]

Peter Anderson¹, Xiaodong He, Chris Buehler², Damien Teney³, Mark Johnson, Stephen Gould¹, Lei Zhang² - Show less +3 more•Institutions (3)

Australian National University¹, Microsoft², University of Adelaide³

25 Jul 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.

...read moreread less

Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

...read moreread less

2,248 citations

Proceedings Article•DOI•

SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning

[...]

Long Chen¹, Hanwang Zhang², Jun Xiao¹, Liqiang Nie³, Jian Shao¹, Wei Liu⁴, Tat-Seng Chua⁵ - Show less +3 more•Institutions (5)

Zhejiang University¹, Columbia University², Shandong University³, Tencent⁴, National University of Singapore⁵

21 Jul 2017

TL;DR: This paper introduces a novel convolutional neural network dubbed SCA-CNN that incorporates Spatial and Channel-wise Attentions in a CNN that significantly outperforms state-of-the-art visual attention-based image captioning methods.

...read moreread less

Abstract: Visual attention has been successfully applied in structural prediction tasks such as visual captioning and question answering. Existing visual attention models are generally spatial, i.e., the attention is modeled as spatial probabilities that re-weight the last conv-layer feature map of a CNN encoding an input image. However, we argue that such spatial attention does not necessarily conform to the attention mechanism — a dynamic feature extractor that combines contextual fixations over time, as CNN features are naturally spatial, channel-wise and multi-layer. In this paper, we introduce a novel convolutional neural network dubbed SCA-CNN that incorporates Spatial and Channel-wise Attentions in a CNN. In the task of image captioning, SCA-CNN dynamically modulates the sentence generation context in multi-layer feature maps, encoding where (i.e., attentive spatial locations at multiple layers) and what (i.e., attentive channels) the visual attention is. We evaluate the proposed SCA-CNN architecture on three benchmark image captioning datasets: Flickr8K, Flickr30K, and MSCOCO. It is consistently observed that SCA-CNN significantly outperforms state-of-the-art visual attention-based image captioning methods.

...read moreread less

1,527 citations

Proceedings Article•DOI•

Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

[...]

Jiasen Lu¹, Caiming Xiong², Devi Parikh³, Richard Socher²•Institutions (3)

Virginia Tech¹, Salesforce.com², Georgia Institute of Technology³

21 Jul 2017

TL;DR: This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning.

...read moreread less

Abstract: Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as the and of. Other words that may seem visual can often be predicted reliably just from the language model e.g., sign after behind a red stop or phone following talking on a cell. In this paper, we propose a novel adaptive attention model with a visual sentinel. At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel. The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation. We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K. Our approach sets the new state-of-the-art by a significant margin.

...read moreread less

1,093 citations

Journal Article•DOI•

Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge

[...]

Oriol Vinyals¹, Alexander Toshev¹, Samy Bengio¹, Dumitru Erhan¹•Institutions (1)

Google¹

01 Apr 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image is presented.

...read moreread less

Abstract: Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. Finally, given the recent surge of interest in this task, a competition was organized in 2015 using the newly released COCO dataset. We describe and analyze the various improvements we applied to our own baseline and show the resulting performance in the competition, which we won ex-aequo with a team from Microsoft Research.

...read moreread less

848 citations

Proceedings Article•DOI•

Dense-Captioning Events in Videos

[...]

Ranjay Krishna¹, Kenji Hata¹, Frederic Ren², Li Fei-Fei¹, Juan Carlos Niebles¹ - Show less +1 more•Institutions (2)

Stanford University¹, University of British Columbia²

01 Oct 2017

TL;DR: In this article, the authors introduce a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, which is called ActivityNet Captions.

...read moreread less

Abstract: Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time. Finally, we report performances of our model for dense-captioning events, video retrieval and localization.

...read moreread less

551 citations

Journal Article•DOI•

Video Captioning With Attention-Based LSTM and Semantic Consistency

[...]

Lianli Gao¹, Zhao Guo¹, Hanwang Zhang², Xing Xu¹, Heng Tao Shen¹ - Show less +1 more•Institutions (2)

University of Electronic Science and Technology of China¹, Columbia University²

19 Jul 2017-IEEE Transactions on Multimedia

TL;DR: A novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences with competitive or even better results than the state-of-the-art baselines for video captioning in both BLEU and METEOR.

...read moreread less

Abstract: Recent progress in using long short-term memory (LSTM) for image captioning has motivated the exploration of their applications for video captioning. By taking a video as a sequence of features, an LSTM model is trained on video-sentence pairs and learns to associate a video to a sentence. However, most existing methods compress an entire video shot or frame into a static representation, without considering attention mechanism which allows for selecting salient features. Furthermore, existing approaches usually model the translating error, but ignore the correlations between sentence semantics and visual content. To tackle these issues, we propose a novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences. This framework integrates attention mechanism with LSTM to capture salient structures of video, and explores the correlation between multimodal representations (i.e., words and visual content) for generating sentences with rich semantic content. Specifically, we first propose an attention mechanism that uses the dynamic weighted sum of local two-dimensional convolutional neural network representations. Then, an LSTM decoder takes these visual features at time $t$ and the word-embedding feature at time $t$ $-$ 1 to generate important words. Finally, we use multimodal embedding to map the visual and sentence features into a joint space to guarantee the semantic consistence of the sentence description and the video visual content. Experiments on the benchmark datasets demonstrate that our method using single feature can achieve competitive or even better results than the state-of-the-art baselines for video captioning in both BLEU and METEOR.

...read moreread less

548 citations

Proceedings Article•DOI•

Boosting Image Captioning with Attributes

[...]

Ting Yao¹, Yingwei Pan², Yehao Li², Zhaofan Qiu², Tao Mei¹ - Show less +1 more•Institutions (2)

Microsoft¹, University of Science and Technology of China²

01 Oct 2017

TL;DR: Li et al. as discussed by the authors proposed a Long Short-Term Memory with Attributes (LSTM-A) architecture that integrates attributes into the successful Convolutional Neural Networks (CNNs) plus RNNs (RNNs) image captioning framework, by training them in an end-to-end manner.

...read moreread less

Abstract: Automatically describing an image with a natural language has been an emerging challenge in both fields of computer vision and natural language processing. In this paper, we present Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks (CNNs) plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner. Particularly, the learning of attributes is strengthened by integrating inter-attribute correlations into Multiple Instance Learning (MIL). To incorporate attributes into captioning, we construct variants of architectures by feeding image representations and attributes into RNNs in different ways to explore the mutual but also fuzzy relationship between them. Extensive experiments are conducted on COCO image captioning dataset and our framework shows clear improvements when compared to state-of-the-art deep models. More remarkably, we obtain METEOR/CIDEr-D of 25.5%/100.2% on testing data of widely used and publicly available splits in [10] when extracting image representations by GoogleNet and achieve superior performance on COCO captioning Leaderboard.

...read moreread less

547 citations

Proceedings Article•DOI•

Scene Graph Generation from Objects, Phrases and Region Captions

[...]

Yikang Li¹, Wanli Ouyang², Bolei Zhou³, Kun Wang¹, Xiaogang Wang¹ - Show less +1 more•Institutions (3)

The Chinese University of Hong Kong¹, University of Sydney², Massachusetts Institute of Technology³

01 Oct 2017

TL;DR: Zhang et al. as mentioned in this paper proposed a multi-level scene description network (MSDN) to solve the three vision tasks jointly in an end-to-end manner, where object, phrase, and caption regions are aligned with a dynamic graph based on their spatial and semantic connections.

...read moreread less

Abstract: Object detection, scene graph generation and region captioning, which are three scene understanding tasks at different semantic levels, are tied together: scene graphs are generated on top of objects detected in an image with their pairwise relationship predicted, while region captioning gives a language description of the objects, their attributes, relations and other context information. In this work, to leverage the mutual connections across semantic levels, we propose a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner. Object, phrase, and caption regions are first aligned with a dynamic graph based on their spatial and semantic connections. Then a feature refining structure is used to pass messages across the three levels of semantic tasks through the graph. We benchmark the learned model on three tasks, and show the joint learning across three tasks with our proposed method can bring mutual improvements over previous models. Particularly, on the scene graph generation task, our proposed method outperforms the stateof- art method with more than 3% margin. Code has been made publicly available.

...read moreread less

477 citations

Proceedings Article•DOI•

Semantic Compositional Networks for Visual Captioning

[...]

Zhe Gan¹, Chuang Gan², Xiaodong He³, Yunchen Pu¹, Kenneth Tran¹, Jianfeng Gao¹, Lawrence Carin¹, Li Deng³ - Show less +4 more•Institutions (3)

Duke University¹, Tsinghua University², Microsoft³

21 Jul 2017

TL;DR: In this article, a Semantic Compositional Network (SCN) is developed for image captioning, in which semantic concepts (i.e., tags) are detected from the image, and the probability of each tag is used to compose the parameters in a long short-term memory (LSTM) network.

...read moreread less

Abstract: A Semantic Compositional Network (SCN) is developed for image captioning, in which semantic concepts (i.e., tags) are detected from the image, and the probability of each tag is used to compose the parameters in a long short-term memory (LSTM) network. The SCN extends each weight matrix of the LSTM to an ensemble of tag-dependent weight matrices. The degree to which each member of the ensemble is used to generate an image caption is tied to the image-dependent probability of the corresponding tag. In addition to captioning images, we also extend the SCN to generate captions for video clips. We qualitatively analyze semantic composition in SCNs, and quantitatively evaluate the algorithm on three benchmark datasets: COCO, Flickr30k, and Youtube2Text. Experimental results show that the proposed method significantly outperforms prior state-of-the-art approaches, across multiple evaluation metrics.

...read moreread less

421 citations

Posted Content•

Bottom-Up and Top-Down Attention for Image Captioning and VQA.

[...]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang - Show less +3 more

25 Jul 2017

TL;DR: A combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of the method to VQA.

...read moreread less

356 citations

Proceedings Article•DOI•

A Hierarchical Approach for Generating Descriptive Image Paragraphs

[...]

Jonathan Krause¹, Justin Johnson¹, Ranjay Krishna¹, Li Fei-Fei¹•Institutions (1)

Stanford University¹

01 Jul 2017

TL;DR: A model that decomposes both images and paragraphs into their constituent parts is developed, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language.

...read moreread less

Abstract: Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail. While one new captioning approach, dense captioning, can potentially describe images in finer levels of detail by captioning many regions within an image, it in turn is unable to produce a coherent story for an image. In this paper we overcome these limitations by generating entire paragraphs for describing images, which can tell detailed, unified stories. We develop a model that decomposes both images and paragraphs into their constituent parts, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language. Linguistic analysis confirms the complexity of the paragraph generation task, and thorough experiments on a new dataset of image and paragraph pairs demonstrate the effectiveness of our approach.

...read moreread less

Posted Content•

Dense-Captioning Events in Videos

[...]

Ranjay Krishna¹, Kenji Hata¹, Frederic Ren², Li Fei-Fei¹, Juan Carlos Niebles¹ - Show less +1 more•Institutions (2)

Stanford University¹, University of British Columbia²

02 May 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, which can capture both short and long events that span minutes.

...read moreread less

Abstract: Most natural videos contain numerous events. For example, in a video of a "man playing a piano", the video might also contain "another man dancing" or "a crowd clapping". We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with it's unique start and end time. Finally, we report performances of our model for dense-captioning events, video retrieval and localization.

...read moreread less

Proceedings Article•DOI•

Deep Reinforcement Learning-Based Image Captioning with Embedding Reward

[...]

Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, Li-Jia Li¹ - Show less +1 more•Institutions (1)

Google¹

09 Nov 2017

TL;DR: A novel decision-making framework for image captioning that combines a policy network and a value network to collaboratively generate captions and outperforms state-of-the-art approaches across different evaluation metrics.

...read moreread less

Abstract: Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance of this task. Most state-of-the-art approaches follow an encoder-decoder framework, which generates captions using a sequential recurrent prediction model. However, in this paper, we introduce a novel decision-making framework for image captioning. We utilize a policy network and a value network to collaboratively generate captions. The policy network serves as a local guidance by providing the confidence of predicting the next word according to the current state. Additionally, the value network serves as a global and lookahead guidance by evaluating all possible extensions of the current state. In essence, it adjusts the goal of predicting the correct words towards the goal of generating captions similar to the ground truth captions. We train both networks using an actor-critic reinforcement learning model, with a novel reward defined by visual-semantic embedding. Extensive experiments and analyses on the Microsoft COCO dataset show that the proposed framework outperforms state-of-the-art approaches across different evaluation metrics.

...read moreread less

Proceedings Article•DOI•

Video Captioning with Transferred Semantic Attributes

[...]

Yingwei Pan¹, Ting Yao¹, Houqiang Li¹, Tao Mei¹•Institutions (1)

University of Science and Technology of China¹

01 Jul 2017

TL;DR: Wang et al. as discussed by the authors proposed Long Short-Term Memory with Transferred Semantic Attributes (LSTM-TSA), which incorporates the transferred semantic attributes learnt from images and videos into the CNN plus RNN framework, by training them in an end-to-end manner.

...read moreread less

Abstract: Automatically generating natural language descriptions of videos plays a fundamental challenge for computer vision community. Most recent progress in this problem has been achieved through employing 2-D and/or 3-D Convolutional Neural Networks (CNNs) to encode video content and Recurrent Neural Networks (RNNs) to decode a sentence. In this paper, we present Long Short-Term Memory with Transferred Semantic Attributes (LSTM-TSA)—a novel deep architecture that incorporates the transferred semantic attributes learnt from images and videos into the CNN plus RNN framework, by training them in an end-to-end manner. The design of LSTM-TSA is highly inspired by the facts that 1) semantic attributes play a significant contribution to captioning, and 2) images and videos carry complementary semantics and thus can reinforce each other for captioning. To boost video captioning, we propose a novel transfer unit to model the mutually correlated attributes learnt from images and videos. Extensive experiments are conducted on three public datasets, i.e., MSVD, M-VAD and MPII-MD. Our proposed LSTM-TSA achieves to-date the best published performance in sentence generation on MSVD: 52.8% and 74.0% in terms of BLEU@4 and CIDEr-D. Superior results are also reported on M-VAD and MPII-MD when compared to state-of-the-art methods.

...read moreread less

Posted Content•

Towards Automatic Learning of Procedures from Web Instructional Videos

[...]

Luowei Zhou¹, Chenliang Xu², Jason J. Corso¹•Institutions (2)

University of Michigan¹, University of Rochester²

28 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A segment-level recurrent network is proposed for generating procedure segments by modeling the dependencies across segments and it is shown that the proposed model outperforms competitive baselines in procedure segmentation.

...read moreread less

Abstract: The potential for agents, whether embodied or software, to learn by observing other agents performing procedures involving objects and actions is rich. Current research on automatic procedure learning heavily relies on action labels or video subtitles, even during the evaluation phase, which makes them infeasible in real-world scenarios. This leads to our question: can the human-consensus structure of a procedure be learned from a large set of long, unconstrained videos (e.g., instructional videos from YouTube) with only visual evidence? To answer this question, we introduce the problem of procedure segmentation--to segment a video procedure into category-independent procedure segments. Given that no large-scale dataset is available for this problem, we collect a large-scale procedure segmentation dataset with procedure segments temporally localized and described; we use cooking videos and name the dataset YouCook2. We propose a segment-level recurrent network for generating procedure segments by modeling the dependencies across segments. The generated segments can be used as pre-processing for other tasks, such as dense video captioning and event parsing. We show in our experiments that the proposed model outperforms competitive baselines in procedure segmentation.

...read moreread less

Proceedings Article•DOI•

End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

[...]

Youngjae Yu¹, Hyungjin Ko¹, Jongwook Choi¹, Gunhee Kim¹•Institutions (1)

Seoul National University¹

01 Jul 2017

TL;DR: In this paper, a high-level concept word detector is proposed that can be integrated with any video-to-language models to generate a list of concept words as useful semantic priors for language generation models.

...read moreread less

Abstract: We propose a high-level concept word detector that can be integrated with any video-to-language models. It takes a video as input and generates a list of concept words as useful semantic priors for language generation models. The proposed word detector has two important properties. First, it does not require any external knowledge sources for training. Second, the proposed word detector is trainable in an end-to-end manner jointly with any video-to-language models. To effectively exploit the detected words, we also develop a semantic attention mechanism that selectively focuses on the detected concept words and fuse them with the word encoding and decoding in the language model. In order to demonstrate that the proposed approach indeed improves the performance of multiple video-to-language tasks, we participate in all the four tasks of LSMDC 2016 [18]. Our approach has won three of them, including fill-in-the-blank, multiple-choice test, and movie retrieval.

...read moreread less

Proceedings Article•DOI•

Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training

[...]

Rakshith Shetty¹, Marcus Rohrbach², Lisa Anne Hendricks³, Mario Fritz¹, Bernt Schiele¹ - Show less +1 more•Institutions (3)

Max Planck Society¹, University of California, Berkeley², Adobe Systems³

01 Oct 2017

TL;DR: This work changes the training objective of the caption generator from reproducing ground-truth captions to generating a set of captions that is indistinguishable from human written captions, and employs adversarial training in combination with an approximate Gumbel sampler to implicitly match the generated distribution to the human one.

...read moreread less

Abstract: While strong progress has been made in image captioning recently, machine and human captions are still quite distinct. This is primarily due to the deficiencies in the generated word distribution, vocabulary size, and strong bias in the generators towards frequent captions. Furthermore, humans – rightfully so – generate multiple, diverse captions, due to the inherent ambiguity in the captioning task which is not explicitly considered in today's systems. To address these challenges, we change the training objective of the caption generator from reproducing ground-truth captions to generating a set of captions that is indistinguishable from human written captions. Instead of handcrafting such a learning target, we employ adversarial training in combination with an approximate Gumbel sampler to implicitly match the generated distribution to the human one. While our method achieves comparable performance to the state-of-the-art in terms of the correctness of the captions, we generate a set of diverse captions that are significantly less biased and better match the global uni-, bi- and tri-gram distributions of the human captions.

...read moreread less

Proceedings Article•DOI•

Hierarchical Boundary-Aware Neural Encoder for Video Captioning

[...]

Lorenzo Baraldi¹, Costantino Grana¹, Rita Cucchiara¹•Institutions (1)

University of Modena and Reggio Emilia¹

01 Jul 2017

TL;DR: A novel LSTM cell is proposed which can identify discontinuity points between frames or segments and modify the temporal connections of the encoding layer accordingly and can discover and leverage the hierarchical structure of the video.

...read moreread less

Abstract: The use of Recurrent Neural Networks for video captioning has recently gained a lot of attention, since they can be used both to encode the input video and to generate the corresponding description. In this paper, we present a recurrent video encoding scheme which can discover and leverage the hierarchical structure of the video. Unlike the classical encoder-decoder approach, in which a video is encoded continuously by a recurrent layer, we propose a novel LSTM cell which can identify discontinuity points between frames or segments and modify the temporal connections of the encoding layer accordingly. We evaluate our approach on three large-scale datasets: the Montreal Video Annotation dataset, the MPII Movie Description dataset and the Microsoft Video Description Corpus. Experiments show that our approach can discover appropriate hierarchical representations of input videos and improve the state of the art results on movie description datasets.

...read moreread less

Proceedings Article•DOI•

Areas of Attention for Image Captioning

[...]

Marco Pedersoli¹, Thomas Lucas, Cordelia Schmid, Jakob Verbeek•Institutions (1)

École de technologie supérieure¹

22 Oct 2017

TL;DR: In this paper, an attention-based model for automatic image captioning is proposed, where the dependencies between image regions, caption words, and the state of an RNN language model are modeled using three pairwise interactions.

...read moreread less

Abstract: We propose “Areas of Attention”, a novel attentionbased model for automatic image captioning. Our approach models the dependencies between image regions, caption words, and the state of an RNN language model, using three pairwise interactions. In contrast to previous attentionbased approaches that associate image regions only to the RNN state, our method allows a direct association between caption words and image regions. During training these associations are inferred from image-level captions, akin to weakly-supervised object detector training. These associations help to improve captioning by localizing the corresponding regions during testing. We also propose and compare different ways of generating attention areas: CNN activation grids, object proposals, and spatial transformers nets applied in a convolutional fashion. Spatial transformers give the best results. They allow for image specific attention areas, and can be trained jointly with the rest of the network. Our attention mechanism and spatial transformer attention areas together yield state-of-the-art results on the MSCOCO dataset.

...read moreread less

Proceedings Article•DOI•

Dense Captioning with Joint Inference and Visual Context

[...]

Linjie Yang, Kevin Tang, Jianchao Yang, Li-Jia Li¹•Institutions (1)

Google¹

01 Jul 2017

Abstract: Dense captioning is a newly emerging computer vision topic for understanding images with dense language descriptions. The goal is to densely detect visual concepts (e.g., objects, object parts, and interactions between them) from images, labeling each with a short descriptive phrase. We identify two key challenges of dense captioning that need to be properly addressed when tackling the problem. First, dense visual concept annotations in each image are associated with highly overlapping target regions, making accurate localization of each visual concept challenging. Second, the large amount of visual concepts makes it hard to recognize each of them by appearance alone. We propose a new model pipeline based on two novel ideas, joint inference and context fusion, to alleviate these two challenges. We design our model architecture in a methodical manner and thoroughly evaluate the variations in architecture. Our final model, compact and efficient, achieves state-of-the-art accuracy on Visual Genome [23] for dense captioning with a relative gain of 73% compared to the previous best algorithm. Qualitative experiments also reveal the semantic capabilities of our model in dense captioning.

...read moreread less

Proceedings Article•DOI•

Captioning Images with Diverse Objects

[...]

Subhashini Venugopalan¹, Lisa Anne Hendricks², Marcus Rohrbach², Raymond J. Mooney¹, Trevor Darrell², Kate Saenko³ - Show less +2 more•Institutions (3)

University of Texas at Austin¹, University of California, Berkeley², Boston University³

01 Jul 2017

TL;DR: The authors proposed the Novel Object Captioner (NOC), a deep visual semantic captioning model that can describe a large number of object categories not present in existing image-caption datasets.

...read moreread less

Abstract: Recent captioning models are limited in their ability to scale and describe concepts unseen in paired image-text corpora. We propose the Novel Object Captioner (NOC), a deep visual semantic captioning model that can describe a large number of object categories not present in existing image-caption datasets. Our model takes advantage of external sources – labeled images from object recognition datasets, and semantic knowledge extracted from unannotated text. We propose minimizing a joint objective which can learn from these diverse data sources and leverage distributional semantic embeddings, enabling the model to generalize and describe novel objects outside of image-caption datasets. We demonstrate that our model exploits semantic information to generate captions for hundreds of object categories in the ImageNet object recognition dataset that are not observed in MSCOCO image-caption training data, as well as many categories that are observed very rarely. Both automatic evaluations and human judgements show that our model considerably outperforms prior work in being able to describe many more categories of objects.

...read moreread less

Proceedings Article•DOI•

Learning Multimodal Attention LSTM Networks for Video Captioning

[...]

Jun Xu¹, Ting Yao², Yongdong Zhang¹, Tao Mei•Institutions (2)

University of Science and Technology of China¹, Microsoft²

19 Oct 2017

TL;DR: This work presents a novel deep framework to boost video captioning by learning Multimodal Attention Long-Short Term Memory networks (MA-LSTM), and designs a novel child-sum fusion unit in the MA-L STM to effectively combine different encoded modalities to the initial decoding states.

...read moreread less

Abstract: Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence while ignoring intrinsic multimodality nature. Observing that different modalities (e.g., frame, motion, and audio streams), as well as the elements within each modality, contribute differently to the sentence generation, we present a novel deep framework to boost video captioning by learning Multimodal Attention Long-Short Term Memory networks (MA-LSTM). Our proposed MA-LSTM fully exploits both multimodal streams and temporal attention to selectively focus on specific elements during the sentence generation. Moreover, we design a novel child-sum fusion unit in the MA-LSTM to effectively combine different encoded modalities to the initial decoding states. Different from existing approaches that employ the same LSTM structure for different modalities, we train modality-specific LSTM to capture the intrinsic representations of individual modalities. The experiments on two benchmark datasets (MSVD and MSR-VTT) show that our MA-LSTM significantly outperforms the state-of-the-art methods with 52.3 BLEU@4 and 70.4 CIDER-D metrics on MSVD dataset, respectively.

...read moreread less

Journal Article•DOI•

Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts

[...]

Kun Fu¹, Junqi Jin¹, Runpeng Cui², Fei Sha³, Changshui Zhang² - Show less +1 more•Institutions (3)

University of Southern California¹, Tsinghua University², University of California, Los Angeles³

01 Dec 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper proposes an image captioning system that exploits the parallel structures between images and sentences and makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image.

...read moreread less

Abstract: Recent progress on automatic generation of image captions has shown that it is possible to describe the most salient information conveyed by images with accurate and meaningful sentences. In this paper, we propose an image captioning system that exploits the parallel structures between images and sentences. In our model, the process of generating the next word, given the previously generated ones, is aligned with the visual perception experience where the attention shifts among the visual regions—such transitions impose a thread of ordering in visual perception. This alignment characterizes the flow of latent meaning, which encodes what is semantically shared by both the visual scene and the text description. Our system also makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image. The contexts adapt language models for word generation to specific scene types. We benchmark our system and contrast to published results on several popular datasets, using both automatic evaluation metrics and human evaluation. We show that either region-based attention or scene-specific contexts improves systems without those components. Furthermore, combining these two modeling ingredients attains the state-of-the-art performance.

...read moreread less

Proceedings Article•DOI•

Guided Open Vocabulary Image Captioning with Constrained Beam Search

[...]

Peter Anderson¹, Basura Fernando¹, Mark Johnson, Stephen Gould¹•Institutions (1)

Australian National University¹

01 Sep 2017

TL;DR: This work uses constrained beam search to force the inclusion of selected tag words in the output, and fixed, pretrained word embeddings to facilitate vocabulary expansion to previously unseen tag words to achieve state of the art results for out-of- domain captioning on MSCOCO (and improved results for in-domain captioning).

...read moreread less

Abstract: Existing image captioning models do not generalize well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real world applications dealing with images in the wild. We address this problem using a flexible approach that enables existing deep captioning architectures to take advantage of image taggers at test time, without re-training. Our method uses constrained beam search to force the inclusion of selected tag words in the output, and fixed, pretrained word embeddings to facilitate vocabulary expansion to previously unseen tag words. Using this approach we achieve state of the art results for out-of-domain captioning on MSCOCO (and improved results for in-domain captioning). Perhaps surprisingly, our results significantly outperform approaches that incorporate the same tag predictions into the learning algorithm. We also show that we can significantly improve the quality of generated ImageNet captions by leveraging ground-truth labels.

...read moreread less

Proceedings Article•DOI•

Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention

[...]

Theodore Bluche¹, Jérôme Louradour², Ronaldo Messina•Institutions (2)

Centre national de la recherche scientifique¹, Université de Montréal²

01 Nov 2017

TL;DR: In this paper, an attention-based model for end-to-end handwriting recognition is presented. But the main difference is the implementation of covert and overt attention with a multi-dimensional LSTM network, which does not require any segmentation of the input paragraph.

...read moreread less

Abstract: We present an attention-based model for end-to-end handwriting recognition. Our system does not require any segmentation of the input paragraph. The model is inspired by the differentiable attention models presented recently for speech recognition, image captioning or translation. The main difference is the implementation of covert and overt attention with a multi-dimensional LSTM network. Our principal contribution towards handwriting recognition lies in the automatic transcription without a prior segmentation into lines, which was critical in previous approaches. Moreover, the system is able to learn the reading order, enabling it to handle bidirectional scripts such as Arabic. We carried out experiments on the well-known IAM Database and report encouraging results which bring hope to perform full paragraph transcription in the near future.

...read moreread less

Proceedings Article•DOI•

Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

[...]

Jingkuan Song¹, Lianli Gao², Zhao Guo², Wu Liu³, Dongxiang Zhang², Heng Tao Shen² - Show less +2 more•Institutions (3)

Columbia University¹, University of Electronic Science and Technology of China², Beijing University of Posts and Telecommunications³

01 Jan 2017

TL;DR: The proposed framework utilizes the temporal attention for selecting specific frames to predict the related words, while the adjusted temporal attention is for deciding whether to depend on the visual information or the language context information.

...read moreread less

Proceedings Article•DOI•

Attend to You: Personalized Image Captioning with Context Sequence Memory Networks

[...]

Cesc Chunseong Park, Byeongchang Kim¹, Gunhee Kim¹•Institutions (1)

Seoul National University¹

24 Jul 2017

TL;DR: This work proposes a novel captioning model named Context Sequence Memory Network (CSMN), and shows the effectiveness of the three novel features of CSMN and its performance enhancement for personalized image captioning over state-of-the-art captioning models.

...read moreread less

Abstract: We address personalization issues of image captioning, which have not been discussed yet in previous research. For a query image, we aim to generate a descriptive sentence, accounting for prior knowledge such as the users active vocabularies in previous documents. As applications of personalized image captioning, we tackle two post automation tasks: hashtag prediction and post generation, on our newly collected Instagram dataset, consisting of 1.1M posts from 6.3K users. We propose a novel captioning model named Context Sequence Memory Network (CSMN). Its unique updates over previous memory network models include (i) exploiting memory as a repository for multiple types of context information, (ii) appending previously generated words into memory to capture long-term information without suffering from the vanishing gradient problem, and (iii) adopting CNN memory structure to jointly represent nearby ordered memory slots for better context understanding. With quantitative evaluation and user studies via Amazon Mechanical Turk, we show the effectiveness of the three novel features of CSMN and its performance enhancement for personalized image captioning over state-of-the-art captioning models.

...read moreread less

Journal Article•DOI•

Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image

[...]

Zhenwei Shi¹, Zhengxia Zou¹•Institutions (1)

Beihang University¹

31 Mar 2017-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: This paper has proposed a remote sensing image captioning framework by leveraging the techniques of the recent fast development of deep learning and fully convolutional networks and demonstrates that the proposed method is able to generate robust and comprehensive sentence description with desirable speed performance.

...read moreread less

Abstract: This paper investigates an intriguing question in the remote sensing field: “can a machine generate humanlike language descriptions for a remote sensing image?” The automatic description of a remote sensing image (namely, remote sensing image captioning) is an important but rarely studied task for artificial intelligence. It is more challenging as the description must not only capture the ground elements of different scales, but also express their attributes as well as how these elements interact with each other. Despite the difficulties, we have proposed a remote sensing image captioning framework by leveraging the techniques of the recent fast development of deep learning and fully convolutional networks. The experimental results on a set of high-resolution optical images including Google Earth images and GaoFen-2 satellite images demonstrate that the proposed method is able to generate robust and comprehensive sentence description with desirable speed performance.

...read moreread less

Posted Content•

Cold Fusion: Training Seq2Seq Models Together with Language Models

[...]

Anuroop Sriram¹, Heewoo Jun¹, Sanjeev Satheesh¹, Adam Coates²•Institutions (2)

Baidu¹, Manchester Metropolitan University²

21 Aug 2017-arXiv: Computation and Language

TL;DR: The authors presented the cold fusion method, which leverages a pre-trained language model during training, and showed its effectiveness on the speech recognition task, which is able to better utilize language information enjoying faster convergence and better generalization, and almost complete transfer to a new domain while using less than 10% of the labeled training data.

...read moreread less

Abstract: Sequence-to-sequence (Seq2Seq) models with attention have excelled at tasks which involve generating natural language sentences such as machine translation, image captioning and speech recognition. Performance has further been improved by leveraging unlabeled data, often in the form of a language model. In this work, we present the Cold Fusion method, which leverages a pre-trained language model during training, and show its effectiveness on the speech recognition task. We show that Seq2Seq models with Cold Fusion are able to better utilize language information enjoying i) faster convergence and better generalization, and ii) almost complete transfer to a new domain while using less than 10% of the labeled training data.

...read moreread less

Proceedings Article•DOI•

Top-Down Visual Saliency Guided by Captions

[...]

Vasili Ramanishka¹, Abir Das¹, Jianming Zhang², Kate Saenko¹•Institutions (2)

Boston University¹, Adobe Systems²

01 Jul 2017

TL;DR: This article proposed Caption-Guided Visual Saliency (CGVS) to expose the region-to-word mapping in modern encoder-decoder networks and demonstrate that it is learned implicitly from caption training data, without any pixel-level annotations.

...read moreread less

Abstract: Neural image/video captioning models can generate accurate descriptions, but their internal process of mapping regions to words is a black box and therefore difficult to explain. Top-down neural saliency methods can find important regions given a high-level semantic task such as object classification, but cannot use a natural language sentence as the top-down input for the task. In this paper, we propose Caption-Guided Visual Saliency to expose the region-to-word mapping in modern encoder-decoder networks and demonstrate that it is learned implicitly from caption training data, without any pixel-level annotations. Our approach can produce spatial or spatiotemporal heatmaps for both predicted captions, and for arbitrary query sentences. It recovers saliency without the overhead of introducing explicit attention layers, and can be used to analyze a variety of existing model architectures and improve their design. Evaluation on large-scale video and image datasets demonstrates that our approach achieves comparable captioning performance with existing methods while providing more accurate saliency heatmaps. Our code is available at visionlearninggroup.github.io/caption-guided-saliency/.

...read moreread less

Collapse