scispace - formally typeset
Open AccessPosted Content

Bidirectional Long-Short Term Memory for Video Description

TLDR
A novel video captioning framework, termed as BiLSTM, which deeply captures bidirectional global temporal structure in video, and which is comprehensively preserving sequential and visual information and adaptively learning dense visual features and sparse semantic representations for videos and sentences.
Abstract
Video captioning has been attracting broad research attention in multimedia community. However, most existing approaches either ignore temporal information among video frames or just employ local contextual temporal knowledge. In this work, we propose a novel video captioning framework, termed as \emph{Bidirectional Long-Short Term Memory} (BiLSTM), which deeply captures bidirectional global temporal structure in video. Specifically, we first devise a joint visual modelling approach to encode video data by combining a forward LSTM pass, a backward LSTM pass, together with visual features from Convolutional Neural Networks (CNNs). Then, we inject the derived video representation into the subsequent language model for initialization. The benefits are in two folds: 1) comprehensively preserving sequential and visual information; and 2) adaptively learning dense visual features and sparse semantic representations for videos and sentences, respectively. We verify the effectiveness of our proposed video captioning framework on a commonly-used benchmark, i.e., Microsoft Video Description (MSVD) corpus, and the experimental results demonstrate that the superiority of the proposed approach as compared to several state-of-the-art methods.

read more

Citations
More filters
Journal ArticleDOI

GLA: Global–Local Attention for Image Description

TL;DR: The proposed GLA method can generate more relevant image description sentences and achieve the state-of-the-art performance on the well-known Microsoft COCO caption dataset with several popular evaluation metrics.
Proceedings ArticleDOI

Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning

TL;DR: This work designs a key control unit, termed visual gate, to adaptively decide "when" and "what" the language generator attend to during the word generation process, and employs a bottom-up workflow to learn a pool of semantic attributes for serving as the propositional attention resources.
Journal ArticleDOI

Occurrence prediction of cotton pests and diseases by bidirectional long short-term memory networks with climate and atmosphere circulation

TL;DR: Experimental results showed that Bi-LSTM shows good performance on the occurrence prediction of pests and diseases in cotton fields, and yields an Area Under the Curve (AUC) of 0.95, and verified that climate indeed have strong impact onThe occurrence of pestsand diseases, and circulation parameters also have certain influence.
Posted Content

A Perceptual Prediction Framework for Self Supervised Event Segmentation

TL;DR: In this paper, a self-supervised, predictive learning framework is proposed to segment long, visually complex videos into individual, stable segments that share the same semantics, and a new adaptive learning paradigm is introduced to reduce the effect of catastrophic forgetting in recurrent neural networks.
Journal ArticleDOI

Exploiting the local temporal information for video captioning

TL;DR: A reinforcement learning based method to predict the adaptive sliding window size sequentially for better event exploration and introduces the single Monte-Carlo sample to approximate the gradient of reward-based loss function.
References
More filters
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Book ChapterDOI

Microsoft COCO: Common Objects in Context

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Proceedings ArticleDOI

Bleu: a Method for Automatic Evaluation of Machine Translation

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Proceedings ArticleDOI

Caffe: Convolutional Architecture for Fast Feature Embedding

TL;DR: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Related Papers (5)