Open AccessPosted Content
Bidirectional Long-Short Term Memory for Video Description
TLDR
A novel video captioning framework, termed as BiLSTM, which deeply captures bidirectional global temporal structure in video, and which is comprehensively preserving sequential and visual information and adaptively learning dense visual features and sparse semantic representations for videos and sentences.Abstract:
Video captioning has been attracting broad research attention in multimedia community. However, most existing approaches either ignore temporal information among video frames or just employ local contextual temporal knowledge. In this work, we propose a novel video captioning framework, termed as \emph{Bidirectional Long-Short Term Memory} (BiLSTM), which deeply captures bidirectional global temporal structure in video. Specifically, we first devise a joint visual modelling approach to encode video data by combining a forward LSTM pass, a backward LSTM pass, together with visual features from Convolutional Neural Networks (CNNs). Then, we inject the derived video representation into the subsequent language model for initialization. The benefits are in two folds: 1) comprehensively preserving sequential and visual information; and 2) adaptively learning dense visual features and sparse semantic representations for videos and sentences, respectively. We verify the effectiveness of our proposed video captioning framework on a commonly-used benchmark, i.e., Microsoft Video Description (MSVD) corpus, and the experimental results demonstrate that the superiority of the proposed approach as compared to several state-of-the-art methods.read more
Citations
More filters
Journal ArticleDOI
GLA: Global–Local Attention for Image Description
TL;DR: The proposed GLA method can generate more relevant image description sentences and achieve the state-of-the-art performance on the well-known Microsoft COCO caption dataset with several popular evaluation metrics.
Proceedings ArticleDOI
Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning
TL;DR: This work designs a key control unit, termed visual gate, to adaptively decide "when" and "what" the language generator attend to during the word generation process, and employs a bottom-up workflow to learn a pool of semantic attributes for serving as the propositional attention resources.
Journal ArticleDOI
Occurrence prediction of cotton pests and diseases by bidirectional long short-term memory networks with climate and atmosphere circulation
TL;DR: Experimental results showed that Bi-LSTM shows good performance on the occurrence prediction of pests and diseases in cotton fields, and yields an Area Under the Curve (AUC) of 0.95, and verified that climate indeed have strong impact onThe occurrence of pestsand diseases, and circulation parameters also have certain influence.
Posted Content
A Perceptual Prediction Framework for Self Supervised Event Segmentation
TL;DR: In this paper, a self-supervised, predictive learning framework is proposed to segment long, visually complex videos into individual, stable segments that share the same semantics, and a new adaptive learning paradigm is introduced to reduce the effect of catastrophic forgetting in recurrent neural networks.
Journal ArticleDOI
Exploiting the local temporal information for video captioning
TL;DR: A reinforcement learning based method to predict the adaptive sliding window size sequentially for better event exploration and introduces the single Monte-Carlo sample to approximate the gradient of reward-based loss function.
References
More filters
Journal ArticleDOI
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Book ChapterDOI
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Proceedings ArticleDOI
Bleu: a Method for Automatic Evaluation of Machine Translation
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Proceedings ArticleDOI
Caffe: Convolutional Architecture for Fast Feature Embedding
Yangqing Jia,Evan Shelhamer,Jeff Donahue,Sergey Karayev,Jonathan Long,Ross Girshick,Sergio Guadarrama,Trevor Darrell +7 more
TL;DR: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.