scispace - formally typeset
Proceedings ArticleDOI

Exploiting Objects with LSTMs for Video Categorization

TLDR
This paper proposes to leverage high-level semantic features to open the "black box" of the state-of-the-art temporal model, Long Short Term Memory (LSTM), with an aim to understand what is learned.
Abstract
Temporal dynamics play an important role for video classification. In this paper, we propose to leverage high-level semantic features to open the "black box" of the state-of-the-art temporal model, Long Short Term Memory (LSTM), with an aim to understand what is learned. More specifically, we first extract object features from a state-of-the-art CNN model that is trained to recognize 20K objects. Then we leverage LSTM with the extracted features as inputs to capture the temporal dynamics in videos. In combination with spatial and motion information, we achieve improvements for supervised video categorization. Furthermore, by masking the inputs, we demonstrate what is learned by LSTM, namely (i) which objects are crucial for recognizing a class-of-interest; (ii) how the LSTM model could assist the temporal localization of these detected objects.

read more

Citations
More filters
Journal ArticleDOI

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos

TL;DR: A unified Spatio-Temporal Attention Networks (STAN) is proposed in the context of multiple modalities, which differs from conventional deep networks, which focus on the attention mechanism, because the authors' temporal attention provides a principled and global guidance across different modalities and video segments.
Proceedings ArticleDOI

Collaborative Deep Metric Learning for Video Understanding

TL;DR: A deep network is proposed that embeds videos using their audio-visual content, onto a metric space which preserves video-to-video relationships, and used to tackle various domains including video classification and recommendation, showing significant improvements over state-of-the-art baselines.
Proceedings ArticleDOI

Exploring Background-bias for Anomaly Detection in Surveillance Videos

TL;DR: This paper develops a series of experiments to validate the existence of background-bias phenomenon, which makes deep networks tend to learn the background information rather than the pattern of anomalies to recognize abnormal behavior, and proposes an end-to-end trainable, anomaly-area guided framework.
Journal ArticleDOI

CI-GNN: Building a Category-Instance Graph for Zero-Shot Video Classification

TL;DR: An end-to-end framework to directly and collectively model the relationships between category-instance, category-category, and instance-instance in the CI-graph is proposed and object semantics is adopted as a bridge to generate unified representations for both videos and categories.
Proceedings ArticleDOI

Large-Scale Content-Only Video Recommendation

TL;DR: This paper model recommendation as a video content-based similarity learning problem, and learn deep video embeddings trained to predict video relationships identified by a co-watch-based system but using only visual and audial content.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Journal ArticleDOI

ImageNet classification with deep convolutional neural networks

TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Related Papers (5)