scispace - formally typeset
Open AccessProceedings ArticleDOI

Lattice Long Short-Term Memory for Human Action Recognition

Reads0
Chats0
TLDR
Lattice-LSTM (L2STM) as discussed by the authors extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations, which effectively enhances the ability to model dynamics across time and addresses the nonstationary issue of long-term motion dynamics without significantly increasing the model complexity.
Abstract
Human actions captured in video sequences are threedimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short- Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but invalid when the duration of the motion is long.,,In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of similar model complexities.

read more

Citations
More filters
Proceedings ArticleDOI

X3D: Expanding Architectures for Efficient Video Recognition

TL;DR: This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth, finding that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters.
Proceedings ArticleDOI

Chinese NER Using Lattice LSTM

TL;DR: The authors proposed a lattice-structured LSTM model for Chinese NER, which encodes a sequence of input characters as well as all potential words that match a lexicon.
Posted Content

X3D: Expanding Architectures for Efficient Video Recognition

TL;DR: X3D as mentioned in this paper is a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth.
Proceedings ArticleDOI

Long-Term Feature Banks for Detailed Video Understanding

TL;DR: In this article, a long-term feature bank is proposed to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds, enabling existing video models to relate the present to the past, and put events in context.
Proceedings ArticleDOI

PoTion: Pose MoTion Representation for Action Recognition

TL;DR: A novel representation that gracefully encodes the movement of some semantic keypoints is introduced that outperforms other state-of-the-art pose representations and is complementary to standard appearance and motion streams.
References
More filters
Proceedings ArticleDOI

Action Recognition with Improved Trajectories

TL;DR: Dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets are improved by taking into account camera motion to correct them.
Book ChapterDOI

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

TL;DR: Temporal Segment Networks (TSN) as discussed by the authors combine a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video, which obtains the state-of-the-art performance on the datasets of HMDB51 and UCF101.
Proceedings Article

Convolutional LSTM Network: a machine learning approach for precipitation nowcasting

TL;DR: In this article, a convolutional LSTM (ConvLSTM) was proposed to capture spatiotemporal correlations better and consistently outperforms FC-LSTMs.
Proceedings Article

Unsupervised Learning of Video Representations using LSTMs

TL;DR: In this paper, an encoder LSTM is used to map an input video sequence into a fixed length representation, which is then decoded using single or multiple decoder Long Short Term Memory (LSTM) networks to perform different tasks.
Proceedings ArticleDOI

Beyond short snippets: Deep networks for video classification

TL;DR: In this article, a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN was proposed to model the video as an ordered sequence of frames.