scispace - formally typeset
Search or ask a question
Book ChapterDOI

Pose-Based Temporal-Spatial Network (PTSN) for Gait Recognition with Carrying and Clothing Variations

28 Oct 2017-pp 474-483
TL;DR: This work proposes a novel pose-based gait recognition approach that is more robust to the clothing and carrying variations, and a pose- based temporal-spatial network (PTSN) is proposed to extract the temporal- Spatial features, which effectively improve the performance of gait Recognition.
Abstract: One of the most attractive biometric techniques is gait recognition, since its potential for human identification at a distance. But gait recognition is still challenging in real applications due to the effect of many variations on the appearance and shape. Usually, appearance-based methods need to compute gait energy image (GEI) which is extracted from the human silhouettes. GEI is an image that is obtained by averaging the silhouettes and as result the temporal information is removed. The body joints are invariant to changing clothing and carrying conditions. We propose a novel pose-based gait recognition approach that is more robust to the clothing and carrying variations. At the same time, a pose-based temporal-spatial network (PTSN) is proposed to extract the temporal-spatial features, which effectively improve the performance of gait recognition. Experiments evaluated on the challenging CASIA B dataset, show that our method achieves state-of-the-art performance in both carrying and clothing conditions.
Citations
More filters
Journal ArticleDOI
TL;DR: PoseGait exploits human 3D pose estimated from images by Convolutional Neural Network as the input feature for gait recognition and design spatio-temporal features from the3D pose to improve the recognition rate.

243 citations

Journal ArticleDOI
17 Jul 2019
TL;DR: GaitSet as discussed by the authors proposes a new network named GaitSet to learn identity information from the set of independent frames, which is immune to permutation of frames, and can naturally integrate frames from different videos which have been filmed under different scenarios, such as diverse viewing angles, different clothes/carrying conditions.
Abstract: As a unique biometric feature that can be recognized at a distance, gait has broad applications in crime prevention, forensic identification and social security. To portray a gait, existing gait recognition methods utilize either a gait template, where temporal information is hard to preserve, or a gait sequence, which must keep unnecessary sequential constraints and thus loses the flexibility of gait recognition. In this paper we present a novel perspective, where a gait is regarded as a set consisting of independent frames. We propose a new network named GaitSet to learn identity information from the set. Based on the set perspective, our method is immune to permutation of frames, and can naturally integrate frames from different videos which have been filmed under different scenarios, such as diverse viewing angles, different clothes/carrying conditions. Experiments show that under normal walking conditions, our single-model method achieves an average rank-1 accuracy of 95.0% on the CASIA-B gait dataset and an 87.1% accuracy on the OU-MVLP gait dataset. These results represent new state-of-the-art recognition accuracy. On various complex scenarios, our model exhibits a significant level of robustness. It achieves accuracies of 87.2% and 70.4% on CASIA-B under bag-carrying and coat-wearing walking conditions, respectively. These outperform the existing best methods by a large margin. The method presented can also achieve a satisfactory accuracy with a small number of frames in a test sample, e.g., 82.5% on CASIA-B with only 7 frames. The source code has been released at https://github.com/AbnerHqC/GaitSet.

236 citations

Proceedings ArticleDOI
14 Jun 2020
TL;DR: Focal Convolution Layer, a new applying of convolution, is presented to enhance the fine-grained learning of the part-level spatial features and the Micro-motion Capture Module is proposed, which is a novel way of temporal modeling for gait task, which focuses on the short-range temporal features rather than the redundant long-range features for cycle gait.
Abstract: Gait recognition, applied to identify individual walking patterns in a long-distance, is one of the most promising video-based biometric technologies. At present, most gait recognition methods take the whole human body as a unit to establish the spatio-temporal representations. However, we have observed that different parts of human body possess evidently various visual appearances and movement patterns during walking. In the latest literature, employing partial features for human body description has been verified being beneficial to individual recognition. Taken above insights together, we assume that each part of human body needs its own spatio-temporal expression. Then, we propose a novel part-based model GaitPart and get two aspects effect of boosting the performance: On the one hand, Focal Convolution Layer, a new applying of convolution, is presented to enhance the fine-grained learning of the part-level spatial features. On the other hand, the Micro-motion Capture Module (MCM) is proposed and there are several parallel MCMs in the GaitPart corresponding to the pre-defined parts of the human body, respectively. It is worth mentioning that the MCM is a novel way of temporal modeling for gait task, which focuses on the short-range temporal features rather than the redundant long-range features for cycle gait. Experiments on two of the most popular public datasets, CASIA-B and OU-MVLP, richly exemplified that our method meets a new state-of-the-art on multiple standard benchmarks. The source code will be available on https://github.com/ChaoFan96/GaitPart.

222 citations

Journal ArticleDOI
TL;DR: A new multi-channel gait template, called period energy image (PEI), and multi-task generative adversarial networks (MGANs), which can leverage adversarial training to extract more discriminative features from gait sequences.
Abstract: Gait recognition is of great importance in the fields of surveillance and forensics to identify human beings since gait is the unique biometric feature that can be perceived efficiently at a distance. However, the accuracy of gait recognition to some extent suffers from both the variation of view angles and the deficient gait templates. On one hand, the existing cross-view methods focus on transforming gait templates among different views, which may accumulate the transformation error in a large variation of view angles. On the other hand, a commonly used gait energy image template loses temporal information of a gait sequence. To address these problems, this paper proposes multi-task generative adversarial networks (MGANs) for learning view-specific feature representations. In order to preserve more temporal information, we also propose a new multi-channel gait template, called period energy image (PEI). Based on the assumption of view angle manifold, the MGANs can leverage adversarial training to extract more discriminative features from gait sequences. Experiments on OU-ISIR, CASIA-B, and USF benchmark data sets indicate that compared with several recently published approaches, PEI + MGANs achieves competitive performance and is more interpretable to cross-view gait recognition.

181 citations


Cites background or methods from "Pose-Based Temporal-Spatial Network..."

  • ...However, the performance of gait recognition [1] suffers from various exterior factors including clothing [2], walking speed [3], low resolution [4] and so on....

    [...]

  • ...The positions of body-joints extracted in their methods provide us with features which are insensitive to clothing and carrying variations [2]....

    [...]

  • ...Instead of using silhouette images which are sensitive to the clothing and carrying variations, a posebased temporal-spatial network [2] is proposed to extract dynamic and static information from the key-point of human...

    [...]

  • ...Recently, deep neural networks based gait recognition methods were introduced in [2], [21], and [30]–[32]....

    [...]

Journal ArticleDOI
Qin Zou1, Yanling Wang1, Qian Wang1, Yi Zhao1, Qingquan Li2 
TL;DR: A hybrid deep neural network is proposed for robust gait feature representation, where features in the space and time domains are successively abstracted by a convolutional neural network and a recurrent neural network to obtain good person identification and authentication performance.
Abstract: Compared to other biometrics, gait is difficult to conceal and has the advantage of being unobtrusive. Inertial sensors, such as accelerometers and gyroscopes, are often used to capture gait dynamics. These inertial sensors are commonly integrated into smartphones and are widely used by the average person, which makes gait data convenient and inexpensive to collect. In this paper, we study gait recognition using smartphones in the wild. In contrast to traditional methods, which often require a person to walk along a specified road and/or at a normal walking speed, the proposed method collects inertial gait data under unconstrained conditions without knowing when, where, and how the user walks. To obtain good person identification and authentication performance, deep-learning techniques are presented to learn and model the gait biometrics based on walking data. Specifically, a hybrid deep neural network is proposed for robust gait feature representation, where features in the space and time domains are successively abstracted by a convolutional neural network and a recurrent neural network. In the experiments, two datasets collected by smartphones for a total of 118 subjects are used for evaluations. The experiments show that the proposed method achieves higher than 93.5% and 93.7% accuracies in person identification and authentication, respectively.

166 citations


Cites background from "Pose-Based Temporal-Spatial Network..."

  • ...studies bypass the use of silhouette images [12]–[14], e....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations


"Pose-Based Temporal-Spatial Network..." refers methods in this paper

  • ...Our contribution in this paper is a pose based temporal-spatial network that combines a LSTM and Convolutional Neural Network (CNN) to capture the dynamic and static information of a gait sequence....

    [...]

  • ...Firstly, we use the Long Short-Term Memory (LSTM) [11] to extract the temporal features from gait pose sequences....

    [...]

  • ...Unlike the MSCNN, we fuse CNN with LSTM in the top fully convolutional layer, which effectively boost the performance of gait recognition....

    [...]

  • ...They feed the joint heatmap of consecutive frames to Long Short Term Memory (LSTM)....

    [...]

  • ...CNN for Spatial Features: The LSTM can effectively extract the dynamic information, but it has not enough capacity to extract the static information of gait, such as the length between Ankle and Knee....

    [...]

Proceedings Article
08 Dec 2014
TL;DR: This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Abstract: We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

6,397 citations


"Pose-Based Temporal-Spatial Network..." refers background in this paper

  • ...[19] trained an additional network on top of optical flow in order to capture temporal information under the framework of CNN....

    [...]

Proceedings ArticleDOI
21 Jul 2017
TL;DR: Part Affinity Fields (PAFs) as discussed by the authors uses a nonparametric representation to learn to associate body parts with individuals in the image and achieves state-of-the-art performance on the MPII Multi-Person benchmark.
Abstract: We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that maintains high accuracy while achieving realtime performance, irrespective of the number of people in the image. The architecture is designed to jointly learn part locations and their association via two branches of the same sequential prediction process. Our method placed first in the inaugural COCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.

3,958 citations

Posted Content
TL;DR: This work presents an approach to efficiently detect the 2D pose of multiple people in an image using a nonparametric representation, which it refers to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image.
Abstract: We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that maintains high accuracy while achieving realtime performance, irrespective of the number of people in the image. The architecture is designed to jointly learn part locations and their association via two branches of the same sequential prediction process. Our method placed first in the inaugural COCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.

3,791 citations


"Pose-Based Temporal-Spatial Network..." refers methods in this paper

  • ...However, a recently bottom-up method [6] for pose estimation using deep learning opens the door to retake approaches based on dynamic parameters....

    [...]

  • ...We use a pre-trained model of multi-person 2D pose estimation [6] to acquire the human pose....

    [...]

Book ChapterDOI
08 Oct 2016
TL;DR: This paper proposes a new supervision signal, called center loss, for face recognition task, which simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers.
Abstract: Convolutional neural networks (CNNs) have been widely used in computer vision community, significantly improving the state-of-the-art. In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model. In order to enhance the discriminative power of the deeply learned features, this paper proposes a new supervision signal, called center loss, for face recognition task. Specifically, the center loss simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. More importantly, we prove that the proposed center loss function is trainable and easy to optimize in the CNNs. With the joint supervision of softmax loss and center loss, we can train a robust CNNs to obtain the deep features with the two key learning objectives, inter-class dispension and intra-class compactness as much as possible, which are very essential to face recognition. It is encouraging to see that our CNNs (with such joint supervision) achieve the state-of-the-art accuracy on several important face recognition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge. Especially, our new approach achieves the best results on MegaFace (the largest public domain face benchmark) under the protocol of small training set (contains under 500000 images and under 20000 persons), significantly improving the previous results and setting new state-of-the-art for both face recognition and face verification tasks.

3,464 citations