scispace - formally typeset
Search or ask a question

Showing papers by "Stan Z. Li published in 2014"


Journal Article•
TL;DR: A semi-automatical way to collect face images from Internet is proposed and a large scale dataset containing about 10,000 subjects and 500,000 images, called CASIAWebFace is built, based on which a 11-layer CNN is used to learn discriminative representation and obtain state-of-theart accuracy on LFW and YTF.
Abstract: Pushing by big data and deep convolutional neural network (CNN), the performance of face recognition is becoming comparable to human. Using private large scale training datasets, several groups achieve very high performance on LFW, ie, 97% to 99%. While there are many open source implementations of CNN, none of large scale face dataset is publicly available. The current situation in the field of face recognition is that data is more important than algorithm. To solve this problem, this paper proposes a semi- automatical way to collect face images from Internet and builds a large scale dataset containing about 10,000 subjects and 500,000 images, called CASIAWebFace. Based on the database, we use a 11-layer CNN to learn discriminative representation and obtain state- of-theart accuracy on LFW and YTF.

1,705 citations


Proceedings Article•DOI•
24 Aug 2014
TL;DR: A more general way that can learn a similarity metric from image pixels directly by using a "siamese" deep neural network that can jointly learn the color feature, texture feature and metric in a unified framework is proposed.
Abstract: Various hand-crafted features and metric learning methods prevail in the field of person re-identification Compared to these methods, this paper proposes a more general way that can learn a similarity metric from image pixels directly By using a "siamese" deep neural network, the proposed method can jointly learn the color feature, texture feature and metric in a unified framework The network has a symmetry structure with two sub-networks which are connected by a cosine layer Each sub network includes two convolutional layers and a full connected layer To deal with the big variations of person images, binomial deviance is used to evaluate the cost between similarities and labels, which is proved to be robust to outliers Experiments on VIPeR illustrate the superior performance of our method and a cross database experiment also shows its good generalization

923 citations


Book Chapter•DOI•
06 Sep 2014
TL;DR: This paper proposes a novel salient color names based color descriptor (SCNCD) to describe colors that outperforms the state-of-the-art performance (without user’s feedback optimization) on two challenging datasets (VIPeR and PRID 450S).
Abstract: Color naming, which relates colors with color names, can help people with a semantic analysis of images in many computer vision applications. In this paper, we propose a novel salient color names based color descriptor (SCNCD) to describe colors. SCNCD utilizes salient color names to guarantee that a higher probability will be assigned to the color name which is nearer to the color. Based on SCNCD, color distributions over color names in different color spaces are then obtained and fused to generate a feature representation. Moreover, the effect of background information is employed and analyzed for person re-identification. With a simple metric learning method, the proposed approach outperforms the state-of-the-art performance (without user’s feedback optimization) on two challenging datasets (VIPeR and PRID 450S). More importantly, the proposed feature can be obtained very fast if we compute SCNCD of each color in advance.

502 citations


Book Chapter•DOI•
06 Sep 2014
TL;DR: The evaluation protocol of the VOT2013 challenge and the results of a comparison of 27 trackers on the benchmark dataset are presented, offering a more systematic comparison of the trackers.
Abstract: The Visual Object Tracking challenge 2014, VOT2014, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 38 trackers are presented. The number of tested trackers makes VOT 2014 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2014 challenge that go beyond its VOT2013 predecessor are introduced: (i) a new VOT2014 dataset with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2013 evaluation methodology, (iii) a new unit for tracking speed assessment less dependent on the hardware and (iv) the VOT2014 evaluation toolkit that significantly speeds up execution of experiments. The dataset, the evaluation kit as well as the results are publicly available at the challenge website (http://votchallenge.net).

391 citations


Posted Content•
TL;DR: Instead of designing feature by ourselves, the deep convolutional neural network is relied on to learn features of high discriminative ability in a supervised manner and combined with some data pre-processing, the face anti-spoofing performance improves drastically.
Abstract: Though having achieved some progresses, the hand-crafted texture features, e.g., LBP [23], LBP-TOP [11] are still unable to capture the most discriminative cues between genuine and fake faces. In this paper, instead of designing feature by ourselves, we rely on the deep convolutional neural network (CNN) to learn features of high discriminative ability in a supervised manner. Combined with some data pre-processing, the face anti-spoofing performance improves drastically. In the experiments, over 70% relative decrease of Half Total Error Rate (HTER) is achieved on two challenging datasets, CASIA [36] and REPLAY-ATTACK [7] compared with the state-of-the-art. Meanwhile, the experimental results from inter-tests between two datasets indicates CNN can obtain features with better generalization ability. Moreover, the nets trained using combined data from two datasets have less biases between two datasets.

350 citations


Journal Article•DOI•
TL;DR: This paper proposes a method to learn a discriminant face descriptor (DFD) in a data-driven way and applies it to the heterogeneous (cross-modality) face recognition problem and learns DFD in a coupled way to reduce the gap between features of heterogeneous face images to improve the performance of this challenging problem.
Abstract: Local feature descriptor is an important module for face recognition and those like Gabor and local binary patterns (LBP) have proven effective face descriptors. Traditionally, the form of such local descriptors is predefined in a handcrafted way. In this paper, we propose a method to learn a discriminant face descriptor (DFD) in a data-driven way. The idea is to learn the most discriminant local features that minimize the difference of the features between images of the same person and maximize that between images from different people. In particular, we propose to enhance the discriminative ability of face representation in three aspects. First, the discriminant image filters are learned. Second, the optimal neighborhood sampling strategy is soft determined. Third, the dominant patterns are statistically constructed. Discriminative learning is incorporated to extract effective and robust features. We further apply the proposed method to the heterogeneous (cross-modality) face recognition problem and learn DFD in a coupled way (coupled DFD or C-DFD) to reduce the gap between features of heterogeneous face images to improve the performance of this challenging problem. Extensive experiments on FERET, CAS-PEAL-R1, LFW, and HFB face databases validate the effectiveness of the proposed DFD learning on both homogeneous and heterogeneous face recognition problems. The DFD improves POEM and LQP by about 4.5 percent on LFW database and the C-DFD enhances the heterogeneous face recognition performance of LBP by over 25 percent.

342 citations


Proceedings Article•DOI•
23 Jun 2014
TL;DR: This paper solves the speed bottleneck of deformable part model (DPM), while maintaining the accuracy in detection on challenging datasets, and achieves state-of-the-art accuracy on pedestrian and face detection task with frame-rate speed.
Abstract: This paper solves the speed bottleneck of deformable part model (DPM), while maintaining the accuracy in detection on challenging datasets. Three prohibitive steps in cascade version of DPM are accelerated, including 2D correlation between root filter and feature map, cascade part pruning and HOG feature extraction. For 2D correlation, the root filter is constrained to be low rank, so that 2D correlation can be calculated by more efficient linear combination of 1D correlations. A proximal gradient algorithm is adopted to progressively learn the low rank filter in a discriminative manner. For cascade part pruning, neighborhood aware cascade is proposed to capture the dependence in neighborhood regions for aggressive pruning. Instead of explicit computation of part scores, hypotheses can be pruned by scores of neighborhoods under the first order approximation. For HOG feature extraction, look-up tables are constructed to replace expensive calculations of orientation partition and magnitude with simpler matrix index operations. Extensive experiments show that (a) the proposed method is 4 times faster than the current fastest DPM method with similar accuracy on Pascal VOC, (b) the proposed method achieves state-of-the-art accuracy on pedestrian and face detection task with frame-rate speed.

297 citations


Proceedings Article•DOI•
TL;DR: In this paper, a multi-view face detector using aggregate channel features is proposed, which extends the image channel to diverse types like gradient magnitude and oriented gradient histograms and therefore encodes rich information in a simple form.
Abstract: Face detection has drawn much attention in recent decades since the seminal work by Viola and Jones. While many subsequences have improved the work with more powerful learning algorithms, the feature representation used for face detection still can’t meet the demand for effectively and efficiently handling faces with large appearance variance in the wild. To solve this bottleneck, we borrow the concept of channel features to the face detection domain, which extends the image channel to diverse types like gradient magnitude and oriented gradient histograms and therefore encodes rich information in a simple form. We adopt a novel variant called aggregate channel features, make a full exploration of feature design, and discover a multiscale version of features with better performance. To deal with poses of faces in the wild, we propose a multi-view detection approach featuring score re-ranking and detection adjustment. Following the learning pipelines in ViolaJones framework, the multi-view face detector using aggregate channel features surpasses current state-of-the-art detectors on AFW and FDDB testsets, while runs at 42 FPS

288 citations


Book Chapter•DOI•
01 Nov 2014
TL;DR: Convolutional neural network is applied to the age estimation problem, which leads to a fully learned end-to-end system can estimate age from image pixels directly and the multi-scale analysis strategy is introduced from traditional methods to the CNN, which improves the performance significantly.
Abstract: In the last five years, biologically inspired features (BIF) always held the state-of-the-art results for human age estimation from face images. Recently, researchers mainly put their focuses on the regression step after feature extraction, such as support vector regression (SVR), partial least squares (PLS), canonical correlation analysis (CCA) and so on. In this paper, we apply convolutional neural network (CNN) to the age estimation problem, which leads to a fully learned end-to-end system can estimate age from image pixels directly. Compared with BIF, the proposed method has deeper structure and the parameters are learned instead of hand-crafted. The multi-scale analysis strategy is also introduced from traditional methods to the CNN, which improves the performance significantly. Furthermore, we train an efficient network in a multi-task way which can do age estimation, gender classification and ethnicity classification well simultaneously. The experiments on MORPH Album 2 illustrate the superiorities of the proposed multi-scale CNN over other state-of-the-art methods.

219 citations


Posted Content•
TL;DR: Following the learning pipelines in Viola-Jones framework, the multi-view face detector using aggregate channel features shows competitive performance against state-of-the-art algorithms on AFW and FDDB test-sets, while runs at 42 FPS on VGA images.
Abstract: Face detection has drawn much attention in recent decades since the seminal work by Viola and Jones. While many subsequences have improved the work with more powerful learning algorithms, the feature representation used for face detection still can't meet the demand for effectively and efficiently handling faces with large appearance variance in the wild. To solve this bottleneck, we borrow the concept of channel features to the face detection domain, which extends the image channel to diverse types like gradient magnitude and oriented gradient histograms and therefore encodes rich information in a simple form. We adopt a novel variant called aggregate channel features, make a full exploration of feature design, and discover a multi-scale version of features with better performance. To deal with poses of faces in the wild, we propose a multi-view detection approach featuring score re-ranking and detection adjustment. Following the learning pipelines in Viola-Jones framework, the multi-view face detector using aggregate channel features shows competitive performance against state-of-the-art algorithms on AFW and FDDB testsets, while runs at 42 FPS on VGA images.

213 citations


Journal Article•DOI•
TL;DR: The co-occurrence between face and body helps to handle large variations, such as heavy occlusions, to further boost the face detection performance, and the hierarchical part based structural model is proposed to explicitly capture them.

Proceedings Article•DOI•
23 Jun 2014
TL;DR: A novel data association approach based on undirected hierarchical relation hypergraph is proposed, which formulates the tracking task as a hierarchical dense neighborhoods searching problem on the dynamically constructed Undirected affinity graph and makes the tracker robust to the spatially close targets with similar appearance.
Abstract: Multi-target tracking is an interesting but challenging task in computer vision field. Most previous data association based methods merely consider the relationships (e.g. appearance and motion pattern similarities) between detections in local limited temporal domain, leading to their difficulties in handling long-term occlusion and distinguishing the spatially close targets with similar appearance in crowded scenes. In this paper, a novel data association approach based on undirected hierarchical relation hypergraph is proposed, which formulates the tracking task as a hierarchical dense neighborhoods searching problem on the dynamically constructed undirected affinity graph. The relationships between different detections across the spatiotemporal domain are considered in a high-order way, which makes the tracker robust to the spatially close targets with similar appearance. Meanwhile, the hierarchical design of the optimization process fuels our tracker to long-term occlusion with more robustness. Extensive experiments on various challenging datasets (i.e. PETS2009 dataset, ParkingLot), including both low and high density sequences, demonstrate that the proposed method performs favorably against the state-of-the-art methods.

Book•DOI•
18 Jul 2014
TL;DR: This Handbook of Biometric Anti-Spoofing reviews the state of the art in covert attacks against biometric systems and in deriving countermeasures to these attacks.
Abstract: Presenting the first definitive study of the subject, this Handbook of Biometric Anti-Spoofing reviews the state of the art in covert attacks against biometric systems and in deriving countermeasures to these attacks. Topics and features: provides a detailed introduction to the field of biometric anti-spoofing and a thorough review of the associated literature; examines spoofing attacks against five biometric modalities, namely, fingerprints, face, iris, speaker and gait; discusses anti-spoofing measures for multi-model biometric systems; reviews evaluation methodologies, international standards and legal and ethical issues; describes current challenges and suggests directions for future research; presents the latest work from a global selection of experts in the field, including members of the TABULA RASA project.

Posted Content•
TL;DR: A more general way that can learn a similarity metric from image pixels directly by using a "siamese" deep neural network that can jointly learn the color feature, texture feature and metric in a unified framework is proposed.
Abstract: Various hand-crafted features and metric learning methods prevail in the field of person re-identification. Compared to these methods, this paper proposes a more general way that can learn a similarity metric from image pixels directly. By using a "siamese" deep neural network, the proposed method can jointly learn the color feature, texture feature and metric in a unified framework. The network has a symmetry structure with two sub-networks which are connected by Cosine function. To deal with the big variations of person images, binomial deviance is used to evaluate the cost between similarities and labels, which is proved to be robust to outliers. Compared to existing researches, a more practical setting is studied in the experiments that is training and test on different datasets (cross dataset person re-identification). Both in "intra dataset" and "cross dataset" settings, the superiorities of the proposed method are illustrated on VIPeR and PRID.

Proceedings Article•DOI•
TL;DR: It is concluded that the large-scale unconstrained face recognition problem is still largely unresolved, thus further attention and effort is needed in developing effective feature representations and learning algorithms.
Abstract: Many efforts have been made in recent years to tackle the unconstrained face recognition challenge. For the benchmark of this challenge, the Labeled Faces in the Wild (LFW) database has been widely used. However, the standard LFW protocol is very limited, with only 3,000 genuine and 3,000 impostor matches for classification. Today a 97% accuracy can be achieved with this benchmark, remaining a very limited room for algorithm development. However, we argue that this accuracy may be too optimistic because the underlying false accept rate may still be high (e.g. 3%). Furthermore, performance evaluation at low FARs is not statistically sound by the standard protocol due to the limited number of impostor matches. Thereby we develop a new benchmark protocol to fully exploit all the 13,233 LFW face images for large-scale unconstrained face recognition evaluation under both verification and open-set identification scenarios, with a focus at low FARs. Based on the new benchmark, we evaluate 21 face recognition approaches by combining 3 kinds of features and 7 learning algorithms. The benchmark results show that the best algorithm achieves 41.66% verification rates at FAR=0.1%, and 18.07% open-set identification rates at rank 1 and FAR=1%. Accordingly we conclude that the large-scale unconstrained face recognition problem is still largely unresolved, thus further attention and effort is needed in developing effective feature representations and learning algorithms. We thereby release a benchmark tool to advance research in this field.

Journal Article•DOI•
TL;DR: A dynamic graph-based tracker (DGT) is proposed to address deformation and occlusion in visual tracking in a unified framework, and shows improved performance over several state-of-the-art trackers, in various challenging scenarios.
Abstract: While some efforts have been paid to handle deformation and occlusion in visual tracking, they are still great challenges. In this paper, a dynamic graph-based tracker (DGT) is proposed to address these two challenges in a unified framework. In the dynamic target graph, nodes are the target local parts encoding appearance information, and edges are the interactions between nodes encoding inner geometric structure information. This graph representation provides much more information for tracking in the presence of deformation and occlusion. The target tracking is then formulated as tracking this dynamic undirected graph, which is also a matching problem between the target graph and the candidate graph. The local parts within the candidate graph are separated from the background with Markov random field, and spectral clustering is used to solve the graph matching. The final target state is determined through a weighted voting procedure according to the reliability of part correspondence, and refined with recourse to a foreground/background segmentation. An effective online updating mechanism is proposed to update the model, allowing DGT to robustly adapt to variations of target structure. Experimental results show improved performance over several state-of-the-art trackers, in various challenging scenarios.

Journal Article•DOI•
TL;DR: This work proposes a transduction method named transductive heterogeneous face matching (THFM) to adapt the VIS-NIR matching learned from training with available image pairs to all people in the target set, and proposes a simple feature representation for effective VIS-nIR matching.
Abstract: Visual versus near infrared (VIS-NIR) face image matching uses an NIR face image as the probe and conventional VIS face images as enrollment. It takes advantage of the NIR face technology in tackling illumination changes and low-light condition and can cater for more applications where the enrollment is done using VIS face images such as ID card photos. Existing VIS-NIR techniques assume that during classifier learning, the VIS images of each target people have their NIR counterparts. However, since corresponding VIS-NIR image pairs of the same people are not always available, which is often the case, so those methods cannot be applied. To address this problem, we propose a transductive method named transductive heterogeneous face matching (THFM) to adapt the VIS-NIR matching learned from training with available image pairs to all people in the target set. In addition, we propose a simple feature representation for effective VIS-NIR matching, which can be computed in three steps, namely Log-DoG filtering, local encoding, and uniform feature normalization, to reduce heterogeneities between VIS and NIR images. The transduction approach can reduce the domain difference due to heterogeneous data and learn the discriminative model for target people simultaneously. To the best of our knowledge, it is the first attempt to formulate the VIS-NIR matching using transduction to address the generalization problem for matching. Experimental results validate the effectiveness of our proposed method on the heterogeneous face biometric databases.

Book Chapter•DOI•
01 Jan 2014
TL;DR: This chapter presents a multi-spectral face recognition system working in VIS (Visible) and NIR (Near Infrared) spectrums, which is robust to various spoofing attacks and user cooperation free, and has two advantages: its recognition rate is higher than the VIS subsystem and users' cooperation is no longer needed.
Abstract: With the wide applications of face recognition, spoofing attack is becoming a big threat to their security Conventional face recognition systems usually adopt behavioral challenge-response or texture analysis methods to resist spoofing attacks, however, these methods require high user cooperation and are sensitive to the imaging quality and environments In this chapter, we present a multi-spectral face recognition system working in VIS (Visible) and NIR (Near Infrared) spectrums, which is robust to various spoofing attacks and user cooperation free First, we introduce the structure of the system from several aspects including: imaging device, face landmarking, feature extraction, matching, VIS, and NIR sub-systems Then the performance of the multi-spectral system and each subsystem is evaluated and analyzed Finally, we describe the multi-spectral image-based anti-spoofing module, and report its performance under photo attacks Experiments on a spoofing database show the excellent performance of the proposed system both in recognition rate and anti-spoofing ability Compared with conventional VIS face recognition system, the multi-spectral system has two advantages: (1) By combining the VIS and NIR spectrums, the system can resist VIS photo and NIR photo attacks easily And users’ cooperation is no longer needed, making the system user friendly and fast (2) Due to the precise key-point localization, Gabor feature extraction and unsupervised learning, the system is robust to pose, illumination and expression variations Generally, its recognition rate is higher than the VIS subsystem

Journal Article•DOI•
Longyin Wen, Zhaowei Cai, Zhen Lei, Dong Yi, Stan Z. Li 
TL;DR: A robust spatio-temporal context model based tracker is presented to complete the tracking task in unconstrained environments and extensive experiments validate the superiority of the tracker over other state-of-the-art trackers.
Abstract: Visual tracking is an important but challenging problem in the computer vision field. In the real world, the appearances of the target and its surroundings change continuously over space and time, which provides effective information to track the target robustly. However, enough attention has not been paid to the spatio-temporal appearance information in previous works. In this paper, a robust spatio-temporal context model based tracker is presented to complete the tracking task in unconstrained environments. The tracker is constructed with temporal and spatial appearance context models. The temporal appearance context model captures the historical appearance of the target to prevent the tracker from drifting to the background in a long-term tracking. The spatial appearance context model integrates contributors to build a supporting field. The contributors are the patches with the same size of the target at the key-points automatically discovered around the target. The constructed supporting field provides much more information than the appearance of the target itself, and thus, ensures the robustness of the tracker in complex environments. Extensive experiments on various challenging databases validate the superiority of our tracker over other state-of-the-art trackers.

Posted Content•
TL;DR: A database collected from a video surveillance setting of 6 cameras, with 200 persons and 7,413 images segmented is presented, and a benchmark protocol for evaluating the performance under the open-set person re-identification scenario is developed.
Abstract: Person re-identification is becoming a hot research for developing both machine learning algorithms and video surveillance applications. The task of person re-identification is to determine which person in a gallery has the same identity to a probe image. This task basically assumes that the subject of the probe image belongs to the gallery, that is, the gallery contains this person. However, in practical applications such as searching a suspect in a video, this assumption is usually not true. In this paper, we consider the open-set person re-identification problem, which includes two sub-tasks, detection and identification. The detection sub-task is to determine the presence of the probe subject in the gallery, and the identification sub-task is to determine which person in the gallery has the same identity as the accepted probe. We present a database collected from a video surveillance setting of 6 cameras, with 200 persons and 7,413 images segmented. Based on this database, we develop a benchmark protocol for evaluating the performance under the open-set person re-identification scenario. Several popular metric learning algorithms for person re-identification have been evaluated as baselines. From the baseline performance, we observe that the open-set person re-identification problem is still largely unresolved, thus further attention and effort is needed.

Book Chapter•DOI•
Yang Hu1, Dong Yi1, Shengcai Liao1, Zhen Lei1, Stan Z. Li1 •
01 Nov 2014
TL;DR: A deep learning framework based on convolutional neural networks is presented to learn the person representation instead of existing hand-crafted features, and cosine metric is used to calculate the similarity.
Abstract: Until now, most existing researches on person re-identification aim at improving the recognition rate on single dataset setting. The training data and testing data of these methods are form the same source. Although they have obtained high recognition rate in experiments, they usually perform poorly in practical applications. In this paper, we focus on the cross dataset person re-identification which make more sense in the real world. We present a deep learning framework based on convolutional neural networks to learn the person representation instead of existing hand-crafted features, and cosine metric is used to calculate the similarity. Three different datasets Shinpuhkan2014dataset, CUHK and CASPR are chosen as the training sets, we evaluate the performances of the learned person representations on VIPeR. For the training set Shinpuhkan2014dataset, we also evaluate the performances on PRID and iLIDS. Experiments show that our method outperforms the existing cross dataset methods significantly and even approaches the performances of some methods in single dataset setting.

Journal Article•DOI•
TL;DR: DICW computes the image-to-class distance between a query face and those of an enrolled subject by finding the optimal alignment between the query sequence and all sequences of that subject along both the time dimension and within-class dimension.
Abstract: Face recognition (FR) systems in real-world applications need to deal with a wide range of interferences, such as occlusions and disguises in face images. Compared with other forms of interferences such as nonuniform illumination and pose changes, face with occlusions has not attracted enough attention yet. A novel approach, coined dynamic image-to-class warping (DICW), is proposed in this work to deal with this challenge in FR. The face consists of the forehead, eyes, nose, mouth, and chin in a natural order and this order does not change despite occlusions. Thus, a face image is partitioned into patches, which are then concatenated in the raster scan order to form an ordered sequence. Considering this order information, DICW computes the image-to-class distance between a query face and those of an enrolled subject by finding the optimal alignment between the query sequence and all sequences of that subject along both the time dimension and within-class dimension. Unlike most existing methods, our method is able to deal with occlusions which exist in both gallery and probe images. Extensive experiments on public face databases with various types of occlusions have confirmed the effectiveness of the proposed method.

Proceedings Article•DOI•
Menglong Yang1, Yiguang Liu1, Longyin Wen, Zhisheng You, Stan Z. Li •
23 Jun 2014
TL;DR: Experimental results show that the proposed framework can handle occlusion well, even including long-duration full occlusions, which may cause tracking failures in the traditional methods.
Abstract: Mutual occlusions among targets can cause track loss or target position deviation, because the observation likelihood of an occluded target may vanish even when we have the estimated location of the target. This paper presents a novel probability framework for multitarget tracking with mutual occlusions. The primary contribution of this work is the introduction of a vectorial occlusion variable as part of the solution. The occlusion variable describes occlusion states of the targets. This forms the basis of the proposed probability framework, with the following further contributions: 1) Likelihood: A new observation likelihood model is presented, in which the likelihood of an occluded target is computed by referring to both of the occluded and oc-cluding targets. 2) Priori: Markov random field (MRF) is used to model the occlusion priori such that less likely "circular" or "cascading" types of occlusions have lower priori probabilities. Both the occlusion priori and the motion priori take into consideration the state of occlusion. 3) Optimization: A realtime RJMCMC-based algorithm with a newmove type called "occlusion state update" is presented. Experimental results show that the proposed framework can handle occlusions well, even including long-duration full occlusions, which may cause tracking failures in the traditional methods.

Proceedings Article•DOI•
Yang Yang1, Shengcai Liao1, Zhen Lei1, Dong Yi1, Stan Z. Li1 •
24 Aug 2014
TL;DR: Experiments show that the proposed algorithm outperforms the state-of-the-art methods on two public benchmark datasets (VIPeR and PRID 450S) and demonstrates its feasibility and effectiveness.
Abstract: Due to illumination changes, partial occlusions, and object scale differences, person re-identification over disjoint camera views becomes a challenging problem. To address this problem, a variety of image representations have been put forward. In this paper, the illumination invariance and distinctiveness of different color models including the proposed color model are firstly evaluated. Since color distribution is robust to image scales and partial occlusions, color distributions based on different color models are then calculated and fused in the stage of feature extraction. Different color models obtain robustness to different types of illumination and thus fusing them can compensate each other and contribute to better performance. In the stage of feature matching, a weighted KISSME is presented to learn a better distance metric than the original KISSME. Experimental results demonstrate its feasibility and effectiveness. Finally, image pairs are matched based on the learned distance metric. Experiments conducted on two public benchmark datasets (VIPeR and PRID 450S) show that the proposed algorithm outperforms the state-of-the-art methods.

Proceedings Article•DOI•
24 Aug 2014
TL;DR: This paper proposes a new algorithm called Sparse SIFT Flow (SSF) to improve the reconstruction accuracy of 3D Morph able Model and incorporates SSF into Multi- features Framework to construct a robust 3DMM fitting algorithm.
Abstract: 3D Morph able Model (3DMM) has been widely used in face analysis for many years. The most challenging part of 3DMM is to find the correspondences between 3D points and 2D pixels. Existing methods only use key points, edges, specular highlights and image pixels to complete the task, which are not accurate or robust. This paper proposes a new algorithm called Sparse SIFT Flow (SSF) to improve the reconstruction accuracy. We mark a set of salient points to control the shape of facial components and use SSF to find their corresponding pixels on the input image. We also incorporate SSF into Multi-Features Framework to construct a robust 3DMM fitting algorithm. Compared with the state-of-the art, our approach significantly improves the fitting results in facial component area.

Proceedings Article•DOI•
24 Aug 2014
TL;DR: This paper proposes a novel face descriptor, namely local gradient order pattern (LGOP), taking into account the ordinal relationship of gradient responses in local region to obtain robust face representation and adopt whitened principal component analysis (WPCA) to reduce the feature dimensionality and improve the computational efficiency.
Abstract: LBP is an effective descriptor for face recognition. LBP encodes the ordinal relationship between the neighborhood samplings and the central one to obtain robust face representation. However, additional information like the difference among neighboring pixels, which may be helpful for face recognition, is ignored. On the other hand, gradient information which enhances the edge response and suppresses the external noise like illumination variation, is usually useful for face recognition. In this paper, we propose a novel face descriptor, namely local gradient order pattern (LGOP), taking into account the ordinal relationship of gradient responses in local region to obtain robust face representation. After pattern encoding, a 2-D histogram is consequently adopted to calculate the occurrence frequency of different patterns and multi-scale histogram features are extracted to represent the face image. We further adopt whitened principal component analysis (WPCA) to reduce the feature dimensionality and improve the computational efficiency. Extensive experiments on FERET, CAS-PEAL and LFW validates the effectiveness of LGOP for both constrained and unconstrained face recognition problems.

Book Chapter•DOI•
06 Sep 2014
TL;DR: The DPM with shape regression (SR-DPM) is more flexible than the traditional DPM by relaxing the fixed anchor location of each part and provides an analogy to deep neural network while benefiting from hand-crafted feature and model.
Abstract: This paper explores the localization of pre-defined semantic object parts, which is much more challenging than traditional object detection and very important for applications such as face recognition, HCI and fine-grained object recognition. To address this problem, we make two critical improvements over the widely used deformable part model (DPM). The first is that we use appearance based shape regression to globally estimate the anchor location of each part and then locally refine each part according to the estimated anchor location under the constraint of DPM. The DPM with shape regression (SR-DPM) is more flexible than the traditional DPM by relaxing the fixed anchor location of each part. It enjoys the efficient dynamic programming inference as traditional DPM and can be discriminatively trained via a coordinate descent procedure. The second is that we propose to stack multiple SR-DPMs, where each layer uses the output of previous SR-DPM as the input to progressively refine the result. It provides an analogy to deep neural network while benefiting from hand-crafted feature and model. The proposed methods are applied to human pose estimation, face alignment and general object part localization tasks and achieve state-of-the-art performance.

Book Chapter•DOI•
01 Nov 2014
TL;DR: This work proposes a novel pedestrian attribute classification method which exploits interactions among different attributes and is able to keep the balance of the independent decision score and interaction of other attributes to yield more robust classification results.
Abstract: Recent works have shown that visual attributes are useful in a number of applications, such as object classification, recognition, and retrieval. However, predicting attributes in images with large variations still remains a challenging problem. Several approaches have been proposed for visual attribute classification; however, most of them assume independence among attributes. In fact, to predict one attribute, it is often useful to consider other related attributes. For example, a pedestrian with long hair and skirt usually imply the female attribute. Motivated by this, we propose a novel pedestrian attribute classification method which exploits interactions among different attributes. Firstly, each attribute classifier is trained independently. Secondly, for each attribute, we also use the decision scores of other attribute classifiers to learn the attribute interaction regressor. Finally, prediction of one attribute is achieved by a weighted combination of the independent decision score and the interaction score from other attributes. The proposed method is able to keep the balance of the independent decision score and interaction of other attributes to yield more robust classification results. Experimental results on the Attributed Pedestrian in Surveillance (APiS 1.0) [1] database validate the effectiveness of the proposed approach for pedestrian attribute classification.

Posted Content•
TL;DR: Zhang et al. as discussed by the authors extract Gabor features at some localized facial points, and then use Restricted Boltzmann Machines (RBMs) to learn a shared representation locally to remove the heterogeneity around each facial point.
Abstract: After intensive research, heterogenous face recognition is still a challenging problem. The main difficulties are owing to the complex relationship between heterogenous face image spaces. The heterogeneity is always tightly coupled with other variations, which makes the relationship of heterogenous face images highly nonlinear. Many excellent methods have been proposed to model the nonlinear relationship, but they apt to overfit to the training set, due to limited samples. Inspired by the unsupervised algorithms in deep learning, this paper proposes an novel framework for heterogeneous face recognition. We first extract Gabor features at some localized facial points, and then use Restricted Boltzmann Machines (RBMs) to learn a shared representation locally to remove the heterogeneity around each facial point. Finally, the shared representations of local RBMs are connected together and processed by PCA. Two problems (Sketch-Photo and NIR-VIS) and three databases are selected to evaluate the proposed method. For Sketch-Photo problem, we obtain perfect results on the CUFS database. For NIR-VIS problem, we produce new state-of-the-art performance on the CASIA HFB and NIR-VIS 2.0 databases.

Posted Content•
TL;DR: Zhang et al. as mentioned in this paper proposed an effective feature representation called Local Maximal Occurrence (LOMO), and a subspace and metric learning method called Cross-view Quadratic Discriminant Analysis (XQDA).
Abstract: Person re-identification is an important technique towards automatic search of a person's presence in a surveillance video. Two fundamental problems are critical for person re-identification, feature representation and metric learning. An effective feature representation should be robust to illumination and viewpoint changes, and a discriminant metric should be learned to match various person images. In this paper, we propose an effective feature representation called Local Maximal Occurrence (LOMO), and a subspace and metric learning method called Cross-view Quadratic Discriminant Analysis (XQDA). The LOMO feature analyzes the horizontal occurrence of local features, and maximizes the occurrence to make a stable representation against viewpoint changes. Besides, to handle illumination variations, we apply the Retinex transform and a scale invariant texture operator. To learn a discriminant metric, we propose to learn a discriminant low dimensional subspace by cross-view quadratic discriminant analysis, and simultaneously, a QDA metric is learned on the derived subspace. We also present a practical computation method for XQDA, as well as its regularization. Experiments on four challenging person re-identification databases, VIPeR, QMUL GRID, CUHK Campus, and CUHK03, show that the proposed method improves the state-of-the-art rank-1 identification rates by 2.2%, 4.88%, 28.91%, and 31.55% on the four databases, respectively.