Showing papers by "Stan Z. Li published in 2016"

PDF

Open Access

Proceedings Article•DOI•

Face Alignment Across Large Poses: A 3D Solution

[...]

Xiangyu Zhu¹, Zhen Lei¹, Xiaoming Liu², Hailin Shi¹, Stan Z. Li¹ - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, Michigan State University²

27 Jun 2016

TL;DR: 3D Dense Face Alignment (3DDFA), in which a dense 3D face model is fitted to the image via convolutional neutral network (CNN), is proposed, and a method to synthesize large-scale training samples in profile views to solve the third problem of data labelling is proposed.

...read moreread less

Abstract: Face alignment, which fits a face model to an image and extracts the semantic meanings of facial pixels, has been an important topic in CV community. However, most algorithms are designed for faces in small to medium poses (below 45), lacking the ability to align faces in large poses up to 90. The challenges are three-fold: Firstly, the commonly used landmark-based face model assumes that all the landmarks are visible and is therefore not suitable for profile views. Secondly, the face appearance varies more dramatically across large poses, ranging from frontal view to profile view. Thirdly, labelling landmarks in large poses is extremely challenging since the invisible landmarks have to be guessed. In this paper, we propose a solution to the three problems in an new alignment framework, called 3D Dense Face Alignment (3DDFA), in which a dense 3D face model is fitted to the image via convolutional neutral network (CNN). We also propose a method to synthesize large-scale training samples in profile views to solve the third problem of data labelling. Experiments on the challenging AFLW database show that our approach achieves significant improvements over state-of-the-art methods.

...read moreread less

1,105 citations

Journal Article•DOI•

A Fast and Accurate Unconstrained Face Detector

[...]

Shengcai Liao¹, Anil K. Jain², Stan Z. Li¹•Institutions (2)

Chinese Academy of Sciences¹, Michigan State University²

01 Feb 2016-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A new image feature called Normalized Pixel Difference (NPD) is proposed, computed as the difference to sum ratio between two pixel values, inspired by the Weber Fraction in experimental psychology, which is scale invariant, bounded, and able to reconstruct the original image.

...read moreread less

Abstract: We propose a method to address challenges in unconstrained face detection, such as arbitrary pose variations and occlusions. First, a new image feature called Normalized Pixel Difference (NPD) is proposed. NPD feature is computed as the difference to sum ratio between two pixel values, inspired by the Weber Fraction in experimental psychology. The new feature is scale invariant, bounded, and is able to reconstruct the original image. Second, we propose a deep quadratic tree to learn the optimal subset of NPD features and their combinations, so that complex face manifolds can be partitioned by the learned rules. This way, only a single soft-cascade classifier is needed to handle unconstrained face detection. Furthermore, we show that the NPD features can be efficiently obtained from a look up table, and the detection template can be easily scaled, making the proposed face detector very fast. Experimental results on three public face datasets (FDDB, GENKI, and CMU-MIT) show that the proposed method achieves state-of-the-art performance in detecting unconstrained faces with arbitrary pose variations and occlusions in cluttered scenes.

...read moreread less

288 citations

Posted Content•

Embedding Deep Metric for Person Re-identication A Study Against Large Variations

[...]

Hailin Shi, Yang Yang, Xiangyu Zhu, Shengcai Liao, Zhen Lei, Wei-Shi Zheng, Stan Z. Li - Show less +3 more

01 Nov 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel moderate positive sample mining method to train robust CNN for person re-identification, dealing with the problem of large variation is proposed and the learning by a metric weight constraint is improved, so that the learned metric has a better generalization ability.

...read moreread less

Abstract: Person re-identification is challenging due to the large variations of pose, illumination, occlusion and camera view. Owing to these variations, the pedestrian data is distributed as highly-curved manifolds in the feature space, despite the current convolutional neural networks (CNN)'s capability of feature extraction. However, the distribution is unknown, so it is difficult to use the geodesic distance when comparing two samples. In practice, the current deep embedding methods use the Euclidean distance for the training and test. On the other hand, the manifold learning methods suggest to use the Euclidean distance in the local range, combining with the graphical relationship between samples, for approximating the geodesic distance. From this point of view, selecting suitable positive i.e. intra-class) training samples within a local range is critical for training the CNN embedding, especially when the data has large intra-class variations. In this paper, we propose a novel moderate positive sample mining method to train robust CNN for person re-identification, dealing with the problem of large variation. In addition, we improve the learning by a metric weight constraint, so that the learned metric has a better generalization ability. Experiments show that these two strategies are effective in learning robust deep metrics for person re-identification, and accordingly our deep model significantly outperforms the state-of-the-art methods on several benchmarks of person re-identification. Therefore, the study presented in this paper may be useful in inspiring new designs of deep models for person re-identification.

...read moreread less

265 citations

Book Chapter•DOI•

Embedding Deep Metric for Person Re-identification: A Study Against Large Variations

[...]

Hailin Shi¹, Yang Yang¹, Xiangyu Zhu¹, Shengcai Liao¹, Zhen Lei¹, Wei-Shi Zheng², Stan Z. Li¹ - Show less +3 more•Institutions (2)

Chinese Academy of Sciences¹, Sun Yat-sen University²

08 Oct 2016

Abstract: Person re-identification is challenging due to the large variations of pose, illumination, occlusion and camera view. Owing to these variations, the pedestrian data is distributed as highly-curved manifolds in the feature space, despite the current convolutional neural networks (CNN)’s capability of feature extraction. However, the distribution is unknown, so it is difficult to use the geodesic distance when comparing two samples. In practice, the current deep embedding methods use the Euclidean distance for the training and test. On the other hand, the manifold learning methods suggest to use the Euclidean distance in the local range, combining with the graphical relationship between samples, for approximating the geodesic distance. From this point of view, selecting suitable positive (i.e. intra-class) training samples within a local range is critical for training the CNN embedding, especially when the data has large intra-class variations. In this paper, we propose a novel moderate positive sample mining method to train robust CNN for person re-identification, dealing with the problem of large variation. In addition, we improve the learning by a metric weight constraint, so that the learned metric has a better generalization ability. Experiments show that these two strategies are effective in learning robust deep metrics for person re-identification, and accordingly our deep model significantly outperforms the state-of-the-art methods on several benchmarks of person re-identification. Therefore, the study presented in this paper may be useful in inspiring new designs of deep models for person re-identification.

...read moreread less

232 citations

Proceedings Article•DOI•

ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition

[...]

Jun Wan, Stan Z. Li, Yibing Zhao¹, Shuai Zhou¹, Isabelle Guyon, Sergio Escalera² - Show less +2 more•Institutions (2)

Macau University of Science and Technology¹, University of Barcelona²

01 Jun 2016

TL;DR: Two large video multi-modal datasets for RGB and RGB-D gesture recognition are presented and the baseline method based on the bag of visual words model is presented, designed for gesture classification from segmented data.

...read moreread less

Abstract: In this paper, we present two large video multi-modal datasets for RGB and RGB-D gesture recognition: the ChaLearn LAP RGB-D Isolated Gesture Dataset (IsoGD) and the Continuous Gesture Dataset (ConGD). Both datasets are derived from the ChaLearn Gesture Dataset (CGD) that has a total of more than 50000 gestures for the "one-shot-learning" competition. To increase the potential of the old dataset, we designed new well curated datasets composed of 249 gesture labels, and including 47933 gestures manually labeled the begin and end frames in sequences. Using these datasets we will open two competitions on the CodaLab platform so that researchers can test and compare their methods for "user independent" gesture recognition. The first challenge is designed for gesture spotting and recognition in continuous sequences of gestures while the second one is designed for gesture classification from segmented data. The baseline method based on the bag of visual words model is also presented.

...read moreread less

219 citations

Journal Article•DOI•

Explore Efficient Local Features from RGB-D Data for One-Shot Learning Gesture Recognition

[...]

Jun Wan¹, Guodong Guo², Stan Z. Li¹•Institutions (2)

Chinese Academy of Sciences¹, West Virginia University²

01 Aug 2016-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A novel spatiotemporal feature extracted from RGB-D data, namely mixed features around sparse keypoints (MFSK) is proposed, which outperforms all currently published approaches on the challenging data of CGD, such as translated, scaled and occluded subsets.

...read moreread less

Abstract: Availability of handy RGB-D sensors has brought about a surge of gesture recognition research and applications. Among various approaches, one shot learning approach is advantageous because it requires minimum amount of data. Here, we provide a thorough review about one-shot learning gesture recognition from RGB-D data and propose a novel spatiotemporal feature extracted from RGB-D data, namely mixed features around sparse keypoints (MFSK). In the review, we analyze the challenges that we are facing, and point out some future research directions which may enlighten researchers in this field. The proposed MFSK feature is robust and invariant to scale, rotation and partial occlusions. To alleviate the insufficiency of one shot training samples, we augment the training samples by artificially synthesizing versions of various temporal scales, which is beneficial for coping with gestures performed at varying speed. We evaluate the proposed method on the Chalearn gesture dataset (CGD). The results show that our approach outperforms all currently published approaches on the challenging data of CGD, such as translated, scaled and occluded subsets. When applied to the RGB-D datasets that are not one-shot (e.g., the Cornell Activity Dataset-60 and MSR Daily Activity 3D dataset), the proposed feature also produces very promising results under leave-one-out cross validation or one-shot learning.

...read moreread less

104 citations

Proceedings Article•DOI•

CRAFT Objects from Images

[...]

Bin Yang, Junjie Yan¹, Zhen Lei, Stan Z. Li•Institutions (1)

Tsinghua University¹

27 Jun 2016

TL;DR: CRAFT as mentioned in this paper proposes a cascade region proposal-network and FasT-rcNN, which tackles each task with a carefully designed network cascade and achieves state-of-the-art performance on object detection benchmarks.

...read moreread less

Abstract: Object detection is a fundamental problem in image understanding. One popular solution is the R-CNN framework [15] and its fast versions [14, 27]. They decompose the object detection problem into two cascaded easier tasks: 1) generating object proposals from images, 2) classifying proposals into various object categories. Despite that we are handling with two relatively easier tasks, they are not solved perfectly and there's still room for improvement. In this paper, we push the "divide and conquer" solution even further by dividing each task into two sub-tasks. We call the proposed method "CRAFT" (Cascade Regionproposal-network And FasT-rcnn), which tackles each task with a carefully designed network cascade. We show that the cascade structure helps in both tasks: in proposal generation, it provides more compact and better localized object proposals, in object classification, it reduces false positives (mainly between ambiguous categories) by capturing both inter-and intra-category variances. CRAFT achieves consistent and considerable improvement over the state-of the-art on object detection benchmarks like PASCAL VOC 07/12 and ILSVRC.

...read moreread less

87 citations

Proceedings Article•

[...]

Yang Yang¹, Shengcai Liao¹, Zhen Lei¹, Stan Z. Li¹•Institutions (1)

Chinese Academy of Sciences¹

12 Feb 2016

TL;DR: Results on the challenging datasets of face verification and person re-identification show that the proposed novel similarity measure outperforms the state-of-the-art methods.

...read moreread less

Abstract: In this paper, we propose a novel similarity measure and then introduce an efficient strategy to learn it by using only similar pairs for person verification. Unlike existing metric learning methods, we consider both the difference and commonness of an image pair to increase its discriminativeness. Under a pair-constrained Gaussian assumption, we show how to obtain the Gaussian priors (i.e., corresponding covariance matrices) of dissimilar pairs from those of similar pairs. The application of a log likelihood ratio makes the learning process simple and fast and thus scalable to large datasets. Additionally, our method is able to handle heterogeneous data well. Results on the challenging datasets of face verification (LFW and Pub-Fig) and person re-identification (VIPeR) show that our algorithm outperforms the state-of-the-art methods.

...read moreread less

65 citations

Journal Article•DOI•

Enhanced Local Gradient Order Features and Discriminant Analysis for Face Recognition

[...]

Chuan-Xian Ren¹, Zhen Lei², Dao-Qing Dai¹, Stan Z. Li²•Institutions (2)

Sun Yat-sen University¹, Chinese Academy of Sciences²

01 Nov 2016-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: Local order constrained IGOs are exploited to generate robust features and enhance the local textures and the order-based coding ability, thus discover intrinsic structure of facial images further.

...read moreread less

Abstract: Robust descriptor-based subspace learning with complex data is an active topic in pattern analysis and machine intelligence. A few researches concentrate the optimal design on feature representation and metric learning. However, traditionally used features of single-type, e.g., image gradient orientations (IGOs), are deficient to characterize the complete variations in robust and discriminant subspace learning. Meanwhile, discontinuity in edge alignment and feature match are not been carefully treated in the literature. In this paper, local order constrained IGOs are exploited to generate robust features. As the difference-based filters explicitly consider the local contrasts within neighboring pixel points, the proposed features enhance the local textures and the order-based coding ability, thus discover intrinsic structure of facial images further. The multimodal features are automatically fused in the most discriminant subspace. The utilization of adaptive interaction function suppresses outliers in each dimension for robust similarity measurement and discriminant analysis. The sparsity-driven regression model is modified to adapt the classification issue of the compact feature representation. Extensive experiments are conducted by using some benchmark face data sets, e.g., of controlled and uncontrolled environments, to evaluate our new algorithm.

...read moreread less

58 citations

Journal Article•DOI•

Exploiting Hierarchical Dense Structures on Hypergraphs for Multi-Object Tracking

[...]

Longyin Wen¹, Zhen Lei¹, Siwei Lyu², Stan Z. Li¹, Ming-Hsuan Yang³ - Show less +1 more•Institutions (3)

Chinese Academy of Sciences¹, University at Albany, SUNY², University of California, Merced³

01 Oct 2016-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work proposes an algorithm that formulates the multi-object tracking task as one to exploit hierarchical dense structures on an undirected hypergraph constructed based on tracklet affinity, and demonstrates that the proposed algorithm performs favorably against the state-of-the-art methods.

...read moreread less

Abstract: Most multi-object tracking algorithms are developed within the tracking-by-detection framework that consider the pairwise appearance similarities between detection responses or tracklets within a limited temporal window, and thus less effective in handling long-term occlusions or distinguishing spatially close targets with similar appearance in crowded scenes. In this work, we propose an algorithm that formulates the multi-object tracking task as one to exploit hierarchical dense structures on an undirected hypergraph constructed based on tracklet affinity. The dense structures indicate a group of vertices that are inter-connected with a set of hyperedges with high affinity values. The appearance and motion similarities among multiple tracklets across the spatio-temporal domain are considered globally by exploiting high-order similarities rather than pairwise ones, thereby facilitating distinguish spatially close targets with similar appearance. In addition, the hierarchical design of the optimization process helps the proposed tracking algorithm handle long-term occlusions robustly. Extensive experiments on various challenging datasets of both multi-pedestrian and multi-face tracking tasks, demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.

...read moreread less

53 citations

Posted Content•

CRAFT Objects from Images

[...]

Bin Yang, Junjie Yan¹, Zhen Lei, Stan Z. Li•Institutions (1)

Tsinghua University¹

12 Apr 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper calls the proposed method "CRAFT" (Cascade Regionproposal-network And FasT-rcnn), which tackles each task with a carefully designed network cascade, and shows that the cascade structure helps in both tasks.

...read moreread less

Abstract: Object detection is a fundamental problem in image understanding. One popular solution is the R-CNN framework and its fast versions. They decompose the object detection problem into two cascaded easier tasks: 1) generating object proposals from images, 2) classifying proposals into various object categories. Despite that we are handling with two relatively easier tasks, they are not solved perfectly and there's still room for improvement. In this paper, we push the "divide and conquer" solution even further by dividing each task into two sub-tasks. We call the proposed method "CRAFT" (Cascade Region-proposal-network And FasT-rcnn), which tackles each task with a carefully designed network cascade. We show that the cascade structure helps in both tasks: in proposal generation, it provides more compact and better localized object proposals; in object classification, it reduces false positives (mainly between ambiguous categories) by capturing both inter- and intra-category variances. CRAFT achieves consistent and considerable improvement over the state-of-the-art on object detection benchmarks like PASCAL VOC 07/12 and ILSVRC.

...read moreread less

Journal Article•DOI•

Learning Stacked Image Descriptor for Face Recognition

[...]

Zhen Lei¹, Dong Yi¹, Stan Z. Li¹•Institutions (1)

Chinese Academy of Sciences¹

01 Sep 2016-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: This paper extends the original shallow face descriptors to deep discriminant face features by introducing a stacked image descriptor (SID), with deep structure, more complex facial information can be extracted and the discriminant and compactness of feature representation can be improved.

...read moreread less

Abstract: Learning-based face descriptors have constantly improved the face recognition performance. Compared with the hand-crafted features, learning-based features are considered to be able to exploit information with better discriminative ability for specific tasks. Motivated by the recent success of deep learning, in this paper, we extend the original shallow face descriptors to deep discriminant face features by introducing a stacked image descriptor (SID). With deep structure, more complex facial information can be extracted and the discriminant and compactness of feature representation can be improved. The SID is learned in a forward optimization way, which is computational efficient compared with deep learning. Extensive experiments on various face databases are conducted to show that SID is able to achieve high face recognition performance with compact face representation, compared with other state-of-the-art descriptors.

...read moreread less

Posted Content•

Multi-Modality Fusion based on Consensus-Voting and 3D Convolution for Isolated Gesture Recognition

[...]

Jiali Duan, Shuai Zhou, Jun Wan, Xiaoyuan Guo, Stan Z. Li - Show less +1 more

21 Nov 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: A convolutional twostream consensus voting network (2SCVN) is proposed which explicitly models both the short-term and long-term structure of the RGB sequences and significantly improves the recognition accuracy.

...read moreread less

Abstract: Recently, the popularity of depth-sensors such as Kinect has made depth videos easily available while its advantages have not been fully exploited. This paper investigates, for gesture recognition, to explore the spatial and temporal information complementarily embedded in RGB and depth sequences. We propose a convolutional twostream consensus voting network (2SCVN) which explicitly models both the short-term and long-term structure of the RGB sequences. To alleviate distractions from background, a 3d depth-saliency ConvNet stream (3DDSN) is aggregated in parallel to identify subtle motion characteristics. These two components in an unified framework significantly improve the recognition accuracy. On the challenging Chalearn IsoGD benchmark, our proposed method outperforms the first place on the leader-board by a large margin (10.29%) while also achieving the best result on RGBD-HuDaAct dataset (96.74%). Both quantitative experiments and qualitative analysis shows the effectiveness of our proposed framework and codes will be released to facilitate future research.

...read moreread less

Proceedings Article•

Metric embedded discriminative vocabulary learning for high-level person representation

[...]

Yang Yang¹, Zhen Lei¹, Shifeng Zhang¹, Hailin Shi¹, Stan Z. Li¹ - Show less +1 more•Institutions (1)

Chinese Academy of Sciences¹

12 Feb 2016

TL;DR: A metric embedded discriminative vocabulary learning for high-level person representation with application to person re-identification and a new and effective term is introduced which aims at making the same persons closer while different ones farther in the metric space.

...read moreread less

Abstract: A variety of encoding methods for bag of word (BoW) model have been proposed to encode the local features in image classification. However, most of them are unsupervised and just employ k-means to form the visual vocabulary, thus reducing the discriminative power of the features. In this paper, we propose a metric embedded discriminative vocabulary learning for high-level person representation with application to person re-identification. A new and effective term is introduced which aims at making the same persons closer while different ones farther in the metric space. With the learned vocabulary, we utilize a linear coding method to encode the image-level features (or holistic image features) for extracting high-level person representation. Different from traditional unsupervised approaches, our method can explore the relationship (same or not) among the persons. Since there is an analytic solution to the linear coding, it is easy to obtain the final high-level features. The experimental results on person reidentification demonstrate the effectiveness of our proposed algorithm.

...read moreread less

Journal Article•DOI•

Multicamera Joint Video Synopsis

[...]

Jianqing Zhu¹, Shengcai Liao², Stan Z. Li²•Institutions (2)

Huaqiao University¹, Chinese Academy of Sciences²

01 Jun 2016-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: Extensive experiments show that the proposed JVS method is better than the traditional single-camera video synopsis method in preserving the chronological orders of moving objects among multicamera synopsis videos.

...read moreread less

Abstract: Due to an increasing demand for video surveillance, there is an explosive growth of surveillance videos, which causes a big challenge in video storage, browsing, and retrieval. The video synopsis technique is thus developed to extract and rearrange the moving objects so as to handle the massive video browsing challenge. However, the traditional video synopsis (TVS) method only considers the processing videos captured by a single camera, ignoring object interactions in multicamera videos. To address this issue, we propose a novel multicamera joint video synopsis (JVS) algorithm for multicamera surveillance videos. First, a key time stamp (KTS) selection method is designed to find an object’s appearing, merging, splitting, and disappearing moments in the frame sequence, called tube, of that object. Second, tubes are rearranged by minimizing a global energy function that involves the overall camera views. Compared with the energy function used in TVS, the proposed global energy function considers the chronological orders of tubes not only in the same camera view but also among different camera views. Moreover, the chronological disorder cost term is formulated based on the KTS labels, and improved by considering the visual similarity between two tubes. Finally, the multicamera synopsis videos are separately generated by stitching together the globally rearranged tubes and background images of the same camera view. Extensive experiments show that the proposed JVS method is better than the traditional single-camera video synopsis method in preserving the chronological orders of moving objects among multicamera synopsis videos.

...read moreread less

Book Chapter•DOI•

Age Estimation Based on a Single Network with Soft Softmax of Aging Modeling

[...]

Zichang Tan¹, Shuai Zhou¹, Shuai Zhou², Jun Wan¹, Zhen Lei¹, Stan Z. Li¹ - Show less +2 more•Institutions (2)

Chinese Academy of Sciences¹, Macau University of Science and Technology²

20 Nov 2016

TL;DR: A novel approach based on a single convolutional neural network (CNN) for age estimation based on the randomness of aging with the Gaussian distribution and a soft softmax regression function used in the network.

...read moreread less

Abstract: In this paper, we propose a novel approach based on a single convolutional neural network (CNN) for age estimation. In our proposed network architecture, we first model the randomness of aging with the Gaussian distribution which is used to calculate the Gaussian integral of an age interval. Then, we present a soft softmax regression function used in the network. The new function applies the aging modeling to compute the function loss. Compared with the traditional softmax function, the new function considers not only the chronological age but also the interval nearby true age. Moreover, owing to the complex of Gaussian integral in soft softmax function, a look up table is built to accelerate this process. All the integrals of age values are calculated offline in advance. We evaluate our method on two public datasets: MORPH II and Cross-Age Celebrity Dataset (CACD), and experimental results have shown that the proposed method has gained superior performances compared to the state of the art.

...read moreread less

Book Chapter•DOI•

Facial Expression Recognition Based on Multi-scale CNNs

[...]

Shuai Zhou¹, Shuai Zhou², Yanyan Liang¹, Jun Wan², Stan Z. Li² - Show less +1 more•Institutions (2)

Macau University of Science and Technology¹, Chinese Academy of Sciences²

14 Oct 2016

TL;DR: This paper proposes a new method for facial expression recognition, called multi-scale CNNs, which consists several sub-CNNs with different scales of input images and can classify facial expression more accurately than any single scale sub- CNN.

...read moreread less

Abstract: This paper proposes a new method for facial expression recognition, called multi-scale CNNs. It consists several sub-CNNs with different scales of input images. The sub-CNNs of multi-scale CNNs are benefited from various scaled input images to learn the optimalized parameters. After trained all these sub-CNNs separately, we can predict the facial expression of an image by extracting its features from the last fully connected layer of sub-CNNs in different scales and mapping the averaged features to the final classification probability. Multi-scale CNNs can classify facial expression more accurately than any single scale sub-CNN. On Facial Expression Recognition 2013 database, multi-scale CNNs achieved an accuracy of 71.80 % on the testing set, which is comparative to other state-of-the-art methods.

...read moreread less

Book Chapter•DOI•

Face Classification: A Specialized Benchmark Study

[...]

Jiali Duan¹, Shengcai Liao¹, Shuai Zhou², Stan Z. Li¹•Institutions (2)

Chinese Academy of Sciences¹, Macau University of Science and Technology²

14 Oct 2016

TL;DR: A specialized benchmark study in this paper focuses purely on face classification, and shows that, without the help of post-processing, the performance of face classification itself is still not very satisfactory, even with a powerful CNN method.

...read moreread less

Abstract: Face detection evaluation generally involves three steps: block generation, face classification, and post-processing. However, firstly, face detection performance is largely influenced by block generation and post-processing, concealing the performance of face classification core module. Secondly, implementing and optimizing all the three steps results in a very heavy work, which is a big barrier for researchers who only cares about classification. Motivated by this, we conduct a specialized benchmark study in this paper, which focuses purely on face classification. We start with face proposals, and build a benchmark dataset with about 3.5 million patches for two-class face/non-face classification. Results with several baseline algorithms show that, without the help of post-processing, the performance of face classification itself is still not very satisfactory, even with a powerful CNN method. We’ll release this benchmark to help assess performance of face classification only, and ease the participation of other related researchers.

...read moreread less

Book Chapter•DOI•

Age Estimation Based on Multi-Region Convolutional Neural Network

[...]

Ting Liu¹, Jun Wan¹, Tingzhao Yu¹, Zhen Lei¹, Stan Z. Li¹ - Show less +1 more•Institutions (1)

Chinese Academy of Sciences¹

14 Oct 2016

TL;DR: The proposed MRCNN has two principle advantages: 8 sub-networks are able to learn the unique age characteristics of the corresponding subregion and the eight networks are packaged together to complement age-related information.

...read moreread less

Abstract: As one of the most important biologic features, age has tremendous application potential in various areas such as surveillance, human-computer interface and video detection. In this paper, a new convolutional neural network, namely MRCNN (Multi-Region Convolutional Neural Network), is proposed based on multiple face subregions. It joins multiple face subregions together to estimation age. Each targeted region is analyzed to explore the contribution degree to age estimation. According to the face geometrical property, we select 8 subregions, and construct 8 sub-network structures respectively, and then fuse at feature-level. The proposed MRCNN has two principle advantages: 8 sub-networks are able to learn the unique age characteristics of the corresponding subregion and the eight networks are packaged together to complement age-related information. Further, we analyze the estimation accuracy on all age groups. Experiments on MORPH illustrate the superior performance of the proposed MRCNN.

...read moreread less

Proceedings Article•DOI•

Learning Discriminative Features with Class Encoder

[...]

Hailin Shi¹, Xiangyu Zhu¹, Zhen Lei¹, Shengcai Liao¹, Stan Z. Li¹ - Show less +1 more•Institutions (1)

Chinese Academy of Sciences¹

01 Jun 2016

TL;DR: In this paper, a class-encoder is proposed to minimize the intra-class variations in the feature space, and to learn a good discriminative manifold on a class scale.

...read moreread less

Abstract: Deep neural networks usually benefit from unsupervised pre-training, e.g. auto-encoders. However, the classifier further needs supervised fine-tuning methods for good discrimination. Besides, due to the limits of full-connection, the application of auto-encoders is usually limited to small, well aligned images. In this paper, we incorporate the supervised information to propose a novel formulation, namely class-encoder, whose training objective is to reconstruct a sample from another one of which the labels are identical. Class-encoder aims to minimize the intra-class variations in the feature space, and to learn a good discriminative manifolds on a class scale. We impose the class-encoder as a constraint into the softmax for better supervised training, and extend the reconstruction on feature-level to tackle the parameter size issue and translation issue. The experiments show that the class-encoder helps to improve the performance on benchmarks of classification and face recognition. This could also be a promising direction for fast training of face recognition models.

...read moreread less

Journal Article•

A Minimum Spanning Tree-Based Method for Initializing the K-Means Clustering Algorithm

[...]

Jie Yang, Y. Ma, Xu Zhang, Stan Z. Li, Y. Zhang - Show less +1 more

03 Nov 2016-World Academy of Science, Engineering and Technology, International Journal of Computer, Electrical, Automation, Control and Information Engineering

TL;DR: An algorithm for selecting initial cluster centers on the basis of minimum spanning tree (MST) is presented and a MST-based initialization method for the k-means algorithm is presented, and the corresponding time complexity is analyzed.

...read moreread less

Abstract: Abstract—The traditional k-means algorithm has been widely used as a simple and efficient clustering method. However, the algorithm often converges to local minima for the reason that it is sensitive to the initial cluster centers. In this paper, an algorithm for selecting initial cluster centers on the basis of minimum spanning tree (MST) is presented. The set of vertices in MST with same degree are regarded as a whole which is used to find the skeleton data points. Furthermore, a distance measure between the skeleton data points with consideration of degree and Euclidean distance is presented. Finally, MST-based initialization method for the k-means algorithm is presented, and the corresponding time complexity is analyzed as well. The presented algorithm is tested on five data sets from the UCI Machine Learning Repository. The experimental results illustrate the effectiveness of the presented algorithm compared to three existing initialization methods.

...read moreread less

Book Chapter•DOI•

Face Detection by Aggregating Visible Components.

[...]

Jiali Duan¹, Shengcai Liao¹, Xiaoyuan Guo¹, Stan Z. Li¹•Institutions (1)

Chinese Academy of Sciences¹

20 Nov 2016

TL;DR: A novel face detection method called Aggregating Visible Components (AVC) is proposed, which addresses pose variations and occlusions simultaneously in a single framework with low complexity.

...read moreread less

Abstract: Pose variations and occlusions are two major challenges for unconstrained face detection. Many approaches have been proposed to handle pose variations and occlusions in face detection, however, few of them addresses the two challenges in a model explicitly and simultaneously. In this paper, we propose a novel face detection method called Aggregating Visible Components (AVC), which addresses pose variations and occlusions simultaneously in a single framework with low complexity. The main contributions of this paper are: (1) By aggregating visible components which have inherent advantages in occasions of occlusions, the proposed method achieves state-of-the-art performance using only hand-crafted feature; (2) Mapped from meanshape through component-invariant mapping, the proposed component detector is more robust to pose-variations (3) A local to global aggregation strategy that involves region competition helps alleviate false alarms while enhancing localization accuracy.

...read moreread less

Posted Content•

Learning Discriminative Features with Class Encoder

[...]

Hailin Shi¹, Xiangyu Zhu¹, Zhen Lei¹, Shengcai Liao¹, Stan Z. Li¹ - Show less +1 more•Institutions (1)

Chinese Academy of Sciences¹

09 May 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: The supervised information is incorporated to propose a novel formulation, namely class-encoder, whose training objective is to reconstruct a sample from another one of which the labels are identical, whose performance on benchmarks of classification and face recognition is improved.

...read moreread less

Abstract: Deep neural networks usually benefit from unsupervised pre-training, eg auto-encoders However, the classifier further needs supervised fine-tuning methods for good discrimination Besides, due to the limits of full-connection, the application of auto-encoders is usually limited to small, well aligned images In this paper, we incorporate the supervised information to propose a novel formulation, namely class-encoder, whose training objective is to reconstruct a sample from another one of which the labels are identical Class-encoder aims to minimize the intra-class variations in the feature space, and to learn a good discriminative manifolds on a class scale We impose the class-encoder as a constraint into the softmax for better supervised training, and extend the reconstruction on feature-level to tackle the parameter size issue and translation issue The experiments show that the class-encoder helps to improve the performance on benchmarks of classification and face recognition This could also be a promising direction for fast training of face recognition models

...read moreread less