scispace - formally typeset
Search or ask a question

Showing papers by "Stan Z. Li published in 2016"


Proceedings ArticleDOI
27 Jun 2016
TL;DR: 3D Dense Face Alignment (3DDFA), in which a dense 3D face model is fitted to the image via convolutional neutral network (CNN), is proposed, and a method to synthesize large-scale training samples in profile views to solve the third problem of data labelling is proposed.
Abstract: Face alignment, which fits a face model to an image and extracts the semantic meanings of facial pixels, has been an important topic in CV community. However, most algorithms are designed for faces in small to medium poses (below 45), lacking the ability to align faces in large poses up to 90. The challenges are three-fold: Firstly, the commonly used landmark-based face model assumes that all the landmarks are visible and is therefore not suitable for profile views. Secondly, the face appearance varies more dramatically across large poses, ranging from frontal view to profile view. Thirdly, labelling landmarks in large poses is extremely challenging since the invisible landmarks have to be guessed. In this paper, we propose a solution to the three problems in an new alignment framework, called 3D Dense Face Alignment (3DDFA), in which a dense 3D face model is fitted to the image via convolutional neutral network (CNN). We also propose a method to synthesize large-scale training samples in profile views to solve the third problem of data labelling. Experiments on the challenging AFLW database show that our approach achieves significant improvements over state-of-the-art methods.

1,105 citations


Journal ArticleDOI
TL;DR: A new image feature called Normalized Pixel Difference (NPD) is proposed, computed as the difference to sum ratio between two pixel values, inspired by the Weber Fraction in experimental psychology, which is scale invariant, bounded, and able to reconstruct the original image.
Abstract: We propose a method to address challenges in unconstrained face detection, such as arbitrary pose variations and occlusions. First, a new image feature called Normalized Pixel Difference (NPD) is proposed. NPD feature is computed as the difference to sum ratio between two pixel values, inspired by the Weber Fraction in experimental psychology. The new feature is scale invariant, bounded, and is able to reconstruct the original image. Second, we propose a deep quadratic tree to learn the optimal subset of NPD features and their combinations, so that complex face manifolds can be partitioned by the learned rules. This way, only a single soft-cascade classifier is needed to handle unconstrained face detection. Furthermore, we show that the NPD features can be efficiently obtained from a look up table, and the detection template can be easily scaled, making the proposed face detector very fast. Experimental results on three public face datasets (FDDB, GENKI, and CMU-MIT) show that the proposed method achieves state-of-the-art performance in detecting unconstrained faces with arbitrary pose variations and occlusions in cluttered scenes.

288 citations


Posted Content
TL;DR: A novel moderate positive sample mining method to train robust CNN for person re-identification, dealing with the problem of large variation is proposed and the learning by a metric weight constraint is improved, so that the learned metric has a better generalization ability.
Abstract: Person re-identification is challenging due to the large variations of pose, illumination, occlusion and camera view. Owing to these variations, the pedestrian data is distributed as highly-curved manifolds in the feature space, despite the current convolutional neural networks (CNN)'s capability of feature extraction. However, the distribution is unknown, so it is difficult to use the geodesic distance when comparing two samples. In practice, the current deep embedding methods use the Euclidean distance for the training and test. On the other hand, the manifold learning methods suggest to use the Euclidean distance in the local range, combining with the graphical relationship between samples, for approximating the geodesic distance. From this point of view, selecting suitable positive i.e. intra-class) training samples within a local range is critical for training the CNN embedding, especially when the data has large intra-class variations. In this paper, we propose a novel moderate positive sample mining method to train robust CNN for person re-identification, dealing with the problem of large variation. In addition, we improve the learning by a metric weight constraint, so that the learned metric has a better generalization ability. Experiments show that these two strategies are effective in learning robust deep metrics for person re-identification, and accordingly our deep model significantly outperforms the state-of-the-art methods on several benchmarks of person re-identification. Therefore, the study presented in this paper may be useful in inspiring new designs of deep models for person re-identification.

265 citations


Book ChapterDOI
08 Oct 2016
Abstract: Person re-identification is challenging due to the large variations of pose, illumination, occlusion and camera view. Owing to these variations, the pedestrian data is distributed as highly-curved manifolds in the feature space, despite the current convolutional neural networks (CNN)’s capability of feature extraction. However, the distribution is unknown, so it is difficult to use the geodesic distance when comparing two samples. In practice, the current deep embedding methods use the Euclidean distance for the training and test. On the other hand, the manifold learning methods suggest to use the Euclidean distance in the local range, combining with the graphical relationship between samples, for approximating the geodesic distance. From this point of view, selecting suitable positive (i.e. intra-class) training samples within a local range is critical for training the CNN embedding, especially when the data has large intra-class variations. In this paper, we propose a novel moderate positive sample mining method to train robust CNN for person re-identification, dealing with the problem of large variation. In addition, we improve the learning by a metric weight constraint, so that the learned metric has a better generalization ability. Experiments show that these two strategies are effective in learning robust deep metrics for person re-identification, and accordingly our deep model significantly outperforms the state-of-the-art methods on several benchmarks of person re-identification. Therefore, the study presented in this paper may be useful in inspiring new designs of deep models for person re-identification.

232 citations


Proceedings ArticleDOI
01 Jun 2016
TL;DR: Two large video multi-modal datasets for RGB and RGB-D gesture recognition are presented and the baseline method based on the bag of visual words model is presented, designed for gesture classification from segmented data.
Abstract: In this paper, we present two large video multi-modal datasets for RGB and RGB-D gesture recognition: the ChaLearn LAP RGB-D Isolated Gesture Dataset (IsoGD) and the Continuous Gesture Dataset (ConGD). Both datasets are derived from the ChaLearn Gesture Dataset (CGD) that has a total of more than 50000 gestures for the "one-shot-learning" competition. To increase the potential of the old dataset, we designed new well curated datasets composed of 249 gesture labels, and including 47933 gestures manually labeled the begin and end frames in sequences. Using these datasets we will open two competitions on the CodaLab platform so that researchers can test and compare their methods for "user independent" gesture recognition. The first challenge is designed for gesture spotting and recognition in continuous sequences of gestures while the second one is designed for gesture classification from segmented data. The baseline method based on the bag of visual words model is also presented.

219 citations


Journal ArticleDOI
TL;DR: A novel spatiotemporal feature extracted from RGB-D data, namely mixed features around sparse keypoints (MFSK) is proposed, which outperforms all currently published approaches on the challenging data of CGD, such as translated, scaled and occluded subsets.
Abstract: Availability of handy RGB-D sensors has brought about a surge of gesture recognition research and applications. Among various approaches, one shot learning approach is advantageous because it requires minimum amount of data. Here, we provide a thorough review about one-shot learning gesture recognition from RGB-D data and propose a novel spatiotemporal feature extracted from RGB-D data, namely mixed features around sparse keypoints (MFSK). In the review, we analyze the challenges that we are facing, and point out some future research directions which may enlighten researchers in this field. The proposed MFSK feature is robust and invariant to scale, rotation and partial occlusions. To alleviate the insufficiency of one shot training samples, we augment the training samples by artificially synthesizing versions of various temporal scales, which is beneficial for coping with gestures performed at varying speed. We evaluate the proposed method on the Chalearn gesture dataset (CGD). The results show that our approach outperforms all currently published approaches on the challenging data of CGD, such as translated, scaled and occluded subsets. When applied to the RGB-D datasets that are not one-shot (e.g., the Cornell Activity Dataset-60 and MSR Daily Activity 3D dataset), the proposed feature also produces very promising results under leave-one-out cross validation or one-shot learning.

104 citations


Proceedings ArticleDOI
27 Jun 2016
TL;DR: CRAFT as mentioned in this paper proposes a cascade region proposal-network and FasT-rcNN, which tackles each task with a carefully designed network cascade and achieves state-of-the-art performance on object detection benchmarks.
Abstract: Object detection is a fundamental problem in image understanding. One popular solution is the R-CNN framework [15] and its fast versions [14, 27]. They decompose the object detection problem into two cascaded easier tasks: 1) generating object proposals from images, 2) classifying proposals into various object categories. Despite that we are handling with two relatively easier tasks, they are not solved perfectly and there's still room for improvement. In this paper, we push the "divide and conquer" solution even further by dividing each task into two sub-tasks. We call the proposed method "CRAFT" (Cascade Regionproposal-network And FasT-rcnn), which tackles each task with a carefully designed network cascade. We show that the cascade structure helps in both tasks: in proposal generation, it provides more compact and better localized object proposals, in object classification, it reduces false positives (mainly between ambiguous categories) by capturing both inter-and intra-category variances. CRAFT achieves consistent and considerable improvement over the state-of the-art on object detection benchmarks like PASCAL VOC 07/12 and ILSVRC.

87 citations


Proceedings Article
12 Feb 2016
TL;DR: Results on the challenging datasets of face verification and person re-identification show that the proposed novel similarity measure outperforms the state-of-the-art methods.
Abstract: In this paper, we propose a novel similarity measure and then introduce an efficient strategy to learn it by using only similar pairs for person verification. Unlike existing metric learning methods, we consider both the difference and commonness of an image pair to increase its discriminativeness. Under a pair-constrained Gaussian assumption, we show how to obtain the Gaussian priors (i.e., corresponding covariance matrices) of dissimilar pairs from those of similar pairs. The application of a log likelihood ratio makes the learning process simple and fast and thus scalable to large datasets. Additionally, our method is able to handle heterogeneous data well. Results on the challenging datasets of face verification (LFW and Pub-Fig) and person re-identification (VIPeR) show that our algorithm outperforms the state-of-the-art methods.

65 citations


Journal ArticleDOI
TL;DR: Local order constrained IGOs are exploited to generate robust features and enhance the local textures and the order-based coding ability, thus discover intrinsic structure of facial images further.
Abstract: Robust descriptor-based subspace learning with complex data is an active topic in pattern analysis and machine intelligence. A few researches concentrate the optimal design on feature representation and metric learning. However, traditionally used features of single-type, e.g., image gradient orientations (IGOs), are deficient to characterize the complete variations in robust and discriminant subspace learning. Meanwhile, discontinuity in edge alignment and feature match are not been carefully treated in the literature. In this paper, local order constrained IGOs are exploited to generate robust features. As the difference-based filters explicitly consider the local contrasts within neighboring pixel points, the proposed features enhance the local textures and the order-based coding ability, thus discover intrinsic structure of facial images further. The multimodal features are automatically fused in the most discriminant subspace. The utilization of adaptive interaction function suppresses outliers in each dimension for robust similarity measurement and discriminant analysis. The sparsity-driven regression model is modified to adapt the classification issue of the compact feature representation. Extensive experiments are conducted by using some benchmark face data sets, e.g., of controlled and uncontrolled environments, to evaluate our new algorithm.

58 citations


Journal ArticleDOI
TL;DR: This work proposes an algorithm that formulates the multi-object tracking task as one to exploit hierarchical dense structures on an undirected hypergraph constructed based on tracklet affinity, and demonstrates that the proposed algorithm performs favorably against the state-of-the-art methods.
Abstract: Most multi-object tracking algorithms are developed within the tracking-by-detection framework that consider the pairwise appearance similarities between detection responses or tracklets within a limited temporal window, and thus less effective in handling long-term occlusions or distinguishing spatially close targets with similar appearance in crowded scenes. In this work, we propose an algorithm that formulates the multi-object tracking task as one to exploit hierarchical dense structures on an undirected hypergraph constructed based on tracklet affinity. The dense structures indicate a group of vertices that are inter-connected with a set of hyperedges with high affinity values. The appearance and motion similarities among multiple tracklets across the spatio-temporal domain are considered globally by exploiting high-order similarities rather than pairwise ones, thereby facilitating distinguish spatially close targets with similar appearance. In addition, the hierarchical design of the optimization process helps the proposed tracking algorithm handle long-term occlusions robustly. Extensive experiments on various challenging datasets of both multi-pedestrian and multi-face tracking tasks, demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.

53 citations


Posted Content
TL;DR: This paper calls the proposed method "CRAFT" (Cascade Regionproposal-network And FasT-rcnn), which tackles each task with a carefully designed network cascade, and shows that the cascade structure helps in both tasks.
Abstract: Object detection is a fundamental problem in image understanding. One popular solution is the R-CNN framework and its fast versions. They decompose the object detection problem into two cascaded easier tasks: 1) generating object proposals from images, 2) classifying proposals into various object categories. Despite that we are handling with two relatively easier tasks, they are not solved perfectly and there's still room for improvement. In this paper, we push the "divide and conquer" solution even further by dividing each task into two sub-tasks. We call the proposed method "CRAFT" (Cascade Region-proposal-network And FasT-rcnn), which tackles each task with a carefully designed network cascade. We show that the cascade structure helps in both tasks: in proposal generation, it provides more compact and better localized object proposals; in object classification, it reduces false positives (mainly between ambiguous categories) by capturing both inter- and intra-category variances. CRAFT achieves consistent and considerable improvement over the state-of-the-art on object detection benchmarks like PASCAL VOC 07/12 and ILSVRC.

Journal ArticleDOI
TL;DR: This paper extends the original shallow face descriptors to deep discriminant face features by introducing a stacked image descriptor (SID), with deep structure, more complex facial information can be extracted and the discriminant and compactness of feature representation can be improved.
Abstract: Learning-based face descriptors have constantly improved the face recognition performance. Compared with the hand-crafted features, learning-based features are considered to be able to exploit information with better discriminative ability for specific tasks. Motivated by the recent success of deep learning, in this paper, we extend the original shallow face descriptors to deep discriminant face features by introducing a stacked image descriptor (SID). With deep structure, more complex facial information can be extracted and the discriminant and compactness of feature representation can be improved. The SID is learned in a forward optimization way, which is computational efficient compared with deep learning. Extensive experiments on various face databases are conducted to show that SID is able to achieve high face recognition performance with compact face representation, compared with other state-of-the-art descriptors.

Posted Content
TL;DR: A convolutional twostream consensus voting network (2SCVN) is proposed which explicitly models both the short-term and long-term structure of the RGB sequences and significantly improves the recognition accuracy.
Abstract: Recently, the popularity of depth-sensors such as Kinect has made depth videos easily available while its advantages have not been fully exploited. This paper investigates, for gesture recognition, to explore the spatial and temporal information complementarily embedded in RGB and depth sequences. We propose a convolutional twostream consensus voting network (2SCVN) which explicitly models both the short-term and long-term structure of the RGB sequences. To alleviate distractions from background, a 3d depth-saliency ConvNet stream (3DDSN) is aggregated in parallel to identify subtle motion characteristics. These two components in an unified framework significantly improve the recognition accuracy. On the challenging Chalearn IsoGD benchmark, our proposed method outperforms the first place on the leader-board by a large margin (10.29%) while also achieving the best result on RGBD-HuDaAct dataset (96.74%). Both quantitative experiments and qualitative analysis shows the effectiveness of our proposed framework and codes will be released to facilitate future research.

Proceedings Article
Yang Yang1, Zhen Lei1, Shifeng Zhang1, Hailin Shi1, Stan Z. Li1 
12 Feb 2016
TL;DR: A metric embedded discriminative vocabulary learning for high-level person representation with application to person re-identification and a new and effective term is introduced which aims at making the same persons closer while different ones farther in the metric space.
Abstract: A variety of encoding methods for bag of word (BoW) model have been proposed to encode the local features in image classification. However, most of them are unsupervised and just employ k-means to form the visual vocabulary, thus reducing the discriminative power of the features. In this paper, we propose a metric embedded discriminative vocabulary learning for high-level person representation with application to person re-identification. A new and effective term is introduced which aims at making the same persons closer while different ones farther in the metric space. With the learned vocabulary, we utilize a linear coding method to encode the image-level features (or holistic image features) for extracting high-level person representation. Different from traditional unsupervised approaches, our method can explore the relationship (same or not) among the persons. Since there is an analytic solution to the linear coding, it is easy to obtain the final high-level features. The experimental results on person reidentification demonstrate the effectiveness of our proposed algorithm.

Journal ArticleDOI
TL;DR: Extensive experiments show that the proposed JVS method is better than the traditional single-camera video synopsis method in preserving the chronological orders of moving objects among multicamera synopsis videos.
Abstract: Due to an increasing demand for video surveillance, there is an explosive growth of surveillance videos, which causes a big challenge in video storage, browsing, and retrieval. The video synopsis technique is thus developed to extract and rearrange the moving objects so as to handle the massive video browsing challenge. However, the traditional video synopsis (TVS) method only considers the processing videos captured by a single camera, ignoring object interactions in multicamera videos. To address this issue, we propose a novel multicamera joint video synopsis (JVS) algorithm for multicamera surveillance videos. First, a key time stamp (KTS) selection method is designed to find an object’s appearing, merging, splitting, and disappearing moments in the frame sequence, called tube, of that object. Second, tubes are rearranged by minimizing a global energy function that involves the overall camera views. Compared with the energy function used in TVS, the proposed global energy function considers the chronological orders of tubes not only in the same camera view but also among different camera views. Moreover, the chronological disorder cost term is formulated based on the KTS labels, and improved by considering the visual similarity between two tubes. Finally, the multicamera synopsis videos are separately generated by stitching together the globally rearranged tubes and background images of the same camera view. Extensive experiments show that the proposed JVS method is better than the traditional single-camera video synopsis method in preserving the chronological orders of moving objects among multicamera synopsis videos.

Book ChapterDOI
20 Nov 2016
TL;DR: A novel approach based on a single convolutional neural network (CNN) for age estimation based on the randomness of aging with the Gaussian distribution and a soft softmax regression function used in the network.
Abstract: In this paper, we propose a novel approach based on a single convolutional neural network (CNN) for age estimation. In our proposed network architecture, we first model the randomness of aging with the Gaussian distribution which is used to calculate the Gaussian integral of an age interval. Then, we present a soft softmax regression function used in the network. The new function applies the aging modeling to compute the function loss. Compared with the traditional softmax function, the new function considers not only the chronological age but also the interval nearby true age. Moreover, owing to the complex of Gaussian integral in soft softmax function, a look up table is built to accelerate this process. All the integrals of age values are calculated offline in advance. We evaluate our method on two public datasets: MORPH II and Cross-Age Celebrity Dataset (CACD), and experimental results have shown that the proposed method has gained superior performances compared to the state of the art.

Book ChapterDOI
14 Oct 2016
TL;DR: This paper proposes a new method for facial expression recognition, called multi-scale CNNs, which consists several sub-CNNs with different scales of input images and can classify facial expression more accurately than any single scale sub- CNN.
Abstract: This paper proposes a new method for facial expression recognition, called multi-scale CNNs. It consists several sub-CNNs with different scales of input images. The sub-CNNs of multi-scale CNNs are benefited from various scaled input images to learn the optimalized parameters. After trained all these sub-CNNs separately, we can predict the facial expression of an image by extracting its features from the last fully connected layer of sub-CNNs in different scales and mapping the averaged features to the final classification probability. Multi-scale CNNs can classify facial expression more accurately than any single scale sub-CNN. On Facial Expression Recognition 2013 database, multi-scale CNNs achieved an accuracy of 71.80 % on the testing set, which is comparative to other state-of-the-art methods.

Book ChapterDOI
14 Oct 2016
TL;DR: A specialized benchmark study in this paper focuses purely on face classification, and shows that, without the help of post-processing, the performance of face classification itself is still not very satisfactory, even with a powerful CNN method.
Abstract: Face detection evaluation generally involves three steps: block generation, face classification, and post-processing. However, firstly, face detection performance is largely influenced by block generation and post-processing, concealing the performance of face classification core module. Secondly, implementing and optimizing all the three steps results in a very heavy work, which is a big barrier for researchers who only cares about classification. Motivated by this, we conduct a specialized benchmark study in this paper, which focuses purely on face classification. We start with face proposals, and build a benchmark dataset with about 3.5 million patches for two-class face/non-face classification. Results with several baseline algorithms show that, without the help of post-processing, the performance of face classification itself is still not very satisfactory, even with a powerful CNN method. We’ll release this benchmark to help assess performance of face classification only, and ease the participation of other related researchers.

Book ChapterDOI
Ting Liu1, Jun Wan1, Tingzhao Yu1, Zhen Lei1, Stan Z. Li1 
14 Oct 2016
TL;DR: The proposed MRCNN has two principle advantages: 8 sub-networks are able to learn the unique age characteristics of the corresponding subregion and the eight networks are packaged together to complement age-related information.
Abstract: As one of the most important biologic features, age has tremendous application potential in various areas such as surveillance, human-computer interface and video detection. In this paper, a new convolutional neural network, namely MRCNN (Multi-Region Convolutional Neural Network), is proposed based on multiple face subregions. It joins multiple face subregions together to estimation age. Each targeted region is analyzed to explore the contribution degree to age estimation. According to the face geometrical property, we select 8 subregions, and construct 8 sub-network structures respectively, and then fuse at feature-level. The proposed MRCNN has two principle advantages: 8 sub-networks are able to learn the unique age characteristics of the corresponding subregion and the eight networks are packaged together to complement age-related information. Further, we analyze the estimation accuracy on all age groups. Experiments on MORPH illustrate the superior performance of the proposed MRCNN.

Proceedings ArticleDOI
Hailin Shi1, Xiangyu Zhu1, Zhen Lei1, Shengcai Liao1, Stan Z. Li1 
01 Jun 2016
TL;DR: In this paper, a class-encoder is proposed to minimize the intra-class variations in the feature space, and to learn a good discriminative manifold on a class scale.
Abstract: Deep neural networks usually benefit from unsupervised pre-training, e.g. auto-encoders. However, the classifier further needs supervised fine-tuning methods for good discrimination. Besides, due to the limits of full-connection, the application of auto-encoders is usually limited to small, well aligned images. In this paper, we incorporate the supervised information to propose a novel formulation, namely class-encoder, whose training objective is to reconstruct a sample from another one of which the labels are identical. Class-encoder aims to minimize the intra-class variations in the feature space, and to learn a good discriminative manifolds on a class scale. We impose the class-encoder as a constraint into the softmax for better supervised training, and extend the reconstruction on feature-level to tackle the parameter size issue and translation issue. The experiments show that the class-encoder helps to improve the performance on benchmarks of classification and face recognition. This could also be a promising direction for fast training of face recognition models.

Journal Article
Jie Yang, Y. Ma, Xu Zhang, Stan Z. Li, Y. Zhang 
TL;DR: An algorithm for selecting initial cluster centers on the basis of minimum spanning tree (MST) is presented and a MST-based initialization method for the k-means algorithm is presented, and the corresponding time complexity is analyzed.
Abstract: Abstract—The traditional k-means algorithm has been widely used as a simple and efficient clustering method. However, the algorithm often converges to local minima for the reason that it is sensitive to the initial cluster centers. In this paper, an algorithm for selecting initial cluster centers on the basis of minimum spanning tree (MST) is presented. The set of vertices in MST with same degree are regarded as a whole which is used to find the skeleton data points. Furthermore, a distance measure between the skeleton data points with consideration of degree and Euclidean distance is presented. Finally, MST-based initialization method for the k-means algorithm is presented, and the corresponding time complexity is analyzed as well. The presented algorithm is tested on five data sets from the UCI Machine Learning Repository. The experimental results illustrate the effectiveness of the presented algorithm compared to three existing initialization methods.

Book ChapterDOI
20 Nov 2016
TL;DR: A novel face detection method called Aggregating Visible Components (AVC) is proposed, which addresses pose variations and occlusions simultaneously in a single framework with low complexity.
Abstract: Pose variations and occlusions are two major challenges for unconstrained face detection. Many approaches have been proposed to handle pose variations and occlusions in face detection, however, few of them addresses the two challenges in a model explicitly and simultaneously. In this paper, we propose a novel face detection method called Aggregating Visible Components (AVC), which addresses pose variations and occlusions simultaneously in a single framework with low complexity. The main contributions of this paper are: (1) By aggregating visible components which have inherent advantages in occasions of occlusions, the proposed method achieves state-of-the-art performance using only hand-crafted feature; (2) Mapped from meanshape through component-invariant mapping, the proposed component detector is more robust to pose-variations (3) A local to global aggregation strategy that involves region competition helps alleviate false alarms while enhancing localization accuracy.

Posted Content
Hailin Shi1, Xiangyu Zhu1, Zhen Lei1, Shengcai Liao1, Stan Z. Li1 
TL;DR: The supervised information is incorporated to propose a novel formulation, namely class-encoder, whose training objective is to reconstruct a sample from another one of which the labels are identical, whose performance on benchmarks of classification and face recognition is improved.
Abstract: Deep neural networks usually benefit from unsupervised pre-training, eg auto-encoders However, the classifier further needs supervised fine-tuning methods for good discrimination Besides, due to the limits of full-connection, the application of auto-encoders is usually limited to small, well aligned images In this paper, we incorporate the supervised information to propose a novel formulation, namely class-encoder, whose training objective is to reconstruct a sample from another one of which the labels are identical Class-encoder aims to minimize the intra-class variations in the feature space, and to learn a good discriminative manifolds on a class scale We impose the class-encoder as a constraint into the softmax for better supervised training, and extend the reconstruction on feature-level to tackle the parameter size issue and translation issue The experiments show that the class-encoder helps to improve the performance on benchmarks of classification and face recognition This could also be a promising direction for fast training of face recognition models