scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Spontaneous Facial Micro-Expression Recognition using 3D Spatiotemporal Convolutional Neural Networks

TL;DR: Two 3D-CNN methods are proposed: MicroExpSTCNN and MicroExpFuseNet, for spontaneous facial micro-expression recognition by exploiting the spatiotemporal information in CNN framework, which outperforms the state-of-the-art methods.
Abstract: Facial expression recognition in videos is an active area of research in computer vision. However, fake facial expressions are difficult to be recognized even by humans. On the other hand, facial micro-expressions generally represent the actual emotion of a person, as it is a spontaneous reaction expressed through human face. Despite of a few attempts made for recognizing micro-expressions, still the problem is far from being a solved problem, which is depicted by the poor rate of accuracy shown by the state-of-the-art methods. A few CNN based approaches are found in the literature to recognize micro-facial expressions from still images. Whereas, a spontaneous microexpression video contains multiple frames that have to be processed together to encode both spatial and temporal information. This paper proposes two 3D-CNN methods: MicroExpSTCNN and MicroExpFuseNet, for spontaneous facial micro-expression recognition by exploiting the spatiotemporal information in CNN framework. The MicroExpSTCNN considers the full spatial information, whereas the MicroExpFuseNet is based on the 3D-CNN feature fusion of the eyes and mouth regions. The experiments are performed over CAS(ME)2 and SMIC microb expression databases. The proposed MicroExpSTCNN model outperforms the state-of-the-art methods.
Citations
More filters
Journal ArticleDOI
TL;DR: A novel optimizer is proposed based on the difference between the present and the immediate past gradient, diffGrad, which shows that diffGrad outperforms other optimizers and performs uniformly well for training CNN using different activation functions.
Abstract: Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad .

140 citations


Cites background from "Spontaneous Facial Micro-Expression..."

  • ...The other applications, where deep learning can be used, include object tracking [13], [14], [15], face anti-spoofing and micro-expression recognition [16], [17], hyper-spectral image classification [18], etc....

    [...]

Proceedings ArticleDOI
12 Oct 2020
TL;DR: A novel micro-expression recognition approach by combining Action Units (AUs) and emotion category labels is proposed, based on facial muscle movements, which outperforms other state-of-the-art methods on both single database and cross-database micro- expression recognition.
Abstract: Micro-expressions (MEs) are important clues for reflecting the real feelings of humans, and micro-expression recognition (MER) can thus be applied in various real-world applications. However, it is difficult to perceive and interpret MEs correctly. With the advance of deep learning technologies, the accuracy of micro-expression recognition is improved but still limited by the lack of large-scale datasets. In this paper, we propose a novel micro-expression recognition approach by combining Action Units (AUs) and emotion category labels. Specifically, based on facial muscle movements, we model different AUs based on relational information and integrate the AUs recognition task with MER. Besides, to overcome the shortcomings of limited and imbalanced training samples, we propose a data augmentation method that can generate nearly indistinguishable image sequences with AU intensity of real-world micro-expression images, which effectively improve the performance and are compatible with other micro-expression recognition methods. Experimental results on three mainstream micro-expression datasets, i.e., CASME II, SAMM, and SMIC, manifest that our approach outperforms other state-of-the-art methods on both single database and cross-database micro-expression recognition.

88 citations


Cites methods from "Spontaneous Facial Micro-Expression..."

  • ...We compare our performance with several state-of-the-art methods including MicroExpSTCNN [36], ELRCN [16], CapsuleNet [43] and MER-GCN [27]....

    [...]

  • ...For instance, 3DConvNet is used in [36] to discover the spatiotemporal relationship between ME sequences with high-level features....

    [...]

  • ...We compare our performance with several state-of-the-art methods including MicroExpSTCNN [36], ELRCN [16], CapsuleNet [43] and...

    [...]

  • ...ies the MER problem by data-driven approaches [16, 17, 36, 43, 44]....

    [...]

Journal ArticleDOI
TL;DR: This paper proposes a three-stream convolutional neural network (TSCNN) to recognize MEs by learning ME-discriminative features in three key frames of ME videos, and designs a dynamic-temporal stream, static-spatial stream, and local- Spatial stream module for the TSCNN.
Abstract: Micro-expression recognition (MER) has attracted much attention with various practical applications, particularly in clinical diagnosis and interrogations. In this paper, we propose a three-stream convolutional neural network (TSCNN) to recognize MEs by learning ME-discriminative features in three key frames of ME videos. We design a dynamic-temporal stream, static-spatial stream, and local-spatial stream module for the TSCNN that respectively attempt to learn and integrate temporal, entire facial region, and facial local region cues in ME videos with the goal of recognizing MEs. In addition, to allow the TSCNN to recognize MEs without using the index values of apex frames, we design a reliable apex frame detection algorithm. Extensive experiments are conducted with five public ME databases: CASME II, SMIC-HS, SAMM, CAS(ME)2, and CASME. Our proposed TSCNN is shown to achieve more promising recognition results when compared with many other methods.

60 citations


Cites background or methods or result from "Spontaneous Facial Micro-Expression..."

  • ...To ensure fair comparisons and following other methods, such as 3D CNN based techniques in the literature [21], we also report the recognition results under the same conditions as the literature....

    [...]

  • ...We also compare the TSCNNwith two 3D-CNNmethods (MicroExpSTCNN and MicroExpFuseNet) that were proposed in [21] using only micro-expression...

    [...]

  • ...spontaneous micro- and macro-expressions, few methods for micro-expression recognition [21], [65], [76], [77] have been designed and tested using this database....

    [...]

  • ...The other is the same as that used in [21], which only contains micro-expression videos that have the same samples as the literature....

    [...]

  • ...The above experimental results show that our TSCNN is very competitive, comparedwith the two 3D-CNN based methods in [21]....

    [...]

Proceedings ArticleDOI
19 Jun 2021
TL;DR: Wang et al. as mentioned in this paper proposed a novel pipeline to learn a facial graph (nodes and edges) representation to capture these local subtle variations, and designed an A-GCN to learn the action unit's matrix by embedding and GCN.
Abstract: Micro-expressions recognition is a challenge because it involves subtle variations in facial organs. In this paper, first, we propose a novel pipeline to learn a facial graph (nodes and edges) representation to capture these local subtle variations. We express the micro-expressions with multi-patches based on facial landmarks and then stack these patches into channels while using a depthwise convolution (DConv) to learn the features inside the patches, namely, node learning. Then, the encoder of the transformer (ETran) is utilized to learn the relationships between the nodes, namely, edge learning. Based on node and edge learning, a learned facial graph representation is obtained. Second, because the occurrence of an expression is closely bound to action units, we design an A U-GCN to learn the action unit’s matrix by embedding and GCN. Finally, we propose a fusion model to introduce the action unit’s matrix into the learned facial graph representation. The experiments are comparing with SOTA on various evaluation criteria, including common classifications on CASME II and SAMM datasets, and also conducted following Micro-expression Grand Challenge 2019 protocol.

46 citations

Proceedings ArticleDOI
06 Aug 2020
TL;DR: Zhang et al. as mentioned in this paper proposed an end-to-end AU-oriented graph classification network, which uses 3D ConvNets to extract AU features and applies GCN layers to discover the dependency laying between AU nodes for ME categorization.
Abstract: Micro-Expression (ME) is the spontaneous, involuntary movement of a face that can reveal the true feeling. Recently, increasing researches have paid attention to this field combing deep learning techniques. Action units (AUs) are the fundamental actions reflecting the facial muscle movements and AU detection has been adopted by many researches to classify facial expressions. However, the time-consuming annotation process makes it difficult to correlate the combinations of AUs to specific emotion classes. Inspired by the nodes relationship building Graph Convolutional Networks (GCN), we propose an end-to-end AU-oriented graph classification network, namely MER-GCN, which uses 3D ConvNets to extract AU features and applies GCN layers to discover the dependency laying between AU nodes for ME categorization. To our best knowl-edge, this work is the first end-to-end architecture for Micro-Expression Recognition (MER) using AUs based GCN. The experimental results show that our approach outperforms CNN-based MER networks.

35 citations

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal Article
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

33,597 citations

Journal ArticleDOI
TL;DR: This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models.
Abstract: In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First , we highlight convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second , we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third , we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed “DeepLab” system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7 percent mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.

11,856 citations


"Spontaneous Facial Micro-Expression..." refers background in this paper

  • ...ing Image Classification [10], Semantic Segmentation [11], Blind Image Quality Assessment [12], Face Anti-spoofing [13], Routine Colon Cancer Nuclei Classification [14] and many more....

    [...]

  • ...Recently, deep learning based methods like convolutional neural networks (CNN) have gained popularity and widely being used to solve various computer vision problems [9] including Image Classification [10], Semantic Segmentation [11], Blind Image Quality Assessment [12], Face Anti-spoofing [13], Routine Colon Cancer Nuclei Classification [14] and many more....

    [...]

Posted Content
TL;DR: DeepLab as discussed by the authors proposes atrous spatial pyramid pooling (ASPP) to segment objects at multiple scales by probing an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views.
Abstract: In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.

10,120 citations