Deep Alignment Network: A Convolutional Neural Network for Robust Face Alignment

doi:10.1109/CVPRW.2017.254

Home
/
Papers
/
Deep Alignment Network: A Convolutional Neural Network for Robust Face Alignment

Proceedings Article•DOI•

Deep Alignment Network: A Convolutional Neural Network for Robust Face Alignment

Marek Kowalski¹, Jacek Naruniec¹, Tomasz Trzcinski¹•Institutions (1)

Warsaw University of Technology¹

01 Jul 2017-pp 2034-2043

TL;DR: The use of entire face images rather than patches allows DAN to handle face images with large variation in head pose and difficult initializations, and reduces the state-of-the-art failure rate by up to 70%.

read less

Abstract: In this paper, we propose Deep Alignment Network (DAN), a robust face alignment method based on a deep neural network architecture. DAN consists of multiple stages, where each stage improves the locations of the facial landmarks estimated by the previous stage. Our method uses entire face images at all stages, contrary to the recently proposed face alignment methods that rely on local patches. This is possible thanks to the use of landmark heatmaps which provide visual information about landmark locations estimated at the previous stages of the algorithm. The use of entire face images rather than patches allows DAN to handle face images with large variation in head pose and difficult initializations. An extensive evaluation on two publicly available datasets shows that DAN reduces the state-of-the-art failure rate by up to 70%. Our method has also been submitted for evaluation as part of the Menpo challenge.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Deep High-Resolution Representation Learning for Visual Recognition

[...]

Jingdong Wang¹, Ke Sun², Tianheng Cheng³, Borui Jiang⁴, Chaorui Deng⁵, Yang Zhao⁶, Dong Liu², Yadong Mu⁴, Mingkui Tan⁵, Xinggang Wang³, Wenyu Liu³, Bin Xiao¹ - Show less +8 more•Institutions (6)

Microsoft¹, University of Science and Technology of China², Huazhong University of Science and Technology³, Peking University⁴, South China University of Technology⁵, Griffith University⁶

01 Oct 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The High-Resolution Network (HRNet) as mentioned in this paper maintains high-resolution representations through the whole process by connecting the high-to-low resolution convolution streams in parallel and repeatedly exchanging the information across resolutions.

...read moreread less

Abstract: High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel and (ii) repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at https://github.com/HRNet .

...read moreread less

1,162 citations

Proceedings Article•DOI•

FSRNet: End-to-End Learning Face Super-Resolution with Facial Priors

[...]

Yu Chen¹, Ying Tai², Xiaoming Liu³, Chunhua Shen⁴, Jian Yang¹ - Show less +1 more•Institutions (4)

Nanjing University of Science and Technology¹, Tencent², Michigan State University³, University of Adelaide⁴

18 Jun 2018

TL;DR: Zhang et al. as discussed by the authors proposed a deep end-to-end trainable face super-resolution network (FSRNet), which makes use of the geometry prior, i.e., facial landmark heatmaps and parsing maps, to super-resolve very low-resolution (LR) face images without well-aligned requirement.

...read moreread less

Abstract: Face Super-Resolution (SR) is a domain-specific superresolution problem. The facial prior knowledge can be leveraged to better super-resolve face images. We present a novel deep end-to-end trainable Face Super-Resolution Network (FSRNet), which makes use of the geometry prior, i.e., facial landmark heatmaps and parsing maps, to super-resolve very low-resolution (LR) face images without well-aligned requirement. Specifically, we first construct a coarse SR network to recover a coarse high-resolution (HR) image. Then, the coarse HR image is sent to two branches: a fine SR encoder and a prior information estimation network, which extracts the image features, and estimates landmark heatmaps/parsing maps respectively. Both image features and prior information are sent to a fine SR decoder to recover the HR image. To generate realistic faces, we also propose the Face Super-Resolution Generative Adversarial Network (FSRGAN) to incorporate the adversarial loss into FSRNet. Further, we introduce two related tasks, face alignment and parsing, as the new evaluation metrics for face SR, which address the inconsistency of classic metrics w.r.t. visual perception. Extensive experiments show that FSRNet and FSRGAN significantly outperforms state of the arts for very LR face SR, both quantitatively and qualitatively.

...read moreread less

415 citations

Proceedings Article•DOI•

Adaptive Wing Loss for Robust Face Alignment via Heatmap Regression

[...]

Xinyao Wang¹, Liefeng Bo, Li Fuxin¹•Institutions (1)

Oregon State University¹

01 Oct 2019

TL;DR: A novel loss function is proposed, named Adaptive Wing loss, that is able to adapt its shape to different types of ground truth heatmap pixels, that penalizes loss more on foreground pixels while less on background pixels.

...read moreread less

Abstract: Heatmap regression with a deep network has become one of the mainstream approaches to localize facial landmarks. However, the loss function for heatmap regression is rarely studied. In this paper, we analyze the ideal loss function properties for heatmap regression in face alignment problems. Then we propose a novel loss function, named Adaptive Wing loss, that is able to adapt its shape to different types of ground truth heatmap pixels. This adaptability penalizes loss more on foreground pixels while less on background pixels. To address the imbalance between foreground and background pixels, we also propose Weighted Loss Map, which assigns high weights on foreground and difficult background pixels to help training process focus more on pixels that are crucial to landmark localization. To further improve face alignment accuracy, we introduce boundary prediction and CoordConv with boundary coordinates. Extensive experiments on different benchmarks, including COFW, 300W and WFLW, show our approach outperforms the state-of-the-art by a significant margin on various evaluation metrics. Besides, the Adaptive Wing loss also helps other heatmap regression tasks.

...read moreread less

183 citations

Cites background from "Deep Alignment Network: A Convoluti..."

...erated from landmark coordinates, is widely used for face alignment [5, 29, 64, 50]....
[...]

Proceedings Article•DOI•

The Menpo Facial Landmark Localisation Challenge: A Step Towards the Solution

[...]

Stefanos Zafeiriou¹, George Trigeorgis¹, Grigorios Chrysos¹, Jiankang Deng¹, Jie Shen¹ - Show less +1 more•Institutions (1)

Imperial College London¹

01 Jul 2017

TL;DR: A new benchmark for facial landmark localisation, contrary to the previous benchmarks, contains facial images both in (nearly) frontal, as well as in profile pose (annotated with a different markup of facial landmarks).

...read moreread less

Abstract: In this paper, we present a new benchmark (Menpo benchmark) for facial landmark localisation and summarise the results of the recent competition, so-called Menpo Challenge, run in conjunction to CVPR 2017. The Menpo benchmark, contrary to the previous benchmarks such as 300-W and 300-VW, contains facial images both in (nearly) frontal, as well as in profile pose (annotated with a different markup of facial landmarks). Furthermore, we increase considerably the number of annotated images so that deep learning algorithms can be robustly applied to the problem. The results of the Menpo challenge demonstrate that recent deep learning architectures when trained with the abundance of data lead to excellent results. Finally, we discuss directions for future benchmarks in the topic.

...read moreread less

165 citations

Cites methods from "Deep Alignment Network: A Convoluti..."

...Kowalski: The method in [27] used a VGG-based alignment network to correct similarity transforms and then a fully-convolutional network that regressed to the final shape....
[...]

Book Chapter•DOI•

Whole-Body Human Pose Estimation in the Wild.

[...]

Sheng Jin¹, Lumin Xu², Jin Xu², Can Wang², Wentao Liu², Chen Qian², Wanli Ouyang³, Ping Luo¹ - Show less +4 more•Institutions (3)

University of Hong Kong¹, SenseTime², University of Sydney³

23 Jul 2020

TL;DR: COCO-WholeBody as discussed by the authors extends COCO dataset with whole-body annotations, including 133 dense landmarks with 68 on the face, 42 on hands and 23 on the body and feet.

...read moreread less

Abstract: This paper investigates the task of 2D human whole-body pose estimation, which aims to localize dense landmarks on the entire human body including face, hands, body, and feet. As existing datasets do not have whole-body annotations, previous methods have to assemble different deep models trained independently on different datasets of the human face, hand, and body, struggling with dataset biases and large model complexity. To fill in this blank, we introduce COCO-WholeBody which extends COCO dataset with whole-body annotations. To our best knowledge, it is the first benchmark that has manual annotations on the entire human body, including 133 dense landmarks with 68 on the face, 42 on hands and 23 on the body and feet. A single-network model, named ZoomNet, is devised to take into account the hierarchical structure of the full human body to solve the scale variation of different body parts of the same person. ZoomNet is able to significantly outperform existing methods on the proposed COCO-WholeBody dataset. Extensive experiments show that COCO-WholeBody not only can be used to train deep models from scratch for whole-body pose estimation but also can serve as a powerful pre-training dataset for many different tasks such as facial landmark detection and hand keypoint estimation. The dataset is publicly available at https://github.com/jin-s13/COCO-WholeBody.

...read moreread less

116 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

Adam: A Method for Stochastic Optimization

[...]

Diederik P. Kingma¹, Jimmy Ba²•Institutions (2)

University of Amsterdam¹, University of Toronto²

01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

...read moreread less

111,197 citations

"Deep Alignment Network: A Convoluti..." refers methods in this paper

...For optimization we use Adam stochastic optimization [15] with an initial step size of 0....
[...]

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

55,235 citations

"Deep Alignment Network: A Convoluti..." refers methods in this paper

...The overall shape of the feed-forwad network was inspired by the network used in [26] for the ImageNet ILSVRC 2014 competition....
[...]

Journal Article•DOI•

Distinctive Image Features from Scale-Invariant Keypoints

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

01 Nov 2004-International Journal of Computer Vision

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

46,906 citations

"Deep Alignment Network: A Convoluti..." refers methods in this paper

...riety of CSR based methods introduced in the literature lie in the choice of the feature extraction method ˚and the regression method r t. For instance, Supervised Descent Method (SDM) [32] uses SIFT [19] features and a simple linear regressor. LBF [21] takes advantage of sparse features generated from binary trees and intensity differences of individual pixels. LBF uses Support Vector Regression [9] ...
[...]

Journal Article•

Dropout: a simple way to prevent neural networks from overfitting

[...]

Nitish Srivastava¹, Geoffrey E. Hinton¹, Alex Krizhevsky¹, Ilya Sutskever¹, Ruslan Salakhutdinov¹ - Show less +1 more•Institutions (1)

University of Toronto¹

01 Jan 2014-Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

33,597 citations

"Deep Alignment Network: A Convoluti..." refers background in this paper

...A dropout [27] layer is added before the first fully connected layer....
[...]

Proceedings Article•DOI•

Histograms of oriented gradients for human detection

[...]

Navneet Dalal¹, Bill Triggs¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

20 Jun 2005

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.

...read moreread less

Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

...read moreread less

31,952 citations

"Deep Alignment Network: A Convoluti..." refers methods in this paper

...Said method extracts HOG [?] features at each of the landmarks and uses a linear model to estimate the error....
[...]
...Said method extracts HOG [8] features at each of the landmarks and uses a linear model to estimate the error....
[...]