scispace - formally typeset
Search or ask a question

Showing papers by "Stefanos Zafeiriou published in 2016"


Proceedings ArticleDOI
20 Mar 2016
TL;DR: This paper proposes a solution to the problem of `context-aware' emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation.
Abstract: The automatic recognition of spontaneous emotions from speech is a challenging task. On the one hand, acoustic features need to be robust enough to capture the emotional content for various styles of speaking, and while on the other, machine learning algorithms need to be insensitive to outliers while being able to model the context. Whereas the latter has been tackled by the use of Long Short-Term Memory (LSTM) networks, the former is still under very active investigations, even though more than a decade of research has provided a large set of acoustic descriptors. In this paper, we propose a solution to the problem of ‘context-aware’ emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation. In this novel work on the so-called end-to-end speech emotion recognition, we show that the use of the proposed topology significantly outperforms the traditional approaches based on signal processing techniques for the prediction of spontaneous and natural emotions on the RECOLA database.

785 citations


Journal ArticleDOI
TL;DR: This paper proposes a semi-automatic annotation technique that was employed to re-annotate most existing facial databases under a unified protocol, and presents the 300 Faces In-The-Wild Challenge (300-W), the first facial landmark localization challenge that was organized twice, in 2013 and 2015.

672 citations


Proceedings ArticleDOI
27 Jun 2016
TL;DR: This paper proposes a combined and jointly trained convolutional recurrent neural network architecture that allows the training of an end-to-end to system that attempts to alleviate the drawbacks of cascaded regression.
Abstract: Cascaded regression has recently become the method of choice for solving non-linear least squares problems such as deformable image alignment. Given a sizeable training set, cascaded regression learns a set of generic rules that are sequentially applied to minimise the least squares problem. Despite the success of cascaded regression for problems such as face alignment and head pose estimation, there are several shortcomings arising in the strategies proposed thus far. Specifically, (a) the regressors are learnt independently, (b) the descent directions may cancel one another out and (c) handcrafted features (e.g., HoGs, SIFT etc.) are mainly used to drive the cascade, which may be sub-optimal for the task at hand. In this paper, we propose a combined and jointly trained convolutional recurrent neural network architecture that allows the training of an end-to-end to system that attempts to alleviate the aforementioned drawbacks. The recurrent module facilitates the joint optimisation of the regressors by assuming the cascades form a nonlinear dynamical system, in effect fully utilising the information between all cascade levels by introducing a memory unit that shares information across all levels. The convolutional module allows the network to extract features that are specialised for the task at hand and are experimentally shown to outperform hand-crafted features. We show that the application of the proposed architecture for the problem of face alignment results in a strong improvement over the current state-of-the-art.

387 citations


Proceedings ArticleDOI
26 Jun 2016
TL;DR: The presented extensive qualitative and quantitative evaluations reveal that the proposed 3DMM achieves state-of-the-art results, outperforming existing models by a large margin.
Abstract: We present Large Scale Facial Model (LSFM) — a 3D Morphable Model (3DMM) automatically constructed from 9,663 distinct facial identities. To the best of our knowledge LSFM is the largest-scale Morphable Model ever constructed, containing statistical information from a huge variety of the human population. To build such a large model we introduce a novel fully automated and robust Morphable Model construction pipeline. The dataset that LSFM is trained on includes rich demographic information about each subject, allowing for the construction of not only a global 3DMM but also models tailored for specific age, gender or ethnicity groups. As an application example, we utilise the proposed model to perform age classification from 3D shape alone. Furthermore, we perform a systematic analysis of the constructed 3DMMs that showcases their quality and descriptive power. The presented extensive qualitative and quantitative evaluations reveal that the proposed 3DMM achieves state-of-the-art results, outperforming existing models by a large margin. Finally, for the benefit of the research community, we make publicly available the source code of the proposed automatic 3DMM construction pipeline. In addition, the constructed global 3DMM and a variety of bespoke models tailored by age, gender and ethnicity are available on application to researchers involved in medically oriented research.

356 citations


Posted Content
TL;DR: It is shown that, without using prior domain knowledge, a CNN can automatically learn to distinguish among different normal sleep stages, and these results are comparable to state-of-the-art methods with hand-engineered features.
Abstract: We used convolutional neural networks (CNNs) for automatic sleep stage scoring based on single-channel electroencephalography (EEG) to learn task-specific filters for classification without using prior domain knowledge. We used an openly available dataset from 20 healthy young adults for evaluation and applied 20-fold cross-validation. We used class-balanced random sampling within the stochastic gradient descent (SGD) optimization of the CNN to avoid skewed performance in favor of the most represented sleep stages. We achieved high mean F1-score (81%, range 79-83%), mean accuracy across individual sleep stages (82%, range 80-84%) and overall accuracy (74%, range 71-76%) over all subjects. By analyzing and visualizing the filters that our CNN learns, we found that rules learned by the filters correspond to sleep scoring criteria in the American Academy of Sleep Medicine (AASM) manual that human experts follow. Our method's performance is balanced across classes and our results are comparable to state-of-the-art methods with hand-engineered features. We show that, without using prior domain knowledge, a CNN can automatically learn to distinguish among different normal sleep stages.

236 citations


Journal ArticleDOI
TL;DR: The proposed method for the Robust Correlated and Individual Component Analysis (RCICA) of two sets of data in the presence of gross, sparse errors is proposed and extended in order to handle temporal incongruities arising in the data.
Abstract: Recovering correlated and individual components of two, possibly temporally misaligned, sets of data is a fundamental task in disciplines such as image, vision, and behavior computing, with application to problems such as multi-modal fusion (via correlated components), predictive analysis, and clustering (via the individual ones). Here, we study the extraction of correlated and individual components under real-world conditions, namely i) the presence of gross non-Gaussian noise and ii) temporally misaligned data. In this light, we propose a method for the Robust Correlated and Individual Component Analysis (RCICA) of two sets of data in the presence of gross, sparse errors. We furthermore extend RCICA in order to handle temporal incongruities arising in the data. To this end, two suitable optimization problems are solved. The generality of the proposed methods is demonstrated by applying them onto $4$ applications, namely i) heterogeneous face recognition, ii) multi-modal feature fusion for human behavior analysis (i.e., audio-visual prediction of interest and conflict), iii) face clustering, and iv) thetemporal alignment of facial expressions. Experimental results on $2$ synthetic and $7$ real world datasets indicate the robustness and effectiveness of the proposed methodson these application domains, outperforming other state-of-the-art methods in the field.

49 citations


Proceedings ArticleDOI
14 Jun 2016
TL;DR: A new comprehensive benchmark for training methodologies, as well as assessing the performance of facial affect/behaviour analysis/ understanding in-the-wild, is proposed, the first time that such a benchmark for valence and arousal "in- the-wild" is presented.
Abstract: Well-established databases and benchmarks have been developed in the past 20 years for automatic facial behaviour analysis. Nevertheless, for some important problems regarding analysis of facial behaviour, such as (a) estimation of affect in a continuous dimensional space (e.g., valence and arousal) in videos displaying spontaneous facial behaviour and (b) detection of the activated facial muscles (i.e., facial action unit detection), to the best of our knowledge, well-established in-the-wild databases and benchmarks do not exist. That is, the majority of the publicly available corpora for the above tasks contain samples that have been captured in controlled recording conditions and/or captured under a very specific milieu. Arguably, in order to make further progress in automatic understanding of facial behaviour, datasets that have been captured in in the-wild and in various milieus have to be developed. In this paper, we survey the progress that has been recently made on understanding facial behaviour in-the-wild, the datasets that have been developed so far and the methodologies that have been developed, paying particular attention to deep learning techniques for the task. Finally, we make a significant step further and propose a new comprehensive benchmark for training methodologies, as well as assessing the performance of facial affect/behaviour analysis/ understanding in-the-wild. To the best of our knowledge, this is the first time that such a benchmark for valence and arousal "in-the-wild" is presented.

48 citations


Book ChapterDOI
08 Oct 2016
TL;DR: This paper introduces a new, publicly available database for fine-grained material classification, consisting of over 2000 surfaces of fabrics, and shows that the fusion of normals and albedo information outperforms standard methods which rely only on the use of texture information.
Abstract: In this paper we focus on an understudied computer vision problem, particularly how the micro-geometry and the reflectance of a surface can be used to infer its material. To this end, we introduce a new, publicly available database for fine-grained material classification, consisting of over 2000 surfaces of fabrics (http://ibug.doc.ic.ac.uk/resources/fabrics.). The database has been collected using a custom-made portable but cheap and easy to assemble photometric stereo sensor. We use the normal map and the albedo of each surface to recognize its material via the use of handcrafted and learned features and various feature encodings. We also perform garment classification using the same approach. We show that the fusion of normals and albedo information outperforms standard methods which rely only on the use of texture information. Our methodologies, both for data collection, as well as for material classification can be applied easily to many real-word scenarios including design of new robots able to sense materials and industrial inspection.

47 citations


Proceedings ArticleDOI
01 Jun 2016
TL;DR: The Deep Canonical Time Warping (DCTW), a method which automatically learns complex non-linear representations of multiple time-series, generated such that they are highly correlated, and temporally in alignment, is presented.
Abstract: Machine learning algorithms for the analysis of timeseries often depend on the assumption that the utilised data are temporally aligned. Any temporal discrepancies arising in the data is certain to lead to ill-generalisable models, which in turn fail to correctly capture the properties of the task at hand. The temporal alignment of time-series is thus a crucial challenge manifesting in a multitude of applications. Nevertheless, the vast majority of algorithms oriented towards the temporal alignment of time-series are applied directly on the observation space, or utilise simple linear projections. Thus, they fail to capture complex, hierarchical non-linear representations which may prove to be beneficial towards the task of temporal alignment, particularly when dealing with multi-modal data (e.g., aligning visual and acoustic information). To this end, we present the Deep Canonical Time Warping (DCTW), a method which automatically learns complex non-linear representations of multiple time-series, generated such that (i) they are highly correlated, and (ii) temporally in alignment. By means of experiments on four real datasets, we show that the representations learnt via the proposed DCTW significantly outperform state-of-the-art methods in temporal alignment, elegantly handling scenarios with highly heterogeneous features, such as the temporal alignment of acoustic and visual features.

45 citations


Journal ArticleDOI
TL;DR: The superiority of the second-order standardized moment average pooling (2Standmap) is suggested and 2Standmap is successfully applied to four challenging tasks namely texture classification, medical image analysis, pain expression recognition, and micro-expression recognition.

41 citations


Posted Content
TL;DR: In this article, a unified and complete view of compositional gradient descent (CGD) algorithms for fitting active appearance models (AAMs) is presented, with three main characteristics: cost function, type of composition, and optimization method.
Abstract: Active Appearance Models (AAMs) are one of the most popular and well-established techniques for modeling deformable objects in computer vision. In this paper, we study the problem of fitting AAMs using Compositional Gradient Descent (CGD) algorithms. We present a unified and complete view of these algorithms and classify them with respect to three main characteristics: i) cost function; ii) type of composition; and iii) optimization method. Furthermore, we extend the previous view by: a) proposing a novel Bayesian cost function that can be interpreted as a general probabilistic formulation of the well-known project-out loss; b) introducing two new types of composition, asymmetric and bidirectional, that combine the gradients of both image and appearance model to derive better conver- gent and more robust CGD algorithms; and c) providing new valuable insights into existent CGD algorithms by reinterpreting them as direct applications of the Schur complement and the Wiberg method. Finally, in order to encourage open research and facilitate future comparisons with our work, we make the implementa- tion of the algorithms studied in this paper publicly available as part of the Menpo Project.

Posted Content
TL;DR: In this paper, the authors proposed Network Fusion for Composite Community Extraction (NF-CCE), a new class of algorithms, based on four different non-negative matrix factorization models, capable of extracting composite communities in multiplex networks.
Abstract: Networks have been a general tool for representing, analyzing, and modeling relational data arising in several domains. One of the most important aspect of network analysis is community detection or network clustering. Until recently, the major focus have been on discovering community structure in single (i.e., monoplex) networks. However, with the advent of relational data with multiple modalities, multiplex networks, i.e., networks composed of multiple layers representing different aspects of relations, have emerged. Consequently, community detection in multiplex network, i.e., detecting clusters of nodes shared by all layers, has become a new challenge. In this paper, we propose Network Fusion for Composite Community Extraction (NF-CCE), a new class of algorithms, based on four different non-negative matrix factorization models, capable of extracting composite communities in multiplex networks. Each algorithm works in two steps: first, it finds a non-negative, low-dimensional feature representation of each network layer; then, it fuses the feature representation of layers into a common non-negative, low-dimensional feature representation via collective factorization. The composite clusters are extracted from the common feature representation. We demonstrate the superior performance of our algorithms over the state-of-the-art methods on various types of multiplex networks, including biological, social, economic, citation, phone communication, and brain multiplex networks.

Journal ArticleDOI
TL;DR: A novel deterministic SFA algorithm that is able to identify linear projections that extract the common slowest varying features of two or more sequences and an expectation maximization algorithm to perform inference in a probabilistic formulation of SFA are proposed.
Abstract: A recently introduced latent feature learning technique for time-varying dynamic phenomena analysis is the so-called slow feature analysis (SFA). SFA is a deterministic component analysis technique for multidimensional sequences that, by minimizing the variance of the first-order time derivative approximation of the latent variables, finds uncorrelated projections that extract slowly varying features ordered by their temporal consistency and constancy. In this paper, we propose a number of extensions in both the deterministic and the probabilistic SFA optimization frameworks. In particular, we derive a novel deterministic SFA algorithm that is able to identify linear projections that extract the common slowest varying features of two or more sequences. In addition, we propose an expectation maximization (EM) algorithm to perform inference in a probabilistic formulation of SFA and similarly extend it in order to handle two and more time-varying data sequences. Moreover, we demonstrate that the probabilistic SFA (EM-SFA) algorithm that discovers the common slowest varying latent space of multiple sequences can be combined with dynamic time warping techniques for robust sequence time-alignment. The proposed SFA algorithms were applied for facial behavior analysis, demonstrating their usefulness and appropriateness for this task.

Journal ArticleDOI
TL;DR: The proposed method has a better convergence rate than any other existing multi-level method for convex problems, and in addition has the same rate as accelerated methods, which is known to be optimal for first-order methods.
Abstract: Composite convex optimization models arise in several applications and are especially prevalent in inverse problems with a sparsity inducing norm and in general convex optimization with simple constraints. The most widely used algorithms for convex composite models are accelerated first order methods; however, they can take a large number of iterations to compute an acceptable solution for large-scale problems. In this paper we propose speeding up first order methods by taking advantage of the structure present in many applications and in image processing in particular. Our method is based on multilevel optimization methods and exploits the fact that many applications that give rise to large-scale models can be modeled using varying degrees of fidelity. We use Nesterov's acceleration techniques together with the multilevel approach to achieve an $\mathcal{O}(1/\sqrt{\epsilon})$ convergence rate, where $\epsilon$ denotes the desired accuracy. The proposed method has a better convergence rate than any other...

Proceedings ArticleDOI
01 Dec 2016
TL;DR: Two novel methods, coined as WSSNMTF (Weighted Simultaneous Symmetric Non-Negative Matrix Tri-Factorization) and NG-WSSNTF (Natural Gradient WSS NMTF), for fusion and clustering of multi-layer graphs are proposed, which are robust with respect to missing edges and noise.
Abstract: Relational data arising in many domains can be represented by networks (or graphs) with nodes capturing entities and edges representing relationships between these entities. Community detection in networks has become one of the most important problems having a broad range of applications. Until recently, the vast majority of papers have focused on discovering community structures in a single network. However, with the emergence of multi-view network data in many real-world applications and consequently with the advent of multilayer graph representation, community detection in multi-layer graphs has become a new challenge. Multi-layer graphs provide complementary views of connectivity patterns of the same set of vertices. Fusion of the network layers is expected to achieve better clustering performance. In this paper, we propose two novel methods, coined as WSSNMTF (Weighted Simultaneous Symmetric Non-Negative Matrix Tri-Factorization) and NG-WSSNMTF (Natural Gradient WSSNMTF), for fusion and clustering of multi-layer graphs. Both methods are robust with respect to missing edges and noise. We compare the performance of the proposed methods with two baseline methods, as well as with three state-of-the-art methods on synthetic and three real-world datasets. The experimental results indicate superior performance of the proposed methods.

Proceedings ArticleDOI
01 Jun 2016
TL;DR: This paper shows for the first time, to the best of the knowledge, that it is possible to construct SDMs by putting object shapes in dense correspondence, and shows that, by sampling the dense model, a part-based SDM can be learned with its parts being in correspondence.
Abstract: During the past few years we have witnessed the development of many methodologies for building and fitting Statistical Deformable Models (SDMs). The construction of accurate SDMs requires careful annotation of images with regards to a consistent set of landmarks. However, the manual annotation of a large amount of images is a tedious, laborious and expensive procedure. Furthermore, for several deformable objects, e.g. human body, it is difficult to define a consistent set of landmarks, and, thus, it becomes impossible to train humans in order to accurately annotate a collection of images. Nevertheless, for the majority of objects, it is possible to extract the shape by object segmentation or even by shape drawing. In this paper, we show for the first time, to the best of our knowledge, that it is possible to construct SDMs by putting object shapes in dense correspondence. Such SDMs can be built with much less effort for a large battery of objects. Additionally, we show that, by sampling the dense model, a part-based SDM can be learned with its parts being in correspondence. We employ our framework to develop SDMs of human arms and legs, which can be used for the segmentation of the outline of the human body, as well as to provide better and more consistent annotations for body joints.

Proceedings ArticleDOI
19 Aug 2016
TL;DR: The first attempt is proposed to combine the best of these two worlds under a unified model and report state-of-the-art performance on the most recent facial benchmark challenge.
Abstract: The two predominant families of deformable models for the task of face alignment are: (i) discriminative cascaded regression models, and (ii) generative models optimised with Gauss-Newton. Although these approaches have been found to work well in practise, they each suffer from convergence issues. Cascaded regression has no theoretical guarantee of convergence to a local minimum and thus may fail to recover the fine details of the object. Gauss-Newton optimisation is not robust to initialisations that are far from the optimal solution. In this paper, we propose the first, to the best of our knowledge, attempt to combine the best of these two worlds under a unified model and report state-of-the-art performance on the most recent facial benchmark challenge.

Posted Content
TL;DR: DenseReg as mentioned in this paper proposes to learn a mapping from image pixels into a dense template grid through a fully convolutional network, which uses manually annotated facial landmarks to establish a dense correspondence field between a 3D object template and the input image.
Abstract: In this paper we propose to learn a mapping from image pixels into a dense template grid through a fully convolutional network. We formulate this task as a regression problem and train our network by leveraging upon manually annotated facial landmarks "in-the-wild". We use such landmarks to establish a dense correspondence field between a three-dimensional object template and the input image, which then serves as the ground-truth for training our regression system. We show that we can combine ideas from semantic segmentation with regression networks, yielding a highly-accurate "quantized regression" architecture. Our system, called DenseReg, allows us to estimate dense image-to-template correspondences in a fully convolutional manner. As such our network can provide useful correspondence information as a stand-alone system, while when used as an initialization for Statistical Deformable Models we obtain landmark localization results that largely outperform the current state-of-the-art on the challenging 300W benchmark. We thoroughly evaluate our method on a host of facial analysis tasks and also provide qualitative results for dense human body correspondence. We make our code available at this http URL along with supplementary materials.

Book ChapterDOI
01 Jan 2016
TL;DR: This chapter reviews the existing commercial affective gaming applications and introduces new gaming scenarios, outlining some of the most important problems that have to be tackled in order to create more realistic and efficient interactions between players and games and highlighting the challenges such systems must overcome.
Abstract: A typical gaming scenario, as developed in the past 20 years, involves a player interacting with a game using a specialized input device, such as a joystick, a mouse, a keyboard or a proprietary game controller. Recent technological advances have enabled the introduction of more elaborated approaches in which the player is able to interact with the game using body pose, facial expressions, actions, even physiological signals. The future lies in ‘affective gaming’, that is games that will be ‘intelligent’ enough not only to extract the player’s commands provided by speech and gestures, but also to extract behavioural cues, as well as emotional states and adjust the game narrative accordingly, in order to ensure more realistic and satisfactory player experience. In this chapter, we review the area of affective gaming by describing existing approaches and discussing recent technological advances. More precisely, we first elaborate on different sources of affect information in games and proceed with issues such as the affective evaluation of players and affective interaction in games. We summarize the existing commercial affective gaming applications and introduce new gaming scenarios. We outline some of the most important problems that have to be tackled in order to create more realistic and efficient interactions between players and games and conclude by highlighting the challenges such systems must overcome.

Journal ArticleDOI
TL;DR: This paper proposes a global similarity measure that is robust to both intensity inhomogeneities and outliers without requiring prior knowledge of the type of outliers and proposes two novel similarity measures based on the cosine of normalised 3D volumetric gradients.

Journal ArticleDOI
TL;DR: In this article, the authors performed a thorough evaluation of state-of-the-art deformable face tracking pipelines using the recently introduced 300VW benchmark and evaluated many different architectures focusing mainly on the task of on-line deformation face tracking.
Abstract: Recently, technologies such as face detection, facial landmark localisation and face recognition and verification have matured enough to provide effective and efficient solutions for imagery captured under arbitrary conditions (referred to as "in-the-wild"). This is partially attributed to the fact that comprehensive "in-the-wild" benchmarks have been developed for face detection, landmark localisation and recognition/verification. A very important technology that has not been thoroughly evaluated yet is deformable face tracking "in-the-wild". Until now, the performance has mainly been assessed qualitatively by visually assessing the result of a deformable face tracking technology on short videos. In this paper, we perform the first, to the best of our knowledge, thorough evaluation of state-of-the-art deformable face tracking pipelines using the recently introduced 300VW benchmark. We evaluate many different architectures focusing mainly on the task of on-line deformable face tracking. In particular, we compare the following general strategies: (a) generic face detection plus generic facial landmark localisation, (b) generic model free tracking plus generic facial landmark localisation, as well as (c) hybrid approaches using state-of-the-art face detection, model free tracking and facial landmark localisation technologies. Our evaluation reveals future avenues for further research on the topic.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: This paper proposes a novel method for robust and automatic face progression in totally unconstrained conditions that outperforms state-of-the-art age progression methods and improves matching accuracy in a face verification protocol that includes age progression.
Abstract: It has been shown that significant age difference between a probe and gallery face image can decrease the matching accuracy. If the face images can be normalized in age, there can be a huge impact on the face verification accuracy and thus many novel applications such as matching driver's license, passport and visa images with the real person's images can be effectively implemented. Face progression can address this issue by generating a face image for a specific age. Many researchers have attempted to address this problem focusing on predicting older faces from a younger face. In this paper, we propose a novel method for robust and automatic face progression in totally unconstrained conditions. Our method takes into account that faces belonging to the same age-groups share age patterns such as wrinkles while faces across different age-groups share some common patterns such as expressions and skin colors. Given training images of K different age-groups the proposed method learns to recover K low-rank age and one low-rank common components. These extracted components from the learning phase are used to progress an input face to younger as well as older ages in bidirectional fashion. Using standard datasets, we demonstrate that the proposed progression method outperforms state-of-the-art age progression methods and also improves matching accuracy in a face verification protocol that includes age progression.


Proceedings ArticleDOI
01 Jun 2016
TL;DR: It is shown that deformable spatio-temporal alignment of faces can be performed in an unsupervised manner (i.e., without employing face trackers or building person-specific deformable models) and achieve better results than considering the problems independent.
Abstract: Typically, the problems of spatial and temporal alignment of sequences are considered disjoint. That is, in order to align two sequences, a methodology that (non)-rigidly aligns the images is first applied, followed by temporal alignment of the obtained aligned images. In this paper, we propose the first, to the best of our knowledge, methodology that can jointly spatio-temporally align two sequences, which display highly deformable texture-varying objects. We show that by treating the problems of deformable spatial and temporal alignment jointly, we achieve better results than considering the problems independent. Furthermore, we show that deformable spatio-temporal alignment of faces can be performed in an unsupervised manner (i.e., without employing face trackers or building person-specific deformable models).

Journal ArticleDOI
TL;DR: The Menpo Project provides tools for annotating images and meshes with a sparse set of fiducial markers that are useful in a variety of areas in Computer Vision and Machine Learning including object detection, deformable modelling and tracking.
Abstract: The Menpo Project [1] is a BSD-licensed set of tools and software designed to provide an end-to-end pipeline for collection and annotation of image and 3D mesh data. In particular, the Menpo Project provides tools for annotating images and meshes with a sparse set of fiducial markers that we refer to as landmarks. For example, Figure 1 shows an example of a face image that has been annotated with 68 2D landmarks. These landmarks are useful in a variety of areas in Computer Vision and Machine Learning including object detection, deformable modelling and tracking. The Menpo Project aims to enable researchers, practitioners and students to easily annotate new data sources and to investigate existing datasets. Of most interest to the Computer Vision is the fact that The Menpo Project contains completely open source implementations of a number of state-of-the-art algorithms for face detection and deformable model building.In the Menpo Project, we are actively developing and contributing to the state-of-the-art in deformable modelling [2], [3], [4], [5]. Characteristic examples of widely used state-of-the-art deformable model algorithms are Active Appearance Models [6],[7], Constrained Local Models [8], [9] and Supervised Descent Method [10]. However, there is still a noteworthy lack of high quality open source software in this area. Most existing packages are encrypted, compiled, non-maintained, partly documented, badly structured or difficult to modify. This makes them unsuitable for adoption in cutting edge scientific research. Consequently, research becomes even more difficult since performing a fair comparison between existing methods is, in most cases, infeasible. For this reason, we believe the Menpo Project represents an important contribution towards open science in the area of deformable modelling. We also believe it is important for deformable modelling to move beyond the established area of facial annotations and to extend to a wide variety of deformable object classes. We hope Menpo can accelerate this progress by providing all of our tools completely free and permissively licensed.