What are the future work guidelines for the proposed approach?

Future work guidelines include devising novel feature representations, in alternative to chrominance signals, to further improve the robustness to varying illumination conditions as well as exploiting the feasibility of combining the predicted HR measurements with visual features for spontaneous emotion classification.

How many subjects participated in the experiment?

It contains 27 subjects (12 males and 15 females) in total, and each subject participated in two experiments: (i) emotion elicitation and (ii) implicit tagging.

What is the optimal value for M?

The optimal value for M is obtained from the followingoptimisation problem:min MkM ◦ (F−C)k2F − βkMk1 + µkM− fMk2F , (8)which can be rewritten independently for each entry of M:min mrt2{0,1}(frt − ort) 2mrt + µ(mrt − emrt)2 − βmrt.

Why does the proposed SAMC achieve higher accuracy than the state-of-the-art?

On this difficult dataset, due to its capacity to select the most reliable chrominance features and ignore the noisy ones, the proposed SAMC achieves significantly higher accuracy than the state-of-the-art.

(Open Access) Self-Adaptive Matrix Completion for Heart Rate Estimation from Face Videos under Realistic Conditions (2016) | Sergey Tulyakov

Q: What is the underlying reason of the chrominance features?

minimizing the rank is a NP-hard problem, and traditionally aconvex surrogate of the rank, the nuclear norm, is used [8]:min EνkEk⇤ + kE−Ck 2 F . (1)Another intrinsic property of the chrominance features is that, since the underlying reason of their oscillation is the internal functioning of the heart, the authors should enforce the estimated chrominance features (those of the low-rank estimated matrix) to be within the heart-rate’s frequency range.

Q: How is the low-rank estimated matrix grouped?

On the one hand, since matrix completion problems are usually approached by reducing the matrix rank, the low-rank estimated matrix naturally groups the rows by their linear dependency.

Q: What are the limitations of the method?

In this work, the authors address the aforementioned limitations by proposing a novel method capable of predicting HR with higher accuracy than the state-of-the-art approaches and of robustly operating on short time sequences in order to detect the instantaneous HR.

Self-Adaptive Matrix Completion for Heart Rate Estimation

from Face Videos under Realistic Conditions

Sergey Tulyakov

, Xavier Alameda-Pineda

, Elisa Ricci

2,3

, Lijun Yin

, Jeffrey F. Cohn

5,6

, Nicu Sebe

University of Trento, Via Sommarive 9, 38123 Trento, Italy

Fondazione Bruno Kessler, Via Sommarive 18, 38123 Trento, Italy

University of Perugia, Via Duranti 93, 06123, Perugia, Italy

State University of New York at Binghamton, Binghamton, NY 13902, USA

Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Department of Psychology, University of Pittsburgh, Pittsburgh, PA 15260, USA

{sergey.tulyakov,xavier.alamedapineda,niculae.sebe}@unitn.it,

eliricci@fbk.eu, lijun@cs.binghamton.edu, jeffcohn@pitt.edu

Abstract

Recent studies in computer vision have shown that,

while practically invisible to a human observer, skin color

changes due to blood ﬂow can be captured on face videos

and, surprisingly, be used to estimate the heart rate (HR).

While considerable progress has been made in the last few

years, still many issues remain open. In particular, state-

of-the-art approaches are not robust enough to operate in

natural conditions (e.g. in case of spontaneous movements,

facial expressions, or illumination changes). Opposite to

previous approaches that estimate the HR by processing all

the skin pixels inside a ﬁxed region of interest, we intro-

duce a strategy to dynamically select face regions useful for

robust HR estimation. Our approach, inspired by recent ad-

vances on matrix completion theory, allows us to predict

the HR while simultaneously discover the best regions of

the face to be used for estimation. Thorough experimental

evaluation conducted on public benchmarks suggests that

the proposed approach signiﬁcantly outperforms state-of-

the-art HR estimation methods in naturalistic conditions.

1. Introduction

After being shown in [

23, 18] that changes invisible to

the naked eye can be used to estimate the heart rate from

a video of human skin, this topic has attracted a lot of at-

tention in the computer vision community. These subtle

changes encompass both color [

27] and motion [4] and they

are induced by the internal functioning of the heart. Since

faces appear frequently in videos and due to recent and sig-

Time

Figure 1. Motivation: Given a video sequence, automatic HR es-

timation from facial features is challenging due to target motion

and facial expressions. Facial features extracted over time in dif-

ferent parts of the face (purple rectangles) show different temporal

dynamics and are subject to noise, as they are heavily affected by

movements and illumination changes. In this paper, we propose a

novel approach to simultaneously estimate the HR signal and se-

lect the reliable face regions at each time for robust HR prediction.

niﬁcant improvements in face tracking and alignment meth-

ods [

3, 21, 13, 14, 29], facial-based remote heart rate esti-

mation has recently become very popular [

17, 30, 10, 25].

Classical approaches successfully addressed this prob-

lem under laboratory-controlled conditions, i.e. imposing

constraints on the subject’s movements and requiring the

absence of facial expressions and mimics [

18, 27, 4]. There-

fore, such methods may not be suitable for real world appli-

cations, such as monitoring drivers inside a vehicle or peo-

ple exercising. Long-time analysis constitutes a further lim-

itation of existing works [

17, 18, 19]. Indeed, instead of es-

2396

timating the instantaneous heart rate, they provide the aver-

age HR measurement over a long video sequence. The main

disadvantage of using a long analysis window is the inabil-

ity to capture interesting short-time phenomena, such as a

sudden HR increase/decrease due to speciﬁc emotions [

22].

In practice, another problem faced by researchers de-

veloping automatic HR measurement approaches, is the

lack of publicly available datasets recorded under realis-

tic conditions. A notable exception is the MAHNOB-HCI

dataset [

20], a multimodal dataset for research on emotion

recognition and implicit tagging, which also contains HR

annotations. Importantly, an extensive evaluation of ex-

isting HR measurement methods on MAHNOB-HCI have

been performed by Li et al. [

17]. However, the MAHNOB-

HCI dataset suffers from some limitations, since the record-

ing conditions are quite controlled: most of the video se-

quences do not contain spontaneous facial expressions, illu-

mination changes or large target movements [

17].

In this work, we tackle the aforementioned problems

by introducing a novel approach for HR estimation from

face videos and providing an extensive evaluation on two

datasets: the MAHNOB-HCI, previously used for HR

recognition research [

17], and a spontaneous dataset with

heart rate data and RGB videos (named MMSE-HR), which

is a subset of the larger multimodal spontaneous emotion

corpus (MMSE) [31] speciﬁcally targeted to challenge HR

estimation methods.

Inspired by previous methods, we track the face in

a given video sequence, so to follow rigid head move-

ments [

17], and extract chrominance features [10] to com-

pensate for illumination variations. Importantly, most previ-

ous approaches preselect a face region of interest (ROI) that

is kept constant through the entire HR estimation. How-

ever, the region containing useful features for HR estima-

tion is a priori different for every frame since major appear-

ance changes are spatially and temporally localized (Fig.

1).

Therefore, we propose a principled data-driven approach to

automatically detect the face parts useful for HR measure-

ment, that is to estimate the time-varying mask of useful ob-

servations, selecting at each frame the relevant face regions

from the chrominance features themselves.

Recent advances on matrix completion (MC) theory [

11]

have shown the ability to recover missing entries of a ma-

trix that is partially observed, i.e. masked. Up to the authors

knowledge, we propose the ﬁrst matrix completion-based

learning algorithm able to self-adapt, that is to automati-

cally select the useful observations, and call it self-adaptive

matrix completion (SAMC). Intuitively, while learning the

mask allows us to discard those face regions strongly af-

fected by facial expressions or large movements, complet-

ing the matrix smooths out the smaller noise associated to

the chrominance feature extraction procedure. The experi-

ments we conducted on the MANHOB-HCI dataset clearly

show that our method outperforms the state-of-the-art ap-

proaches for HR prediction. To further demonstrate the

ability of our method to operate in challenging scenar-

ios, we report a series of tests on the MMSE-HR dataset,

where subjects show signiﬁcant movements and facial ex-

pressions.

Thus, the contribution of this paper is three-fold:

• We present a novel approach to address the problem of

HR estimation from face videos in realistic conditions.

To cope with large facial variations due to spontaneous

facial expressions and movements, we propose a prin-

cipled framework to automatically discard the face re-

gions corresponding to noisy features and only use the

reliable ones for HR prediction. The region selection

is addressed within a novel matrix completion-based

optimization framework, called self-adaptive matrix

completion, for which an efﬁcient solver is proposed.

• Our approach is demonstrated to be more accurate than

previous methods for average HR estimation on pub-

licly available benchmarks. In addition, we report

short-term analysis results to show the ability of our

method to detect instantaneous heart rate.

• We perform extensive evaluation on the commonly

used MAHNOB-HCI dataset and a spontaneous

MMSE-HR dataset including 102 sequences of 40 sub-

jects, moving and performing spontaneous facial ex-

pressions. As we show, this dataset is valuable for in-

stantaneous HR estimation.

2. Related Work

In this section, we brieﬂy review previous works on re-

mote heart rate measurement and on matrix completion.

2.1. HR Estimation from Face Videos

Cardiac activity measurement is an essential tool to con-

trol the subjects’ health and is actively used by medical

practitioners. Conventional contact methods offer high ac-

curacy of cardiac cycle. However, they require speciﬁc sen-

sors to be attached to the human skin, be it a set of elec-

trocardiogram (ECG) leads, a pulse oximiter, or the more

recent ﬁtness tracker. To avoid the use of invasive sensors,

non-contact remote HR measurement from visual data has

been proposed recently by computer vision researchers.

Verkruysse et al. [

23] showed that ambient light and a

consumer camera can be used to reveal the cardio-vascular

pulse wave and to remotely analyze the vital signs of a per-

son. Poh et al. [

18] proposed to use blind source separation

on color changes caused by heart activity to extract the HR

signal from a face video. In [

27] an Eulerian magniﬁcation

method is used to amplify subtle changes in a video stream

2397

and to visualize temporal dynamics of the blood ﬂow. Bal-

akrishnan et al. [

4] showed that subtle head motions are af-

fected by cardiac activity, and these motions can be used to

extract HR measurements from a video stream.

However, all these methods failed to address the prob-

lems of HR estimation in presence of facial expressions

and subject’s movements, despite their frequent presence

in real-world applications. This limits the use of these ap-

proaches to laboratory settings. In [

10, 25] a chrominance-

based method to relax motion constraints was introduced.

However, this approach was tested on a few not-publicly-

available sequences, making it hard to compare with.

Li et al. [

17] proposed an approach based on adap-

tive ﬁltering to handle illumination and motion issues and

they evaluated it on the publicly available MAHNOB-HCI

dataset [

20]. However, although this work represents a

valuable step towards remote HR measurement from visual

data, it also shares several major limitations with the pre-

vious methods. The output of the method is the average

HR, whereas to capture short-term phenomena (e.g. HR

variations due to instantaneous emotions) the processing

of smaller time intervals is required. A further limitation

of [

17] is the MAHNOB-HCI dataset itself, since it is col-

lected in a laboratory setting and the subjects are required

to wear an invasive EEG measuring device on their head.

Additionally, subjects perform neither large movements nor

many spontaneous facial expressions.

In this work, we address the aforementioned limitations

by proposing a novel method capable of predicting HR with

higher accuracy than the state-of-the-art approaches and of

robustly operating on short time sequences in order to detect

the instantaneous HR. To our knowledge, while previous

works [

17, 25] have acknowledged the importance of select-

ing parts of the signal to cope with noise and provide robust

HR estimates, this paper is the ﬁrst to tackle this problem

within a principled optimization framework.

2.2. Matrix completion

Matrix completion [

11] approaches develop from the

idea that an unknown low-rank matrix can be recovered

from a small set of entries. This is done by solving an op-

timization problem, namely, a rank minimization problem

subject to some data constraints arising from the small set of

entries. Matrix completion has proved successful for many

computer vision tasks, when data and labels are noisy or in

the case of missing data, such as multi-label image classi-

ﬁcation [

6], image retrieval and tagging [28, 9], manifold

correspondence ﬁnding [16], head/body pose estimation [1]

and emotion recognition from abstract paintings [

2]. Most

of these works extended the original MC framework by im-

posing task-speciﬁc constraints. For instance, in [9] a MC

problem is formulated adding a speciﬁc regularizer to ad-

dress the ambiguous labeling problem. Very importantly,

even if most computer-vision papers based on matrix com-

pletion are addressing classiﬁcation tasks, therefore split-

ting the matrix to be completed between features and labels,

MC techniques can be used in general, without any struc-

tural splitting. Indeed, in [

15] matrix completion is adopted

to address the movie recommendation problem, where each

column (row) represents a user (movie), and therefore each

entry of the matrix shows the suitableness of a video for a

user. In [

16, 15], the MC problem is extended to take into

account an underlying graph structure inducing a weighted

relationship between the columns/rows of the matrix. In this

paper, we were inspired by [

16, 15, 1] in modeling the tem-

poral smoothness of the HR signal. However, our method

is essentially novel, since we are able to simultaneously re-

cover the unknown low-rank matrix and the underlying data

mask, corresponding to the most reliable observations.

3. HR Estimation using SAMC

In this section we describe the proposed approach for

HR estimation from face videos, that has four main phases

as shown in Figure

2. Phase 1 is devoted to process face

images so to extract face regions, that are used in phase 2

to compute chrominance features. Phase 3 consists in the

joint estimation of the underlying low-rank feature matrix

and the mask using SAMC. Finally, phase 4 computes the

heart rate from the signal estimate provided by SAMC.

3.1. Phases 1 & 2: From Face Videos to Chromi-

nance Features

Inspired by previous methods on remote HR estimation,

we use Intraface

to localize and track 66 facial landmarks.

Many approaches have been employed for face frontalisa-

tion [

24, 12]. However, in order to preserve the underlying

blood ﬂow signal, in the current study we deﬁne the facial

region of interest (see Fig.

2-Phase 1), from which the HR

will be estimated. The potential ROI is then warped to a

rectangle using a piece-wise linear warping procedure, be-

fore dividing the potential ROI into a grid containing R re-

gions.

The overall performance of the HR estimation method

will strongly depend on the features extracted on each of

the R sub-regions of the facial ROI. Ideally, we would se-

lect features that are robust to facial movements and expres-

sions, while being discriminant enough to account for the

subtle changes in skin color. Currently, the best features

for HR estimation are the chrominance features, deﬁned

in [

10]. The chrominance features for HR estimation are

derived from the RGB channels, as follows. For each pixel

the chrominance signal C is computed as the linear com-

bination of two signals X

and Y

, i.e. C = X

− αY

where α =

σ(X

)

σ(Y

)

and σ(X

), σ(Y

) denote the standard

http://www.humansensing.cs.cmu.edu/intraface

2398

2. Feature Extraction

Feature

Extraction

Region 1

Region 2

Region R

...

ROI

extraction

ROI

Warping

1. Face Region Extraction

3. Self-Adaptive Matrix Completion

Observation matrix Low-rank matrix

Prior mask

SAMC

Estimated Mask

0 1 2 3 4 5 6

Frequency, Hz

HR Frequency

Signal estimated using SAMC

Magnitude

Power spectral

density estimation

4. Heart Rate Estimation

Figure 2. Overview of the proposed approach for HR estimation. During the ﬁrst phase, we automatically detect a set of facial keypoints and

use them to deﬁne a ROI. This region is then warped to a rectangular area and divided into a grid. For each small sub-region, chrominance

features are computed (Phase 2). We then apply SAMC on the matrix of all feature observations to recover a smooth signal, while selecting

from which sub-regions the signal is recovered (Phase 3). Welch’s method [

26] is used to estimate the power spectral density and thus the

HR frequency (Phase 4).

deviations of X

, Y

. The signals X

, Y

are band-passed

ﬁltered signals obtained respectively from the signals X and

Y , where X =3R

− 2G

, Y =1.5R

+ G

− 1.5B

and R

and B

are the normalized values of the indi-

vidual color channels. The color combination coefﬁcients

to derive X and Y are computed using a skin-tone stan-

dardization approach (see [

10] for details). For each region

r =1,...,R, the ﬁnal chrominance features are computed

averaging the values of the chrominance signals over all the

pixels.

3.2. Phase 3: Self-Adaptive Matrix Completion

The estimation of HR from the chrominance features is

challenging for mainly two reasons. Firstly, the chromi-

nance features associated to different facial regions are not

fully synchronized. In other words, even if the output sig-

nals of many regions are synchronized between them (main-

stream underlying heart signal), the signal of many other re-

gions may not be in phase with the mainstream. Secondly,

face movements and facial expressions induce strong per-

turbations in the chrominance features. These perturbations

are typically local in space and time while large in intensity

(Fig.

1). Therefore, we need to localize where these pertur-

bations take place so not to use them in the HR estimation.

These two main difﬁculties are intuitively overcome by

deriving a matrix completion technique embedding a self-

adaptation strategy. On the one hand, since matrix com-

pletion problems are usually approached by reducing the

matrix rank, the low-rank estimated matrix naturally groups

the rows by their linear dependency. In our particular case,

two rows are (near) linearly dependent if and only if the

output signals they represent are synchronized. Therefore,

the underlying HR signal is hypothesized to be in the vector

subspace spanned by the largest group of linearly dependent

rows of the estimated low-rank matrix.

On the other hand, the estimated low-rank matrix is en-

forced to resemble the observations. In previous MC ap-

proaches [

6, 9, 1, 16], the non-observed part of the ma-

trix consisted of the labels of the test set. Thus, the set of

unknown matrix entries was ﬁxed and known in advance.

The HR estimation problem is slightly different since there

are no missing observations, i.e. the matrix is fully ob-

served. However, many of these observations are highly

noisy, thus corrupting the estimation of the HR. Importantly,

we do not know in advance which are the corrupted obser-

vations. This is why we believe that this problem naturally

requires some form of adaptation, implying that the method

selects the samples with which the learning is performed.

Consequently, we name the proposed learning method self-

adaptive matrix completion (SAMC).

In order to formalize the self-adaptive matrix comple-

tion problem let us assume the existence of R regions

where chrominance features are computed during T video

frames. This provides a chrominance observations matrix

C 2 R

R⇥T

. Ideally, in a scenario where we could trust all

region features continuously, we would simply estimate the

low-rank matrix that better approximates the matrix of ob-

servations C, by solving: min

ν rank(E)+kE − Ck

where ν is a regularization parameter. Unfortunately, min-

imizing the rank is a NP-hard problem, and traditionally a

2399

convex surrogate of the rank, the nuclear norm, is used [8]:

min

νkEk

⇤

+ kE − Ck

. (1)

Another intrinsic property of the chrominance features

is that, since the underlying reason of their oscillation is

the internal functioning of the heart, we should enforce the

estimated chrominance features (those of the low-rank esti-

mated matrix) to be within the heart-rate’s frequency range.

Inspired by [

15, 16, 1] we add a temporal smoothing term

by means of a Laplacian matrix L:

min

νkEk

⇤

+ kE − Ck

+ γ Tr(ELE

), (2)

where γ measures the weight of the temporal smoothing

within the learning process. L should encode the relational

information between the observations acquired at different

instants, thus acting like a relaxed band-pass ﬁlter. Indeed,

imposing that e

is band-pass ﬁltered is equivalent to reduce

− e

= ke

, where each column of T is a

shifted replica of the band-pass normalized ﬁlter tap values

so that the product e

T boils down to a convolution and

is a copy of T with zeros in the diagonal, since the band-

pass ﬁlter is normalized. Imposing this for all R regions at

once writes: Tr(E

), and therefore L =

As previously discussed, the estimated matrix should not

take into account the observed entries associated to large

movements or spontaneous facial expressions. We model

this by including a masking binary matrix M 2{0, 1}

R⇥T

in the previous equation as [

6]:

min

νkEk

⇤

+ kM ◦ (E − C)k

+ γ Tr(ELE

), (3)

where ◦ stands for the element-wise (Hadamard) product

and the entries of the matrix M are 1 if the corresponding

entry in C has to be taken into account for the HR estima-

tion and 0 otherwise.

Importantly, while in the previous studies M was known

in advance, in the present study we have to estimate it. We

naturally interpret this as a form of adaptation since M is a

observation-selection variable indicating from which obser-

vations should the method learn at each iteration. The mask-

ing matrix M should select the largest possible amount of

samples that provide useful information for the estimation

of the HR. Moreover, when available, it would be desirable

to use a prior for the mask M, taking real values between 0

and 1,

M 2 [0, 1]

R⇥T

. The complete SAMC optimization

problem writes:

min

E,M

νkEk

⇤

+ kM ◦ (E − C)k

+ γ Tr(ELE

)

− βkMk

+ µkM −

, (4)

The parameters β and µ regulate respectively the number

of selected observations and the importance of prior infor-

mation. In this paper the prior mask

M is deﬁned as the

negative exponential of the local standard deviation of the

signal. Our intuition is that, if the signal has small local

standard deviation, the chrominance variation within the re-

gion is due to the heart-rate and not to head movements or

facial expressions, and therefore that matrix entry should be

used to estimate the HR.

3.2.1 Solving SAMC

The SAMC optimization problem in (

4) is not jointly con-

vex in E and M. Moreover, even in the case the mask-

ing matrix M was ﬁxed, (

4) would contain non-differential

and differential terms and a direct optimization would be

challenging. Instead, alternating methods have proven to

be successful in solving (i) convex problems with non-

differential terms and (ii) marginally convex problems that

are not jointly convex. More precisely, we derive an opti-

misation solver based on the alternating direction method of

multipliers (ADMM) [

5]. In order to derive the associated

ADMM method, we ﬁrst deﬁne the augmented Lagrangian

problem associated to (4):

min

E,F,M,Z

νkEk

⇤

+kM◦(F − C)k

+γ Tr(FLF

)−βkMk

+ µkM −

+ hZ, E − Fi +

kE − Fk

, (5)

where F is deﬁned to split the terms of (

4) that depend on

E into those that are differential and those that are not. The

variable Z represents the Lagrange multipliers constrain-

ing E to be equal to F, further regularized by the term

kE−Fk

. The ADMM solves the optimisation problem by

alternating the direction of the optimisation while keeping

the other directions ﬁxed. Speciﬁcally, solving (

5) requires

alternating the following three steps until convergence:

E/M-step With ﬁxed F and Z the optimal value of E is

obtained by solving:

min

νkEk

⇤

kE − F + ρ

−1

. (6)

The solution of such problem is given by the shrinkage op-

erator applied to F − ρ

−1

Z, see [

7]. Formally, if we write

the singular value decomposition of F − ρ

−1

Z = UDV

the optimal value for E is:

⇤

= US

(D)V

, (7)

where S

(x) = max(0,x− λ) is the soft-thresholding op-

erator, applied element-wise to D in (

7).

The optimal value for M is obtained from the following

optimisation problem:

min

kM ◦ (F − C)k

− βkMk

+ µkM −

, (8)

2400

Self-Adaptive Matrix Completion for Heart Rate Estimation from Face Videos under Realistic Conditions

Figures

Citations

Algorithmic Principles of Remote PPG

Learning Deep Models for Face Anti-Spoofing: Binary or Auxiliary Supervision

Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis

DeepPhys: Video-Based Physiological Measurement Using Convolutional Attention Networks

Unsupervised skin tissue segmentation for remote photoplethysmography

References

Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers

The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms

A Singular Value Thresholding Algorithm for Matrix Completion

Exact Matrix Completion via Convex Optimization

Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization

Related Papers (5)

Non-contact, automated cardiac pulse measurements using video imaging and blind source separation.

Remote plethysmographic imaging using ambient light

Robust Pulse Rate From Chrominance-Based rPPG

Advancements in Noncontact, Multiparameter Physiological Measurements Using a Webcam

Algorithmic Principles of Remote PPG

Frequently Asked Questions (10)

Q1. What have the authors contributed in "Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions" ?

Q2. What are the future works mentioned in the paper "Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions" ?

Q3. What are the future work guidelines for the proposed approach?

Q4. What is the underlying reason of the chrominance features?

Q5. How many subjects participated in the experiment?

Q6. What is the main disadvantage of using a long analysis window?

Q7. What is the optimal value for M?

Q8. Why does the proposed SAMC achieve higher accuracy than the state-of-the-art?

Q9. How is the low-rank estimated matrix grouped?

Q10. What are the limitations of the method?