scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Learning deep physiological models of affect

TL;DR: All three key components of an affective model are touched upon and the use of deep learning (DL) methodologies for affective modeling from multiple physiological signals are introduced.
Abstract: More than 15 years after the early studies in Affective Computing (AC), [1] the problem of detecting and modeling emotions in the context of human-computer interaction (HCI) remains complex and largely unexplored. The detection and modeling of emotion is, primarily, the study and use of artificial intelligence (AI) techniques for the construction of computational models of emotion. The key challenges one faces when attempting to model emotion [2] are inherent in the vague definitions and fuzzy boundaries of emotion, and in the modeling methodology followed. In this context, open research questions are still present in all key components of the modeling process. These include, first, the appropriateness of the modeling tool employed to map emotional manifestations and responses to annotated affective states; second, the processing of signals that express these manifestations (i.e., model input); and third, the way affective annotation (i.e., model output) is handled. This paper touches upon all three key components of an affective model (i.e., input, model, output) and introduces the use of deep learning (DL) [3], [4], [5] methodologies for affective modeling from multiple physiological signals.

Summary (5 min read)

Introduction

  • In particular, the computational cost of feature selection may increase combinatorially (quadratically, in the greedy case) with respect to the number of features considered [6].
  • The study compares DL against ad-hoc feature extraction on physiological signals, used broadly in the AC literature, showing that DL yields models of equal or significantly higher accuracy when a single signal is used as model input.
  • First, to the best of the authors’ knowledge, this is the first time deep learning is introduced to the domain of psychophysiology, yielding efficient computational models of affect.
  • Finally, the key findings of the paper show the potential of DL as a mechanism for eliminating manual feature extraction and even, in some occasions, bypassing automatic feature selection for affective modeling.

II. COMPUTATIONAL MODELING OF AFFECT

  • Emotions and affect are mental and bodily processes that can be inferred by a human observer from a combination of contextual, behavioral and physiological cues.
  • Part of the complexity of affect modeling emerges from the challenges of finding objective and measurable signals that carry affective information (e.g. body posture, speech and skin conductance) and designing methodologies to collect and label emotional experiences effectively (e.g. induce specific emotions by exposing participants to a set of images).
  • The signals and the affective target values collected shape the modeling task and, thus, influence the efficacy and applicability of dissimilar computational methods.
  • This section gives an overview of the field beyond the input modalities and emotion annotation protocols examined in their case study.

A. Feature Extraction

  • The most common features extracted from unidimensional continuous signals — i.e. temporal sequences of real values such as blood volume pulse, accelerometer data, or speech — are simple statistical features, such as average and standard deviation values, calculated on the time or frequency domains of the raw or the normalized signals (see [15], [16] among others).
  • More complex feature extractors inspired by signal processing methods have also been proposed by several authors.
  • Second, the tracked points are aggregated into discrete Action Units [22], gestures [23] (e.g. lip stretch or head nod) or continuous statistical features (e.g. body contraction index), which are then used to predict the affective state of the user [24].
  • Deep neural network architectures such as convolutional neural networks (CNNs), as a popular technique for object recognition in images [25], have also been applied for facial expression recognition.
  • The authors expect that information relevant for prediction can be extracted more effectively using dimensionality reduction methods directly on the raw physiological signals than on a set of designer-selected extracted features.

B. Training Models of Affect

  • The selection of a method to create a model that maps a given set of features to predictions of affective variables is strongly influenced by the dynamic aspect of the features (stationary or sequential) and the format in which training examples are given (continuous values, class labels or ordinal labels).
  • On the other hand, Hidden Markov Models [41], Dynamic Bayesian Networks [42] and Recurrent Neural Networks [43] have been applied for constructing affect detectors that rely on features which change dynamically.
  • In all the above-mentioned studies, the prediction targets are either class labels or continuous values.
  • Alternatively, if ranks are used, the problem of affective modeling becomes one of preference learning.
  • These methods allow us to avoid binning together ordinal labels and to work with comparative questionnaires, which provide more reliable self-report data compared to ratings, as they generate less inconsistency and order effects [12].

III. DEEP ARTIFICIAL NEURAL NETWORKS

  • To bypass the manual ad-hoc feature extraction stage, the authors use a deep model composed from (a) a multi-layer convolutional neural network (CNN) that transforms the raw signals into a reduced set of features that feed (b) a single-layer perceptron (SLP) which predicts affective states (see Fig. 1).
  • The authors hypothesis is that the automation of feature extraction via deep learning will yield physiological affect detectors of higher predictive power, which, in turn, will deliver affective models of higher accuracy.
  • To train the convolutional neural network (see Section III-A) the authors use denoising auto-encoders [56], an unsupervised learning method to train filters or feature extractors which transform the information of the input signal (see Section III-B) in order to capture a distributed representation of its leading factors of variation, but without the linearity assumption of PCA.
  • In the case study examined in this paper, target values are given as pairwise comparisons (partial orders of length 2) making error functions commonly used with gradient descent methods, such as the difference of squared errors or cross-entropy, unsuitable for the task.
  • For that purpose, the authors use the rank margin error function for preference data [58], [59] as detailed in Section III-C below.

A. Convolutional Neural Networks

  • Convolutional or time-delay neural networks [25] are hierarchical models that alternate convolutional and pooling layers (see Fig. 1) in order to process large input spaces in which a spatial or temporal relation among the inputs exists (e.g. images, speech or physiological signals).
  • Each neuron scans sequentially the input, assessing at each patch location the similarity to the pattern encoded on the weights.
  • The output of the convolutional layer is the set of feature maps resulting from convolving each of the neurons across the input.
  • As soon as feature maps have been generated, a pooling layer aggregates consecutive values of the feature maps resulting from the previous convolutional layer, reducing their resolution with a pooling function (see Fig. 3).
  • The maximum or average values are the two most commonly used pooling functions providing max-pooling and average-pooling layers, respectively.

B. Auto-encoders

  • An auto-encoder (AE) [60], [8], [21] is a model that transforms an input space into a new distributed representation (extracted features) by applying a deterministic parametrized function (e.g. single layer of logistic neurons) called the encoder (see Fig. 4).
  • The resulting training objective is to reconstruct the original uncorrupted inputs, i.e., one minimizes the discrepancy between the outputs of the decoder and the original uncorrupted inputs.
  • Auto-encoders are among several unsupervised learning techniques that have provided remarkable improvements to gradient-descent supervised learning [4], especially when the number of labeled examples is small or in transfer settings [62].
  • ANNs that are pretrained using these techniques usually converge to more robust and accurate solutions than ANNs with randomly sampled initial weights.
  • The authors trained the filters of each convolutional layer patchwise, i.e., by considering the input at each position (one patch) in the sequence as one example.

C. Preference Deep Learning

  • The outputs of a trained CNN define a number of learned features extracted from the input signal.
  • These, in turn, may feed any function approximator or classifier that attempts to find a mapping between the input signal and a target output (i.e. affective state in their case).
  • To this aim, the authors use backpropagation [57], which optimizes an error function iteratively across a number of epochs by adjusting the weights of the SLP proportionally to the gradient of the error with respect to the current value of the weights and current data samples.
  • This function decreases linearly as the difference between the predicted value for preferred and non-preferred samples increases.
  • In each training epoch, for every pairwise preference in the training dataset, the output of the neural network is computed for the two data samples in the preference (preferred and non preferred) and the rank-margin error is backpropagated through the network in order to obtain the gradient required to update the weights.

D. Automatic Feature Selection

  • Automatic feature selection (FS) is an essential process towards picking those features (deep learned or ad-hoc extracted) that are appropriate for predicting the examined affective states.
  • The authors use Sequential Forward Feature Selection (SFS) for its low computational effort and demonstrated good performance compared to more advanced, nevertheless time consuming, feature subset selection algorithms such as the genetic-based FS [34].
  • While a number of other FS algorithms are available for comparison, in this paper the authors focus on the comparative benefits of learned physiological detectors over ad-hoc designed features.
  • In brief, SFS is a bottom-up search procedure where one feature is added at a time to the current feature set (see e.g. [48]).
  • The feature to be added is selected from the subset of the remaining features such that the new feature set generates the maximum value of the performance function over all candidate features for addition.

IV. THE MAZE-BALL DATASET

  • The dataset used to evaluate the proposed methodology was gathered during an experimental game survey where 36 participants played four pairs of different variants of the same video-game.
  • The test-bed game named Maze-Ball is a 3D prey/predator game that features a ball inside a maze controlled by the arrow keys.
  • Eight different game variants were presented to the players.
  • The authors expected that different camera profiles would induce different experiences and affective states, which would, in turn, reflect on the physiological state of the players, making it possible to predict the players’ affective self-reported preferences using information extracted from their physiology.
  • Blood volume pulse (BVP) and skin conductance (SC) were recorded at 31.25.

A. Ad-Hoc Extraction of Statistical Features

  • This section lists the statistical features extracted from the two physiological signals monitored.
  • Some features are extracted for both signals while some are signal-dependent as seen in the list below.
  • The choice of those specific statistical features is made in order to cover a fair amount of possible BVP and SC signal dynamics (tonic and phasic) proposed in the majority of previous studies in the field of psychophysiology (e.g. see [15], [65], [51] among many).
  • Both signals (α ∈ {BV P, SC}): Average E{α}, standard deviation σ{α}, maximum max{α}, minimum min{α}, the difference between maximum and minimum signal recording.

V. EXPERIMENTS

  • To test the efficacy of DL on constructing accurate models of affect the authors pretrained several convolutional neural networks — using denoising auto-encoders — to extract features for each of the physiological signals and across all reported affective states in the dataset.
  • In all experiments reported in this paper the final number of features pooled from the CNNs is 15, to match the number of ad-hoc extracted statistical features (see Section IV-A).
  • Independently of model input, the use of preference learning models — which are trained and evaluated using within-participant differences — automatically minimizes the effects of between-participants physiological differences (as noted in [33], [12] among other studies).
  • This section presents the key findings derived from the SC (Section V-A) and the BVP (Section V-B) signals and concludes with the analysis of the fusion of the two physiological signals (Section V-C).
  • The prediction accuracy of the models is calculated as the average 3-fold cross-validation (CV) accuracy (average percentage of correctly classified pairs on each fold).

B. Blood Volume Pulse

  • Following the same systematic approach for selecting CNN topology and parameter sets, the authors present two convolutional networks for the experiments on the Blood Volume Pulse (BVP) signal.
  • Figure 5(b) depicts the 45 connection weights of each neuron in CNNBV P1×45 which cover 43.2 seconds of the BVP signal’s upper envelope, also known as 1) Deep Learned Features.
  • On that basis, the second (N2) and fifth (N5) neurons detect two 10- second-long periods of HR increments, which are separated by a HR decay period.
  • It is worth noting that no other model improved baseline accuracy using all features (see Fig. 7(a)).
  • Given the reported links between fun and heart rate [18], this result suggests that CNNBV P1×45 effectively extracted HR information from the BVP signal to predict reported fun.

C. Fusion of SC and BVP

  • To test the effectiveness of learned features in fused models, the authors combined the outputs of the BVP and SC CNN networks presented earlier into one SLP and compared its performance against a combination of all ad-hoc BVP and SC features.
  • For space considerations the authors only present the combination of the best performing CNNs trained on each signal individually — 30×45).
  • The black bar displayed on each average value represents the standard error (10 runs).
  • The fusion of CNNs from both signals generates models that yield higher prediction accuracies than models built on ad-hoc features across all affective states, using both all features and subsets of selected features (see Fig. 8).
  • In all cases but one (i.e. anxiety prediction with SFS) these performances are significantly higher than the performances of corresponding models built on commonly used ad-hoc statistical features.

VI. DISCUSSION

  • The paper did not provide a thorough analysis of the impact of feature selection to the efficiency of DL as the focus was put on feature extraction.
  • Furthermore, other automatic feature extraction methods, such as principal component analysis, which is common in domains, such as image classification [68], will be explored for psycho-physiological modeling and compared to DL in this domain.
  • The authors argue that DL is expected to be of limited use in low resolution signals (e.g. player score over time) which could generate well-defined feature spaces for affective modeling.

VII. CONCLUSIONS

  • This paper introduced the application of deep learning (DL) to the construction of reliable models of affect built on physiological manifestations of emotion.
  • The algorithm proposed employs a number of convolutional layers that learn to extract relevant features from the input signals.
  • The increase in performance is more evident when automatic feature selection is employed.
  • These findings showcased the potential of DL for affective modeling, as both manual feature extraction and automatic feature selection could be ultimately bypassed.
  • With small modifications, the methodology proposed can be applied for affect classification and regression tasks across any type of input signal.

Did you find this useful? Give us your feedback

Figures (8)

Content maybe subject to copyright    Report

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. X, NO. X, MONTH 20XX 1
Learning Deep Physiological Models of Affect
H
´
ector P. Mart
´
ınez, Yoshua Bengio, and Georgios N. Yannakakis, Member, IEEE
Abstract—Feature extraction and feature selection are crucial
phases in the process of affective modeling. Both, however,
incorporate substantial limitations that hinder the development
of reliable and accurate models of affect. For the purpose of
modeling affect manifested through physiology, this paper builds
on recent advances in machine learning with deep learning
(DL) approaches. The efficiency of DL algorithms that train
artificial neural network models is tested and compared against
standard feature extraction and selection approaches followed
in the literature. Results on a game data corpus containing
players’ physiological signals (i.e. skin conductance and blood
volume pulse) and subjective self-reports of affect reveal that
DL outperforms manual ad-hoc feature extraction as it yields
significantly more accurate affective models. Moreover, it appears
that DL meets and even outperforms affective models that are
boosted by automatic feature selection, for several of the scenarios
examined. As the DL method is generic and applicable to any
affective modeling task, the key findings of the paper suggest
that ad-hoc feature extraction and selection to a lesser degree
could be bypassed.
Index Terms—Deep learning, affective modeling, auto-
encoders, convolutional neural networks, preference learning,
skin conductance, blood volume pulse, signal fusion, games
I. INTRODUCTION
M
ORE than 15 years after the early studies in Affective
Computing (AC), [1] the problem of detecting and
modeling emotions in the context of human-computer inter-
action (HCI) remains complex and largely unexplored. The
detection and modeling of emotion is, primarily, the study
and use of artificial intelligence (AI) techniques for the
construction of computational models of emotion. The key
challenges one faces when attempting to model emotion [2]
are inherent in the vague definitions and fuzzy boundaries
of emotion, and in the modeling methodology followed. In
this context, open research questions are still present in all
key components of the modeling process. These include, first,
the appropriateness of the modeling tool employed to map
emotional manifestations and responses to annotated affective
states; second, the processing of signals that express these
manifestations (i.e. model input); and third, the way affective
annotation (i.e. model output) is handled. This paper touches
upon all three key components of an affective model (i.e.
input, model, output) and introduces the use of deep learning
(DL) [3], [4], [5] methodologies for affective modeling from
multiple physiological signals.
H. P. Mart
´
ınez is with the IT University of Copenhagen, Center for
Computer Games Research
E-mail: hpma@itu.dk
Y. Bengio is with the University of Montreal, Department of Computer
Science and Operations Research
E-mail: bengioy@iro.umontreal.ca
G. N. Yannakakis is with the University of Malta, Dept. of Digital Games
E-mail: georgios.yannakakis@um.edu.mt
Traditionally in AC research, behavioral and bodily res-
ponses to stimuli are collected and used as the affective model
input. The input can be of three main types: a) behavioral res-
ponses to emotional stimuli expressed through an interactive
application (e.g. data obtained from a log of actions performed
in a game); b) objective data collected as bodily responses to
stimuli, such as physiological signals and facial expressions;
and c) the context of the interaction. Before these data streams
are fed into the computational model, an automatic or ad-hoc
feature extraction procedure is employed to derive appropriate
signal attributes (e.g. average skin conductance) that will feed
the model. It is also common to introduce an automatic or a
semi-automatic feature selection procedure that picks the most
appropriate of the features extracted.
While the phases of feature extraction and feature selection
are beneficial for affective modeling, they inherit a number
of critical limitations that make their use cumbersome in
highly complex multimodal input spaces. First, manual feature
extraction limits the creativity of attribute design to the expert
(i.e. the AC researcher) resulting in potentially inappropriate
affect detectors that might not be able to capture the man-
ifestations of the affect embedded in the raw input signals.
Second, both feature extraction and feature selection to
a larger degree are computationally expensive phases. In
particular, the computational cost of feature selection may
increase combinatorially (quadratically, in the greedy case)
with respect to the number of features considered [6]. In
general, there is no guarantee that any search algorithm is
able to converge to optimal feature sets for the model; even
exhaustive search may be approximate, since models are often
trained with non-deterministic algorithms.
Our hypothesis is that the use of non-linear unsupervised
and supervised learning methods relying on the principles
of DL [3], [4] can eliminate the limitations of the current
feature extraction and feature selection practices in affective
modeling. We test the hypothesis that DL could construct
feature extractors that are more appropriate than selected ad-
hoc features picked via automatic selection. Learning within
deep artificial neural network (ANN) architectures has proven
to be a powerful machine learning approach for a number
of benchmark problems and domains, including image and
speech recognition [7], [8]. DL allows the automation of fea-
ture extraction (and feature selection, in part) without compro-
mising on the accuracy of the obtained computational models
and the physical meaning of the data attributes extracted [9].
Using deep learning we were able to extract meaningful mul-
timodal data attributes beyond manual ad-hoc feature design.
These learned attributes led to more accurate affective models
and, at the same time, potentially save computational resources
by bypassing the computationally expensive feature selection
phase. Most importantly, with the use of DL we gain simplicity

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. X, NO. X, MONTH 20XX 2
as multiple signals can be fused and fed directly with
limited preprocessing to the model for training.
Other common automatic feature extraction techniques
within AC are principal component analysis (PCA) and Fisher
projection. However they are typically applied to a set of
features extracted a priori [10] while we apply DL directly
to the raw data signals. Moreover, DL techniques can operate
with any signal type and are not restricted to discrete signals
as, for example, sequential data mining techniques are [11].
Finally, compared to dynamic affect modeling approaches such
as Hidden Markov Models and Dynamic Bayesian Networks,
DL models are advantageous with respect to their ability
to reduce signal resolution across the several layers of their
architectures.
This paper focuses on developing DL models of affect
using data which are annotated in a ranking format (pairwise
preferences). We emphasize the benefits of preference-based
(or ranking-based) annotations for emotion (e.g. X is more
frustrating than Y) as opposed to rating-based annotation
[12] (such as the self-assessment manikins [13], a tool to
rate levels of arousal and valence in discrete or continuous
scales [14]) and introduce the use of DL algorithms for
preference learning, namely, preference deep learning (PDL).
In this paper, the PDL algorithm proposed is tested on emo-
tional manifestations of relaxation, anxiety, excitement, and
fun, embedded in physiological signals (i.e. skin conductance
and blood volume pulse) derived from a game-based user study
of 36 participants. The study compares DL against ad-hoc
feature extraction on physiological signals, used broadly in
the AC literature, showing that DL yields models of equal or
significantly higher accuracy when a single signal is used as
model input. When the skin conductance and blood volume
pulse signals are fused, DL outperforms standard feature
extraction across all affective states examined. The supremacy
of DL is maintained even when automatic feature selection
is employed to improve models built on ad-hoc features; in
several affective states the performance of models built on
automatically selected ad-hoc features does not surpass or
reach the corresponding accuracy of the PDL approach.
This paper advances the state-of-the-art in affective
modeling in several ways. First, to the best of the authors’
knowledge, this is the first time deep learning is introduced
to the domain of psychophysiology, yielding efficient com-
putational models of affect. Second, the paper shows the
strength of the method when applied to the fusion of different
physiological signals. Third, the paper introduces PDL, i.e. the
use of deep ANN architectures trained on ranked (pairwise
preference) annotations of affect. Finally, the key findings
of the paper show the potential of DL as a mechanism
for eliminating manual feature extraction and even, in some
occasions, bypassing automatic feature selection for affective
modeling.
II. COMPUTATIONAL MODELING OF AFFECT
Emotions and affect are mental and bodily processes that
can be inferred by a human observer from a combination
of contextual, behavioral and physiological cues. Part of the
complexity of affect modeling emerges from the challenges of
finding objective and measurable signals that carry affective
information (e.g. body posture, speech and skin conductance)
and designing methodologies to collect and label emotional
experiences effectively (e.g. induce specific emotions by ex-
posing participants to a set of images). Although this paper
is only concerned with computational aspects of creating
physiological detectors of affect, the signals and the affective
target values collected shape the modeling task and, thus, influ-
ence the efficacy and applicability of dissimilar computational
methods. Consequently, this section gives an overview of
the field beyond the input modalities and emotion annotation
protocols examined in our case study. Furthermore, the studies
surveyed are representative of the two principal applications
of AI for affect modeling and cover the two key research
pillars of this paper: 1) defining feature sets to extract relevant
bits of information from objective data signals (i.e. for feature
extraction), and 2) creating models that map a feature set into
predicted affective states (i.e. for training models of affect).
A. Feature Extraction
In the context of affect detection, we refer to feature extrac-
tion as the process of transforming the raw signals captured by
the hardware (e.g. a skin conductance sensor, a microphone,
or a camera) into a set of inputs suitable for a computational
predictor of affect. The most common features extracted from
unidimensional continuous signals i.e. temporal sequences
of real values such as blood volume pulse, accelerometer data,
or speech are simple statistical features, such as average and
standard deviation values, calculated on the time or frequency
domains of the raw or the normalized signals (see [15],
[16] among others). More complex feature extractors inspired
by signal processing methods have also been proposed by
several authors. For instance, Giakoumis et al. [17] proposed
features extracted from physiological signals using Legendre
and Krawtchouk polynomials while Yannakakis and Hallam
[18] used the approximate entropy [19] and the parameters of
linear, quadratic and exponential regression models fitted to a
heart rate signal. The focus of this paper is on DL methods
that can automatically derive feature extractors from the raw
data, as opposed to a fixed set of hand-crafted extractors that
represent pre-designed statistical features of the signals.
Unidimensional symbolic or discrete signals i.e. temporal
sequences of discrete labels, typically events such as clicking
a mouse button or blinking an eye are usually transformed
with ad-hoc statistical feature extractors such as counts,
similarly to continuous signals. Distinctively, Mart
´
ınez and
Yannakakis [11] used frequent sequence mining methods [20]
to find frequent patterns across different discrete modalities,
namely gameplay events and discrete physiological events.
The count of each pattern was then used as an input feature
to an affect detector. This methodology is only applicable
to discrete signals: continuous signals must be discretized,
which involves a loss of information. To this end, the key
advantage of the DL methodology proposed in this paper is
that it can handle both discrete and continuous signals; a
lossless transformation can convert a discrete signal into a

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. X, NO. X, MONTH 20XX 3
binary continuous signal, which can potentially be fed into a
deep network DL has been successfully applied to classify
binary images, e.g. [21].
Affect recognition based on signals with more than one
dimension typically boils down to affect recognition from
images or videos of body movements, posture or facial
expressions. In most studies, a series of relevant points of
the face or body are first detected (e.g. right mouth corner
and right elbow) and tracked along frames. Second, the
tracked points are aggregated into discrete Action Units [22],
gestures [23] (e.g. lip stretch or head nod) or continuous
statistical features (e.g. body contraction index), which are
then used to predict the affective state of the user [24]. Both
above-mentioned feature extraction steps are, by definition,
supervised learning problems as the points to be tracked and
action units to be identified have been defined a priori. While
these problems have been investigated extensively under the
name of facial expression or gesture recognition, we will
not survey them broadly as this paper focuses on methods
for automatically discovering new or unknown features in an
unsupervised manner.
Deep neural network architectures such as convolutional
neural networks (CNNs), as a popular technique for object
recognition in images [25], have also been applied for facial
expression recognition. In [26], CNNs were used to detect
predefined features such as eyes and mouth which later were
used to detect smiles. Contrary to our work, in that study
each of the layers of the CNN was trained independently
using backpropagation, i.e. labeled data was available for
training each level. More recently, Rifai et al. [27] successfully
applied a variant of auto-encoders [21] and convolutional
networks, namely Contractive Convolutional Neural Networks,
to learn features from images of faces and predict the displayed
emotion, breaking the previous state-of-the-art on the Toronto
Face Database [28]. The key differences of this paper with
that study reside in the nature of the dataset and the method
used. While Rifai et al. [27] used a large dataset (over 100, 000
samples; 4, 178 of them were labeled with an emotion class)
of static images displaying posed emotions, we use a small
dataset (224 samples, labeled with pairwise orders) with a
set of physiological time-series recorded along an emotional
experience. The reduced size of our dataset (which is of the
same magnitude as datasets used in related psychophysiologi-
cal studies e.g. [29], [30]) does not allow the extraction of
large feature sets (e.g. 9, 000 features in [27]), which would
lead to affect models of poor generalizability. The nature of
our preference labels also calls for a modified CNN training
algorithm for affective preference learning which is introduced
in this paper. Furthermore, while the use of CNNs to process
images is extensive, to the best of the authors knowledge,
CNNs have not been applied before to process (or as a means
to fuse) physiological signals.
As in many other machine learning applications, in affect
detection it is common to apply dimensionality reduction
techniques to the complete set of features extracted. A wide
variety of feature selection (FS) methods have been used
in the literature including sequential forward [31], sequential
floating forward [10], sequential backwards [32], n-best indi-
viduals [33], perceptron [33] and genetic [34] feature selection.
Fisher projection and Principal Component Analysis (PCA)
have been also widely used as dimensionality reducers on
different modalities of AC signals (e.g. see [10] among others).
An auto-encoder can be viewed as a non-linear generalization
of PCA [8]; however, while PCA has been applied in AC
to transpose sets of manually extracted features into low-
dimensional spaces, in this paper auto-encoders are used to
train unsupervised CNNs to transpose subsets of the raw
input signals into a learned set of features. We expect that
information relevant for prediction can be extracted more
effectively using dimensionality reduction methods directly on
the raw physiological signals than on a set of designer-selected
extracted features.
B. Training Models of Affect
The selection of a method to create a model that maps
a given set of features to predictions of affective variables
is strongly influenced by the dynamic aspect of the features
(stationary or sequential) and the format in which training
examples are given (continuous values, class labels or or-
dinal labels). A vast set of off-the-shelf machine learning
(ML) methods have been applied to create models of affect
based on stationary features, irrespective of the specific emo-
tions and modalities involved. These include Linear Discrimi-
nant Analysis [35], Multi-layer Perceptrons [32], K-Nearest
Neighbours [36], Support Vector Machines [37], Decision
Trees [38], Bayesian Networks [39], Gaussian Processes [29]
and Fuzzy-rules [40]. On the other hand, Hidden Markov
Models [41], Dynamic Bayesian Networks [42] and Recurrent
Neural Networks [43] have been applied for constructing affect
detectors that rely on features which change dynamically. In
the approach presented here, deep neural network architectures
reduce hierarchically the resolution of temporal signals down
to a set of features that can be fed to simple stateless models
eliminating the need for complex sequential predictors.
In all the above-mentioned studies, the prediction targets
are either class labels or continuous values. Class labels are
assigned either using an induction protocol (e.g. participants
are asked to self-elicit an emotion [36], presented with stories
to evoke a specific emotion [44]) or via rating- or rank-based
questionnaires given to users experiencing the emotion (self-
reports) or experts (third-person reports). If ratings are used,
they can be binned into discrete or binary classes (e.g. on a
scale from 1 to 5 measuring stress, values above or below
3 correspond to the user at stress or not at all, respectively
[45]) or used as target values for supervised learning (e.g.
two experts rate the amount of sadness of a facial expression
and the average value is used as the sadness intensity [46]).
Alternatively, if ranks are used, the problem of affective
modeling becomes one of preference learning. In this paper
we use object ranking methods a subset of preference
learning algorithms [47], [48] which train computational
models using partial orders among the training samples. These
methods allow us to avoid binning together ordinal labels and
to work with comparative questionnaires, which provide more
reliable self-report data compared to ratings, as they generate
less inconsistency and order effects [12].

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. X, NO. X, MONTH 20XX 4
Object ranking methods and comparative (rank) question-
naires have been scarcely explored in the AC literature,
despite their well-known advantages. For example, Tognetti et
al. [49] applied Linear Discriminant Analysis to learn models
of preferences over game experiences based on physiological
statistical features and comparative pairwise self-reports (i.e.
participants played pairs of games and ranked games according
to preference). On the same basis, Yannakakis et al. [50], [51]
and Mart
´
ınez et al. [34], [33] trained single and multiple layer
perceptrons via genetic algorithms (i.e. neuroevolutionary
preference learning) to learn models for several affective and
cognitive states (e.g. fun, challenge and frustration) using
physiological and behavioral data, and pairwise self-reports.
In this paper we introduce a deep learning methodology for
data given in a ranked format (i.e. Preference Deep Learning)
for the purpose of modeling affect.
III. DEEP ARTIFICIAL NEURAL NETWORKS
We investigate an effective method of learning models that
map signals of user behavior to predictions of affective states.
To bypass the manual ad-hoc feature extraction stage, we use
a deep model composed from (a) a multi-layer convolutional
neural network (CNN) that transforms the raw signals into
a reduced set of features that feed (b) a single-layer per-
ceptron (SLP) which predicts affective states (see Fig. 1).
Our hypothesis is that the automation of feature extraction
via deep learning will yield physiological affect detectors of
higher predictive power, which, in turn, will deliver affective
models of higher accuracy. The advantages of deep learning
techniques mentioned in the introduction of the paper have
led to very promising results in computer vision as they
have outperformed other state-of-the-art methods [52], [53].
Furthermore, convolutional networks have been successfully
applied to dissimilar temporal datasets (e.g. [54], [25]) in-
cluding electroencephalogram (EEG) signals [55] for seizure
prediction.
To train the convolutional neural network (see Section III-A)
we use denoising auto-encoders [56], an unsupervised learning
method to train filters or feature extractors which transform the
information of the input signal (see Section III-B) in order to
capture a distributed representation of its leading factors of
variation, but without the linearity assumption of PCA. The
SLP is then trained using backpropagation [57] to map the
outputs of the CNN to the given affective target values. In
the case study examined in this paper, target values are given
as pairwise comparisons (partial orders of length 2) making
error functions commonly used with gradient descent methods,
such as the difference of squared errors or cross-entropy,
unsuitable for the task. For that purpose, we use the rank
margin error function for preference data [58], [59] as detailed
in Section III-C below. Additionally, we apply an automatic
feature selection method to reduce the dimensionality of the
feature space improving the prediction accuracy of the models
trained (see Section III-D).
A. Convolutional Neural Networks
Convolutional or time-delay neural networks [25] are hier-
archical models that alternate convolutional and pooling layers
Fig. 2. Convolutional layer. The neurons in a convolutional layer take as
input a patch on the input signal x. Each of the neurons calculates a weighted
sum of the inputs (x · w), adds a bias parameter θ and applies an activation
function s(x). The output of each neuron contributes to a different feature
map. In order to find patterns that are insensitive to the baseline level of
the input signal, x is normalized with mean equal to 0. In this example, the
convolutional layer contains 3 neurons with 20 inputs each.
(see Fig. 1) in order to process large input spaces in which
a spatial or temporal relation among the inputs exists (e.g.
images, speech or physiological signals).
Convolutional layers contain a set of neurons that detect
different patterns on a patch of the input (e.g. a time window
in a time-series or part of an image). The inputs of each neuron
(namely receptive field) determine the size of the patch. Each
neuron contains a number of trainable weights equal to the
number of its inputs and an additional bias parameter (also
trainable); the output is calculated by applying an activation
function (e.g. logistic sigmoid) to the weighted sum of the in-
puts plus the bias (see Fig. 2). Each neuron scans sequentially
the input, assessing at each patch location the similarity to
the pattern encoded on the weights. The consecutive outputs
generated at every location of the input assemble a feature
map (see Fig. 1). The output of the convolutional layer is
the set of feature maps resulting from convolving each of the
neurons across the input. Note that the convolution of each
neuron produces the same number of outputs as the number
of samples in the input signal (e.g. the sequence length) minus
the size of the patch (i.e. the size of the receptive field of the
neuron), plus 1 (see Fig. 1).

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. X, NO. X, MONTH 20XX 5
Fig. 1. Example of structure of a deep ANN architecture. The architecture contains: (a) a convolutional neural network (CNN) with two convolutional and
two pooling layers, and (b) a single-layer perceptron (SLP) predictor. In the illustrated example the first convolutional layer (3 neurons and path length of
20 samples) processes a skin conductance signal which is propagated forward through an average-pooling layer (window length of 3 samples). A second
convolutional layer (3 neurons and patch length of 11 samples) processes the subsampled feature maps and the resulting feature maps feed the second
average-pooling layer (window length of 6 samples). The final subsampled feature maps form the output of the CNN which provides a number of extracted
(learned) features which feed the input of the SLP predictor.

Citations
More filters
Journal ArticleDOI
TL;DR: The experiment results show that neural signatures associated with different emotions do exist and they share commonality across sessions and individuals, and the performance of deep models with shallow models is compared.
Abstract: To investigate critical frequency bands and channels, this paper introduces deep belief networks (DBNs) to constructing EEG-based emotion recognition models for three emotions: positive, neutral and negative. We develop an EEG dataset acquired from 15 subjects. Each subject performs the experiments twice at the interval of a few days. DBNs are trained with differential entropy features extracted from multichannel EEG data. We examine the weights of the trained DBNs and investigate the critical frequency bands and channels. Four different profiles of 4, 6, 9, and 12 channels are selected. The recognition accuracies of these four profiles are relatively stable with the best accuracy of 86.65%, which is even better than that of the original 62 channels. The critical frequency bands and channels determined by using the weights of trained DBNs are consistent with the existing observations. In addition, our experiment results show that neural signatures associated with different emotions do exist and they share commonality across sessions and individuals. We compare the performance of deep models with shallow models. The average accuracies of DBN, SVM, LR, and KNN are 86.08%, 83.99%, 82.70%, and 72.60%, respectively.

1,131 citations


Cites background or methods or result from "Learning deep physiological models ..."

  • ...[5] trained an efficient deep convolution neural net-...

    [...]

  • ...Recently deep learning methods are also successfully applied to physiological signal processing such as EEG, electromyogram (EMG), electrocardiogram (ECG), and skin resistance (SC), and achieve comparable results in comparison with other conventional methods [5], [32]–[34]....

    [...]

  • ...Although affective computing has achieved rapid development in recent years, there are still many open problems to be solved [5], [6]....

    [...]

  • ...can eliminate the limitation of handcrafted features [5]....

    [...]

01 Jan 2014
TL;DR: This survey article reinterprets the evolution of NLP research as the intersection of three overlapping curves-namely Syntactics, Semantics, and Pragmatics Curves which will eventually lead NLPResearch to evolve into natural language understanding.

768 citations

Journal ArticleDOI
TL;DR: This article reinterpreted the evolution of NLP research as the intersection of three overlapping curves-namely Syntactics, Semantics, and Pragmatics Curves-which will eventually lead NLP to evolve into natural language understanding.
Abstract: Natural language processing (NLP) is a theory-motivated range of computational techniques for the automatic analysis and representation of human language. NLP research has evolved from the era of punch cards and batch processing (in which the analysis of a sentence could take up to 7 minutes) to the era of Google and the likes of it (in which millions of webpages can be processed in less than a second). This review paper draws on recent developments in NLP research to look at the past, present, and future of NLP technology in a new light. Borrowing the paradigm of `jumping curves? from the field of business management and marketing prediction, this survey article reinterprets the evolution of NLP research as the intersection of three overlapping curves-namely Syntactics, Semantics, and Pragmatics Curves- which will eventually lead NLP research to evolve into natural language understanding.

553 citations

Journal ArticleDOI
28 Jun 2018-Sensors
TL;DR: A comprehensive review on physiological signal-based emotion recognition, including emotion models, emotion elicitation methods, the published emotional physiological datasets, features, classifiers, and the whole framework for emotion recognition based on the physiological signals is presented.
Abstract: Emotion recognition based on physiological signals has been a hot topic and applied in many areas such as safe driving, health care and social security. In this paper, we present a comprehensive review on physiological signal-based emotion recognition, including emotion models, emotion elicitation methods, the published emotional physiological datasets, features, classifiers, and the whole framework for emotion recognition based on the physiological signals. A summary and comparation among the recent studies has been conducted, which reveals the current existing problems and the future work has been discussed.

484 citations


Cites methods from "Learning deep physiological models ..."

  • ...trained an efficient deep convolution neural network (CNN) to classify four cognitive states (relaxation, anxiety, excitement and fun) using skin conductance and blood volume pulse signals [119]....

    [...]

Posted Content
TL;DR: A novel end-to-end neural network model, Multi-Scale Convolutional Neural Networks (MCNN), which incorporates feature extraction and classification in a single framework, leading to superior feature representation.
Abstract: Time series classification (TSC), the problem of predicting class labels of time series, has been around for decades within the community of data mining and machine learning, and found many important applications such as biomedical engineering and clinical prediction. However, it still remains challenging and falls short of classification accuracy and efficiency. Traditional approaches typically involve extracting discriminative features from the original time series using dynamic time warping (DTW) or shapelet transformation, based on which an off-the-shelf classifier can be applied. These methods are ad-hoc and separate the feature extraction part with the classification part, which limits their accuracy performance. Plus, most existing methods fail to take into account the fact that time series often have features at different time scales. To address these problems, we propose a novel end-to-end neural network model, Multi-Scale Convolutional Neural Networks (MCNN), which incorporates feature extraction and classification in a single framework. Leveraging a novel multi-branch layer and learnable convolutional layers, MCNN automatically extracts features at different scales and frequencies, leading to superior feature representation. MCNN is also computationally efficient, as it naturally leverages GPU computing. We conduct comprehensive empirical evaluation with various existing methods on a large number of benchmark datasets, and show that MCNN advances the state-of-the-art by achieving superior accuracy performance than other leading methods.

408 citations


Cites background from "Learning deep physiological models ..."

  • ...Extensive comparison has shown that convolution operations in CNN have better capability on extracting meaningful features than adhoc feature selection [21]....

    [...]

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Journal ArticleDOI
28 Jul 2006-Science
TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.
Abstract: High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

16,717 citations

Journal ArticleDOI
TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
Abstract: We show how to use "complementary priors" to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

15,055 citations


"Learning deep physiological models ..." refers background in this paper

  • ...of DL [3], [4] can eliminate the limitations of the current feature extraction and feature selection practices in affective modeling....

    [...]

  • ...input, model, output) and introduces the use of deep learning (DL) [3], [4], [5] methodologies for affective modeling from multiple physiological signals....

    [...]

Journal ArticleDOI

12,519 citations


"Learning deep physiological models ..." refers methods in this paper

  • ...[12] (such as the self-assessment manikins [13], a tool to rate levels of arousal and valence in discrete or continuous scales [14]) and introduce the use of DL algorithms for preference learning, namely, preference deep learning (PDL)....

    [...]

Book
01 Jan 2009
TL;DR: The motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer modelssuch as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks are discussed.
Abstract: Can machine learning deliver AI? Theoretical results, inspiration from the brain and cognition, as well as machine learning experiments suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one would need deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers, graphical models with many levels of latent variables, or in complicated propositional formulae re-using many sub-formulae. Each level of the architecture represents features at a different level of abstraction, defined as a composition of lower-level features. Searching the parameter space of deep architectures is a difficult task, but new algorithms have been discovered and a new sub-area has emerged in the machine learning community since 2006, following these discoveries. Learning algorithms such as those for Deep Belief Networks and other related unsupervised learning algorithms have recently been proposed to train deep architectures, yielding exciting results and beating the state-of-the-art in certain areas. Learning Deep Architectures for AI discusses the motivations for and principles of learning algorithms for deep architectures. By analyzing and comparing recent results with different learning algorithms for deep architectures, explanations for their success are proposed and discussed, highlighting challenges and suggesting avenues for future explorations in this area.

7,767 citations


"Learning deep physiological models ..." refers background in this paper

  • ...techniques that have provided remarkable improvements to gradient-descent supervised learning [4], especially when the number of labeled examples is small or in transfer settings [62]....

    [...]

  • ...of DL [3], [4] can eliminate the limitations of the current feature extraction and feature selection practices in affective modeling....

    [...]

  • ...input, model, output) and introduces the use of deep learning (DL) [3], [4], [5] methodologies for affective modeling from multiple physiological signals....

    [...]

Frequently Asked Questions (19)
Q1. What have the authors contributed in "Learning deep physiological models of affect" ?

For the purpose of modeling affect manifested through physiology, this paper builds on recent advances in machine learning with deep learning ( DL ) approaches. The efficiency of DL algorithms that train artificial neural network models is tested and compared against standard feature extraction and selection approaches followed in the literature. Moreover, it appears that DL meets and even outperforms affective models that are boosted by automatic feature selection, for several of the scenarios examined. As the DL method is generic and applicable to any affective modeling task, the key findings of the paper suggest that ad-hoc feature extraction and selection — to a lesser degree — could be bypassed. 

Even though the results obtained are more than encouraging with respect to the applicability and efficacy of DL for affective modeling, there are a number of research directions that should be considered in future research. Although in this paper the authors have compared DL to a complete and representative set of ad-hoc features, a wider set of features could be explored in future work. Future work, however, would aim to further test the sensitivity of CNN topologies and parameter sets as well as the generality of the extracted features across physiological datasets, reducing the experimentation effort required for future applications of DL to psychophysiology. 

By defining the reconstruction error as the sum of squared differences between the inputs and the reconstructed inputs, the authors can use a gradient descent method such as backpropagation to train the weights of the model. 

Since the authors are interested in the minimal feature subset that yields the highest performance, the authors terminate selection procedure when an added feature yields equal or lower validation performance to the performance obtained without it. 

Part of thecomplexity of affect modeling emerges from the challenges of finding objective and measurable signals that carry affective information (e.g. body posture, speech and skin conductance) and designing methodologies to collect and label emotional experiences effectively (e.g. induce specific emotions by exposing participants to a set of images). 

An advantage of ad-hoc extracted statistical features resides in the simplicity to interpret the physical properties of the signal as they are usually based on simple statistical metrics. 

As soon as feature maps have been generated, a pooling layer aggregates consecutive values of the feature maps resulting from the previous convolutional layer, reducing their resolution with a pooling function (see Fig. 3). 

Despite the challenges that the periodicity of blood volume pulse generates in affective modeling, CNNs managed to extract powerful features to predict two affective states, outperforming the statistical features proposed in the literature and matching more complex data processing methods used in similar studies [67]. 

Note that while all layers of the deep architecture could be trained (including supervised fine-tuning of the CNNs), due to the small number of labeled examples available here, the Preference Deep Learning algorithm is constrained to the last layer (i.e. SLP) of the network in order to avoid overfitting. 

Convolutional layers contain a set of neurons that detect different patterns on a patch of the input (e.g. a time window in a time-series or part of an image). 

in general, suggest that DL methodologies are highly appropriate for affective modeling and, more importantly, indicate that ad-hoc feature extraction can be redundant for physiology-based modeling. 

The fusion of CNNs from both signals generates models that yield higher prediction accuracies than models built on ad-hoc features across all affective states, using both all features and subsets of selected features (see Fig. 8). 

A high output of these neurons would suggest that a change in the experience elicited a heightened level of arousal that decayed naturally seconds after. 

As in the CNNs used in the SC experiments, both topologies are topped up with an averagepooling layer that reduces the length of the outputs from each of the 5 output neurons down to 3 — i.e. the CNNs output 5 feature maps of length 3 which amounts to 15 features. 

A vast set of off-the-shelf machine learning (ML) methods have been applied to create models of affect based on stationary features, irrespective of the specific emotions and modalities involved. 

The number of inputs of the first convolutional layer of the two CNNs considered were selected to extract features at different time resolutions (20 and 80 inputs corresponding to 12.8 and 51.2 seconds, respectively) and, thereby, giving an indication of the impact the time resolution might have onperformance. 

For reported fun and excitement, CNN-based feature extraction demonstrates a great advantage of extracting affect-relevant information from BVP bypassing beat detection and heart rate estimation. 

The authors trained the filters of each convolutional layer patchwise, i.e., by considering the input at each position (one patch) in the sequence as one example. 

Auto-encoders are among several unsupervised learning techniques that have provided remarkable improvements to gradient-descent supervised learning [4], especially when the number of labeled examples is small or in transfer settings [62].