Learning deep physiological models of affect
Summary (5 min read)
Introduction
- In particular, the computational cost of feature selection may increase combinatorially (quadratically, in the greedy case) with respect to the number of features considered [6].
- The study compares DL against ad-hoc feature extraction on physiological signals, used broadly in the AC literature, showing that DL yields models of equal or significantly higher accuracy when a single signal is used as model input.
- First, to the best of the authors’ knowledge, this is the first time deep learning is introduced to the domain of psychophysiology, yielding efficient computational models of affect.
- Finally, the key findings of the paper show the potential of DL as a mechanism for eliminating manual feature extraction and even, in some occasions, bypassing automatic feature selection for affective modeling.
II. COMPUTATIONAL MODELING OF AFFECT
- Emotions and affect are mental and bodily processes that can be inferred by a human observer from a combination of contextual, behavioral and physiological cues.
- Part of the complexity of affect modeling emerges from the challenges of finding objective and measurable signals that carry affective information (e.g. body posture, speech and skin conductance) and designing methodologies to collect and label emotional experiences effectively (e.g. induce specific emotions by exposing participants to a set of images).
- The signals and the affective target values collected shape the modeling task and, thus, influence the efficacy and applicability of dissimilar computational methods.
- This section gives an overview of the field beyond the input modalities and emotion annotation protocols examined in their case study.
A. Feature Extraction
- The most common features extracted from unidimensional continuous signals — i.e. temporal sequences of real values such as blood volume pulse, accelerometer data, or speech — are simple statistical features, such as average and standard deviation values, calculated on the time or frequency domains of the raw or the normalized signals (see [15], [16] among others).
- More complex feature extractors inspired by signal processing methods have also been proposed by several authors.
- Second, the tracked points are aggregated into discrete Action Units [22], gestures [23] (e.g. lip stretch or head nod) or continuous statistical features (e.g. body contraction index), which are then used to predict the affective state of the user [24].
- Deep neural network architectures such as convolutional neural networks (CNNs), as a popular technique for object recognition in images [25], have also been applied for facial expression recognition.
- The authors expect that information relevant for prediction can be extracted more effectively using dimensionality reduction methods directly on the raw physiological signals than on a set of designer-selected extracted features.
B. Training Models of Affect
- The selection of a method to create a model that maps a given set of features to predictions of affective variables is strongly influenced by the dynamic aspect of the features (stationary or sequential) and the format in which training examples are given (continuous values, class labels or ordinal labels).
- On the other hand, Hidden Markov Models [41], Dynamic Bayesian Networks [42] and Recurrent Neural Networks [43] have been applied for constructing affect detectors that rely on features which change dynamically.
- In all the above-mentioned studies, the prediction targets are either class labels or continuous values.
- Alternatively, if ranks are used, the problem of affective modeling becomes one of preference learning.
- These methods allow us to avoid binning together ordinal labels and to work with comparative questionnaires, which provide more reliable self-report data compared to ratings, as they generate less inconsistency and order effects [12].
III. DEEP ARTIFICIAL NEURAL NETWORKS
- To bypass the manual ad-hoc feature extraction stage, the authors use a deep model composed from (a) a multi-layer convolutional neural network (CNN) that transforms the raw signals into a reduced set of features that feed (b) a single-layer perceptron (SLP) which predicts affective states (see Fig. 1).
- The authors hypothesis is that the automation of feature extraction via deep learning will yield physiological affect detectors of higher predictive power, which, in turn, will deliver affective models of higher accuracy.
- To train the convolutional neural network (see Section III-A) the authors use denoising auto-encoders [56], an unsupervised learning method to train filters or feature extractors which transform the information of the input signal (see Section III-B) in order to capture a distributed representation of its leading factors of variation, but without the linearity assumption of PCA.
- In the case study examined in this paper, target values are given as pairwise comparisons (partial orders of length 2) making error functions commonly used with gradient descent methods, such as the difference of squared errors or cross-entropy, unsuitable for the task.
- For that purpose, the authors use the rank margin error function for preference data [58], [59] as detailed in Section III-C below.
A. Convolutional Neural Networks
- Convolutional or time-delay neural networks [25] are hierarchical models that alternate convolutional and pooling layers (see Fig. 1) in order to process large input spaces in which a spatial or temporal relation among the inputs exists (e.g. images, speech or physiological signals).
- Each neuron scans sequentially the input, assessing at each patch location the similarity to the pattern encoded on the weights.
- The output of the convolutional layer is the set of feature maps resulting from convolving each of the neurons across the input.
- As soon as feature maps have been generated, a pooling layer aggregates consecutive values of the feature maps resulting from the previous convolutional layer, reducing their resolution with a pooling function (see Fig. 3).
- The maximum or average values are the two most commonly used pooling functions providing max-pooling and average-pooling layers, respectively.
B. Auto-encoders
- An auto-encoder (AE) [60], [8], [21] is a model that transforms an input space into a new distributed representation (extracted features) by applying a deterministic parametrized function (e.g. single layer of logistic neurons) called the encoder (see Fig. 4).
- The resulting training objective is to reconstruct the original uncorrupted inputs, i.e., one minimizes the discrepancy between the outputs of the decoder and the original uncorrupted inputs.
- Auto-encoders are among several unsupervised learning techniques that have provided remarkable improvements to gradient-descent supervised learning [4], especially when the number of labeled examples is small or in transfer settings [62].
- ANNs that are pretrained using these techniques usually converge to more robust and accurate solutions than ANNs with randomly sampled initial weights.
- The authors trained the filters of each convolutional layer patchwise, i.e., by considering the input at each position (one patch) in the sequence as one example.
C. Preference Deep Learning
- The outputs of a trained CNN define a number of learned features extracted from the input signal.
- These, in turn, may feed any function approximator or classifier that attempts to find a mapping between the input signal and a target output (i.e. affective state in their case).
- To this aim, the authors use backpropagation [57], which optimizes an error function iteratively across a number of epochs by adjusting the weights of the SLP proportionally to the gradient of the error with respect to the current value of the weights and current data samples.
- This function decreases linearly as the difference between the predicted value for preferred and non-preferred samples increases.
- In each training epoch, for every pairwise preference in the training dataset, the output of the neural network is computed for the two data samples in the preference (preferred and non preferred) and the rank-margin error is backpropagated through the network in order to obtain the gradient required to update the weights.
D. Automatic Feature Selection
- Automatic feature selection (FS) is an essential process towards picking those features (deep learned or ad-hoc extracted) that are appropriate for predicting the examined affective states.
- The authors use Sequential Forward Feature Selection (SFS) for its low computational effort and demonstrated good performance compared to more advanced, nevertheless time consuming, feature subset selection algorithms such as the genetic-based FS [34].
- While a number of other FS algorithms are available for comparison, in this paper the authors focus on the comparative benefits of learned physiological detectors over ad-hoc designed features.
- In brief, SFS is a bottom-up search procedure where one feature is added at a time to the current feature set (see e.g. [48]).
- The feature to be added is selected from the subset of the remaining features such that the new feature set generates the maximum value of the performance function over all candidate features for addition.
IV. THE MAZE-BALL DATASET
- The dataset used to evaluate the proposed methodology was gathered during an experimental game survey where 36 participants played four pairs of different variants of the same video-game.
- The test-bed game named Maze-Ball is a 3D prey/predator game that features a ball inside a maze controlled by the arrow keys.
- Eight different game variants were presented to the players.
- The authors expected that different camera profiles would induce different experiences and affective states, which would, in turn, reflect on the physiological state of the players, making it possible to predict the players’ affective self-reported preferences using information extracted from their physiology.
- Blood volume pulse (BVP) and skin conductance (SC) were recorded at 31.25.
A. Ad-Hoc Extraction of Statistical Features
- This section lists the statistical features extracted from the two physiological signals monitored.
- Some features are extracted for both signals while some are signal-dependent as seen in the list below.
- The choice of those specific statistical features is made in order to cover a fair amount of possible BVP and SC signal dynamics (tonic and phasic) proposed in the majority of previous studies in the field of psychophysiology (e.g. see [15], [65], [51] among many).
- Both signals (α ∈ {BV P, SC}): Average E{α}, standard deviation σ{α}, maximum max{α}, minimum min{α}, the difference between maximum and minimum signal recording.
V. EXPERIMENTS
- To test the efficacy of DL on constructing accurate models of affect the authors pretrained several convolutional neural networks — using denoising auto-encoders — to extract features for each of the physiological signals and across all reported affective states in the dataset.
- In all experiments reported in this paper the final number of features pooled from the CNNs is 15, to match the number of ad-hoc extracted statistical features (see Section IV-A).
- Independently of model input, the use of preference learning models — which are trained and evaluated using within-participant differences — automatically minimizes the effects of between-participants physiological differences (as noted in [33], [12] among other studies).
- This section presents the key findings derived from the SC (Section V-A) and the BVP (Section V-B) signals and concludes with the analysis of the fusion of the two physiological signals (Section V-C).
- The prediction accuracy of the models is calculated as the average 3-fold cross-validation (CV) accuracy (average percentage of correctly classified pairs on each fold).
B. Blood Volume Pulse
- Following the same systematic approach for selecting CNN topology and parameter sets, the authors present two convolutional networks for the experiments on the Blood Volume Pulse (BVP) signal.
- Figure 5(b) depicts the 45 connection weights of each neuron in CNNBV P1×45 which cover 43.2 seconds of the BVP signal’s upper envelope, also known as 1) Deep Learned Features.
- On that basis, the second (N2) and fifth (N5) neurons detect two 10- second-long periods of HR increments, which are separated by a HR decay period.
- It is worth noting that no other model improved baseline accuracy using all features (see Fig. 7(a)).
- Given the reported links between fun and heart rate [18], this result suggests that CNNBV P1×45 effectively extracted HR information from the BVP signal to predict reported fun.
C. Fusion of SC and BVP
- To test the effectiveness of learned features in fused models, the authors combined the outputs of the BVP and SC CNN networks presented earlier into one SLP and compared its performance against a combination of all ad-hoc BVP and SC features.
- For space considerations the authors only present the combination of the best performing CNNs trained on each signal individually — 30×45).
- The black bar displayed on each average value represents the standard error (10 runs).
- The fusion of CNNs from both signals generates models that yield higher prediction accuracies than models built on ad-hoc features across all affective states, using both all features and subsets of selected features (see Fig. 8).
- In all cases but one (i.e. anxiety prediction with SFS) these performances are significantly higher than the performances of corresponding models built on commonly used ad-hoc statistical features.
VI. DISCUSSION
- The paper did not provide a thorough analysis of the impact of feature selection to the efficiency of DL as the focus was put on feature extraction.
- Furthermore, other automatic feature extraction methods, such as principal component analysis, which is common in domains, such as image classification [68], will be explored for psycho-physiological modeling and compared to DL in this domain.
- The authors argue that DL is expected to be of limited use in low resolution signals (e.g. player score over time) which could generate well-defined feature spaces for affective modeling.
VII. CONCLUSIONS
- This paper introduced the application of deep learning (DL) to the construction of reliable models of affect built on physiological manifestations of emotion.
- The algorithm proposed employs a number of convolutional layers that learn to extract relevant features from the input signals.
- The increase in performance is more evident when automatic feature selection is employed.
- These findings showcased the potential of DL for affective modeling, as both manual feature extraction and automatic feature selection could be ultimately bypassed.
- With small modifications, the methodology proposed can be applied for affect classification and regression tasks across any type of input signal.
Did you find this useful? Give us your feedback
Citations
1,131 citations
Cites background or methods or result from "Learning deep physiological models ..."
...[5] trained an efficient deep convolution neural net-...
[...]
...Recently deep learning methods are also successfully applied to physiological signal processing such as EEG, electromyogram (EMG), electrocardiogram (ECG), and skin resistance (SC), and achieve comparable results in comparison with other conventional methods [5], [32]–[34]....
[...]
...Although affective computing has achieved rapid development in recent years, there are still many open problems to be solved [5], [6]....
[...]
...can eliminate the limitation of handcrafted features [5]....
[...]
768 citations
553 citations
484 citations
Cites methods from "Learning deep physiological models ..."
...trained an efficient deep convolution neural network (CNN) to classify four cognitive states (relaxation, anxiety, excitement and fun) using skin conductance and blood volume pulse signals [119]....
[...]
408 citations
Cites background from "Learning deep physiological models ..."
...Extensive comparison has shown that convolution operations in CNN have better capability on extracting meaningful features than adhoc feature selection [21]....
[...]
References
73,978 citations
16,717 citations
15,055 citations
"Learning deep physiological models ..." refers background in this paper
...of DL [3], [4] can eliminate the limitations of the current feature extraction and feature selection practices in affective modeling....
[...]
...input, model, output) and introduces the use of deep learning (DL) [3], [4], [5] methodologies for affective modeling from multiple physiological signals....
[...]
12,519 citations
"Learning deep physiological models ..." refers methods in this paper
...[12] (such as the self-assessment manikins [13], a tool to rate levels of arousal and valence in discrete or continuous scales [14]) and introduce the use of DL algorithms for preference learning, namely, preference deep learning (PDL)....
[...]
7,767 citations
"Learning deep physiological models ..." refers background in this paper
...techniques that have provided remarkable improvements to gradient-descent supervised learning [4], especially when the number of labeled examples is small or in transfer settings [62]....
[...]
...of DL [3], [4] can eliminate the limitations of the current feature extraction and feature selection practices in affective modeling....
[...]
...input, model, output) and introduces the use of deep learning (DL) [3], [4], [5] methodologies for affective modeling from multiple physiological signals....
[...]
Related Papers (5)
Frequently Asked Questions (19)
Q2. What are the future works in "Learning deep physiological models of affect" ?
Even though the results obtained are more than encouraging with respect to the applicability and efficacy of DL for affective modeling, there are a number of research directions that should be considered in future research. Although in this paper the authors have compared DL to a complete and representative set of ad-hoc features, a wider set of features could be explored in future work. Future work, however, would aim to further test the sensitivity of CNN topologies and parameter sets as well as the generality of the extracted features across physiological datasets, reducing the experimentation effort required for future applications of DL to psychophysiology.
Q3. How can the authors train the weights of the model?
By defining the reconstruction error as the sum of squared differences between the inputs and the reconstructed inputs, the authors can use a gradient descent method such as backpropagation to train the weights of the model.
Q4. What is the function that terminates the selection procedure?
Since the authors are interested in the minimal feature subset that yields the highest performance, the authors terminate selection procedure when an added feature yields equal or lower validation performance to the performance obtained without it.
Q5. What is the main idea behind the complexity of affect modeling?
Part of thecomplexity of affect modeling emerges from the challenges of finding objective and measurable signals that carry affective information (e.g. body posture, speech and skin conductance) and designing methodologies to collect and label emotional experiences effectively (e.g. induce specific emotions by exposing participants to a set of images).
Q6. What is the advantage of ad-hoc extracted statistical features?
An advantage of ad-hoc extracted statistical features resides in the simplicity to interpret the physical properties of the signal as they are usually based on simple statistical metrics.
Q7. What is the function that is used to reduce the resolution of the feature maps?
As soon as feature maps have been generated, a pooling layer aggregates consecutive values of the feature maps resulting from the previous convolutional layer, reducing their resolution with a pooling function (see Fig. 3).
Q8. What is the main reason why CNNs are so effective in predicting affective states?
Despite the challenges that the periodicity of blood volume pulse generates in affective modeling, CNNs managed to extract powerful features to predict two affective states, outperforming the statistical features proposed in the literature and matching more complex data processing methods used in similar studies [67].
Q9. Why is the Preference Deep Learning algorithm constrained to the last layer?
Note that while all layers of the deep architecture could be trained (including supervised fine-tuning of the CNNs), due to the small number of labeled examples available here, the Preference Deep Learning algorithm is constrained to the last layer (i.e. SLP) of the network in order to avoid overfitting.
Q10. What is the definition of a convolutional layer?
Convolutional layers contain a set of neurons that detect different patterns on a patch of the input (e.g. a time window in a time-series or part of an image).
Q11. What does the paper suggest that DL methodologies are appropriate for affective modeling?
in general, suggest that DL methodologies are highly appropriate for affective modeling and, more importantly, indicate that ad-hoc feature extraction can be redundant for physiology-based modeling.
Q12. What is the effect of fusion of CNNs on the prediction of affect?
The fusion of CNNs from both signals generates models that yield higher prediction accuracies than models built on ad-hoc features across all affective states, using both all features and subsets of selected features (see Fig. 8).
Q13. What would be the effect of a high output of these neurons?
A high output of these neurons would suggest that a change in the experience elicited a heightened level of arousal that decayed naturally seconds after.
Q14. How many features are output from the CNNs?
As in the CNNs used in the SC experiments, both topologies are topped up with an averagepooling layer that reduces the length of the outputs from each of the 5 output neurons down to 3 — i.e. the CNNs output 5 feature maps of length 3 which amounts to 15 features.
Q15. What are the main types of ML methods used to create models of affect?
A vast set of off-the-shelf machine learning (ML) methods have been applied to create models of affect based on stationary features, irrespective of the specific emotions and modalities involved.
Q16. How many inputs were selected to extract features at different time resolutions?
The number of inputs of the first convolutional layer of the two CNNs considered were selected to extract features at different time resolutions (20 and 80 inputs corresponding to 12.8 and 51.2 seconds, respectively) and, thereby, giving an indication of the impact the time resolution might have onperformance.
Q17. What is the way to extract affect-relevant information from BVP?
For reported fun and excitement, CNN-based feature extraction demonstrates a great advantage of extracting affect-relevant information from BVP bypassing beat detection and heart rate estimation.
Q18. How do the authors train the filters of each convolutional layer?
The authors trained the filters of each convolutional layer patchwise, i.e., by considering the input at each position (one patch) in the sequence as one example.
Q19. What are the advantages of using auto-encoders?
Auto-encoders are among several unsupervised learning techniques that have provided remarkable improvements to gradient-descent supervised learning [4], especially when the number of labeled examples is small or in transfer settings [62].