A Bayesian framework for video affective representation
Summary (3 min read)
1.1. Overview
- Video and audio on-demand systems are getting more and more popular and are likely to replace traditional TVs.
- The enormous mass of digital multimedia content with its huge variety requires more efficient multimedia management methods.
- These studies were mostly based on content analysis and textual tags [2].
- Then, the affect representation system fuses the extracted features, the stored personal information, and the metadata to represent the evoked emotion.
- Section 3 details the movie dataset used and the features that have been extracted.
1.2. State of the art
- Video affect representation requires understanding of the intensity and type of user’s affect while watching a video.
- In order to represent affect in video, they first selected video- and audio- content based features based on their relation to the valence-arousal space that was defined as an affect model (for the definition of affect model, see Section 1.3) [4].
- Next, they used color energy, lighting and brightness as valence related features to be used for a HMM-based valence classification of the previously arousal-categorized shots.
- A personalized affect representation method based on a regression approach for estimating user-felt arousal and valence from multimedia content features and/or from physiological responses was presented by Soleymani et al. [7].
- A relevance vector machine was used to find linear regression weights.
1.3. Affect and Affective representation
- Russell [10] proposed a 3D continuous space called the valence-arousal-dominance space which was based on a self-representation of emotions from multiple subjects.
- The valence axis represents the pleasantness of a situation, from unpleasant to pleasant; the arousal axis expresses the degree of felt excitement, from calm to exciting.
- The most straightforward way to represent an emotion is to use discrete labels such as fear, anxiety and joy, labelbased representations have several disadvantages.
- The main one is that despite the universality of basic emotions, the labels themselves are not universal.
- Each movie consists of scenes and each scene consists of a sequence of shots which are happening in the same location.
2.1. Arousal estimation with regression on shots
- Informative features for arousal estimation include loudness and energy of the audio signals, motion component, visual excitement and shot duration.
- The RVM is able to reject uninformative features during its training hence no further feature selection was used for arousal determination.
- 0 1 ˆ wzwa sN i k iik (1) After computing arousal at the shot level, the average and maximum arousals of the shots of each scene are computed and used as arousal indicator features for the scene affective classification.
- During an exciting scene the arousal related features do not all remain at their extreme level.
- This was done in such a way that all movies from the dataset except for the one to which the shot belonged to were used as the training set for the RVM.
2.2. Bayesian framework and scene
- For the purpose of categorizing the valence-arousal space into three affect classes, the valence-arousal space was divided into the three areas shown in Figure 2, each corresponding to one class.
- Hence, the authors categorized the lower half of the plane into one class.
- These classes were used as a simple representation for the emotion categories based on the previous literature on emotion assessment [14].
- This feature vector in turn was used for the classification.
- Different methods were evaluated to estimate the posterior probability p(yj|xj).
3. Material description
- A dataset of movies segmented and affectively annotated by arousal and valence is used as the training set.
- The majority of movies were selected either because they were used in similar studies (e.g. [15]), or because they were recent and popular.
- Movie videos were encoded into the MPEG-1 format to extract motion vectors and I frames for further feature extraction.
- The second information stream, namely sound, has an important impact on user’s affect.
- Textual features were also extracted from the subtitles track of the movies.
3.1. Audio features
- A total of 53 low-level audio features were determined for each of the audio signals.
- To determine the three important audio types (music, speech, environment), the authors implemented a three class audio type classifier using support vector machines (SVM) operating on audio low-level features in a one second segment.
- Feature category Extracted features MFCC MFCC coefficients (13 features) [20], Derivative of MFCC (13 features), Autocorrelation of MFCC (13 features) Energy Average energy of audio signal [20].
- Time frequency Spectrum flux, Spectral centroid, Delta spectrum magnitude, Band energy ratio, [20;21].
- The thin red line, Gangs of New York each audio type in a movie segment.
3.2. Visual features
- From a movie director's point of view, lighting key [2;23] and color variance [2] are important tools to evoke emotions.
- The average shot change rate, and shot length variance were extracted to characterize video rhythm.
- Fast moving scenes or objects' movements in consecutive frames are also an effective factor for evoking excitement.
- Colors and their proportions are important parameters to elicit emotions [17].
- In order to use colors in the list of video features, a 20 bin color histogram of hue and lightness values in the HSV space was computed for each I frame and subsequently averaged over all frames.
3.3. Affective annotation
- The coordinates of a pointer manipulated by the user are continuously recorded during the show time of the stimuli (video, image, or external source) and used as the affect indicators.
- A set of SAM manikins (Self-Assessment Manikins [25]) are generated for different combinations of arousal and valence to help the user understand the emotions related to the regions of valence-arousal space.
- E.g. the positive excited manikin is generated by combining the positive manikin and the excited manikin.
- The participant was asked to annotate the movies so as to indicate at which times his/her felt emotion has changed.
- The participant was asked to indicate at least one point during each scene not to leave any scene without assessment.
4.1. Arousal estimation of shots
- Figure 4 shows a sample arousal curve from part of the film entitled “Silent Hill”.
- The participant’s felt emotion was however not completely in agreement with the estimated curve, as can for instance be observed in the second half of the plot.
- A possible cause for the discrepancy is the low temporal resolution of the selfassessment.
- Another possible cause is experimental weariness: after having had exciting stimuli for minutes, a participant's arousal might be decreasing despite strong movements in the video and loud audio.
- Finally, some emotional feelings might simply not be captured by lowlevel features; this would for instance be the case for a racist comment in a movie dialogue which evokes disgust for a participant.
4.2. Classification results
- 2 1 (3) For the ten-folding cross validation the original samples, movie scenes, were partitioned into 10 subsample sets.
- The naïve Bayesian classifier results are shown in Table 3-a.
- As with the temporal prior, the genre prior leads to better estimate of the emotion class.
- The evolution of classification results over consecutive scenes when adding the time prior shows that this prior allows correcting results for some samples that were misclassified using the genre prior only.
- Using physiological signals or audiovisual recordings will help overcome these problems and facilitate this part of the work, by yielding continuous affective annotations without interrupting the user [7].
5. Conclusions and perspectives
- An affective representation system for estimating felt emotions at the scene level has been proposed using a Bayesian classification framework that allows taking some form of context into account.
- Results showed the advantage of using well chosen priors, such as temporal information provided by the previous scene emotion, and movie genre.
- The f1 classification measure of 54.9% that was obtained on three emotional classes with a naïve Bayesian classifier was improved to 56.5% and 59.5 using only the time and genre prior.
- This measure finally improved to 63.4% after utilizing all the priors.
- It will also provide us with a better understanding of the feasibility of using group-wise profiles containing some affective characteristics that are shared between users.
Did you find this useful? Give us your feedback
Citations
3,013 citations
Cites background or methods from "A Bayesian framework for video affe..."
...Ç...
[...]
...We propose here a semi-automated method for stimulus selection, with the goal of minimizing the bias arising from the manual stimuli selection....
[...]
...For each found affective tag, the 10 songs most often labeled with this tag were selected....
[...]
582 citations
Cites background from "A Bayesian framework for video affe..."
...[3] M. Soleymani, G. Chanel, J. J. M. Kierkels, and T. Pun,“ Affective Characterization of Movie Scenes Based on Content Analysis and Physiological Changes,”International Journal of Semantic Computing, vol. 3, no. 2, pp. 235-254, June 2009....
[...]
270 citations
Additional excerpts
...Bayesian framework for video affective representation [25]...
[...]
158 citations
77 citations
Cites background from "A Bayesian framework for video affe..."
...[27] introduced a Bayesian classification framework for affective video tagging which takes contextual information into account since emotions that are elicited in response to a video scene contain valuable information for multimedia indexing and tagging....
[...]
...There exist a few approaches [6,27,29] to recognize emotions from videos but the field of video soundtrack recommendation for UGVs [24,34] is largely unexplored....
[...]
References
15 citations