scispace - formally typeset
Search or ask a question
JournalISSN: 1783-7677

Journal on Multimodal User Interfaces 

Springer Science+Business Media
About: Journal on Multimodal User Interfaces is an academic journal published by Springer Science+Business Media. The journal publishes majorly in the area(s): Sonification & Usability. It has an ISSN identifier of 1783-7677. Over the lifetime, 396 publications have been published receiving 7360 citations. The journal is also known as: Journal on multimodal user interfaces (Print) & Multimodal user interfaces (Internet).


Papers
More filters
Journal ArticleDOI
TL;DR: In this article, the authors presented an approach to learn several specialist models using deep learning techniques, each focusing on one modality, including CNN, deep belief net, K-means based bag-of-mouths, and relational autoencoder.
Abstract: The task of the Emotion Recognition in the Wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based “bag-of-mouths” model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67 % on the 2014 dataset.

357 citations

Journal ArticleDOI
TL;DR: This paper builds a hierarchical architecture of the committee with exponentially-weighted decision fusion of deep CNNs for robust facial expression recognition for the third Emotion Recognition in the Wild (EmotiW2015) challenge.
Abstract: This paper describes our approach towards robust facial expression recognition (FER) for the third Emotion Recognition in the Wild (EmotiW2015) challenge. We train multiple deep convolutional neural networks (deep CNNs) as committee members and combine their decisions. To improve this committee of deep CNNs, we present two strategies: (1) in order to obtain diverse decisions from deep CNNs, we vary network architecture, input normalization, and random weight initialization in training these deep models, and (2) in order to form a better committee in structural and decisional aspects, we construct a hierarchical architecture of the committee with exponentially-weighted decision fusion. In solving a seven-class problem of static FER in the wild for the EmotiW2015, we achieve a test accuracy of 61.6 %. Moreover, on other public FER databases, our hierarchical committee of deep CNNs yields superior performance, outperforming or competing with state-of-the-art results for these databases.

219 citations

Journal ArticleDOI
TL;DR: The multimodal approach increased the recognition rate by more than 10% when compared to the most successful unimodal system, and the best pairing is ‘gesture-speech’.
Abstract: In this paper a study on multimodal automatic emotion recognition during a speech-based interaction is presented. A database was constructed consisting of people pronouncing a sentence in a scenario where they interacted with an agent using speech. Ten people pronounced a sentence corresponding to a command while making 8 different emotional expressions. Gender was equally represented, with speakers of several different native languages including French, German, Greek and Italian. Facial expression, gesture and acoustic analysis of speech were used to extract features relevant to emotion. For the automatic classification of unimodal data, bimodal data and multimodal data, a system based on a Bayesian classifier was used. After performing an automatic classification of each modality, the different modalities were combined using a multimodal approach. Fusion of the modalities at the feature level (before running the classifier) and at the results level (combining results from classifier from each modality) were compared. Fusing the multimodal data resulted in a large increase in the recognition rates in comparison to the unimodal systems: the multimodal approach increased the recognition rate by more than 10% when compared to the most successful unimodal system. Bimodal emotion recognition based on all combinations of the modalities (i.e., ‘face-gesture’, ‘face-speech’ and ‘gesture-speech’) was also investigated. The results show that the best pairing is ‘gesture-speech’. Using all three modalities resulted in a 3.3% classification improvement over the best bimodal results.

218 citations

Journal ArticleDOI
TL;DR: An objective statistical survey across the various sub-disciplines in the field and applied information analysis and network-theory techniques to answer several key questions relevant to the field reveal that there has been a sustained growth in this field.
Abstract: Assistive technology for the visually impaired and blind people is a research field that is gaining increasing prominence owing to an explosion of new interest in it from disparate disciplines. The field has a very relevant social impact on our ever-increasing aging and blind populations. While many excellent state-of-the-art accounts have been written till date, all of them are subjective in nature. We performed an objective statistical survey across the various sub-disciplines in the field and applied information analysis and network-theory techniques to answer several key questions relevant to the field. To analyze the field we compiled an extensive database of scientific research publications over the last two decades. We inferred interesting patterns and statistics concerning the main research areas and underlying themes, identified leading journals and conferences, captured growth patterns of the research field; identified active research communities and present our interpretation of trends in the field for the near future. Our results reveal that there has been a sustained growth in this field; from less than 50 publications per year in the mid 1990s to close to 400 scientific publications per year in 2014. Assistive Technology for persons with visually impairments is expected to grow at a swift pace and impact the lives of individuals and the elderly in ways not previously possible.

158 citations

Journal ArticleDOI
TL;DR: The proposition that the auditory and visual human communication complement each other, which is well-known in auditory-visual speech processing, is exploited and the proposed framework’s effectiveness in depression analysis is shown.
Abstract: Depression is a severe mental health disorder with high societal costs. Current clinical practice depends almost exclusively on self-report and clinical opinion, risking a range of subjective biases. The long-term goal of our research is to develop assistive technologies to support clinicians and sufferers in the diagnosis and monitoring of treatment progress in a timely and easily accessible format. In the first phase, we aim to develop a diagnostic aid using affective sensing approaches. This paper describes the progress to date and proposes a novel multimodal framework comprising of audio-video fusion for depression diagnosis. We exploit the proposition that the auditory and visual human communication complement each other, which is well-known in auditory-visual speech processing; we investigate this hypothesis for depression analysis. For the video data analysis, intra-facial muscle movements and the movements of the head and shoulders are analysed by computing spatio-temporal interest points. In addition, various audio features (fundamental frequency f0, loudness, intensity and mel-frequency cepstral coefficients) are computed. Next, a bag of visual features and a bag of audio features are generated separately. In this study, we compare fusion methods at feature level, score level and decision level. Experiments are performed on an age and gender matched clinical dataset of 30 patients and 30 healthy controls. The results from the multimodal experiments show the proposed framework’s effectiveness in depression analysis.

148 citations

Performance
Metrics
No. of papers from the Journal in previous years
YearPapers
20235
202217
202143
202029
201932
201823