scispace - formally typeset
Search or ask a question

Showing papers in "Journal on Multimodal User Interfaces in 2013"


Journal ArticleDOI
TL;DR: The proposition that the auditory and visual human communication complement each other, which is well-known in auditory-visual speech processing, is exploited and the proposed framework’s effectiveness in depression analysis is shown.
Abstract: Depression is a severe mental health disorder with high societal costs. Current clinical practice depends almost exclusively on self-report and clinical opinion, risking a range of subjective biases. The long-term goal of our research is to develop assistive technologies to support clinicians and sufferers in the diagnosis and monitoring of treatment progress in a timely and easily accessible format. In the first phase, we aim to develop a diagnostic aid using affective sensing approaches. This paper describes the progress to date and proposes a novel multimodal framework comprising of audio-video fusion for depression diagnosis. We exploit the proposition that the auditory and visual human communication complement each other, which is well-known in auditory-visual speech processing; we investigate this hypothesis for depression analysis. For the video data analysis, intra-facial muscle movements and the movements of the head and shoulders are analysed by computing spatio-temporal interest points. In addition, various audio features (fundamental frequency f0, loudness, intensity and mel-frequency cepstral coefficients) are computed. Next, a bag of visual features and a bag of audio features are generated separately. In this study, we compare fusion methods at feature level, score level and decision level. Experiments are performed on an age and gender matched clinical dataset of 30 patients and 30 healthy controls. The results from the multimodal experiments show the proposed framework’s effectiveness in depression analysis.

148 citations


Journal ArticleDOI
TL;DR: The D64 corpus as mentioned in this paper is a multimodal corpus recorded over two successive days and contains annotations on conversational involvement, speech activity and pauses as well as information of the average degree of change in the movement of participants.
Abstract: In recent years there has been a substantial debate about the need for increasingly spontaneous, conversational corpora of spoken interaction that are not controlled or task directed. In parallel the need has arisen for the recording of multi-modal corpora which are not restricted to the audio domain alone. With a corpus that would fulfill both needs, it would be possible to investigate the natural coupling, not only in turn-taking and voice, but also in the movement of participants. In the following paper we describe the design and recording of such a corpus and we provide some illustrative examples of how such a corpus might be exploited in the study of dynamic interaction. The D64 corpus is a multimodal corpus recorded over two successive days. Each day resulted in approximately 4 h of recordings. In total five participants took part in the recordings of whom two participants were female and three were male. Seven video cameras were used of which at least one was trained on each participant. The Optitrack motion capture kit was used in order to enrich information. The D64 corpus comprises annotations on conversational involvement, speech activity and pauses as well as information of the average degree of change in the movement of participants.

79 citations


Journal ArticleDOI
TL;DR: It is indicated that emergent leadership is related, but not equivalent, to dominance, and while multimodal features bring a moderate degree of effectiveness in inferring the leader, much simpler features extracted from the audio channel are found to give better performance.
Abstract: In this paper we present a multimodal analysis of emergent leadership in small groups using audio-visual features and discuss our experience in designing and collecting a data corpus for this purpose. The ELEA Audio-Visual Synchronized corpus (ELEA AVS) was collected using a light portable setup and contains recordings of small group meetings. The participants in each group performed the winter survival task and filled in questionnaires related to personality and several social concepts such as leadership and dominance. In addition, the corpus includes annotations on participants’ performance in the survival task, and also annotations of social concepts from external viewers. Based on this corpus, we present the feasibility of predicting the emergent leader in small groups using automatically extracted audio and visual features, based on speaking turns and visual attention, and we focus specifically on multimodal features that make use of the looking at participants while speaking and looking at while not speaking measures. Our findings indicate that emergent leadership is related, but not equivalent, to dominance, and while multimodal features bring a moderate degree of effectiveness in inferring the leader, much simpler features extracted from the audio channel are found to give better performance.

75 citations


Journal ArticleDOI
TL;DR: A systematically annotated speech and gesture corpus consisting of 25 route-and-landmark-description dialogues, the Bielefeld Speech and Gesture Alignment corpus (SaGA), collected in experimental face-to-face settings is discussed.
Abstract: Communicating face-to-face, interlocutors frequently produce multimodal meaning packages consisting of speech and accompanying gestures. We discuss a systematically annotated speech and gesture corpus consisting of 25 route-and-landmark-description dialogues, the Bielefeld Speech and Gesture Alignment corpus (SaGA), collected in experimental face-to-face settings. We first describe the primary and secondary data of the corpus and its reliability assessment. Then we go into some of the projects carried out using SaGA demonstrating the wide range of its usability: on the empirical side, there is work on gesture typology, individual and contextual parameters influencing gesture production and gestures’ functions for dialogue structure. Speech-gesture interfaces have been established extending unification-based grammars. In addition, the development of a computational model of speech-gesture alignment and its implementation constitutes a research line we focus on.

70 citations


Journal ArticleDOI
TL;DR: It is shown how phenomena of fluid real-time conversation, like adapting to user feedback or smooth turn-keeping, can be realized with ASAP and the overall architectural concept is described, along with specific means of specifying incremental behavior in BML and technical implementations of different modules.
Abstract: Embodied conversational agents still do not achieve the fluidity and smoothness of natural conversational interac- tion One main reason is that current system often respond with big latencies and in inflexible ways We argue that to overcome these problems, real-time conversational agents need to be based on an underlying architecture that provides two essential features for fast and fluent behavior adaptation: A close bi-directional coordination between input processing and output generation, and incrementality of processing at both stages We propose an architectural framework for con- versational agents (ASAP) providing these two ingredients for fluid real-time conversation The overall architectural con- cept is described, along with specific means of specifying incremental behavior in BML and technical implementations of different modules We show how phenomena of fluid real- time conversation, like adapting to user feedback or smooth turn-keeping, can be realized with ASAP and we describe in detail an example real-time interaction with the implemented

37 citations


Journal ArticleDOI
TL;DR: The World Wide Web Consortium's (W3C) Multimodal Architecture and Interfaces (MMI Architecture) standard is described, an architecture and communications protocol that enables a wide variety of independent modalities to be integrated into multimodal applications.
Abstract: This paper describes the World Wide Web Consortium’s (W3C) Multimodal Architecture and Interfaces (MMI Architecture) standard, an architecture and communications protocol that enables a wide variety of independent modalities to be integrated into multimodal applications. By encapsulating the functionalities of modality components and requiring all control information to go through the Interaction Manager, the MMI Architecture simplifies integrating components from multiple sources.

33 citations


Journal ArticleDOI
TL;DR: This paper explains how four computational models of emotions that are oriented towards interactive facial animation were implemented and evaluated in the Multimodal Affective and Reactive Character framework, designed for animating interactive expressive virtual agents with different levels of interactivity.
Abstract: Emotions and their expressions by virtual characters will play a key role in future affective human–machine interfaces. Recent advances in the psychology of emotions and recent progress in computer graphics allow researchers to animate virtual characters that are capable of subtly expressing emotions. Existing virtual agent systems are nevertheless often limited in terms of underlying emotional models, visual realism, and real-time interaction capabilities. In this paper, we explain how we explored four computational models of emotions that are oriented towards interactive facial animation. The models that we implemented correspond to different approaches to emotions: a categorical approach, a dimensional approach, a cognitive approach, and a social approach. We explain how we implemented and evaluated these models in our Multimodal Affective and Reactive Character framework. MARC is designed for animating interactive expressive virtual agents with different levels of interactivity. The advantages, drawbacks and complementarity of these approaches are discussed.

30 citations


Journal ArticleDOI
TL;DR: Ravel (Robots with Audiovisual Abilities), a publicly available data set which covers examples of Human Robot Interaction (HRI) scenarios, is introduced, illustrating its appropriateness for carrying out a large variety of HRI experiments.
Abstract: We introduce Ravel (Robots with Audiovisual Abilities), a publicly available data set which covers examples of Human Robot Interaction (HRI) scenarios. These scenarios are recorded using the audio-visual robot head POPEYE, equipped with two cameras and four microphones, two of which being plugged into the ears of a dummy head. All the recordings were performed in a standard room with no special equipment, thus providing a challenging indoor scenario. This data set provides a basis to test and benchmark methods and algorithms for audio-visual scene analysis with the ultimate goal of enabling robots to interact with people in the most natural way. The data acquisition setup, sensor calibration, data annotation and data content are fully detailed. Moreover, three examples of using the recorded data are provided, illustrating its appropriateness for carrying out a large variety of HRI experiments. The Ravel data are publicly available at: http://ravel.humavips.eu/ .

27 citations


Journal ArticleDOI
TL;DR: This article describes the experiences integrating JVoiceXML into the W3C MMI architecture and identifies general limitations with regard to the available design space.
Abstract: Research regarding multimodal interaction led to a multitude of proposals for suitable software architectures. With all architectures describing multimodal systems differently, interoperability is severely hindered. The W3C MMI architecture is a proposed recommendation for a common architecture. In this article, we describe our experiences integrating JVoiceXML into the W3C MMI architecture and identify general limitations with regard to the available design space.

27 citations


Journal ArticleDOI
TL;DR: The development of the user interfaces of a multi-device digital coaching service that provides tailored feedback to users concerning their physical activity level and medication intake and the outcomes of a survey study of user preferences are presented.
Abstract: We present the development of the user interfaces of a multi-device digital coaching service that provides tailored feedback to users concerning their physical activity level and medication intake. We present the outcomes of a survey study of user preferences regarding the situation, device and timing of feedback they receive from their personal attentive digital coach. There are clear preferences among the subjects for different types of messages on different devices. Results were implemented in a first prototype. We present the results of a user evaluations with a real version of the digital health coach and we compare them with the results of the survey study.

25 citations


Journal ArticleDOI
TL;DR: This paper presents two realistic datasets acquired in a fully equipped Health Smart Home related to distress detection from speech and 15 participants who were performing several instances of seven activities of daily living.
Abstract: Health Smart Homes are nowadays a very explored research area due to the needs for automation and telemedicine to support people in loss of autonomy and also due to the evolution of the technology that led in cheap and efficient sensors. However, collecting data in this area is still very challenging. As a consequence, many studies can not be validated on real data. In this paper, we present two realistic datasets acquired in a fully equipped Health Smart Home. The first is related to distress detection from speech (450 recorded sentences) and involved 10 participants, the second involved 15 participants who were performing several instances of seven activities of daily living (16 hours of multimodal data).

Journal ArticleDOI
TL;DR: This paper presents a natural and intuitive interface, which uses the Microsoft Kinect, to interactively control an armature by tracking body poses, and allows animators to save time with respect to the traditional animation technique based on keyframing.
Abstract: Most virtual characters’ animations are based on armatures to manipulate the characters’ body parts (rigging). Armatures behave as the characters’ skeletons, and their segments are referred to as bones. Each bone of the skeleton is associated with a well defined set of vertices defining the character’ mesh (skinning), thus allowing animators to control movements and deformations of the character itself. This paper presents a natural and intuitive interface, which uses the Microsoft Kinect, to interactively control an armature by tracking body poses. Animators can animate virtual characters in real-time by their own body poses, thus obtaining realistic and smooth animations. Moreover, the proposed interface allows animators to save time with respect to the traditional animation technique based on keyframing. Different examples are used to compare the Kinect-based interface with the keyframing approach, thus obtaining both an objective and a subjective assessment.

Journal ArticleDOI
TL;DR: The AGNES project, a home-based system has been developed that allows connecting elderly with their families, friends and other significant people over the Internet, and it was found that use of the AGNES system had positive effects on the mental state of the users, compared to the control group without the technology.
Abstract: Western societies are confronted with a number of challenges caused by the increasing number of older citizens. One important aspect is the need and wish of older people to live as long as possible in their own home and maintain an independent life. As people grew older, their social networks disperse, with friends and families moving to other parts of town, other cities or even countries. Additionally, people become less mobile with age, leading to less active participation in societal life. Combined, this normal, age-related development leads to increased loneliness and social isolation of older people, with negative effects on mental and physical health of those people. In the AGNES project, a home-based system has been developed that allows connecting elderly with their families, friends and other significant people over the Internet. As most older people have limited experience with computers and often special requirements on technology, one focus of AGNES was to develop with the users novel technological means for interacting with their social network. The resulting system uses ambient displays, tangible interfaces and wearable devices providing ubiquitous options for interaction with the network, and secondary sensors for additionally generating carefully chosen information on the person to be relayed to significant persons. Evaluations show that the chosen modalities for interaction are well adopted by the users. Further it was found that use of the AGNES system had positive effects on the mental state of the users, compared to the control group without the technology.

Journal ArticleDOI
TL;DR: The aim is to identify the current gaps that hinder the facility and acceleration of the multimodal mobile applications development using model-based approaches and considers the inclusion of model-driven engineering features such as guidance and model reuse that allows the appropriate use of models and benefits from them.
Abstract: The multimodal interaction is becoming richer in last years thanks to the increasing evolution of mobile devices (smartphones/tablets) and their embedded sensors including accelerometer, gyroscope, global positioning system, near field communication and proximity sensors. Using such sensors, either sequentially or simultaneously, to interact with applications ensures an intuitive interaction and the user acceptance. Today, the development of multimodal mobile systems incorporating input and output modalities through sensors is a long and difficult task. Despite the facts that numerous model-based approaches have emerged and are supposed to simplify the multimodal mobile applications engineering, the design and implementation of these applications are still generally in an ad hoc way. In order to explain this situation, the present paper reviews, discusses, and analyses different model-based approaches proposed to develop multimodal mobile applications. The analysis considers not only the modelling and generation of mobile multimodality features, but also the inclusion of model-driven engineering features such as guidance and model reuse that allows the appropriate use of models and benefits from them. Our aim is to identify the current gaps that hinder the facility and acceleration of the multimodal mobile applications development using model-based approaches.

Journal ArticleDOI
TL;DR: A verbal comment is characterized as an informative or exclamatory sentence containing an evaluative adjective or other direct or indirect ways to express evaluations.
Abstract: The paper defines the notion of comment as a communicative act that is not requested by the previous turn and conveys additional information to what has been said before, generally concerning an opinion of its Sender and possibly an evaluation. After distinguishing interpretative versus evaluative comments and focusing on the latter, we characterize a verbal comment as an informative or exclamatory sentence containing an evaluative adjective or other direct or indirect ways to express evaluations. Then, a qualitative study is presented on bodily direct and indirect comments in political debates.

Journal ArticleDOI
TL;DR: Gender is revealed as a momentous influencing factor and the role of individuality is pointed out for the way of interaction in multimodal interactive systems, while the influence of the system output seems to be quite limited.
Abstract: When developing multimodal interactive systems it is not clear which importance should be given to which modality. In order to study influencing factors on multimodal interaction, we conducted a Wizard of Oz study on a basic recurrent task: 53 subjects performed diverse selections of objects on a screen. The way and modality of interaction was not specified nor predefined by the system, and the users were free in how and what to select. Natural input modalities like speech, gestures, touch, and arbitrary multimodal combinations of these were recorded as dependent variables. As independent variables, subjects’ gender, personality traits, and affinity towards technical devices were surveyed, as well as the system’s varying presentation styles of the selection. Our statistical analyses reveal gender as a momentous influencing factor and point out the role of individuality for the way of interaction, while the influence of the system output seems to be quite limited. This knowledge about the prevalent task of selection will be useful for designing effective and efficient multimodal interactive systems across a wide range of applications and domains.

Journal ArticleDOI
TL;DR: The LDOS-PerAff-1 corpus as discussed by the authors is composed of video clips of subjects' affective responses to visual stimuli, annotated with their personality information using the five-factor personality model.
Abstract: We present the LDOS-PerAff-1 Corpus that bridges the affective computing and recommender system research areas, which makes it unique. The corpus is composed of video clips of subjects’ affective responses to visual stimuli. These affective responses are annotated in the continuous valence-arousal-dominance space. Furthermore, the subjects are annotated with their personality information using the five-factor personality model. We also provide the explicit ratings that the users gave to the images used for the visual stimuli. In the paper we present the results of four experiments conducted with the corpus; an affective content-based recommender system, a personality-based collaborative filtering recommender system, an emotion-detection algorithm and a qualitative study of the latent factors.

Journal ArticleDOI
TL;DR: The results of the two sets of experiments indicate in general that head movements, and to a lesser extent facial expressions, are important indicators of feedback, and that gestures and speech disambiguate each other in the machine learning process.
Abstract: This article deals with multimodal feedback in two Danish multimodal corpora, i.e., a collection of map-task dialogues and a corpus of free conversations in first encounters between pairs of subjects. Machine learning techniques are applied to both sets of data to investigate various relations between the non-verbal behaviour—more specifically head movements and facial expressions—and speech with regard to the expression of feedback. In the map-task data, we study the extent to which the dialogue act type of linguistic feedback expressions can be classified automatically based on the non-verbal features. In the conversational data, on the other hand, non-verbal and speech features are used together to distinguish feedback from other multimodal behaviours. The results of the two sets of experiments indicate in general that head movements, and to a lesser extent facial expressions, are important indicators of feedback, and that gestures and speech disambiguate each other in the machine learning process.

Journal ArticleDOI
TL;DR: A multimodal notification framework allowing the optimal delivery and handling of multimedia requests and medical alerts in a nursing home is presented to enhance the quality of life of elderly people as well as the efficiency of the medical staff.
Abstract: This article presents a multimodal notification framework allowing the optimal delivery and handling of multimedia requests and medical alerts in a nursing home. Multimodal notifications are automatically adapted to different criteria such as the device characteristics, capabilities and modalities, the emergency of the situation, the semantics of the notification, the recipient, etc. This framework is operated with various applications (e.g., health alert, medicine reminder, activity proposition) that have been supported by user requirement analysis done nearby an elderly population and healthcare professionals (e.g., nurses, caregivers). An acceptability study was performed to understand the users’ expectations regarding this new technology and modalities. This study was followed by the evaluation of proposed services with different real end-users in a pilot site. Results of these studies presented in this paper highlight the added value of the proposed framework to enhance the quality of life of elderly people as well as the efficiency of the medical staff.

Journal ArticleDOI
TL;DR: This paper presents a new, publicly available dataset, aimed to be used as a benchmark for Point of Gaze (PoG) detection algorithms, which consists of a set of videos recording the eye motion of human participants as they were looking at, or following, aSet of predefined points of interest on a computer visual display unit.
Abstract: This paper presents a new, publicly available dataset, aimed to be used as a benchmark for Point of Gaze (PoG) detection algorithms. The dataset consists of two modalities that can be combined for PoG definition: (a) a set of videos recording the eye motion of human participants as they were looking at, or following, a set of predefined points of interest on a computer visual display unit (b) a sequence of 3D head poses synchronized with the video. The eye motion was recorded using a Mobile Eye-XG, head mounted, infrared monocular camera and the head position by using a set of Vicon motion capture cameras. The ground truth of the point of gaze and head location and direction in the three-dimensional space are provided together with the data. The ground truth regarding the point of gaze is known in advance since the participants are always looking at predefined targets on a monitor.

Journal ArticleDOI
TL;DR: Research work presented here focuses more on the cross-cultural aspect of gestural behavior defining a common corpus construction protocol aiming to identify cultural patterns within non-verbal behavior across cultures i.e. German, Greek and Italian.
Abstract: A multimodal, cross-cultural corpus of affec- tive behavior is presented in this research work. The corpus construction process, including issues related to the design and implementation of an experiment, is discussed along with resulting acoustic prosody, facial expressions and ges- tureexpressivityfeatures.However,researchworkpresented here focuses more on the cross-cultural aspect of gestural behavior defining a common corpus construction protocol aiming to identify cultural patterns within non-verbal behav- ior across cultures i.e. German, Greek and Italian. Culture specific findings regarding gesture expressivity are derived fromtheaffectiveanalysisperformed.Additionally,themul- timodal aspect, including prosody and facial expressions, is researched in terms of fusion techniques. Finally, a release plan of the corpus to the public domain is discussed aiming to establish the current corpus as a benchmark multimodal, cross-cultural standard and reference point.

Journal ArticleDOI
TL;DR: A method to induce, record and annotate natural emotions was used to provide multimodal data for dynamic emotion recognition from facial expressions and speech prosody; results from a dynamic recognition algorithm, based on recurrent neural networks, indicate that multi-modal processing surpasses both speech and visual analysis by a wide margin.
Abstract: Recording and annotating a multimodal database of natural expressivity is a task that requires careful planning and implementation, before even starting to apply feature extraction and recognition algorithms. Requirements and characteristics of such databases are inherently different than those of acted behaviour, both in terms of unconstrained expressivity of the human participants, and in terms of the expressed emotions. In this paper, we describe a method to induce, record and annotate natural emotions, which was used to provide multimodal data for dynamic emotion recognition from facial expressions and speech prosody; results from a dynamic recognition algorithm, based on recurrent neural networks, indicate that multimodal processing surpasses both speech and visual analysis by a wide margin. The SAL database was used in the framework of the Humaine Network of Excellence as a common ground for research in everyday, natural emotions.

Journal ArticleDOI
TL;DR: This work proposes to decrease the manual customisation effort by addressing the cold-start adaptation problem, i.e., predicting interface preferences of individuals and groups for new (unseen) combinations of applications, tasks and devices, based on knowledge regarding preferences of other users.
Abstract: Interaction in smart environments should be adapted to the users’ preferences, e.g., utilising modalities appropriate for the situation. While manual customisation of a single application could be feasible, this approach would require too much user effort in the future, when a user interacts with numerous applications with different interfaces, such as e.g. a smart car, a smart fridge, a smart shopping assistant etc. Supporting user groups, jointly interacting with the same application, poses additional challenges: humans tend to respect the preferences of their friends and family members, and thus the preferred interface settings may depend on all group members. This work proposes to decrease the manual customisation effort by addressing the cold-start adaptation problem, i.e., predicting interface preferences of individuals and groups for new (unseen) combinations of applications, tasks and devices, based on knowledge regarding preferences of other users. For predictions we suggest several reasoning strategies and employ a classifier selection approach for automatically choosing the most appropriate strategy for each interface feature in each new situation. The proposed approach is suitable for cases where long interaction histories are not yet available, and it is not restricted to similar interfaces and application domains, as we demonstrate by experiments on predicting preferences of individuals and groups for three different application prototypes: recipe recommender, cooking assistant and car servicing assistant. The results show that the proposed method handles the cold-start problem in various types of unseen situations fairly well: it achieved an average prediction accuracy of $$72 \pm 1\,\%$$ . Further studies on user acceptance of predictions with two different user communities have shown that this is a desirable feature for applications in smart environments, even when predictions are not so accurate and when users do not perceive manual customisation as very time-consuming.

Journal ArticleDOI
TL;DR: An annotation scheme for a political debate dataset which is mainly in the form of video annotations and audio annotations is introduced and an automatic categorization system based on a multimodal parametrization is successfully performed.
Abstract: The interaction between Members of Parliament (MPs) is convention-based and rule-regulated. As instantiations of individual and group confrontations, parliamentary debates display well-regulated competing discursive processes. Unauthorised interruptions are spontaneous verbal reactions of MPs who interrupt the current speaker. This paper focuses on the answers of the current speaker to these disruptions. It introduces an annotation scheme for a political debate dataset which is mainly in the form of video annotations and audio annotations. The annotations contain information ranging from general linguistic to domain specific information. Some is annotated with automatic tools, and some is manually annotated. One of the goals is to use the information to predict the categories of the answers by the speaker to the disruptions. A typology of such answers is proposed and an automatic categorization system based on a multimodal parametrization is successfully performed.

Journal ArticleDOI
TL;DR: The Filipino multimodal emotion database FilMED was built with the purpose of developing affective systems for TALA, which is an ambient intelligent empathic space and three automatic affect recognition systems that used FilMED to build the affect models were presented.
Abstract: This paper describes the Filipino multimodal emotion database (FilMED). FilMED was built with the purpose of developing affective systems for TALA, which is an ambient intelligent empathic space. We collected a total of 11,430 audio–video clips showing acted and spontaneous expressions of emotion involving 25 subjects. We used Filipino emotion labels to annotate the emotion, which includes: kasiyahan (happiness), kalungkutan (sadness), galit (anger), takot (fear), gulat (surprise), and pandidiri (disgust). We also engaged 20 coders to annotate the clips with valence and arousal values using Feeltrace. To show the usefulness of the database, we presented three automatic affect recognition systems that used FilMED to build the affect models.

Journal ArticleDOI
TL;DR: This study suggests that relevant interfaces could improve emotional recognition and thus facilitate distant communications and thus facilitates distant communications.
Abstract: The use of facial interfaces in distant communications highlights the relevance of emotional recognition. However researches on emotional facial expression (EFE) recognition are mainly based on static and posed stimuli and their results are not much transferable to daily interactions. The purpose of the present study is to compare emotional recognition of authentic EFEs with 11 different interface designs. A widget allowing participants both to recognize an emotion and to assess it on-line was used. Divided-face and compound-face interfaces are compared with a common full frontal interface. Analytic and descriptive on-line results reveal that some interfaces facilitate emotional recognition whereas others would decrease it. This study suggests that relevant interfaces could improve emotional recognition and thus facilitate distant communications.

Journal ArticleDOI
TL;DR: This work evaluates a backchannel synthesis algorithm for speaker–listener dialogs using an asymmetric version of the SWOZ framework, and reveals patterns of inappropriate behavior in terms of quantity and timing of backchannels.
Abstract: The Switching Wizard of Oz (SWOZ) is a setup to evaluate human behavior synthesis algorithms in online face-to-face interactions. Conversational partners are represented to each other as virtual agents, whose animated behavior is either based on a synthesis algorithm, or driven by the actual behavior of the conversational partner. Human and algorithm have the same expression capabilities. The source is switched at random intervals, which means that the algorithm’s behavior can only be identified when it deviates from what is regarded as appropriate. The SWOZ approach is especially suitable for the controlled evaluation of synthesis algorithms that consider a limited set of behaviors. We evaluate a backchannel synthesis algorithm for speaker–listener dialogs using an asymmetric version of the framework. Human speakers talk to virtual listeners, that are either controlled by human listeners or by an algorithm. Speakers indicate when they feel they are no longer talking to a human listener. Analysis of these responses reveals patterns of inappropriate behavior in terms of quantity and timing of backchannels. These insights can be used to improve synthesis algorithms.