scispace - formally typeset
Search or ask a question

Showing papers by "Sebastian Möller published in 2012"


Journal ArticleDOI
TL;DR: It is shown that a certain EEG technique, event-related-potentials (ERP) analysis, is a useful and valid tool in quality research and can be monitored in conscious and presumably non-conscious stages of processing.
Abstract: Common speech quality evaluation methods rely on self-reported opinions after perceiving test stimuli. Whereas these methods-when carefully applied-provide valid and reliable quality indices, they provide little insight into the processes underlying perception and judgment. In this paper, we analyze the performance of electroencephalography (EEG) for indicating different types of degradations in speech stimuli. We show that a certain EEG technique, event-related-potentials (ERP) analysis, is a useful and valid tool in quality research. Three experiments are reported which show that quality degradations can be monitored in conscious and presumably non-conscious stages of processing. Potential and limitations of the approach are discussed and lines of future research are drawn.

75 citations


Journal ArticleDOI
TL;DR: A taxonomy of the most relevant QoS and QoE aspects which result from multimodal human–machine interactions is developed, which is meant to guide system evaluation and make it more systematic and comparable.
Abstract: Quality of Service (QoS) and Quality of Experience (QoE) have to be considered when designing, building and maintaining services involving multimodal human–machine interaction. In order to guide the assessment and evaluation of such services, we first develop a taxonomy of the most relevant QoS and QoE aspects which result from multimodal human–machine interactions. It consists of three layers: (1) The quality factors influencing QoS and QoE related to the user, the system, and the context of use; (2) the QoS interaction performance aspects describing user and system behavior and performance; and (3) the QoE aspects related to the quality perception and judgment processes taking place within the user. For each of these layers, we then provide metrics which are able to capture the QoS and QoE aspects in a quantitative way, either via questionnaires or performance measures. The metrics are meant to guide system evaluation and make it more systematic and comparable.

43 citations


Proceedings ArticleDOI
05 Jul 2012
TL;DR: Electroencephalography is used to assess the cognitive state of subjects related to the quality of auditory speech stimuli and shows that users listening to degraded audio rated the quality lower in comparison to an undisturbed stimulus as expected and got more fatigued during the 20 minute presentation.
Abstract: Common methods to determine the quality of media rely on conscious ratings of a subject's opinion about the quality of presented stimuli. While such methods provide a reliable and valid means of determining quality, they provide little insight into the physiological processes preceding the quality judgment, which, however, may affect the subjective behavior, e.g., in terms of alertness or media usage duration. In this paper we used a non-intrusive physiological method, electroencephalography, to assess the cognitive state of subjects related to the quality of auditory speech stimuli. We show that users listening to degraded audio rated the quality lower in comparison to an undisturbed stimulus as expected and, in addition, got more fatigued during the 20 minute presentation. Indicators of the increased fatigue were Theta and Alpha frequencies of the electroencephalogram data. The results show that the perception of degraded media has long-term influences on physiological processes at the time scale of minutes which may immediately influence customer behavior.

34 citations


Proceedings ArticleDOI
05 Jul 2012
TL;DR: It is shown that the more degraded a video is, the earlier and higher the P300 amplitude is rising, and the peak amplitudes are highly correlated with the audiovisual Mean Opinion Score (MOS).
Abstract: The subjective evaluation of video quality mostly relies on opinion tests in which test participants judge perceived quality on rating scales. However, these methods provide limited insight how the quality judgments are being formed in the brain. In past studies we showed the general feasibility to complement opinion tests with physiological measures, as the electroencephalography (EEG), for pure video and audio experiments. To establish EEG as a reliable complement measurement method in standard quality rating tests, the next step is to validate the method in the audiovisual domain. For this purpose we conducted an experiment using audiovisual stimuli and degraded these in both modalities. We show that the more degraded a video is the earlier and higher the P300 amplitude is rising. In addition, the peak amplitudes are highly correlated with the audiovisual Mean Opinion Score (MOS).

28 citations


Proceedings Article
01 Jan 2012
TL;DR: This paper contributes in methodology and understanding on how to asses the perceived personality from an unknown speaker by humans and machines.
Abstract: In this paper, we present ongoing experiments and insights regarding automatic and human assessment of perceived personality. While within the INTERSPEECH Speaker Trait Challenge participants will train systems in order to recognize binary targets along the Big 5 personality trait, we will analyze and discuss properties of the data, the labeling scheme and the predictive quality. Conducting factor analyses, estimating reliability, and building regression models capturing dimensions of personality we compare all results to our own work and introduce a new extension of our personality database. In conclusion, this paper contributes in methodology and understanding on how to asses the perceived personality from an unknown speaker by humans and machines. Index Terms: extra-linguistic speech properties, personality modeling from speech, speaker characteristics

18 citations


Proceedings ArticleDOI
01 Dec 2012
TL;DR: Three dimensions can be assigned to naturalness of voice, temporal distortions and calmness in text-to-speech systems and will be used in the future to build a dimension-based quality predictor for synthetic speech.
Abstract: This paper presents research on perceptual quality dimensions of synthetic speech. We generated 57 stimuli from 16/19 female/male German text-to-speech systems (TTS) and asked listeners to judge the perceptual distances between them in a sorting task. Through a subsequent multidimensional scaling algorithm, we extracted three dimensions. Via expert listening and a comparison to ratings gathered on 16 attribute scales, the three dimensions can be assigned to naturalness of voice, temporal distortions and calmness. These dimensions are discussed in detail and compared to the perceptual quality dimensions from previous multidimensional analyses. Moreover, the results are analyzed depending on the type of TTS system. The identified dimensions will be used in the future to build a dimension-based quality predictor for synthetic speech.

15 citations


DOI
01 Jan 2012
TL;DR: The Dagstuhl Seminar 12181 focused on the further development of an agreed definition of the term Quality of Experience (QoE) in collaboration with the COST Action IC1003 "Qualinet", as well as inventories of possibilities to measure QoE (beyond the usual user polls) and to exploit feedback between users and systems that reflects QOE issues.
Abstract: This report documents the program and the outcomes of Dagstuhl Seminar 12181 "Quality of Experience: From User Perception to Instrumental Metrics". As follow-up of the Dagstuhl Seminar 09192 "From Quality of Service to Quality of Experience", it focused on the further development of an agreed definition of the term Quality of Experience (QoE) in collaboration with the COST Action IC1003 "Qualinet", as well as inventories of possibilities to measure QoE (beyond the usual user polls) and to exploit feedback between users and systems that reflects QoE issues. The report furthermore describes the mode of work throughout the seminar, with focus on personal statements by the participants, results of the group works, and open challenges.

14 citations


Journal ArticleDOI
TL;DR: The results highlight a strong potential for instrumental estimation techniques of TTS quality, with the Fo slope within voiced segments proving particularly useful when integrated in a nonlinear fashion, whereas measures of durational variation perform comparably weak.
Abstract: Formal parameters of speech prosody are investigated concerning their ability to estimate the perceptual quality of text-to-speech (TTS) signals. The study is carried out for the German language using a broad databasis comprising a wide range of TTS systems and text materials. 18 purely acoustic markers, derived from Fo and vocalic/consonantal durations, are analysed individually and in conjunction via cross-validated regression models. The Fo slope within voiced segments proves particularly useful when integrated in a nonlinear fashion, whereas measures of durational variation perform comparably weak. The results highlight a strong potential for instrumental estimation techniques of TTS quality.

13 citations


Proceedings Article
30 Jan 2012
TL;DR: Effects for gender, but not for age, have been found for modality preferences and women prefer touch and voice over gesture for many scales assessed, whereas men do not show this pattern consistently.
Abstract: In order to examine user group differences in modality preferences, participants of either gender and two age groups have been asked to rate their experience after interacting with a smart-home system offering unimodal and multimodal input possibilities (voice, free-hand gesture, smartphone touch screen). Effects for gender, but not for age (younger and older adults) have been found for modality preferences. Women prefer touch and voice over gesture for many scales assessed, whereas men do not show this pattern consistently. Instead, they prefer gesture over voice for hedonic quality scales. Comparable results are obtained for technological expertise assessed individually. This interrelation of gender and expertise could not be solved and is discussed along with consequences of the results obtained. Keywords-multimodal dialog system; evaluation; user factors.

12 citations


Proceedings ArticleDOI
21 Sep 2012
TL;DR: Observed gender differences in the perception of security of mobile phones, especially on authentication and payment-related features and application on smartphones are described.
Abstract: In this paper we describe observed gender differences in the perception of security of mobile phones, especially on authentication and payment-related features and application on smartphones. The data was gathered in a focus group, two surveys and an experiment during November 2009 and September 2011. The data shows significant differences in perceived security and future use of security-related features and applications.

12 citations


Proceedings ArticleDOI
05 Jul 2012
TL;DR: This pilot study explores the use of neurophysiological signals as correlates of image preference characterization and results have shown promising results and mental states associated with preferred and non-preferred images, as well as baseline neutral state could be classified with above-chance levels.
Abstract: Image preference is a subjective factor which plays an important role in Quality-of-Experience (QoE) modelling. Traditionally, preference characterization has been quantified via questionnaires or subjective evaluations. Current advances in neurophysiological signal acquisition, however, have allowed for such “non-measurable” subjective parameters to be quantified objectively. In this pilot study, we explore the use of neurophysiological signals as correlates of image preference characterization. Experiments with seven participants have shown promising results and mental states associated with preferred and non-preferred images, as well as baseline neutral state could be classified with above-chance levels.

Journal ArticleDOI
TL;DR: The nine papers in this special issue report the latest findings in subjective and objective methodologies for audio and visual signal processing.
Abstract: The nine papers in this special issue report the latest findings in subjective and objective methodologies for audio and visual signal processing.

Proceedings ArticleDOI
04 Dec 2012
TL;DR: It is concluded that sounds can be unobtrusive, but still convey their intended meaning in a working context as well as in a leisure time situation without being perceived as disturbing.
Abstract: To give feedback on mobile devices, sound is commonly used in different ways. Much research has focused on the learnability and user performance with systems that have audio feedback. But so far, there is no standardized method to evaluate the subjective quality of auditory feedback messages. We describe a study to investigate the affective impression of short audio feedback on mobile devices and their functional connotation in three different contexts. Results show that context influences the affective impression of sounds and that there is a relation between ratings according affective quality and functional applicability. We conclude that sounds can be unobtrusive, but still convey their intended meaning in a working context as well as in a leisure time situation without being perceived as disturbing.


Proceedings Article
01 Jan 2012
TL;DR: Novel approaches for the construction of non-intrusive quality estimators are presented and a substantial degree of systematic influence of prosodic variation on TTS quality is revealed.
Abstract: We present a study on the relation between fundamental frequency (F0) and its perceptual effect in the context of text-tospeech (TTS) synthesis. Features that essentially capture the intonational (macro-prosodic) properties of spoken speech are introduced and analysed with regard to the following questions: (i) How does the prosodic variation of TTS signals differ from natural speech? (ii) Is there a functional relationship between the prosodic variation of TTS signals and its perceived quality? In answering these questions we present novel approaches for the construction of non-intrusive quality estimators. The results reveal a substantial degree of systematic influence of prosodic variation on TTS quality.

Journal Article
TL;DR: A new intrusive speech quality model, called Diagnostic Instrumental Assessment of Listening quality (DIAL), providing diagnostic information in both narrow-band and super-wideband contexts, contrary to previous methods.
Abstract: Speech quality models usually estimate the integral quality of the degraded speech files. Such quality values do not inform system developers and telephone service providers on the perceived degradation introduced by the system under study. This paper describes a new intrusive speech quality model, called Diagnostic Instrumental Assessment of Listening quality (DIAL), providing diagnostic information in both narrow-band and super-wideband contexts. Contrary to previous methods, this model estimates scores on four perceptual quality dimensions: directness/frequency content, continuity, noisiness, and loudness. These four dimensions are assumed to define the whole speech quality space.

01 Jan 2012
TL;DR: This article used the Fujisaki model to describe pitch contour of a speech signal through the parameters base frequency, phrase commands, and accent commands for the quality prediction of synthetic speech.
Abstract: This paper presents research on the use of Fujisaki parameters for the quality prediction of synthetic speech. The Fujisaki model describes the pitch contour of a speech signal through the parameters base frequency, phrase commands, and accent commands. While the base frequency represents the minimum F0 value in the signal, the phrase commands describe the slowly varying components and the accent commands indicate local peaks in the contour. The Fujisaki parameters were assessed for four independent auditory evaluated databases consisting of synthetic speech generated by over 20 different text-tospeech (TTS) systems. The prosody generation techniques of these systems is unknown to us, i.e. it may happen that the systems base their prosody on a Fujisakilike model or not. The extracted parameters were used to calculate 47 features (e.g. mean distance between phrase commands, variance of accent command amplitudes, etc.). A stepwise multiple linear regression of these features with the overall quality judgement (MOS) as the response variable led to one quality prediction model per gender. A leave-one-out cross-validation showed the stability of both models. The Pearson Correlation R between predicted MOS and auditory MOS was computed per gender and database. The mean correlation reached a value of R > .50. Even though, the computed Fujisaki features do not fully capture the auditory quality of TTS stimuli both models will be helpful for predicting TTS quality. Especially, in combination with other features an increase in accuracy is to be expected.

01 Jan 2012
TL;DR: In this article, the authors investigated if the social facilitation effect can also be observed while interacting with anthropomorphic robots and found that it can be observed in the presence or absence of others.
Abstract: The social facilitation effect is a well-known social-psychological phenomenon. It describes how performance changes depending on the presence or absence of others. The current study investigates if the social facilitation effect can also be observed while interacting with anthropomorphic robots.

Proceedings ArticleDOI
09 Sep 2012
TL;DR: It is shown that predictions of modality choice as well as the overall system quality are possible, and an age effect is observed: if older adults are included, predictions are less precise.
Abstract: A standardized procedure to evaluate the perceived quality of multimodal systems is still lacking. Previous research has however shown that the quality ratings for a multimodal system are equal to the weighted sum of the quality ratings of its individual modalities, with the modality that is more frequently used having a stronger influence. These findings suggest, that if the choice of modality can be predicted, an estimation of the quality of the multimodal systems is possible, based solely on an evaluation of its component modalities. Accordingly, the current study investigates the prediction of modality choice based on quality ratings of the component modalities, in order to achieve accurate quality predictions for multimodal systems. It is shown that predictions of modality choice as well as the overall system quality are possible. Furthermore, an age effect is observed: if older adults are included, predictions are less precise.

Proceedings ArticleDOI
14 Oct 2012
TL;DR: MoCCha, a mobile campus application used not only as a subject of research, but as a research platform for a number of scientific disciplines, with the aim for ecological validity that human-subject studies in lab environments are potentially missing.
Abstract: In this paper, we present MoCCha, a mobile campus application used not only as a subject of research, but as a research platform for a number of scientific disciplines. Using apps that are available from mobile application stores enables studying user behavior in the field with the aim for ecological validity that human-subject studies in lab environments are potentially missing.

Proceedings ArticleDOI
09 Sep 2012
TL;DR: A model of the belief the user has about the system state, and a model of vocabulary alignment are proposed, which show that parameters derived from these models are significantly correlated with the users’ quality perception.
Abstract: As spoken dialog systems become more complex, efficient ways to evaluate them in early development stages are required. User simulation has been successfully used for this purpose. While current user models describe behavior on the level of overt behavior, modeling aspects of cognition can reveal direct insights into usability problems. Thus, in this paper we propose two models related to grounding in dialog: a model of the belief the user has about the system state, and a model of vocabulary alignment. We show that parameters derived from these models are significantly correlated with the users’ quality perception.


Proceedings ArticleDOI
05 Jul 2012
TL;DR: This case study evaluates the performance of multimedia presentations of radiological findings in communicating propositional facts about a patient's health status and shows a slightly higher efficiency of the audiovisual compared to textual or visual-only presentation, and a preference of users for the interactive audio-visual or visual tool.
Abstract: Multimedia presentations are expected to show a higher performance in communicating complex and relational information than single media presentations. In this case study, we evaluate the performance of multimedia presentations of radiological findings in communicating propositional facts about a patient's health status. In collaboration with experts and future users, three radiological communication tools have been developed which differ with respect to the presentation media and interactivity. These tools have been evaluated in a controlled user study concerning their efficiency in communicating propositional facts, as well as concerning their usability in the clinical praxis. The results show a slightly higher efficiency of the audiovisual compared to textual or visual-only presentation, and a preference of users for the interactive audio-visual or visual tool.

25 Sep 2012
TL;DR: A user simulation is used to connect speech synthesis to a real, state-of-the-art automatic speech recognition (ASR) component deployed in a working commercial SDS via a standard telephone line and shows that a good text-to-speech synthesis configuration rivals human speech both in recognition scores as well as variability.
Abstract: In this paper, we test the effect of using speech synthesis when interacting with a spoken dialog system (SDS). We use a user simulation to connect our speech synthesis to a real, state-of-the-art automatic speech recognition (ASR) component deployed in a working commercial SDS via a standard telephone line. In a series of experiments, we compare human-machine dialogs and their recognition scores with simulated dialogs using synthesis. Our results show that a good text-to-speech synthesis configuration rivals human speech both in recognition scores as well as variability. This makes the speech interface in user simulation quite attractive.

25 Sep 2012
TL;DR: This paper addresses models for predicting the quality of speech transmission and communication services by focusing on models which predict individual dimensions of quality, as they are subjectively perceived by a communication partner.
Abstract: In this paper, we address models for predicting the quality of speech transmission and communication services. In contrast to models which aim at predicting the integral quality of speech, i.e., at an index for the overall quality, we focus on models which predict individual dimensions of quality, as they are subjectively perceived by a communication partner. Such predictions of individual dimensions can then be combined to form an integralquality estimation; however, they may also be used separately in order to diagnose the reasons of insufficient quality. Beyond, describing particular perceptual effects rather than technical systems, they are less vulnerable by new system variants not included in the model development.

Proceedings ArticleDOI
05 Jul 2012
TL;DR: This paper analyses the applicability of speech, video, and call quality prediction models for video telephony in heterogeneous wireless networks and shows how accurately the quality of video calls in mobile networks can be predicted with the existing approaches, and discloses the major limitations of the individual models.
Abstract: Video calls performed in heterogeneous mobile networks are affected by time-varying quality changes. Packet loss or transmission adaptation by network handover, bit rate switching, or codec changeover, are the potential results of user mobility. In order to optimize the quality under these circumstances, quality monitoring plays an essential role to improve user experience in future mobile networks. However, the existing quality prediction models have not been fully validated for video telephony in future networks. In this paper, we analyse the applicability of speech, video, and call quality prediction models for video telephony in heterogeneous wireless networks. We focus on user perception of changing transmission quality. We show how accurately the quality of video calls in mobile networks can be predicted with the existing approaches, and we disclose the major limitations of the individual models. Thus, this paper contributes to the monitoring of video call quality in future mobile networks.

Proceedings Article
01 Jan 2012
TL;DR: It is argued that approaches to annotate multimodal human face-to-face interaction are not suitable for current device-based human-computer interaction, and existing extensions proposed to established parameters describing the interaction with spoken dialog systems are presented and discussed.
Abstract: In this paper we argue that approaches to annotate multimodal human face-to-face interaction are not suitable for current device-based human-computer interaction. Instead, existing extensions proposed to established parameters describing the interaction with spoken dialog systems are presented and discussed. A standardization activity for annotating user interactions with multimodal systems is needed, e.g. to efficiently extract multimodal interaction parameters useful for evaluation.

Proceedings Article
01 Jun 2012
TL;DR: It is argued that standardized metrics and automatic evaluation tools are necessary for speeding up knowledge generation and development processes for dialog systems.
Abstract: We argue that standardized metrics and automatic evaluation tools are necessary for speeding up knowledge generation and development processes for dialog systems.

Proceedings ArticleDOI
09 Sep 2012
TL;DR: This paper shows how the user simulation environment SpeechEval was included in the development process of three VoiceXML dialogue systems and discusses advantages and drawbacks compared to tests with real users.
Abstract: In this paper we present our experiences in developing a spoken dialogue system supported by tests with a user simulation. Since the code of dialogue systems with modest complexity can easily get unclear, it is almost impossible to deliver error-free systems without user tests in the development process. We show how we included our user simulation environment SpeechEval in the development process of three VoiceXML dialogue systems and discuss advantages and drawbacks compared to tests with real users.

Proceedings Article
01 Jan 2012
TL;DR: The idea is to extract perceptually relevant feature estimations from the speech signal, and combine them with an overall quality metric in order to obtain more reliable as well as more diagnostic predictions of speech quality.
Abstract: In this paper, we present a new framework for the diagnostic prediction of transmitted speech quality. The idea is to extract perceptually relevant feature estimations from the speech signal, and combine them with an overall quality metric in order to obtain more reliable as well as more diagnostic predictions of speech quality. We implement this framework in two complementary ways: In terms of a signal-based model which can be used for online and offline measurement, and in terms of a parametric model which can be used for network planning. The implementations are compared to standard state-of-the-art models and show a similar level of reliability, while providing additional diagnostic value.