scispace - formally typeset
Search or ask a question

Showing papers by "Sebastian Möller published in 2013"


12 Mar 2013
TL;DR: The concepts and ideas cited in this paper mainly refer to the Quality of Experience of multimedia communication systems, but may be helpful also for other areas where QoE is an issue, and the document will not reflect the opinion of each individual person at all points.
Abstract: This White Paper is a contribution of the European Network on Quality of Experience in Multimedia Systems and Services, Qualinet (COST Action IC 1003, see www.qualinet.eu), to the scientific discussion about the term "Quality of Experience" (QoE) and its underlying concepts. It resulted from the need to agree on a working definition for this term which facilitates the communication of ideas within a multidisciplinary group, where a joint interest around multimedia communication systems exists, however approached from different perspectives. Thus, the concepts and ideas cited in this paper mainly refer to the Quality of Experience of multimedia communication systems, but may be helpful also for other areas where QoE is an issue. The Network of Excellence (NoE) Qualinet aims at extending the notion of network-centric Quality of Service (QoS) in multimedia systems, by relying on the concept of Quality of Experience (QoE). The main scientific objective is the development of methodologies for subjective and objective quality metrics taking into account current and new trends in multimedia communication systems as witnessed by the appearance of new types of content and interactions. A substantial scientific impact on fragmented efforts carried out in this field will be achieved by coordinating the research of European experts under the catalytic COST umbrella. The White Paper has been compiled on the basis of a first open call for ideas which was launched for the February 2012 Qualinet Meeting held in Prague, Czech Republic. The ideas were presented as short statements during that meeting, reflecting the ideas of the persons listed under the headline "Contributors" in the previous section. During the Prague meeting, the ideas have been further discussed and consolidated in the form of a general structure of the present document. An open call for authors was issued at that meeting, to which the persons listed as "Authors" in the previous section have announced their willingness to contribute in the preparation of individual sections. For each section, a coordinating author has been assigned which coordinated the writing of that section, and which is underlined in the author list preceding each section. The individual sections were then integrated and aligned by an editing group (listed as "Editors" in the previous section), and the entire document was iterated with the entire group of authors. Furthermore, the draft text was discussed with the participants of the Dagstuhl Seminar 12181 "Quality of Experience: From User Perception to Instrumental Metrics" which was held in Schlos Dagstuhl, Germany, May 1-4 2012, and a number of changes were proposed, resulting in the present document. As a result of the writing process and the large number of contributors, authors and editors, the document will not reflect the opinion of each individual person at all points. Still, we hope that it is found to be useful for everybody working in the field of Quality of Experience of multimedia communication systems, and most probably also beyond that field.

686 citations


Journal ArticleDOI
TL;DR: A novel classification approach helps to detect trials with presumably non-conscious processing at the threshold of perception and uncovers a non-trivial confounder between neural hits and neural misses.
Abstract: Objective. Assessing speech quality perception is a challenge typically addressed in behavioral and opinion-seeking experiments. Only recently, neuroimaging methods were introduced, which were used to study the neural processing of quality at group level. However, our electroencephalography (EEG) studies show that the neural correlates of quality perception are highly individual. Therefore, it became necessary to establish dedicated machine learning methods for decoding subject-specific effects. Approach. The effectiveness of our methods is shown by the data of an EEG study that investigates how the quality of spoken vowels is processed neurally. Participants were asked to indicate whether they had perceived a degradation of quality (signal-correlated noise) in vowels, presented in an oddball paradigm. Main results. We find that the P3 amplitude is attenuated with increasing noise. Single-trial analysis allows one to show that this is partly due to an increasing jitter of the P3 component. A novel classification approach helps to detect trials with presumably non-conscious processing at the threshold of perception. We show that this approach uncovers a non-trivial confounder between neural hits and neural misses. Significance. The combined use of EEG signals and machine learning methods results in a significant ‘neural’ gain in sensitivity (in processing quality loss) when compared to standard behavioral evaluation; averaged over 11 subjects, this amounts to a relative improvement in sensitivity of 35%. (Some figures may appear in colour only in the online journal)

45 citations


Proceedings ArticleDOI
03 Jul 2013
TL;DR: An overview of the most-discussed concepts in computer gaming evaluation is provided, taking the perspective of a quality engineer who identifies influence factors, quantifies them in terms of performance metrics, and analyzes their impact on perceived quality features.
Abstract: With the advent of computer games, game providers try to improve their users' experience by ensuring high platform and transmission performance, by developing new interaction techniques and more interesting interfaces, or by launching new game ideas. However, it is still unclear how this affects user-perceived Quality of Experience (QoE). In this paper, we provide an overview of the most-discussed concepts in computer gaming evaluation, taking the perspective of a quality engineer who identifies influence factors, quantifies them in terms of performance metrics, and analyzes their impact on perceived quality features. The concepts are grouped in terms of a taxonomy which can be used for developing empirical test methods as well as instrumental prediction models for computer gaming QoE.

41 citations


Journal ArticleDOI
TL;DR: How perceived auditory and visual qualities integrate to an overall audiovisual quality perception in different experimental contexts is shown, revealing both similarities and differences in terms of magnitude and in which cases they occur.
Abstract: This paper investigates multi-modal aspects of audiovisual quality assessment for interactive communication services. It shows how perceived auditory and visual qualities integrate to an overall audiovisual quality perception in different experimental contexts. Two audiovisual experiments are presented and provide experimental data for the present analysis. First, two experimental contexts are compared, i.e., passive ‘viewing and listening’ and interactive, with regard to their impact on the audiovisual qualities as subjectively perceived by the user. Second, the effects of cross-modal interactions on the assessment of the audio and video qualities are measured for those experimental contexts. The results are compared to the ones found in the literature revealing both similarities and differences in terms of magnitude and also in which cases they occur. Third, the impact of the conversational scenario on the assessment of the auditory and visual qualities is investigated. Finally, results from the literature related to audiovisual integration are gathered by the type of application. A general integration function is proposed for each category, and the performances of these ‘application-oriented’ models demonstrate a direct gain in prediction.

29 citations


Proceedings Article
27 May 2013
TL;DR: Results of three recent studies are presented, thus showing that neurophysiological correlates can be obtained for i) natural speech and ii) synthesized speech QoE perception, as well as iii) image preference characterization for multimediaQoE evaluation.
Abstract: The human brain is the epicenter of every human action, thus neurophysiology will pave the way for understanding human behavior and cognition and their interplay with Quality of Experience (QoE) Recent advances in neurophysiological monitoring tools have allowed useful QoE constructs to be measured in real-time, such as human cognition, attention, emotion, fatigue, perception and task performance In this paper, we describe a multimodal neurophysiological experimental facility recently implemented for QoE evaluation A description of the facility and the available equipment is presented Results of three recent studies are also presented, thus showing that neurophysiological correlates can be obtained for i) natural speech and ii) synthesized speech QoE perception, as well as iii) image preference characterization for multimedia QoE evaluation

24 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: This study confirms that the identification task is facilitated if the voices are transmitted through wideband instead of narrowband channels, and that headsets and hands-free phones take greater advantage of the improved bandwidth that is gaining ground rapidly.
Abstract: Together with the variety of networks, diverse terminals and devices, such as telephones with handset or hands-free mode, mobile phones and headsets, are commonly available for everyday calls. We conducted an auditory test to examine the combined influence of these user interfaces, audio bandwidths, coding schemes and packet loss on human speaker identification of previously known voices. The effects of the user interfaces on transmission and reception were tested separately with the different channel impairments. Our study confirms that the identification task is facilitated if the voices are transmitted through wideband instead of narrowband channels, and that headsets and hands-free phones take greater advantage of the improved bandwidth that is gaining ground rapidly.

18 citations


01 Jan 2013
TL;DR: It is shown that even though the 9 tests differ in terms of used synthesizer types, stimulus duration, language, and quality assessment methods, the resulting perceptual quality dimensions can be linked to 5 universal quality dimensions of synthetic speech: naturalness of voice, prosodic quality, fluency and intelligibility, disturbances, and calmness.
Abstract: In this paper, we present a comparative overview of 9 studies on perceptual quality dimensions of synthetic speech. Different subjective assessment techniques have been used to evaluate the text-to-speech (TTS) stimuli in each of these tests: in a semantic differential, the test participants rate every stimulus on a given set of rating scales, while in a paired comparison test, the subjects rate the similarity of pairs of stimuli. Perceptual quality dimensions can be derived from the results of both test methods, either by performing a factor analysis or via multidimensional scaling. We show that even though the 9 tests differ in terms of used synthesizer types, stimulus duration, language, and quality assessment methods, the resulting perceptual quality dimensions can be linked to 5 universal quality dimensions of synthetic speech: (i) naturalness of voice, (ii) prosodic quality, (iii) fluency and intelligibility, (iv) disturbances, and (v) calmness.

18 citations


Proceedings ArticleDOI
01 Nov 2013
TL;DR: This study analyzes how different levels of synthetic speech quality, obtained from different text-to-speech (TTS) systems, affect the emotional response of a user and finds an increase in neuronal activity in the left frontal area with decreasing quality.
Abstract: The tolerance limit for acceptable multimedia quality is changing as more and more high quality services approach the market. Thus, negative emotional reactions towards low quality services may cause user disappointment and are likely to increase churn rate. The current study analyzes how different levels of synthetic speech quality, obtained from different text-to-speech (TTS) systems, affect the emotional response of a user. This is achieved using two methods: subjective, by means of user reports; and neurophysiological by means of electroencephalography (EEG) analysis. More specifically, we analyzed the frontal alpha band power and correlated this with the subjective ratings based on the Self-Assessment Manikin scale. We found an increase in neuronal activity in the left frontal area with decreasing quality and argue that this is due to user disappointment with low quality TTS systems as they become harder to understand.

17 citations


Proceedings ArticleDOI
25 Aug 2013
TL;DR: Estimations from the new ITU standard POLQA, its predecessor WB-PESQ and the diagnostic DIAL model are compared to subjective listener judgments and reveal that the instrumental measures are not fully able to cope with ABE-processed speech, particularly in predicting ABE rank orders reliably.
Abstract: During the transition period from narrowband to wideband speech transmission services, Artificial Bandwidth Extension (ABE) algorithms are able to reduce the perceptual degradation of narrowband-transmitted speech signals by extending the audio bandwidth. In this paper, we analyze whether the resulting speech quality can be predicted reliably with instrumental models. Estimations from the new ITU standard POLQA, its predecessor WB-PESQ and the diagnostic DIAL model are compared to subjective listener judgments. This comparison reveals that the instrumental measures are not fully able to cope with ABE-processed speech, particularly in predicting ABE rank orders reliably. Reasons for this finding and corresponding diagnoses are discussed.

14 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: Electroencephalography and self-assessment tools are used to investigate the neural and affective correlates of speech quality perception of reverberant speech, and it is shown that EEG event related potentials (ERP) are a useful tool to monitor the conscious stages of neural-processing during a speech quality assessment task.
Abstract: Subjective speech quality assessment depends on listener “quality” opinions after hearing a particular test speech stimulus. Subjective scores are given based on a perception and quality judgment process that is unique to a particular listener. These processes are postulated to be dependent on the listener's internal reference of what good and bad quality sounds like, as well as their mental and emotional states. To overcome this variability, subjective listening tests often average scores over several listeners. In this paper, we use electroencephalography (EEG) and self-assessment tools to investigate the neural and affective correlates of speech quality perception of reverberant speech, with the goal of obtaining new insights into human speech quality perception in complex listening environments. We show that EEG event related potentials (ERP) are a useful tool to monitor the conscious stages of neural-processing during a speech quality assessment task. Significant correlations were obtained between the so-called P300 ERP component and the reverberation time of the room, as well as between the P300 peak amplitude and emotional self-assessment ratings. These insights could lead to more effective ways of characterizing room acoustics for improved speech quality and intelligibility.

13 citations


Proceedings ArticleDOI
03 Jul 2013
TL;DR: The results show that the perception of degraded media has long-term influences on physiological processes at the time scale of minutes which may influence customer behavior even though it probably stays undetected using purely standard subjective test methods.
Abstract: Quality aspects of media content are usually assessed by using subjective rating methods where subjects provide active feedback towards the perceived quality. The process how this judgment is being formed and the long-term influences of accompanying physiological processes are not represented by these methods. In this paper we used the electroencephalogram (EEG) to assess the cognitive state (vigilance) of listeners measured by the alpha frequency band power with respect to varying bit rate conditions. We show that stimuli with varying and constant bit rate cause a state of reduced vigilance. In addition we show the potential tendency that an interval of high bit rate audio inserted into a low bit rate stimulus increased the vigilance of listeners at the timescale of minutes. The results show that the perception of degraded media has long-term influences on physiological processes at the time scale of minutes which may influence customer behavior even though it probably stays undetected using purely standard subjective test methods.

Journal ArticleDOI
30 Apr 2013
TL;DR: This paper provides an overview of instrumental models for predicting the quality of speech signals on the basis of perceptual and cognitive characteristics of the human auditory system, showing that perception modeling can significantly improve prediction accuracy.
Abstract: This paper provides an overview of instrumental models for predicting the quality of speech signals. On the basis of perceptual and cognitive characteristics of the human auditory system which are relevant for quality judgment, approaches are presented which aim at predicting overall quality, intelligibility, or other quality dimensions from measurable parameters or signal characteristics. The approaches are discussed with respect to their underlying principles, showing that perception modeling can significantly improve prediction accuracy. Application examples are presented which make use of these algorithms for offline or online prediction, adaptation, or intelligibility improvement.

Proceedings ArticleDOI
03 Jul 2013
TL;DR: In the present study, an inverse relationship between TTS speech quality and the amplitude of an EEG evoked response called the `P300,' suggesting an increase in cognitive load as TTS quality decreases, likely due to reduction in speech intelligibility.
Abstract: Evaluating the quality of text-to-speech systems (TTS) is usually achieved by subjective methods where participants have to rate the stimulus on multiple scales, such as naturalness, prosody, and overall quality In the present study, we aim towards evaluating TTS system quality using not only conventional subjective methods, but also via a neurophysiological approach based on obtaining neural correlates of TTS quality perception using electroencephalography (EEG) Such an approach allows for better insight into the perception processes involved during the human quality judgement process, and may open doors to innovative subjective testing methods and/or objective measurement tools In our experiments, we have shown an inverse relationship between TTS speech quality and the amplitude of an EEG evoked response called the `P300,' suggesting an increase in cognitive load as TTS quality decreases, likely due to reduction in speech intelligibility


Proceedings ArticleDOI
25 Aug 2013
TL;DR: The first steps towards the development of an evaluation instrument applicable to a wide range of anthropomorphic interfaces, including human-like and therefore more natural interaction, are described.
Abstract: The gulf between user and system can be minimized by adapting the system to the user‘s natural characteristics. Socalled anthropomorphic interfaces represent one strategy of such an adaption as they are assumed to provide a more human-like and therefore more natural interaction. However, regarding the evaluation of anthropomorphic interfaces, the well-known and empirically tested instruments are limited to educational contexts. Hence, this paper describes the first steps towards the development of an evaluation instrument applicable to a wide range of such interfaces.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: Quality predictors for TTS signals are developed following two different approaches to handle the huge set of speech features: a three-step feature selection followed by a stepwise multiple linear regression and an approach based on support vector machines.
Abstract: We extract 1495 speech features from 2 subjectively evaluated text-to-speech (TTS) databases. These features are extracted from pitch, loudness, MFCCs, spectrals, formants, and intensity. The speech material is synthesized using up to 15 different TTS systems, some of them with up to 8 different voices. We develop quality predictors for TTS signals following two different approaches to handle the huge set of speech features: a three-step feature selection followed by a stepwise multiple linear regression and an approach based on support vector machines. The predictors are cross-validated via 3-fold cross validation (CV) and leave-one-test-out (LOTO) CV. Due to the high number of features we apply a strict CV method where the partitioning is realized prior to the feature scaling and feature selection steps. In comparison we also follow a semi-strict approach where the partitioning effectively takes place after these steps. In the 3-fold CV case we achieve correlations as high as .75 for strict CV and .89 for semi-strict CV. The more ambitious LOTO CV yields correlations around .80 for the male speakers whereas the results for the female voices show the need for improvement.

DOI
09 Sep 2013
TL;DR: It is concluded that tactile feedback messages are unobtrusive, but have to be designed carefully to convey their intended meaning in a working context as well as in a leisure time situation.
Abstract: On mobile devices, vibrotactile messages are a common way to give feedback to the user. They might be a less obtrusive means to communicate information about the system status compared to auditory feedback. Much research has focused on the possibilities to perceive and discriminate different vibrotactile messages, less regarding her contentual interpretation. We describe a series of two studies. Aim of the pilot study was to find meaningful vibrotactile messages of which we then wanted to investigate the affective impression and functional connotation on a mobile device within varying staged contexts. Results show that the affective impression of those so-called Tactons is independent of the context. Moreover, we observed a relation between ratings of affective quality and functional applicability. We conclude that tactile feedback messages are unobtrusive, but have to be designed carefully to convey their intended meaning in a working context as well as in a leisure time situation.

Proceedings ArticleDOI
01 Dec 2013
TL;DR: Preliminary findings corroborate that emotions play a significant role in human quality and QoE perception and indicate an increase in delta and beta coupling with a decrease in the speech quality levels.
Abstract: Quality of Experience (QoE) is a human-centric paradigm which produces the blue print of human behavioral states such as perception, emotion, cognition and expectation. Recent advances in neurophysiological monitoring tools have facilitated the study of frequency, time and location of neuronal activity to an unprecedented degree, as well as opened doors to a better understanding of human cognition, emotions and overall behavioral systems. These neurophysiological insights may provide more accurate and objective characterization of QoE metrics. This paper seeks to investigate neuronal activity generated by three different quality levels of a speech stimulus using electroen-cephalography (EEG). To this end, an electroencephalography (EEG) feature was computed based on the coupling between so-called delta and beta EEG frequency bands, which has previously been linked with negative behavioral characteristics (anxiety, frustration, dissatisfaction). The result indicates an increase in delta and beta coupling with a decrease in the speech quality levels. Additionally, neural correlates of a subjective affective scores (arousal and valence) were also computed and shown to be inversely proportional with EEG feature. These preliminary findings corroborate that emotions play a significant role in human quality and QoE perception.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: An embodied conversational agent was developed implementing four strategies to adapt to the user: User tracking and recognition with a camera, remembering interests in topics, remembering preferences concerning the level of detail of information, and changes in confirmation strategy.
Abstract: An embodied conversational agent was developed implementing four strategies to adapt to the user: User tracking and recognition with a camera, remembering interests in topics, remembering preferences concerning the level of detail of information, and changes in confirmation strategy. The agent was integrated into a system providing public information on ICT related projects of research and development for visitors of the laboratories. In an interactive experiment, the adaptive version was compared to a non-adaptive version. The logging data, but not the questionnaire data, shows significant differences, indicating a benefit of the adaptive version in terms of efficiency in interaction. The logging data and results from a final interview are discussed in relation to other work on this subject, concluding on the difficulties to provide not only more efficient interaction, but also higher User Experience.

01 Jan 2013
TL;DR: In this paper, the authors presented an evaluation protocol for the assessment of synthetic speech in audiobook reading tasks for the Blizzard Challenge 2012 and analyzed the results of the subjective listening test.
Abstract: In this paper we present research on perceptual quality dimensions of text-to-speech systems in audiobook reading tasks. Therefore, we proposed a newly developed evaluation protocol for the assessment of synthetic speech in audiobook reading tasks for the Blizzard Challenge 2012. We illustrate the experimental setup of the special audiobook reading task of the Blizzard Challenge 2012 and analyze and interpret the results of the subjective listening test. Via a factor analysis, two quality dimensions could be extracted. Through the correlation between the val- ues of the rating scales and the factor values, the dimensions could be assigned to prosody & rhythm and to the listening pleasure of the user. This confirms the results of the previous study in which the current evaluation protocol was created. Also, a comparison with the perceptual quality dimensions of text-to-speech systems in different use cases led to significant similarities.

Proceedings Article
01 Jan 2013
TL;DR: The development of a mobile phone app for supporting participants and organizers of Interspeech conferences is described, meant as an open tool to be handled under the auspices of ISCA, and can be extended with speech and language functionalities in the future.
Abstract: In this paper, we describe the development of a mobile phone app for supporting participants and organizers of Interspeech conferences. Based on a survey amongst future organizers and attendees, we identified the most relevant functionalities and implemented an initial set of them on two popular platforms, iOS and Android. The app is meant as an open tool to be handled under the auspices of ISCA, and can be extended with speech and language functionalities in the future. This way, we hope to turn it into a community platform which can be used for experimenting with new speech technologies on site. A first version of this app will be presented at Interspeech 2013.

Proceedings Article
01 Jan 2013
TL;DR: This position paper claims that a major obstacle of offering video lectures for public universities appears to be the fact that they intend to compete with prestigious private universities regarding quality of the videos and complexity of the installed platform without being able to provide the additional resources required to do so.
Abstract: This position paper claims that a major obstacle of offering video lectures for public universities appears to be the fact that they intend to compete with prestigious private universities regarding quality of the videos and complexity of the installed platform without being able to provide the additional resources required to do so. We argue that in other areas of teaching this issue has been acknowledged for a long time, and lacking resources are usually compensated for by primarily two means: individually offering provisory course material (manuscripts), and active participation of the student body in administering those. Based on this, a simple system is proposed that mostly draws on existing platforms and tools, and refrains from extensive video editing prior to publishing. We discuss technical and non-technical requirements and possible research directions that result from establishing such low-fidelity video lectures.

Proceedings ArticleDOI
27 Apr 2013
TL;DR: It is shown that EEG is a feasible method for quantifying conscious processing of feedback in different modalities as it correlates highly with subjective ratings and can thus be considered an additional tool for assessing the effectiveness of feedback.
Abstract: To acknowledge information received by a mobile device, a number of feedback modalities are available for which human information processing is still not completely understood. This paper focuses on how different feedback modalities are perceived by users introducing a test method that is new in this field of research. The evaluation is done via standard self-assessment and by analyzing brain activity [electroencephalogram (EEG)]. We conducted an experiment with unimodal and multi-modal feedback combinations, and compared behavioral user data to EEG data. We could show that EEG is a feasible method for quantifying conscious processing of feedback in different modalities as it correlates highly with subjective ratings. EEG can thus be considered an additional tool for assessing the effectiveness of feedback, revealing conscious and potential non-conscious information processing.

01 Jan 2013
TL;DR: How perceived auditory and visual qualities integrate to an overall audiovisual quality perception in different experimental contexts is shown, revealing both similarities and differences in terms of magnitude and in which cases they occur.
Abstract: This paper investigates multi-modal aspects of audiovisual quality assessment for interactive communication services. It shows how perceived auditory and visual qualities integrate to an overall audiovisual quality perception in different experimental contexts. Two audiovisual experiments are presented and provide experimental data for the present analysis. First, two experimental contexts are compared, i.e., passive ‘viewing and listening’ and interactive, with regard to their impact on the audiovisual qualities as subjectively perceived by the user. Second, the effects of cross-modal interactions on the assessment of the audio and video qualities are measured for those experimental contexts. The results are compared to the ones found in the literature revealing both similarities and differences in terms of magnitude and also in which cases they occur. Third, the impact of the conversational scenario on the assessment of the auditory and visual qualities is investigated. Finally, results from the literature related to audiovisual integration are gathered by the type of application. A general integration function is proposed for each category, and the performances of these ‘application-oriented’ models demonstrate a direct gain in prediction.

Proceedings ArticleDOI
01 Jul 2013
TL;DR: Listeners identified the talkers while listening to the degraded conversations, being more accurate for particular transmission scenarios, and show that human speaker identification can be considered as an additional criterion when judging the benefits of enhanced bandwidths.
Abstract: Telecommunication systems available today allow efficient voice transmission through channels of different audio bandwidths and terminated with different user interfaces. However, the sending and receiving user interface, the bandwidth limitation and the effects of lossy signal compression degrade the quality of the received signal and impede an unequivocal recognition of the talker. In the present work the performance of a group of human listeners identifying speakers under different conditions is assessed by transmitting a set of multi-speaker conversations through networks and user interfaces of diverse characteristics. After learning the voices of the participants of the phone calls in clean conditions, listeners identified the talkers while listening to the degraded conversations, being more accurate for particular transmission scenarios. The results show that human speaker identification can be considered as an additional criterion when judging the benefits of enhanced bandwidths.