scispace - formally typeset
Search or ask a question

Showing papers by "Sebastian Möller published in 2008"


Proceedings ArticleDOI
12 May 2008
TL;DR: An overview of the model, of its integration into a multimedia model predicting audio-visual quality, and of its application to service monitoring are provided, and a performance analysis shows a high correlation with the results of different subjective video quality perception tests.
Abstract: The paper presents a parameter-based model for predicting the perceived quality of transmitted video for IPTV applications. The core model we derived can be applied both to service monitoring and network or service planning. In its current form, the model covers H.264 and MPEG-2 coded video (standard and high definition) transmitted over IP-links. The model includes factors like the coding bit-rate, the packet loss percentage and the type of packet loss handling used by the codec. The paper provides an overview of the model, of its integration into a multimedia model predicting audio-visual quality, and of its application to service monitoring. A performance analysis is presented showing a high correlation with the results of different subjective video quality perception tests. An outlook highlights future model extensions.

88 citations


Journal ArticleDOI
TL;DR: It is shown that - although an accurate prediction of individual ratings is not yet possible with such models - they may still be used for taking decisions on component optimization, and are thus helpful tools for the system developer.

54 citations



Proceedings Article
01 May 2008
TL;DR: The results show that the two user groups differ in their speaking style as well as their vocabulary, and older users are far less easy to stereotype than younger users.
Abstract: In this paper, we present the collection and analysis of a spoken dialogue corpus obtained from interactions of older and younger users with a smart-home system. Our aim is to identify the amount and the origin of linguistic differences in the way older and younger users address the system. In addition, we investigate changes in the users’ linguistic behaviour after exposure to the system. The results show that the two user groups differ in their speaking style as well as their vocabulary. In contrast to younger users, who adapt their speaking style to the expected limitations of the system, older users tend to use a speaking style that is closer to human-human communication in terms of sentence complexity and politeness. However, older users are far less easy to stereotype than younger users.

33 citations


Journal ArticleDOI
TL;DR: A normalized log-likelihood measure, computed between perceptual features extracted from synthesized speech and a gender-dependent HMM reference model, is proposed and shown to be a reliable parameter for multidimensional TTS quality diagnosis.
Abstract: In this letter, the first steps toward the development of a signal-based instrumental quality measure for text-to-speech (TTS) systems are described. Hidden Markov models (HMM), trained on naturally-produced speech, serve as artificial text- and speaker-independent reference models against which synthesized speech signals are assessed. A normalized log-likelihood measure, computed between perceptual features extracted from synthesized speech and a gender-dependent HMM reference model, is proposed and shown to be a reliable parameter for multidimensional TTS quality diagnosis. Experiments with subjectively scored synthesized speech data show that the proposed measure attains promising estimation performance for quality dimensions labeled overall impression, listening effort, naturalness, continuity/fluency, and acceptance.

32 citations


Proceedings ArticleDOI
19 May 2008
TL;DR: This contribution introduces an ergonomic technique that aims at seamlessly switching the speech codec in Voice-over-IP calls during vertical handovers, based on SIP/SDP session renegotiation, the establishment of a parallel media stream and RTP packet filtering.
Abstract: Vertical handovers in Next Generation Networks enable a new experience of mobility, since application layer sessions are maintained while roaming across different access networks. For real-time media services like Voice-over-IP, a change in the underlying network technology is particulary challenging. Ongoing calls are not suspended during the handover, however, the handover may go along with an audible gap during the transition time due to lost or delayed packets and an adaptation in call parameters such as those of the employed speech codec. This, in turn, may translate in an unfamiliar speech quality perception. This contribution introduces an ergonomic technique that aims at seamlessly switching the speech codec in Voice-over-IP calls during vertical handovers, based on SIP/SDP session renegotiation, the establishment of a parallel media stream and RTP packet filtering. Evaluation results are presented, showing that the proposed approach does not cause any interruption of the audio stream in about 90% of the test cases, clearly outperforming simple re-negotiation of session parameters that does not take a seamless transition into account (interruptions in all test cases). PESQ speech quality estimates reveal a quality advantage of 5.4% on average for the considered scenario.

21 citations



Book ChapterDOI
16 Jun 2008
TL;DR: The results show that users do not make use of all modalities for all tasks, and both modality usage and the range of offered modalities seem to influence subjective user ratings.
Abstract: This paper describes initial results from an evaluation study on the use of different modalities. The application, a web-based media recommender and management system called MediaScout, was installed on two multimodal devices, and on one unimodal device as a control condition. The study aims to investigate whether users make use of multimodality if it is offered and under which circumstances they do so. Moreover, it was studied whether users' stated modality preferences match their actual use of the modalities. The results show that users do not make use of all modalities for all tasks. Modality usage is determined by the task to be performed, as well as the efficiency of the modality for achieving the task goal. Both modality usage and the range of offered modalities seem to influence subjective user ratings.

19 citations


Proceedings ArticleDOI
12 May 2008
TL;DR: An algorithm was developed that allows the obtained "speech-quality-per-call" score to be predicted on the basis of the MOS of the individual utterances and the approach was proven and confirmed for objectively obtained quality scores.
Abstract: We present a method to estimate the perceived listening quality by a subscriber at the end of a common voice telephony conversation. This method was recently introduced in ETSI STQ mobile and was approved as TR 102 506 'Speech Quality per Call'. The idea is to calculate this "speech-quality-per-call" value based on short-term listening quality scores (so-called Mean Opinion Scores, MOS), as they are usually derived by subjective listening-only tests, or based on predictions of short-term scores by means of objective measures. It is shown that a pure linear averaging of short-term scores will not predict the perceived quality of the entire call sufficiently well in case of a non-stationary quality over the call. Mainly the "recency effect" and the out-weighting of very bad parts in a call have to be considered in an adequate way. An algorithm was developed that allows the obtained "speech-quality-per-call" score to be predicted on the basis of the MOS of the individual utterances. The algorithm can be applied for various lengths of call and numbers of individual utterances. Since speech quality is usually objectively predicted in real networks the approach was also proven and confirmed for objectively obtained quality scores. This paper follows widely the work and the decisions taken within ETSI STQ mobile.

19 citations



Journal ArticleDOI
TL;DR: In this paper, the authors report on experiments to estimate the speech output quality of telephone services in an instrumental way, using single-ended quality prediction models, addressing both naturally-produced as well as synthesized speech generated with a Text-To-Speech (TTS) system.
Abstract: This paper reports on experiments to estimate the speech output quality of telephone services in an instrumental way, using single-ended quality prediction models. It addresses both naturally-produced as well as synthesized speech generated with a Text-To-Speech (TTS) system. Three auditory tests have been carried out where typical speech samples have been transmitted over various telephone channels, and then judged by listeners with respect to their overall quality. The mean auditory ratings obtained in these tests have been compared to estimates provided by three different single-ended models, one of which is currently recommended by the International Telecommunication Union for predicting the quality of naturally-produced speech. Correlations between auditory and estimated quality scores vary considerably between experiments. It is concluded that the single-ended models mainly predict the effects of the transmission channel, but not of the (naturally-produced or synthesized) source speech material.

Proceedings ArticleDOI
18 Mar 2008
TL;DR: A testbed specially built to investigate the user perception of mobility in NGNs, using the Mobisense setting, which allows the migration of VoIP calls across heterogeneous wireless networks, and simulation of user mobility patterns while roaming between networks.
Abstract: The convergence of different commercial wireless technologies poses new opportunities for always-on connectivity and nearly ubiquitous access in Next Generation Networks (NGNs) The proposed integrated virtual access platform frees users from the fixed locations, and allows them to enjoy services while they are on the move However, the possibility to seamlessly roam (on demand) across wireless technologies is a must in NGNs Even though a wide body of research has been done on seamless mobility, a thorough analysis of user perception of this phenomenon is required in order to successfully design, and further improve, always-on services and mobility management solutions This paper describes a testbed specially built to investigate the user perception of mobility in NGNs The Mobisense testbed enables the mapping of user experience to network conditions, focusing on phenomena caused by a user roaming across diverse wireless technologies and the impact on applications and services The Mobisense setting helps in the analysis of IP services, in particular experiments were done to study user perceived quality of Voice over IP in NGNs This testbed allows (1) the migration of VoIP calls across heterogeneous wireless networks, (2) network application data tracking, (3) simulation of user mobility patterns, (4) and audio codec changeovers while roaming between networks

Proceedings Article
01 Jan 2008
TL;DR: A new instrumental measure for end-to-end speech transmission quality is presented which is based on perceptually relevant dimensions based on the dimensions “discontinuity’, “noisiness”, and “coloration” which were identified through multidimensional analyses.
Abstract: In this contribution, a new instrumental measure for end-to-end speech transmission quality is presented which is based on perceptually relevant dimensions. The paper describes the complete scientific development process of such a measure, starting off from the general framework and concluding with the concrete realization. The measure is based on the dimensions “discontinuity”, “noisiness”, and “coloration”, which were identified through multidimensional analyses. Three dimension estimators are introduced which are capable to predict so-called dimension impairment factors on the basis of signal parameters. Each dimension impairment factor reflects the degradation with respect to a single perceptual dimension. By combining the impairment factors, integral quality can be estimated. A maximum correlation of r = 0.9 with auditory test results is achieved for a wide range of perceptually different conditions.

Proceedings ArticleDOI
20 Oct 2008
TL;DR: A linear combination of both quality aspects models overall quality of talking heads to a good degree and is shown to be consistent in their results.
Abstract: In this paper we report the results of a user study evaluating talking heads in the smart home domain. Three noncommercial talking head components are linked to two freely available speech synthesis systems, resulting in six different combinations. The influence of head and voice components on overall quality is analyzed as well as the correlation between them. Three different ways to assess overall quality are presented. It is shown that these three are consistent in their results. Another important result is that in this design speech and visual quality are independent of each other. Furthermore, a linear combination of both quality aspects models overall quality of talking heads to a good degree.

Journal ArticleDOI
TL;DR: The research project aims at the development of an attribute‐based speech‐quality measure, which provides estimates of different attributes of speech samples and then maps them to one integral‐quality estimate.
Abstract: State‐of‐the‐art assessment method of speech‐transmission quality (e.g., PESQ or TOSQA) predict the mean‐opinion score (MOS) quite accurately, but cannot provide diagnostic information, which is, however, highly desirable for system developers. In our research project, we aim at the development of an attribute‐based speech‐quality measure, which provides estimates of different attributes of speech samples and then maps them to one integral‐quality estimate. Three dominant, mutually orthogonal perceptual dimensions were firstly identified by auditory experiments and multidimensional analysis (MDA) for narrow‐band speech transmission: “directness/ frequency content,” “continuity,” and “noisiness.” The present paper focuses on the further decomposition and measurement of the global dimension “Noisiness.” Therefore, an auditory test including samples degraded by different kinds of noises has been conducted. The subsequent MDA indicates that at least two sub‐dimensions (SD), “Speech Contamination” and (perceived) “Additive‐Noise Level,” are further describing the global dimension “Noisiness.” The first SD characterizes the degree the noise distorts the speech signal as such, whereas the second SD reflects the degree the additive circuit or background noise itself annoys the listener. The instrumental estimation methods for both SDs and the mapping to the integral‐quality ratings are presented in this paper.

01 Jan 2008
TL;DR: In this article, the authors analyzed linear regression models predicting user satisfaction with spoken dialog systems, with the outcome that there is some congruity as to the parameters which are correlated.
Abstract: In this paper, we analyze linear regression models predicting user satisfaction with spoken dialog systems. Correlations between interaction parameters and the judgments are analyzed user-wise, with the outcome that there is some congruity as to the parameters which are correlated. However, some users are less well predictable than others. A relation between the strength of correlations and user characteristics, impacting judgment behaviour and scale usage, is made. Prediction models of subgroups with specific characteristics are calculated. The results differ clearly depending on user characteristics. For some groups, transformation of the reply scale improves the prediction.


Book ChapterDOI
16 Jun 2008
TL;DR: It is shown, that the weights associated with interaction parameters in the model change in dependence of the system's major problems by examining correlations under different quantities of understanding errors in the dialogs.
Abstract: For spoken dialog systems, PARADISE [Walker et al. 1997] provides a framework to train a user satisfaction prediction model on given data. The approach weights and sums interaction parameters to predict a satisfaction metric calculated from a questionnaire. In this paper, we try to tackle a major problem of these models, namely their weak generalizability. We show, that the weights associated with interaction parameters in the model change in dependence of the system's major problems by examining correlations under different quantities of understanding errors in the dialogs.

Book ChapterDOI
01 Jan 2008
TL;DR: This chapter presents standardised methods for both measurement approaches and an initial evaluation study in subjective evaluation experiments shows that the parameters correlate only weakly with subjective judgements; thus, both types of evaluation provide complementary types of information.
Abstract: The quality experienced during the interaction with telephone-based spoken dialogue services results from a perception and judgement process. As a consequence, quality has to be measured in a subjective way, with the help of human test persons. To complement subjective quality judgements, parameters can be logged which quantify the flow of the interaction, the behaviour of the user and the system, and the performance of individual system modules during the interaction. Although such parameters are not directly linked to the quality perceived by the user, they provide useful information for system development, optimisation, and maintenance. This chapter presents standardised methods for both measurement approaches. Firstly, a brief overview of subjective evaluation experiments is provided, following Recommendation P.851 issued by the International Telecommunication Union. Secondly, a collection of parameters is presented which has proven to be useful for system design. An initial evaluation study in is described which shows that the parameters correlate only weakly with subjective judgements; thus, both types of evaluation provide complementary types of information. Linear regression models may be used to predict subjective judgements from interaction parameters, but their prediction accuracy is still limited.



Proceedings Article
01 Jan 2008
TL;DR: This paper presents an estimator for “noisiness” that is based on three parameters: the level and the color of additive noise contained in a speech signal as well as its amount of signal-correlated noise.
Abstract: The development of instrumental measures that do not only estimate the speech quality of modern telecommunication systems but also analyze it is a current issue in speech processing. Our work aims at an analytic quality measure for telephone-band speech that is based on the instrumental assessment of so-called quality dimensions. They describe different quality-relevant characteristics of speech signals and thus allow for a quality analysis. An overall-quality rating is obtained by a suitable combination of the dimension ratings. For telephone-band speech, three quality dimensions have been identified: “directness/frequency content”, “continuity”, and “noisiness”. This paper presents an estimator for “noisiness” that is based on three parameters: the level and the color of additive noise contained in a speech signal as well as its amount of signal-correlated noise. The dimension estimates show a correlation > 0.95 with the results of auditory tests.

Book ChapterDOI
16 Jun 2008
TL;DR: In order to facilitate the evaluation of advanced spoken dialogue systems (SDSs), the architecture for a new quality prediction model is presented which follows the perception, judgment and action processes which are assumed to take place in a user interacting with a dialogue system.
Abstract: In order to facilitate the evaluation of advanced spoken dialogue systems (SDSs), we present the architecture for a new quality prediction model. The architecture follows the perception, judgment and action processes which are assumed to take place in a user interacting with a dialogue system. It is pointed out which components of the model are already available, and how they may be improved in the future.

01 Jan 2008
TL;DR: A newly developed German Text-Toaudiovisual-Speech (TTavS) synthesis system based on the English speaking HeadZero and the German talking head MASSY, which shows a significant user preference for the new German HeadZero.
Abstract: The authors describe a newly developed German Text-Toaudiovisual-Speech (TTavS) synthesis system based on the English speaking HeadZero. Targets of the control parameters of the talking head are generated by mapping of German phonemes to the originally English visemic blend shapes controls. The resulting German version of HeadZero and the German talking head MASSY were extended to generate audiovisual speech utterances from text with voices of both the MARY and the MBROLA audio speech synthesizers. A test was designed to evaluate the quality of the talking heads combined with the two synthetic voices in the context of a smart home environment. The results show a significant user preference for the new German HeadZero. Both heads are rated better when combined with the MARY voice.

Journal ArticleDOI
TL;DR: A parametric approach for predicting the quality of IP‐based audio using the main parameters are the audio codec, codec bitrate, packet loss characteristics and the audio content.
Abstract: Different multimedia services are more and more transmitted over a common network infrastructure, e.g. using the Internet Protocol (IP). Examples are the widespread voice over Internet Protocol (VoIP), and Internet Protocol Television (IPTV). The streaming of pure audio over IP even has a longer tradition, with applications such as internet radio. For an efficient development, planning and monitoring of such services, models can be used that predict user‐perceived quality based on technical service characteristics. Speech quality models for telephony are among the most advanced ones in this context, with different model types like the signal‐based PESQ (ITU‐T Rec. P.862, 2001) or the parametric E‐model (ITU‐T Rec. G.107, 2005). In this paper, we describe a parametric approach for predicting the quality of IP‐based audio. The main parameters are the audio codec, codec bitrate, packet loss characteristics and the audio content. We base our considerations on own listening tests conducted in the framework of IPTV quality assessment, on approaches and test data described in the literature and on complementary knowledge from the fields of speech and video quality models. In this context, we identify similarities and discrepancies between different types of services in the light of a common model framework.

Patent
16 May 2008
TL;DR: In this paper, an apparatus and a method for providing data communication over a packet-based network comprising a network interface (450) which is connectable to the packet based network is described.
Abstract: The invention relates to an apparatus and a method for providing data communication over a packet based network comprising a network interface (450) which is connectable to the packet based network The apparatus comprises at least two codec means (430, 470) being arranged in parallel and being connected to said network interface (450) A first codec means (470) comprises a first buffer (460) and a second codec means (430) comprises a second buffer (440) A management unit (420) is connected to both codec means (430, 470) and a sound device (410) When a changeover signal is issued, said first codec means (470) continuous to decode an input data packet being passed through by said first buffer (460), and said second codec means (430) initiates an encoding process of an output data packet being received from the sound device (410) and directly transmits the encoded output data packet to the network interface (450)

Book ChapterDOI
16 Jun 2008
TL;DR: This paper examines SASSI results obtained by different studies in order to evaluate its reliability and validity and investigates if SASSi is appropriate also for multimodal systems.
Abstract: Questionnaires are a widespread method for subjective usability evaluation. Concerning speech-based systems the SASSI questionnaire is probably the most common one. This paper examines SASSI results obtained by different studies in order to evaluate its reliability and validity. Furthermore, it is investigated if SASSI is appropriate also for multimodal systems. The results indicate that some SASSI sub-scales need a revision.

Patent
17 Dec 2008
TL;DR: In this article, a usability evaluation processing system and method for evaluating usability by simulation of an interaction between a user and a system to be evaluated, the usability evaluation system comprising a user behavior model unit, a system-to-be evaluated model unit and an evaluation unit connected to the automatic testing unit, wherein a probabilistic model through a state chart given by the system is used for the user behaviour model.
Abstract: A usability evaluation processing system and method for evaluating usability by simulation of an interaction between a user and a system to be evaluated, the usability evaluation processing system comprising a user behavior model unit, a system to be evaluated model unit, an automatic testing unit connected to the user behavior model unit and the system model unit, and an evaluation unit connected to the automatic testing unit, wherein a probabilistic model through a state chart given by the system is used for the user behavior model.

Journal ArticleDOI
TL;DR: In this article, the authors introduce a systemic view on a listener in an auditory experiment, which helps to separate sound events from auditory events and from their descriptions, and to identify and describe the processes involved in such experiments.
Abstract: Jens Blauert has introduced a systemic view on a listener in an auditory experiment. This view helps to separate sound events from auditory events and from their descriptions, and to identify and describe the processes involved in such experiments. The notion has later been extended to listeners in a quality judgment situation by Jekosch and Raake, leading to the notion of a "quality event". On the one hand, knowledge of the involved processes is necessary to design appropriate measurement processes for, e.g., sound quality, transmission quality, auditory‐scene quality, or product‐sound quality. On the other hand, such knowledge enables us to define algorithms which estimate quality ‐ or sub‐aspects of it ‐ in the system design process. The talk will mainly follow the second line and will identify components which are necessary for an algorithmic description of the processes involved in the formation of a quality event. Taking the example of telecommunication services, it will be shown which components of...

Book ChapterDOI
01 Jan 2008
TL;DR: Recent work to build a testbed for the analysis and evaluation of log-file information collected by telephone-based spoken dialogue platforms, which allows the confusability of lexicon and grammar to be evaluated.
Abstract: Spoken dialogue platforms usually log a multitude of information for each interaction between user and system. Such information is potentially very helpful for system evaluation and optimization; however, it is very difficult to interpret log data due to its large amount and specificity. In this chapter, we describe recent work to build a testbed for the analysis and evaluation of log-file information collected by telephone-based spoken dialogue platforms. On the signal level, parameters are extracted which allow recognition errors caused by channel and user characteristics to be analyzed. On the symbolic level, phonemic similarities are computed which allow the confusability of lexicon and grammar to be evaluated. The algorithms are integrated into a graphical user interface for an effective analysis of system performance. They are evaluated on data collected within a usability test of a pre-qualifying application of a large German telecom operator.