scispace - formally typeset
Search or ask a question

Showing papers by "Sebastian Möller published in 2019"


Proceedings ArticleDOI
12 May 2019
TL;DR: This paper presents a non-intrusive speech quality assessment model NISQA, which – in contrast to current state-of-the-art models – can predict the quality of super-wideband speech transmission and is able to accurately predict thequality impact of packet loss concealment of modern codecs, such as Opus and EVS.
Abstract: The quality of speech communication networks has recently improved significantly by extending the available audio bandwidth from narrowband, firstly to wideband, and then to super-wideband. This bandwidth extension marks the end of the typically muffled sound we know from plain old telephone services. Another reason for increased speech quality is the fully digitally packet-based transmission. However, so far, no speech quality prediction model is able to estimate super-wideband quality without a clean reference signal. In this paper, we present a non-intrusive speech quality assessment model NISQA, which – in contrast to current state-of-the-art models – can predict the quality of super-wideband speech transmission. Furthermore, it is able to accurately predict the quality impact of packet loss concealment of modern codecs, such as Opus and EVS. The model uses a novel approach, where a CNN firstly estimates the per-frame quality, and subsequently, an RNN aggregates the per-frame values over time, to estimate the overall speech quality. Averaged over a comprehensive test set, the model achieves an RMSE*3rd of 0.29 with subjective MOS.

54 citations


Posted Content
TL;DR: This paper presents TextComplexityDE, a dataset consisting of 1000 sentences in German language taken from 23 Wikipedia articles in 3 different article-genres to be used for developing text-complexity predictor models and automatic text simplification inGerman language.
Abstract: This paper presents TextComplexityDE, a dataset consisting of 1000 sentences in German language taken from 23 Wikipedia articles in 3 different article-genres to be used for developing text-complexity predictor models and automatic text simplification in German language. The dataset includes subjective assessment of different text-complexity aspects provided by German learners in level A and B. In addition, it contains manual simplification of 250 of those sentences provided by native speakers and subjective assessment of the simplified sentences by participants from the target group. The subjective ratings were collected using both laboratory studies and crowdsourcing approach.

20 citations


Journal ArticleDOI
TL;DR: A speech quality model, based on wireless parameters, such as signal-to-noise ratio, Doppler shift, MIMO configurations, and different modulation schemes, is proposed and inserted into the wide-band E-model algorithm.
Abstract: Communication service providers use specialized solutions to evaluate the quality of their services. Also, different mechanisms that increase network robustness are incorporated in current communication systems. One of the most accepted techniques to improve transmission performance is the MIMO system. In communication services, voice quality is important to determine the user’s quality of experience. Nowadays, different speech quality assessment methods are used, one of them is the parametric method that is used for network planning purposes. The ITU-T Rec. G107.1 is the most accepted model for wide-band communication systems. However, it does not consider the degradations occurring in a wireless network nor the quality improvement, caused by the MIMO systems. Thus, we propose a speech quality model, based on wireless parameters, such as signal-to-noise ratio, Doppler shift, MIMO configurations, and different modulation schemes. Also, real speech signals encoded by 3 operation modes of the AMR-WB codec are used in test scenarios. The resulting speech samples were evaluated by the algorithm described in ITU-T Rec. P.862.2, which scores are used as a reference. With these results, a wireless function, named $I_{W-M}$ that relates the wireless network parameters with speech quality is proposed and inserted into the wide-band E-model algorithm. It is worth noting that the main novelty of the proposed $I_{W-M}$ is the consideration of the quality improvement incorporated by MIMO systems with different antenna array configurations. The performance validation results demonstrated that the inclusion of $I_{W-M}$ values into the R global score determined a reliable model, reaching a Pearson correlation coefficient and a normalized RMSE of 0.976 and 0.144, respectively.

20 citations


Proceedings ArticleDOI
18 Jun 2019
TL;DR: Investigating the influence of serial-position effects on the Game Experience Questionnaire shows that GEQ does not suffer from either recency, primacy or peak effects, however, when users are asked about the controllability and responsiveness of the games, the recency effect exists.
Abstract: When a participant is asked to evaluate a stimulus, the judgment is based on the remembered experience, which might be different from the actual experience. This phenomenon happens according to the theory that some moments of an experience such as the beginning, peak and the end of the experience have more impact on the memory. These moments can be recalled with a higher probability than the other parts of the experience, and some minor bad moments of experience might be forgotten or forgiven due to the rest of the good experiences. This paper, using a subjective study and emulating an artificial delay on participants' gameplay investigates the influence of these serial-position effects on the Game Experience Questionnaire (GEQ). The result shows that GEQ does not suffer from either recency, primacy or peak effects. However, when users are asked about the controllability and responsiveness of the games, the recency effect exists. The paper also shows that GEQ has the forgiveness effect and participants forgive or may forget a bad experience if it coincides with a considerable duration of a good experience.

19 citations


Proceedings ArticleDOI
05 Jun 2019
TL;DR: The results show that gamers can adapt to constant delay while they are playing and change their behavior if the actions in a game are predictable, and can be used to create a network resource allocation technique which controls a congested network by giving more priority and resource to the unadaptable games than the adaptable games.
Abstract: Both online and cloud gaming services require a very low network delay to create a good Quality of Experience (QoE) for their users. The required network latency cannot be guaranteed due to the current best effort-nature of the network, and as a result, network latency often degrades the gamer’s performance and QoE. In this paper, the adaptability of gamers to different variations on delay is investigated both subjectively and objectively using three self-developed games. The results show that gamers can adapt to constant delay while they are playing and change their behavior if the actions in a game are predictable. Such adaptation leads to a significant increase in gamers performance and QoE. The paper also provides evidence that regardless of performance frequent delay switching annoys gamers. The result of this study can be used to create a network resource allocation technique which controls a congested network by giving more priority and resource to the unadaptable games than the adaptable games.

16 citations


Journal ArticleDOI
01 Dec 2019
TL;DR: In this paper, the authors investigated whether a rating performed in a virtual environment is comparable to a rating obtained via a paper questionnaire and how questionnaires for assessing virtual experiences should be designed and integrated into the virtual environment.
Abstract: Current developments in virtual reality (VR) hardware have made immersive VR experiences more affordable through commercially available head-mounted displays. As more studies are likely to be conducted using these devices, the question arises how to embed questionnaires in virtual environments without impairing the immersive user experience. In this work we investigate two different aspects: (1) if a rating performed in a virtual environment is comparable to a rating obtained via a paper questionnaire and (2) how questionnaires for assessing virtual experiences should be designed and integrated into the virtual environment. For this research, we used our own extended version of VRate—a VR questionnaire asset for Unity. In the first study with 27 participants, we compared ratings assessed within VR with ratings obtained using a paper questionnaire. We found that the ratings gathered in VR are comparable to the ratings gathered in the real world by paper–pencil questionnaires (subscales: global presence, spatial presence, and experience realism). In the second study with 48 participants, we investigated the users’ perceived suitability of the VR questionnaire and the optimal mounting position of the questionnaire (hand-mounted, head-up display or billboard). Moreover, we investigated whether the questionnaire should be answered in the same or in a separate dedicated virtual environment and how the users’ feeling of presence in VR is influenced by this placement. Results indicate a subjective preference for the billboard position, with a significant preference for billboard over hand-mounted and no significant preference between billboard and head-up-display. Regarding the placement of the VR questionnaire (in-scene vs. dedicated virtual environment) we did not find any influence on presence. In the following, we discuss the pros and cons of different placement/mounting options and provide suggestions for designing and implementing questionnaires embedded in virtual environments.

15 citations


Journal ArticleDOI
TL;DR: The present study exemplifies the utility of physiological methods like EEG for dissociating speech degradations not only based on perceived intensity level, but also their distinctive quality dimension.
Abstract: OBJECTIVE By means of subjective psychophysical methods, quality of transmitted speech has been decomposed into three perceptual dimensions named 'discontinuity' (F), 'noisiness' (N) and 'coloration' (C). Previous studies using electroencephalography (EEG) already reported effects of perceived intensity of single quality dimensions on electrical brain activity. However, it has not been investigated so far, whether the dimensions themselves are dissociable on a neurophysiological level of analysis. APPROACH Pursuing this goal in the present study, a high-quality (HQ) recording of a spoken word was degraded on each dimension at a time, resulting in three quality-impaired stimuli (F, N, C) which were on average described as being equal in perceived degradation intensity. Participants performed a three-stimulus oddball task, involving the serial presentation of different stimulus types: (1) HQ or degraded 'standard' stimuli to establish sensory/perceptual quality references. (2) Degraded 'oddball' stimuli to cause random, infrequent deviations from those references. EEG was employed to examine the neuro-electrical correlates of speech quality perception. MAIN RESULTS Emphasis was placed on modulations in temporal and morphological characteristics of the P300 component of the event-related brain potential (ERP), whose subcomponents P3a and P3b are commonly linked to attentional orienting and task relevance categorization, respectively. Electrophysiological data analysis ([Formula: see text]) revealed significant modulations of P300 amplitude and latency by the perceptual dimensions underlying both quality references and oddball stimuli. SIGNIFICANCE The present study exemplifies the utility of physiological methods like EEG for dissociating speech degradations not only based on perceived intensity level, but also their distinctive quality dimension.

13 citations


Proceedings ArticleDOI
01 Jun 2019
TL;DR: This research proposes a speech quality parametric model (SQPM) based on artificial neural networks that considers both wireless network degradation characteristics and the techniques used to improve the transmission quality and intends to be useful for wireless network planning tasks.
Abstract: In communication services, speech quality plays an important role to achieve user expectations. Nowadays, there are different objective methods to estimate speech quality. Parametric models consider different factors, such as network parameters, acoustic characteristics, communication equipment, among others. The most representative parametric models for telephone service are described in ITU-T Rec. G.107 and G.107.1, mostly known as E-model and WB E-model, respectively. However, they do not consider wireless network parameters as inputs. In this context, this research proposes a speech quality parametric model (SQPM) based on artificial neural networks that considers both wireless network degradation characteristics and the techniques used to improve the transmission quality. For this purpose, a network simulator was built, in which two forward error correction (FEC) codes and four different antenna configurations in a multiple-input-multiple-output (MIMO) system are implemented. To validate the results obtained by the simulator, the ITU-T Rec. P.863 and the WB E-model are used. Experimental results show how different wireless network configurations impact on speech quality. Performance evaluation results demonstrated a high correlation between the proposed SQPM and ITU-T Rec. P.863 results, reaching an PCC and an RMSE of 0.9901 and 0.1492, respectively. Therefore, our proposal intends to be useful for wireless network planning tasks.

12 citations


Proceedings ArticleDOI
05 Jun 2019
TL;DR: The development of an automated readability assessment estimator based on supervised learning algorithms over German text corpora is described, with that Random Forest Regressor yielding best result for RMSE measure.
Abstract: Data-driven approaches towards readability assessment, using automated linguistic analysis and machine learning methods, is a viable road forward for readability rankings. This paper describes the development of an automated readability assessment estimator based on supervised learning algorithms over German text corpora. For this purpose, natural language processing tools are used to extract 73 linguistic features grouped in traditional, lexical and morphological features. Feature engineering approaches are employed to select informative features. Different supervised learning models are implemented, with the top-ranked features fed as input. The results obtained depict that Random Forest Regressor yielding best result (0.847) for RMSE measure.

12 citations


Proceedings ArticleDOI
05 Jun 2019
TL;DR: Results show that level of UI complexity has a significant influence on readability, while positioning of UI elements significantly influences users’ perception of support from system.
Abstract: In past years, as virtual reality (VR) technology is extensively developing, more and more people are using it in different fields. One of the fast developing fields in VR are exergames, a combination of physical exercise and game. With a goal to engage people in physical activity, VR exergames should look and feel good for users. Therefore, user interface (UI) in VR is important and has to be built in a way to enhance user experience. In this paper, ergometer is used together with VR rowing environment as VR exergame for a study aiming to explore possibilities of UI in VR. Accordingly, different metrics commonly used to quantify rowing action (e.g, speed and distance) were visualized. The visualizations that were created had different positioning (closer or further away from the player) and different level of complexity (more or less metrics shown as only numbers or in a gamified design). Participants (N = 27) during experiment for this study rowed four times in different conditions depending were metrics of the game were shown: 1) as a cockpit at the front of the rowing e.g, with gamified visualization of metrics; 2) as a coach boat that follows the player with gamified visualization of metrics on a screen; 3) as a cockpit at the front of the rowing e.g, with digital visualization of metrics; 4) as a coach boat that follows the player with digital visualization of metrics on a screen. Results show that level of UI complexity has a significant influence on readability, while positioning of UI elements significantly influences users’ perception of support from system. Furthermore, participants preferred the opposite level of complexity depending on the position where metrics were shown.

11 citations


Journal ArticleDOI
TL;DR: The obtained results elucidate the importance of contextual and content-related influencing factors for proving the validity of the P300 as a psychophysiological indicator of speech quality change and the transfer of ERP-based quality assessment to more practically relevant measurement contexts are discussed.
Abstract: Objective Non-invasive physiological methods like electroencephalography (EEG) are increasingly employed to assess human information processing during exposure to multimedia signals. In the quality engineering field, previous research has promoted the utility of the P300 event-related brain potential (ERP) component for indicating variation in quality perception. The present study provides a starting point to test whether the P300 and its two subcomponents, P3a and P3b, are truly reflective of changes in the perceived quality of transmitted speech signals given the presence of other, quality-unrelated changes in acoustic stimulation. Approach High-quality and degraded variants of spoken words were presented in a two-feature oddball task, which required participants to actively respond to rarely occurring 'target' stimuli within a series of frequent 'standard' stimuli, thereby eliciting ERP waveforms. Target presentations involved either single quality changes or concurrent double changes in quality and the initial phoneme. Main results In case additional phonological change was present, only varying quality of standard stimuli caused significant modulations in P3a and P3b characteristics (N = 32). Thus, the formation of different short-term quality references exerted a persisting influence on the auditory processing of transmitted speech. Significance The obtained results elucidate the importance of contextual and content-related influencing factors for proving the validity of the P300 as a psychophysiological indicator of speech quality change. Associated questions regarding the transfer of ERP-based quality assessment into more practically relevant measurement contexts are discussed.

Proceedings ArticleDOI
15 Sep 2019
TL;DR: A single-ended quality diagnosis model for superwideband speech communication networks, which predicts the perceived Noisiness, Coloration, and Discontinuity of transmitted speech and can additionally indicate the cause of quality degradation.
Abstract: We present a single-ended quality diagnosis model for superwideband speech communication networks, which predicts the perceived Noisiness, Coloration, and Discontinuity of transmitted speech. The model is an extension to the single-ended speech quality prediction model NISQA and can additionally indicate the cause of quality degradation. Service providers can use the model independently of the communication system’s technology since it is based on universal perceptual quality dimensions. The prediction model consists of a convolutional neural network that firstly calculates per-frame features of a speech signal and subsequently aggregates the features over time with a recurrent neural network, to estimate the speech quality dimensions. The proposed diagnosis model achieves promising results with an average RMSE* of 0.24.

Posted Content
TL;DR: DiaMaT as mentioned in this paper trains a neural text classifier to distinguish human from machine translations, which uncover systematic differences between the two classes, which are uncovered with neural explainability methods.
Abstract: Evaluating translation models is a trade-off between effort and detail. On the one end of the spectrum there are automatic count-based methods such as BLEU, on the other end linguistic evaluations by humans, which arguably are more informative but also require a disproportionately high effort. To narrow the spectrum, we propose a general approach on how to automatically expose systematic differences between human and machine translations to human experts. Inspired by adversarial settings, we train a neural text classifier to distinguish human from machine translations. A classifier that performs and generalizes well after training should recognize systematic differences between the two classes, which we uncover with neural explainability methods. Our proof-of-concept implementation, DiaMaT, is open source. Applied to a dataset translated by a state-of-the-art neural Transformer model, DiaMaT achieves a classification accuracy of 75% and exposes meaningful differences between humans and the Transformer, amidst the current discussion about human parity.

Proceedings ArticleDOI
01 Jun 2019
TL;DR: A survey is conducted that investigated the environments’ characteristics of users from German speaking countries and provides insights aimed at easing the decision making process when designing subjective user studies.
Abstract: Crowdsourcing has been used extensively for gathering and annotating data cost efficiently. Nowadays, there are multiple platforms offering crowd-sourced workforce, still most of these users are from Asia or English speaking countries, and not so many native German speakers. Thus, there is a lack of information regarding the conditions in which German users execute tasks, neither about their habits when taking part in crowdsourcing campaigns. Which is of main importance to address properly user studies to German crowd-workers. This paper reports on a survey that investigated the environments’ characteristics of users from German speaking countries. To this end, a study has been conducted in which users were asked to provide details about the surroundings in which they normally execute crowdsourcing tasks. Audio and visual data was collected per user which contributed to aggregate even more information on the users’ input. We provide insights aimed at easing the decision making process when designing subjective user studies.

Proceedings ArticleDOI
05 Jun 2019
TL;DR: Results show different perception of delay and QoE depending on user’s own delay, with participants perceived the opponent's player as being delayed even if only the player itself had network delay along with significantly lower rating ofQoE only when their delay was high.
Abstract: One of the fields where Virtual Reality (VR) is finding a potentially growing market is in the combination of exercising and gaming - also called exergaming. When it comes to competition in gaming, is important to investigate how different levels of delay influence overall quality of experience (QoE) in VR multiplayer exergames. The experimental setup consisted of a VR application coupled with a rowing ergometer, allowing races between the user and an artificially created opponent that is following the player with a similar speed and keeping the race tight. To investigate the influence of the delay, on both user’s and opponent’s side three levels of network delay were introduced (30ms, 100ms, and 500ms) and mixed throughout different conditions. After each session, participants rated perceived flow, sense of presence, and the degree to which they have noticed the delay in their or the opponent’s system. Interestingly, results show different perception of delay and QoE depending on user’s own delay. Participants perceived the opponent’s player as being delayed even if only the player itself had network delay along with significantly lower rating of QoE only when their delay was high.

Proceedings ArticleDOI
05 Jun 2019
TL;DR: Estimates of VQoE might be delivered in an objective, continuous and concealed manner, thus diminishing any further need for subjective self-reports.
Abstract: As known from everyday contexts of multimedia usage, suddenly occurring quality impairments are capable of causing strong negative emotions in human users. This is particularly the case if the displayed content is highly relevant to current motives and behavioral goals. The present study investigated the effects of visual degradations on quality perception and emotional state of participants who were exposed to a series of short video clips. After each video playback, participants had to decide whether a certain event happened in the video. For data collection, subjective measures of quality and emotion were complemented by behavioral measures derived from capturing participants’ spontaneous facial expressions. For data analysis, two general approaches were combined: First, a multivariate analysis of variance approach allowed to examine the effects of visual degradation factors on perceived quality and subjective emotional dimensions. It mainly revealed that perceived quality and emotional valence were both sensitive to degradation intensity, whereas the impact of degradation length was limited when task-relevant video content had already been obscured. Second, using a machine learning approach, an automatic Video Quality of Experience (VQoE) prediction system based on the recorded facial expressions was derived, demonstrating a strong correlation between facial expressions and perceived quality. Hereby, estimates of VQoE might be delivered in an objective, continuous and concealed manner, thus diminishing any further need for subjective self-reports.

Proceedings ArticleDOI
13 May 2019
TL;DR: This work investigates the intra- and inter-listener agreement withing a subjective speech quality assessment task and found that disagreement can represent a source of information to some extent.
Abstract: Crowdsourcing is a great tool for conducting subjective user studies with large amounts of users. Collecting reliable annotations about the quality of speech stimuli is challenging. The task itself is of high subjectivity and users in crowdsourcing work without supervision. This work investigates the intra- and inter-listener agreement withing a subjective speech quality assessment task. To this end, a study has been conducted in the laboratory and in crowdsourcing in which listeners were requested to rate speech stimuli with respect to their overall quality. Ratings were collected on a 5-point scale in accordance with the ITU-T Rec. P.800 and P.808, respectively. The speech samples were taken from the database ITU-T Rec. P.501 Annex D, and were presented four times to the listeners. Finally, the crowdsourcing results were contrasted to the ratings collected in the laboratory. Strong and significant Spearman’s correlation was achieved when contrasting the ratings collected in both environments. Our analysis show that while the inter-rater agreement increased the more the listeners conducted the assessment task, the intra-rater reliability remained constant. Our study setup helped to overcome the subjectivity of the task and we found that disagreement can represent a source of information to some extent.

Journal ArticleDOI
01 Dec 2019
TL;DR: Which features were dominant for the classification of people with dementia and how the results might be utilized for managing general QoL of PwD are investigated.
Abstract: While the number of dementia cases is steadily increasing, as of today no medication has been developed to cure its underlying causes. Instead, the focus in treatment has shifted to improve quality of life (QoL) for people with dementia (PwD). To this end, some non-pharmacological treatments such as exercising, socializing, and playing games have received increasing attention. PflegeTab is a tablet-based application developed for this purpose. It includes a number of services such as cognitive training games, everyday activity training games, emotional applications, and a biographical picture album. In the present paper, we explore the possibility of QoL prediction for PwD using data collected while nursing home residents played games in PflegeTab ($$N = 81$$). Using features generated from the data and applying linear discriminant analysis for classification, our approach obtained an average accuracy of 74.80% on predicting QoL ratings when measured by Monte Carlo cross-validation. Furthermore, this paper investigates which features were dominant for the classification (prominent features were e.g. time needed for task completion) and briefly discusses how the results might be utilized for managing general QoL of PwD.

01 Jan 2019
TL;DR: ReTiCo is a python-based programming frame-work that utilizes the concepts of incremental processing to create interactive dialogue systems and simulations of conversational behavior and is focused on simplifying the use of incremental modules to make research on specific tasks of an incrementaldialogue system easier.
Abstract: In this paper we present ReTiCo, a python-based programming frame-work that utilizes the concepts of incremental processing to create interactive spo-ken dialogue systems and simulations of conversational behavior. In contrast toalready existing toolkits like InproTK, our framework allows for quick visual cre-ation of complex networks and is able to save and load incremental networks forsimulations, automated testing as well as analysis. It is focused on simplifying theuse of incremental modules to make research on specific tasks of an incrementaldialogue system (like real-time speech signal processing) and on the interactionbetween different incremental modules easier.We make this framework accessible as open source so that it can be used in researchon spoken dialogue systems and conversation simulation.

Proceedings ArticleDOI
01 Jun 2019
TL;DR: Inspired by adversarial settings, a neural text classifier is trained to distinguish human from machine translations and uncovers systematic differences between the two classes, which are uncovered with neural explainability methods.
Abstract: Evaluating translation models is a trade-off between effort and detail. On the one end of the spectrum there are automatic count-based methods such as BLEU, on the other end linguistic evaluations by humans, which arguably are more informative but also require a disproportionately high effort. To narrow the spectrum, we propose a general approach on how to automatically expose systematic differences between human and machine translations to human experts. Inspired by adversarial settings, we train a neural text classifier to distinguish human from machine translations. A classifier that performs and generalizes well after training should recognize systematic differences between the two classes, which we uncover with neural explainability methods. Our proof-of-concept implementation, DiaMaT, is open source. Applied to a dataset translated by a state-of-the-art neural Transformer model, DiaMaT achieves a classification accuracy of 75% and exposes meaningful differences between humans and the Transformer, amidst the current discussion about human parity.

Proceedings ArticleDOI
01 Jun 2019
TL;DR: A model as a function of intra-rater reliability, root-mean-squared-deviation between the listeners ratings and age, has been built to predict the listener performance, intended to provide a measure of how valid the crowdsourcing results are, when there is no laboratory results to compare to.
Abstract: Crowdsourcing has become a convenient instrument for addressing subjective user studies to a large amounts of users. Data from crowdsourcing can be corrupted due to users’ neglect, and different mechanisms has been proposed to address the users’ reliability and to ensure valid experiments’ results. Users that are consistent in their answers or present a high intra-rater reliability score, are desired for subjective studies. This work investigates the relationship between the intra-rater reliability and the user performance in the context of a speech quality assessment task. To this end, a crowdsourcing study has been conducted in which users were requested to rate speech stimuli with respect to their overall quality. Ratings were collected on a 5-point scale in accordance with the ITU-T Rec. P.808. The speech stimuli were taken from the database ITU-T Rec. P.501 Annex D, and the results are to be contrasted with ratings collected in a laboratory experiment. Furthermore, a model as a function of intra-rater reliability, root-mean-squared-deviation between the listeners ratings and age, has been built to predict the listener performance. Such a model is intended to provide a measure of how valid the crowdsourcing results are, when there is no laboratory results to compare to.

Proceedings ArticleDOI
15 Sep 2019
TL;DR: This paper proposes a first version of an extended E-model which addresses both super-wideband and fullband scenarios, and which predicts the effects of speech codecs, packet loss, and delay as the most important degradations to be expected in such scenarios.
Abstract: In order to plan speech communication services regarding the quality experienced by their users, parametric models have been used since a long time. These models predict the overall quality experienced by a communication partner on the basis of parameters describing the elements of the transmission channel and the terminal equipment. The mostly used model is the Emodel which is standardized in ITU-T Rec. G.107 for narrowband and in ITU-T Rec. G.107.1 for wideband scenarios. However, with the advent of super-wideband and fullband transmission, the E-model needs to be extended. In this paper, we propose a first version of an extended E-model which addresses both super-wideband and fullband scenarios, and which predicts the effects of speech codecs, packet loss, and delay as the most important degradations to be expected in such scenarios. Predictions are compared to the results of listeningonly and conversational tests as well as to signal-based predictions, showing a reasonable prediction accuracy.

Proceedings ArticleDOI
05 Jun 2019
TL;DR: This work analyzes the feasibility and appropriateness of micro-task crowdsourcing for evaluation of different summary quality characteristics and reports an ongoing work on the crowdsourced evaluation of query-based extractive text summaries.
Abstract: High cost and time consumption are concurrent barriers for research and application of automated summarization. In order to explore options to overcome this barrier, we analyze the feasibility and appropriateness of micro-task crowdsourcing for evaluation of different summary quality characteristics and report an ongoing work on the crowdsourced evaluation of query-based extractive text summaries. To do so, we assess and evaluate a number of linguistic quality factors such as grammaticality, non-redundancy, referential clarity, focus and structure & coherence. Our first results imply that referential clarity, focus and structure & coherence are the main factors effecting the perceived summary quality by crowdworkers. Further, we compare these results using an initial set of expert annotations that is currently being collected, as well as an initial set of automatic quality score ROUGE for summary evaluation. Preliminary results show that ROUGE does not correlate with linguistic quality factors, regardless if assessed by crowd or experts. Further, crowd and expert ratings show highest degree of correlation when assessing low quality summaries. Assessments increasingly divert when attributing high quality judgments.

Proceedings ArticleDOI
05 Jun 2019
TL;DR: A new method for quality diagnosis of speech communication networks that builds upon recent developments in the field of semantic image segmentation is presented, which uses the deep convolutional network architecture SegNet to label each pixel of a speech spectrogram image as either clean or with its corresponding distortion.
Abstract: There are numerous instrumental tools available to monitor the perceived quality of speech communication networks. However, these tools give no insight into the cause of a quality degradation. In this paper, we present a new method for quality diagnosis of speech communication networks that builds upon recent developments in the field of semantic image segmentation. The proposed model works non-intrusively, without the need for a clean reference signal. We use the deep convolutional network architecture SegNet and label each pixel of a speech spectrogram image as either clean or with its corresponding distortion. This way, quality degradations can directly be located in the time and frequency domain. To train the model, we created a large database with four different distortion types: packet-loss, background noise, GSM buzz, and bandwidth limitation. While processing the speech files, we also generated corresponding ground-truth labels with which we trained SegNet. Our experiments show promising results of this new diagnostic approach with a mIoU of 0.75.

01 Jan 2019
TL;DR: This work formulated a ranking schema that can be employed in textual claims for speeding up the human fact-checking process and has shown that the proposed method statistically outperformed the baseline.
Abstract: With the proliferation of online information sources, it has become more and more difficult to judge the trustworthiness of a statement on the Web. Nevertheless, recent advances in natural language processing allow us to analyze information more objectively according to certain criteria e.g. whether a proposition is factual or opinative, or even the authority or credibility of an author in a certain topic. In this paper, we formulated a ranking schema that can be employed in textual claims for speeding up the human fact-checking process. Our experiments have shown that our proposed method statistically outperformed the baseline. Additionally, this work describes a multilingual data set of claims collected from several fact-check websites, which was used to fine-tuning our model.

Proceedings ArticleDOI
24 Apr 2019
TL;DR: The methodology introduced in this research makes it possible to estimate the speech quality when FEC codes are implemented in a communication system.
Abstract: In current communication systems, forward Error Correction (FEC) codes are used to decrease the information losses in the transmission channel to improve the signal quality in the reception point. The speech quality is affected by many factors being packet losses one of the most important. Currently, there are different speech quality assessment methods; for planning purposes, E-model algorithm is the most representative. However, it does not consider wireless network characteristics and techniques. Based on this fact, the main contribution of this work is to adapt the wide-band (WB) E-model algorithm to be able to evaluate the impact of FEC codes on speech quality. For this purpose, a function named G FECx(Ms,SNR) is proposed that quantifies the quality gain reached by FEC codes at different wireless channel conditions. This function is inserted into I e,eff,WB impairment factor of the WB E-model algorithm. Experimental results demonstrated that the proposed solution gets a high correlation with results obtained by ITU-T Rec. P.863, reaching an average PCC and RMSE of 0.990 and 2.629, respectively. Thus, the methodology introduced in this research makes it possible to estimate the speech quality when FEC codes are implemented in a communication system.

Proceedings ArticleDOI
01 Apr 2019
TL;DR: The spectral entropy distance is proposed as a new measure for objective quality estimations of noisy speech that gives a robust measure of how noisy a signal is in the presence of active speech.
Abstract: In this paper, we propose to use spectral entropy distance as a new measure for objective quality estimations of noisy speech. While the perceived quality estimation of a transmitted speech signal under background noise is fairly straight forward, the estimation of noise on active speech is more complex. For example, an increase in loudness can be confused as noise by common quality measures. Also, other distortions, such as interruptions due to packet loss, can decrease the energy in the degraded signal and thus lead to an underestimation of the noisiness. This is especially critical when the noise is only present during active speech segments, as it is the case for quantization noise caused by low bitrate codecs or voice activity detections at the receiver side. The spectral entropy, however, only considers the frequency composition of a signal and does not depend on the signal energy. Therefore, it gives a robust measure of how noisy a signal is in the presence of active speech. In our experiments, we trained a prediction model based on the spectral entropy and obtained excellent prediction results that show that the spectral entropy distance is indeed a useful tool for the quality estimation of noisy speech.