scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Nonintrusive Quality Assessment of Noise Suppressed Speech With Mel-Filtered Energies and Support Vector Regression

TL;DR: This paper proposes a nonintrusive metric for the quality assessment of noise-suppressed speech and utilizes the sensitivity of FBEs to noise in order to obtain an effective representation of speech towards quality assessment.
Abstract: Objective speech quality assessment is a challenging task which aims to emulate human judgment in the complex and time consuming task of subjective assessment. It is difficult to perform in line with the human perception due the complex and nonlinear nature of the human auditory system. The challenge lies in representing speech signals using appropriate features and subsequently mapping these features into a quality score. This paper proposes a nonintrusive metric for the quality assessment of noise-suppressed speech. The originality of the proposed approach lies primarily in the use of Mel filter bank energies (FBEs) as features and the use of support vector regression (SVR) for feature mapping. We utilize the sensitivity of FBEs to noise in order to obtain an effective representation of speech towards quality assessment. In addition, the use of SVR exploits the advantages of kernels which allow the regression algorithm to learn complex data patterns via nonlinear transformation for an effective and generalized mapping of features into the quality score. Extensive experiments conducted using two third party databases with different noise-suppressed speech signals show the effectiveness of the proposed approach.
Citations
More filters
Journal ArticleDOI
TL;DR: A historic perspective on mulsemedia work is presented and current developments in the area are reviewed and standardization efforts, via the MPEG-V standard, are described.
Abstract: Mulsemedia—multiple sensorial media—captures a wide variety of research efforts and applications This article presents a historic perspective on mulsemedia work and reviews current developments in the area These take place across the traditional multimedia spectrum—from virtual reality applications to computer games—as well as efforts in the arts, gastronomy, and therapy, to mention a few We also describe standardization efforts, via the MPEG-V standard, and identify future developments and exciting challenges the community needs to overcome

153 citations


Cites methods from "Nonintrusive Quality Assessment of ..."

  • ...M. Narwaria, W. Lin, I. Mcloughlin, S. Emmanue, and L. T. Chia....

    [...]

  • ...W. Lin and C.-C. Jay Kuo....

    [...]

  • ...W. Lin....

    [...]

  • ...Z. Lu, W. Lin, X. Yang, E. Ong, and S. Yao....

    [...]

  • ...Second, there is the issue of integration and adaptation where multiple media objects should be used jointly and separately to improve application performance, and distributed multimedia applications should provide transparent delivery of dynamic content in such a way that Authors addresses: G. Ghinea (corresponding author), Department of Computer Science, Kingston Lane, Uxbridge, UB8 3PH, U.K.; email: george.ghinea@brunel.ac.uk; C. Timmerer, Universit ¨ atsstrasse 65-67 A-9020 Klagenfurt Austria; email: christian.timmerer@itec.uni-klu.ac.at; W. Lin, School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798; email: wslin@ntu.edu.sg; S. R. Gulliver, Henley Business School, Whiteknights, Reading, RG6 6UR, U.K.; email: s.r.gulliver@henley.reading.ac.uk....

    [...]

Proceedings ArticleDOI
02 Sep 2018
TL;DR: In this paper, an end-to-end, non-intrusive speech quality evaluation model, termed Quality-Net, based on bidirectional long short-term memory (LSTM) was proposed.
Abstract: Nowadays, most of the objective speech quality assessment tools (e.g., perceptual evaluation of speech quality (PESQ)) are based on the comparison of the degraded/processed speech with its clean counterpart. The need of a "golden" reference considerably restricts the practicality of such assessment tools in real-world scenarios since the clean reference usually cannot be accessed. On the other hand, human beings can readily evaluate the speech quality without any reference (e.g., mean opinion score (MOS) tests), implying the existence of an objective and non-intrusive (no clean reference needed) quality assessment mechanism. In this study, we propose a novel end-to-end, non-intrusive speech quality evaluation model, termed Quality-Net, based on bidirectional long short-term memory. The evaluation of utterance-level quality in Quality-Net is based on the frame-level assessment. Frame constraints and sensible initializations of forget gate biases are applied to learn meaningful frame-level quality assessment from the utterance-level quality label. Experimental results show that Quality-Net can yield high correlation to PESQ (0.9 for the noisy speech and 0.84 for the speech processed by speech enhancement). We believe that Quality-Net has potential to be used in a wide variety of applications of speech signal processing.

93 citations

Posted Content
TL;DR: In this article, an end-to-end, non-intrusive speech quality evaluation model, termed Quality-Net, based on bidirectional long short-term memory (LSTM) was proposed.
Abstract: Nowadays, most of the objective speech quality assessment tools (e.g., perceptual evaluation of speech quality (PESQ)) are based on the comparison of the degraded/processed speech with its clean counterpart. The need of a "golden" reference considerably restricts the practicality of such assessment tools in real-world scenarios since the clean reference usually cannot be accessed. On the other hand, human beings can readily evaluate the speech quality without any reference (e.g., mean opinion score (MOS) tests), implying the existence of an objective and non-intrusive (no clean reference needed) quality assessment mechanism. In this study, we propose a novel end-to-end, non-intrusive speech quality evaluation model, termed Quality-Net, based on bidirectional long short-term memory. The evaluation of utterance-level quality in Quality-Net is based on the frame-level assessment. Frame constraints and sensible initializations of forget gate biases are applied to learn meaningful frame-level quality assessment from the utterance-level quality label. Experimental results show that Quality-Net can yield high correlation to PESQ (0.9 for the noisy speech and 0.84 for the speech processed by speech enhancement). We believe that Quality-Net has potential to be used in a wide variety of applications of speech signal processing.

46 citations

Journal ArticleDOI
TL;DR: Investigations on ASVspoof 2015 challenge database and AVspoof database show that the proposed approach with a linear discriminative classifier yields a better system, irrespective of whether the spoofed signal is replayed to the microphone or is directly injected into the system software process.
Abstract: Automatic speaker verification systems can be spoofed through recorded, synthetic, or voice converted speech of target speakers. To make these systems practically viable, the detection of such attacks, referred to as presentation attacks, is of paramount interest. In that direction, this paper investigates two aspects: 1) a novel approach to detect presentation attacks where, unlike conventional approaches, no speech signal modeling related assumptions are made, rather the attacks are detected by computing first-order and second-order spectral statistics and feeding them to a classifier, and 2) generalization of the presentation attack detection systems across databases. Our investigations on ASVspoof 2015 challenge database and AVspoof database show that, when compared to the approaches based on conventional short-term spectral features, the proposed approach with a linear discriminative classifier yields a better system, irrespective of whether the spoofed signal is replayed to the microphone or is directly injected into the system software process. Cross-database investigations show that neither the short-term spectral processing-based approaches nor the proposed approach yield systems which are able to generalize across databases or methods of attack. Thus, revealing the difficulty of the problem and the need for further resources and research.

45 citations


Cites background from "Nonintrusive Quality Assessment of ..."

  • ...In the literature it has been shown that first order and second order spectral statistics can be used to predict speech quality or quality assessment [47], [48]....

    [...]

Proceedings ArticleDOI
01 Aug 2016
TL;DR: Quantification of the experimental results suggests that proposed metric gives more accurate and correlated scores than an existing benchmark for objective, non-intrusive quality assessment metric ITU-T P.563 standard.
Abstract: To emulate the human perception in quality assessment, an objective metric or assessment method is required, which is a challenging task. Moreover, assessing the quality of speech without any reference or the ground truth is altogether more difficult. In this paper, we propose a new non-intrusive speech quality assessment metric for objective evaluation of speech quality. The originality of proposed scheme lies in using deep autoencoder to extract low-dimensional features from a spectrum of the speech signal and finds a mapping between features and subjective scores using an artificial neural network (ANN). We have shown that autoencoder features capture noise information in a better way than state-of-the-art Filterbank Energies (FBEs). Quantification of our experimental results suggests that proposed metric gives more accurate and correlated scores than an existing benchmark for objective, non-intrusive quality assessment metric ITU-T P.563 standard.

45 citations


Cites methods from "Nonintrusive Quality Assessment of ..."

  • ...In [12], authors posed quality estimation as a regression problem and used average Mel Frequency Cepstral Coefficients (MFCCs) to find mapping to subjective scores using support vector regression (SVR)....

    [...]

  • ...In this paper, the problem of speech quality assessment is posed as a regression problem, same as previously done in [12] and [13]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

40,826 citations

Book
01 Jan 1993
TL;DR: This book presents a meta-modelling framework for speech recognition that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of manually modeling speech.
Abstract: 1. Fundamentals of Speech Recognition. 2. The Speech Signal: Production, Perception, and Acoustic-Phonetic Characterization. 3. Signal Processing and Analysis Methods for Speech Recognition. 4. Pattern Comparison Techniques. 5. Speech Recognition System Design and Implementation Issues. 6. Theory and Implementation of Hidden Markov Models. 7. Speech Recognition Based on Connected Word Models. 8. Large Vocabulary Continuous Speech Recognition. 9. Task-Oriented Applications of Automatic Speech Recognition.

8,442 citations


Additional excerpts

  • ...The last section presents the concluding remarks....

    [...]

Book
01 Jan 2004
TL;DR: This book provides an easy introduction for students and researchers to the growing field of kernel-based pattern analysis, demonstrating with examples how to handcraft an algorithm or a kernel for a new specific application, and covering all the necessary conceptual and mathematical tools to do so.
Abstract: Kernel methods provide a powerful and unified framework for pattern discovery, motivating algorithms that can act on general types of data (e.g. strings, vectors or text) and look for general types of relations (e.g. rankings, classifications, regressions, clusters). The application areas range from neural networks and pattern recognition to machine learning and data mining. This book, developed from lectures and tutorials, fulfils two major roles: firstly it provides practitioners with a large toolkit of algorithms, kernels and solutions ready to use for standard pattern discovery problems in fields such as bioinformatics, text analysis, image analysis. Secondly it provides an easy introduction for students and researchers to the growing field of kernel-based pattern analysis, demonstrating with examples how to handcraft an algorithm or a kernel for a new specific application, and covering all the necessary conceptual and mathematical tools to do so.

6,050 citations

Journal ArticleDOI
TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.
Abstract: Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.

4,822 citations

Proceedings Article
01 Jan 2000
TL;DR: A database designed to evaluate the performance of speech recognition algorithms in noisy conditions and recognition results are presented for the first standard DSR feature extraction scheme that is based on a cepstral analysis.
Abstract: This paper describes a database designed to evaluate the performance of speech recognition algorithms in noisy conditions. The database may either be used for the evaluation of front-end feature extraction algorithms using a defined HMM recognition back-end or complete recognition systems. The source speech for this database is the TIdigits, consisting of connected digits task spoken by American English talkers (downsampled to 8kHz) . A selection of 8 different real-world noises have been added to the speech over a range of signal to noise ratios and special care has been taken to control the filtering of both the speech and noise. The framework was prepared as a contribution to the ETSI STQ-AURORA DSR Working Group [1]. Aurora is developing standards for Distributed Speech Recognition (DSR) where the speech analysis is done in the telecommunication terminal and the recognition at a central location in the telecom network. The framework is currently being used to evaluate alternative proposals for front-end feature extraction. The database has been made publicly available through ELRA so that other speech researchers can evaluate and compare the performance of noise robust algorithms. Recognition results are presented for the first standard DSR feature extraction scheme that is based on a cepstral analysis.

1,909 citations