Nonintrusive Quality Assessment of Noise Suppressed Speech With Mel-Filtered Energies and Support Vector Regression

doi:10.1109/TASL.2011.2174223

Home
/
Papers
/
Nonintrusive Quality Assessment of Noise Suppressed Speech With Mel-Filtered Energies and Support Vector Regression

Journal Article•DOI•

Nonintrusive Quality Assessment of Noise Suppressed Speech With Mel-Filtered Energies and Support Vector Regression

Manish Narwaria¹, Weisi Lin¹, Ian McLoughlin¹, Sabu Emmanuel¹, Liang-Tien Chia¹ - Show less +1 more•Institutions (1)

Nanyang Technological University¹

01 May 2012-IEEE Transactions on Audio, Speech, and Language Processing (IEEE)-Vol. 20, Iss: 4, pp 1217-1232

TL;DR: This paper proposes a nonintrusive metric for the quality assessment of noise-suppressed speech and utilizes the sensitivity of FBEs to noise in order to obtain an effective representation of speech towards quality assessment.

read less

Abstract: Objective speech quality assessment is a challenging task which aims to emulate human judgment in the complex and time consuming task of subjective assessment. It is difficult to perform in line with the human perception due the complex and nonlinear nature of the human auditory system. The challenge lies in representing speech signals using appropriate features and subsequently mapping these features into a quality score. This paper proposes a nonintrusive metric for the quality assessment of noise-suppressed speech. The originality of the proposed approach lies primarily in the use of Mel filter bank energies (FBEs) as features and the use of support vector regression (SVR) for feature mapping. We utilize the sensitivity of FBEs to noise in order to obtain an effective representation of speech towards quality assessment. In addition, the use of SVR exploits the advantages of kernels which allow the regression algorithm to learn complex data patterns via nonlinear transformation for an effective and generalized mapping of features into the quality score. Extensive experiments conducted using two third party databases with different noise-suppressed speech signals show the effectiveness of the proposed approach.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Mulsemedia: State of the Art, Perspectives, and Challenges

[...]

Gheorghita Ghinea¹, Christian Timmerer², Weisi Lin³, Stephen R. Gulliver⁴•Institutions (4)

Brunel University London¹, Adria Airways², Nanyang Technological University³, University of Reading⁴

01 Oct 2014-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A historic perspective on mulsemedia work is presented and current developments in the area are reviewed and standardization efforts, via the MPEG-V standard, are described.

...read moreread less

Abstract: Mulsemedia—multiple sensorial media—captures a wide variety of research efforts and applications This article presents a historic perspective on mulsemedia work and reviews current developments in the area These take place across the traditional multimedia spectrum—from virtual reality applications to computer games—as well as efforts in the arts, gastronomy, and therapy, to mention a few We also describe standardization efforts, via the MPEG-V standard, and identify future developments and exciting challenges the community needs to overcome

...read moreread less

153 citations

Cites methods from "Nonintrusive Quality Assessment of ..."

...M. Narwaria, W. Lin, I. Mcloughlin, S. Emmanue, and L. T. Chia....
[...]
...W. Lin and C.-C. Jay Kuo....
[...]
...W. Lin....
[...]
...Z. Lu, W. Lin, X. Yang, E. Ong, and S. Yao....
[...]
...Second, there is the issue of integration and adaptation where multiple media objects should be used jointly and separately to improve application performance, and distributed multimedia applications should provide transparent delivery of dynamic content in such a way that Authors addresses: G. Ghinea (corresponding author), Department of Computer Science, Kingston Lane, Uxbridge, UB8 3PH, U.K.; email: george.ghinea@brunel.ac.uk; C. Timmerer, Universit ¨ atsstrasse 65-67 A-9020 Klagenfurt Austria; email: christian.timmerer@itec.uni-klu.ac.at; W. Lin, School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798; email: wslin@ntu.edu.sg; S. R. Gulliver, Henley Business School, Whiteknights, Reading, RG6 6UR, U.K.; email: s.r.gulliver@henley.reading.ac.uk....
[...]

Proceedings Article•DOI•

Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model Based on BLSTM.

[...]

Szu-Wei Fu¹, Yu Tsao², Hsin-Te Hwang³, Hsin-Min Wang³•Institutions (3)

National Taiwan University¹, Center for Information Technology², Academia Sinica³

02 Sep 2018

TL;DR: In this paper, an end-to-end, non-intrusive speech quality evaluation model, termed Quality-Net, based on bidirectional long short-term memory (LSTM) was proposed.

...read moreread less

Abstract: Nowadays, most of the objective speech quality assessment tools (e.g., perceptual evaluation of speech quality (PESQ)) are based on the comparison of the degraded/processed speech with its clean counterpart. The need of a "golden" reference considerably restricts the practicality of such assessment tools in real-world scenarios since the clean reference usually cannot be accessed. On the other hand, human beings can readily evaluate the speech quality without any reference (e.g., mean opinion score (MOS) tests), implying the existence of an objective and non-intrusive (no clean reference needed) quality assessment mechanism. In this study, we propose a novel end-to-end, non-intrusive speech quality evaluation model, termed Quality-Net, based on bidirectional long short-term memory. The evaluation of utterance-level quality in Quality-Net is based on the frame-level assessment. Frame constraints and sensible initializations of forget gate biases are applied to learn meaningful frame-level quality assessment from the utterance-level quality label. Experimental results show that Quality-Net can yield high correlation to PESQ (0.9 for the noisy speech and 0.84 for the speech processed by speech enhancement). We believe that Quality-Net has potential to be used in a wide variety of applications of speech signal processing.

...read moreread less

93 citations

Posted Content•

Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM

[...]

Szu-Wei Fu¹, Yu Tsao², Hsin-Te Hwang³, Hsin-Min Wang³•Institutions (3)

National Taiwan University¹, Center for Information Technology², Academia Sinica³

16 Aug 2018-arXiv: Sound

TL;DR: In this article, an end-to-end, non-intrusive speech quality evaluation model, termed Quality-Net, based on bidirectional long short-term memory (LSTM) was proposed.

...read moreread less

46 citations

Journal Article•DOI•

Long-Term Spectral Statistics for Voice Presentation Attack Detection

[...]

Hannah Muckenhirn¹, Pavel Korshunov², Mathew Magimai-Doss², Sébastien Marcel²•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, Idiap Research Institute²

01 Nov 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Investigations on ASVspoof 2015 challenge database and AVspoof database show that the proposed approach with a linear discriminative classifier yields a better system, irrespective of whether the spoofed signal is replayed to the microphone or is directly injected into the system software process.

...read moreread less

Abstract: Automatic speaker verification systems can be spoofed through recorded, synthetic, or voice converted speech of target speakers. To make these systems practically viable, the detection of such attacks, referred to as presentation attacks, is of paramount interest. In that direction, this paper investigates two aspects: 1) a novel approach to detect presentation attacks where, unlike conventional approaches, no speech signal modeling related assumptions are made, rather the attacks are detected by computing first-order and second-order spectral statistics and feeding them to a classifier, and 2) generalization of the presentation attack detection systems across databases. Our investigations on ASVspoof 2015 challenge database and AVspoof database show that, when compared to the approaches based on conventional short-term spectral features, the proposed approach with a linear discriminative classifier yields a better system, irrespective of whether the spoofed signal is replayed to the microphone or is directly injected into the system software process. Cross-database investigations show that neither the short-term spectral processing-based approaches nor the proposed approach yield systems which are able to generalize across databases or methods of attack. Thus, revealing the difficulty of the problem and the need for further resources and research.

...read moreread less

45 citations

Cites background from "Nonintrusive Quality Assessment of ..."

...In the literature it has been shown that first order and second order spectral statistics can be used to predict speech quality or quality assessment [47], [48]....
[...]

Proceedings Article•DOI•

Novel deep autoencoder features for non-intrusive speech quality assessment

[...]

Meet H. Soni¹, Hemant A. Patil¹•Institutions (1)

Dhirubhai Ambani Institute of Information and Communication Technology¹

01 Aug 2016

TL;DR: Quantification of the experimental results suggests that proposed metric gives more accurate and correlated scores than an existing benchmark for objective, non-intrusive quality assessment metric ITU-T P.563 standard.

...read moreread less

Abstract: To emulate the human perception in quality assessment, an objective metric or assessment method is required, which is a challenging task. Moreover, assessing the quality of speech without any reference or the ground truth is altogether more difficult. In this paper, we propose a new non-intrusive speech quality assessment metric for objective evaluation of speech quality. The originality of proposed scheme lies in using deep autoencoder to extract low-dimensional features from a spectrum of the speech signal and finds a mapping between features and subjective scores using an artificial neural network (ANN). We have shown that autoencoder features capture noise information in a better way than state-of-the-art Filterbank Energies (FBEs). Quantification of our experimental results suggests that proposed metric gives more accurate and correlated scores than an existing benchmark for objective, non-intrusive quality assessment metric ITU-T P.563 standard.

...read moreread less

45 citations

Cites methods from "Nonintrusive Quality Assessment of ..."

...In [12], authors posed quality estimation as a regression problem and used average Mel Frequency Cepstral Coefficients (MFCCs) to find mapping to subjective scores using support vector regression (SVR)....
[...]
...In this paper, the problem of speech quality assessment is posed as a regression problem, same as previously done in [12] and [13]....
[...]

1
2
3
4
…
5
6
7
8

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

LIBSVM: A library for support vector machines

[...]

Chih-Chung Chang¹, Chih-Jen Lin¹•Institutions (1)

National Taiwan University¹

06 May 2011-ACM Transactions on Intelligent Systems and Technology

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

...read moreread less

Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

...read moreread less

40,826 citations

Book•

Fundamentals of speech recognition

[...]

Lawrence R. Rabiner, Biing-Hwang Juang

01 Jan 1993

TL;DR: This book presents a meta-modelling framework for speech recognition that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of manually modeling speech.

...read moreread less

Abstract: 1. Fundamentals of Speech Recognition. 2. The Speech Signal: Production, Perception, and Acoustic-Phonetic Characterization. 3. Signal Processing and Analysis Methods for Speech Recognition. 4. Pattern Comparison Techniques. 5. Speech Recognition System Design and Implementation Issues. 6. Theory and Implementation of Hidden Markov Models. 7. Speech Recognition Based on Connected Word Models. 8. Large Vocabulary Continuous Speech Recognition. 9. Task-Oriented Applications of Automatic Speech Recognition.

...read moreread less

8,442 citations

Additional excerpts

...The last section presents the concluding remarks....
[...]

Book•

Kernel Methods for Pattern Analysis

[...]

John Shawe-Taylor¹, Nello Cristianini²•Institutions (2)

University of Southampton¹, University of Bristol²

01 Jan 2004

TL;DR: This book provides an easy introduction for students and researchers to the growing field of kernel-based pattern analysis, demonstrating with examples how to handcraft an algorithm or a kernel for a new specific application, and covering all the necessary conceptual and mathematical tools to do so.

...read moreread less

Abstract: Kernel methods provide a powerful and unified framework for pattern discovery, motivating algorithms that can act on general types of data (e.g. strings, vectors or text) and look for general types of relations (e.g. rankings, classifications, regressions, clusters). The application areas range from neural networks and pattern recognition to machine learning and data mining. This book, developed from lectures and tutorials, fulfils two major roles: firstly it provides practitioners with a large toolkit of algorithms, kernels and solutions ready to use for standard pattern discovery problems in fields such as bioinformatics, text analysis, image analysis. Secondly it provides an easy introduction for students and researchers to the growing field of kernel-based pattern analysis, demonstrating with examples how to handcraft an algorithm or a kernel for a new specific application, and covering all the necessary conceptual and mathematical tools to do so.

...read moreread less

6,050 citations

Journal Article•DOI•

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

[...]

S. Davis, Paul Mermelstein¹•Institutions (1)

bell northern research¹

01 Aug 1980-IEEE Transactions on Acoustics, Speech, and Signal Processing

TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.

...read moreread less

Abstract: Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.

...read moreread less

4,822 citations

Proceedings Article•

The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions

[...]

David Pearce, Hans-Günter Hirsch¹•Institutions (1)

Ericsson¹

01 Jan 2000

TL;DR: A database designed to evaluate the performance of speech recognition algorithms in noisy conditions and recognition results are presented for the first standard DSR feature extraction scheme that is based on a cepstral analysis.

...read moreread less

Abstract: This paper describes a database designed to evaluate the performance of speech recognition algorithms in noisy conditions. The database may either be used for the evaluation of front-end feature extraction algorithms using a defined HMM recognition back-end or complete recognition systems. The source speech for this database is the TIdigits, consisting of connected digits task spoken by American English talkers (downsampled to 8kHz) . A selection of 8 different real-world noises have been added to the speech over a range of signal to noise ratios and special care has been taken to control the filtering of both the speech and noise. The framework was prepared as a contribution to the ETSI STQ-AURORA DSR Working Group [1]. Aurora is developing standards for Distributed Speech Recognition (DSR) where the speech analysis is done in the telecommunication terminal and the recognition at a central location in the telecom network. The framework is currently being used to evaluate alternative proposals for front-end feature extraction. The database has been made publicly available through ELRA so that other speech researchers can evaluate and compare the performance of noise robust algorithms. Recognition results are presented for the first standard DSR feature extraction scheme that is based on a cepstral analysis.

...read moreread less

1,909 citations